GIL Reporting

AI in the OSPool with emphasis on PelicanFS Python bindings

April 29, 2025

AI workloads are becoming more important in the scientific research domain, so properly supporting this use-case is becoming increasingly important in the OSPool, too. As with many other compute workloads, the data handling part of AI workloads requires heightened attention on the HTC-oriented OSPool resources. Since the OSPool does not offer a shared file system, the Pelican software stack is the recommended way of handling data. To make access to the Pelican-managed resources easier, the PelicanFS Python bindings have been developed, allowing for transparent access to such data from typical AI workloads, e.g. the PyTorch-based ones.

OSDF Cache Selection

April 4, 2025

April 4th, 2025

Testing condor_ssh_to_job

December 23, 2024

An alternative pilot setup in Kubernetes

December 19, 2024

The existing OSPool pilot setup on Kubernetes in the NRP relies on nested containerization, which is not supported on the majority of Kubernetes deployments and has security drawbacks, too. This document proposes an alternative pilot setup that only uses Kubernetes-native mechanisms to make it usable on most Kubernetes-managed resources. In order to implement this new pilot setup, HTCondor will need to become Kubernetes aware. This document provides pointers to the envisioned Kubernetes mechanisms to be used by various HTCondor processes, as well as a couple simplified pilot-like Kubernetes examples that exercise most of those mechanisms.

Understanding HTCondor Logs

June 30, 2024

The ShadowLog provides a detailed description of a job as it moves through the condor system from when it is submitted until it terminates. For a typical job, the logs are relatively concise and understandable.

OSDF Operational Improvement Ideas

March 27, 2024

The Open Science Data Federation (OSDF) is an OSG service designed to support the sharing of files staged in autonomous “origins” for efficient access to those files from anywhere in the world via a global namespace and network of caches. The OSDF may be used either standalone - allowing data to be downloaded via HTTPS - or with HTCondor managing data transfer for compute jobs running on one of the many resource pools supported by OSG.

HTCondor Annex - Scalability and Stability

January 31, 2024

One challenge for High-throughput computing (HTC) is integrating with the computational cloud or other clusters to provide more computational power. HTCondor Annex can create working nodes on clusters using schedulers such as HTCondor, Slurm, and others.

CRIU - Checkpoint Restore in Userspace

December 22, 2022

While executing scientific applications, they must sometimes stop the process due to hardware problems or even job end-of-life. In that case, some applications can create a set of files to save their current state to be loaded on a restoring process at a later time. However, most applications do not have this feature. CRIU (Checkpoint Restore in Userspace, pronounced kree-oo) is a tool for checkpointing and restoring applications in GNU/Linux environment. With CRIU it is possible to stop an application, save the working memory on disk and restore this state. As the OSPool is built around the notion of pre-emptable resources, this could be very useful on jobs that get pre-empted or exceed allocated runtimes.

Accounting of provisioned resources on OSG pool

August 23, 2022

HTCondor and Glidein Workflow Management System (GlideinWMS) are two of the tools that provide access to computational resources on the OSG pool. Our goal is to check the information about computational resource provisioning, based only on HTCondor and GlideinWMS.

Using PRP-developed provisioner for dynamic provisioning of non-Grid resources

June 30, 2022

The OSPool has mostly been relying in GlideinWMS for the provisioning of its execute resources. While that worked reasonably well for Grid-type resources, it currently lacks support for many non-Grid resource types, including Kubernetes and Lancium. Adding support for those kind of resources to GlideinWMS would require a non-trivial amount of work, so an alternative approach was investigated.

Limits of current GPU accounting in OSG

May 31, 2022

OSG currently accounts for GPU resources in “GPU chip hours” [[1][source1]]. The GPUs are however not all the same, some are small and some are big; the science delivered from different models thus varies by an order of magnitude. Moreover, a single GPU could be shared between multiple jobs.

We thus propose that OSG starts treating GPUs similarly to CPUs, i.e., switches GPU-core-hours, like we do for CPUs.

Proposal for classifying the utilization of OSG

January 7, 2022

For this proposal, we are concerned only about wall time measures due to the combined gWMS and HTCondor systems. The proposed classification thus may be understood largely as a walk through the “state machine” of how these two systems work together.

Multiple files vs. Tar.gz

January 7, 2022

The objective of these tests is to check if it is faster to transfer each file separately or compress all the files on a tar.gz file for a job on the Open Science Grid (OSG) using the HTCondor.

Data robustness and scalability on Open Science Grid

December 31, 2021

This document provides a set of results about the data access in the Open Science Grid (OSG). The objective is to show issues on data access of an OSG job. All the process follows a methodology to understand how a user requests the information and any problem with this process.

A100 MIG support in OSG, with emphasis on IceCube

September 27, 2021

This document provides an assessment of the feasibility to use the NVIDIA A100 GPU Multi Instance Graphics (MIG) capabilities inside OSG, with a particular emphasis on IceCube. MIG allows an A100 GPU to be split in several partitions that can be assigned to independent compute jobs. (MIG is also available on the NIVIDA A30 GPUs, but we currently have no HW to test).

Options for Jupyter Support in OSG

March 31, 2021

GIL conducted an open dicussion with the entire OSG [email protected] mailing list on March 9, 2021. The team reviewed the proposal document and concluded the following.