AI in the OSPool with emphasis on PelicanFS Python bindings

by Igor Sfiligoi - University of California San Diego, as of April 29th, 2025

AI workloads are becoming more important in the scientific research domain, so properly supporting this use-case is becoming increasingly important in the OSPool, too. As with many other compute workloads, the data handling part of AI workloads requires heightened attention on the HTC-oriented OSPool resources. Since the OSPool does not offer a shared file system, the Pelican software stack is the recommended way of handling data. To make access to the Pelican-managed resources easier, the PelicanFS Python bindings have been developed, allowing for transparent access to such data from typical AI workloads, e.g. the PyTorch-based ones.

Executive summary

In this document we report on the evaluation of the feasibility of running AI workloads on OSPool resources, with an emphasis of using the PelicanFS Python bindings for data access. We observe that

More details on the observations and the related recommendations are available below.

Itemized observations and recommendations

Observation 1:

PelicanFS Python bindings are easy to install with pip, right inside the job.

pip install pelicanfs
python3 …

Observation 2:

Easy to use for public data

from pelicanfs.core import PelicanFileSystem
pelfs = PelicanFileSystem("pelican://osg-htc.org")
hello_world = pelfs.cat('/ospool/uc-shared/public/OSG-Staff/isfiligoi/test1.txt')
print(hello_world)

Clear example in

https://github.com/PelicanPlatform/pelicanfs

But that git area is not linked from the OSPool documentation, nor does it have HTCondor-related examples.

Recommendation for 1+2:

Provide some kind of pelicanfs-related documentation in the OSPool documentation area.

Possibly related to https://portal.osg-htc.org/documentation/software_examples/ai/tutorial-pytorch/

Observation 3:

Does not have native support for credential handling.

Trying to access private data, e.g.

hello_world = pelfs.cat('/ospool/ap21/data/isfiligoi/test1.txt')

fails with

No working cache found
pelicanfs.exceptions.NoAvailableSource

Workaround for 3:

Explicitly load the credential and pass it to pelicanfs:

import json

scitokens_file_path = …
with open(scitokens_file_path, 'r') as file:
    data = json.load(file)
access_token = data['access_token']
auth_token = f"Bearer {access_token}"
pelfs = PelicanFileSystem("pelican://osg-htc.org", headers={'Authorization': auth_token})
hello_world = pelfs.cat('/ospool/ap21/data/isfiligoi/test1.txt')
print(hello_world)

Note that the exact location of the token is not completely deterministic.
This worked in April 2025, but was slightly different on March 2025.

import json

# Get the directory path from the _CONDOR_CREDS environment variable
condor_creds_dir = os.getenv('_CONDOR_CREDS')
scitokens_file_path = os.path.join(condor_creds_dir, 'ap.use')

Recommendation for 3:

The command line tool

pelican object get pelican://osg-htc.org/ospool/ap21/data/isfiligoi/test1.txt test1.txt

has baked-in support for auto-discovery of the appropriate credential to use.

The PelicanFS library should have an easy option that mimics that (ideally as the default behaviour).

Observation 4:

There is no clear documentation on how to deal with many small files, which we know OSDF does not handle well. No clear guidance about using tar or zip files.

It may be obvious for a python/fsspec expert to use one of the helper classes, e.g.

https://filesystem-spec.readthedocs.io/en/latest/_modules/fsspec/implementations/tar.html

but I doubt a science user will have that knowledge.

There is an example in the tutorial Github area:

https://github.com/PelicanPlatform/PelicanPytorchTutorial/blob/main/benchmark/Benchmark1.ipynb

but it is buried deep inside the code, and would not be an obvious place to look.

Recommendation for 4:

Add explicit guidelines and top-level examples on how to deal with large numbers of small files. Possibly in https://portal.osg-htc.org/documentation/software_examples/ai/tutorial-pytorch/

Observation 5:

There is a reasonable amount of AI/ML training documentation (even though it does not include any PelicanFS integration). However, all the AI examples I can find are exclusively based on training. Given the HTC nature of the OSPool, inference seems a much more natural fit for it. Especially since modern, large scale AI training requires true HPC resources, beyond what the OSPool provides.

Recommendation for 5:

Add AI-inference focused examples in the OSPool documentation.

Also, given the much larger availability of CPUs, compared to GPUs, some CPU-based inference examples would likely be beneficial.