OSDF Operational Improvement Ideas

By Fabio Andrijauskas - UCSD

The Open Science Data Federation (OSDF) is an OSG service designed to support the sharing of files staged in autonomous “origins” for efficient access to those files from anywhere in the world via a global namespace and network of caches. The OSDF may be used either standalone - allowing data to be downloaded via HTTPS - or with HTCondor managing data transfer for compute jobs running on one of the many resource pools supported by OSG.

During the OSDF operations, GIL testing, and information from sysadmins, large users, and user facilitators, it was possible to check some items requiring attention. Each item has an impact (1 to 5) and complexity (1 to 5) to be completed. The impact is related to improving the service availability, security, new features, user experience, etc.

1 Add support to chkmount for origins and cache.
Impact: 5
Complexity: 1
Related components: XrootD, docker image.
Problem: Using Kubernetes or docker sometimes the pod starts without the main volume for Origins or Caches, causing the XrootD using the base directory directly.
Solution: This option checks the existence of a specific file on a XrootD volume; if the file is not present, the XrootD won't start. It was tested using this option: oss.space public $(rootdir) chkmount thisfileshoulbeheretoxrootdststart -https://xrootd.slac.stanford.edu/doc/dev54/ofs_config.htm#_Toc89982406.
Pelican status: WIP - Tracked as issue #995. PR is #1003
2 Ability to fix a site or OSDF cache position on demand.
Impact: 5
Complexity: 5
Related components: OSDF client, Pelican director.
Problem: It is required to wait three weeks or more to fix a position on the GeoIP system for caches; it would be interesting to be able to fix the positions on the spot.
Solution: OSDF uses the GeoIP system provided for CVMFS GeoIP services. It is possible to add this feature to the Pelican director.
Pelican status: DONE. Delivered in Pelican 7.4.0.
3 Support public reading and scitoken-authenticated writing.
Impact: 5
Complexity: 3
Related components: OSG topology.
Problem: A user case requests to write files using sci-tokens and read the file publicly.
Solution: It is required to add this functionality on the OSG topology.
Pelican status: DONE. Can be configured in 7.4.0 but is “first class” in 7.5.0. Would appreciate testing/eval.
4 Add Perfsonar testing on the OSDF Docker images.
Impact: 5
Complexity: 4
Related components: OSDF docker image.
Problem: To debug network problem, it is interesting to have a MaDDash grid with all the origins and caches.
Solution: Add the scripts used for NRP OSDF host to the OSDF docker image. Those scripts can test the transfer rate between a cache and origins using multiple concurrent tests.
Pelican status: Not planned.
5 File eviction on OSDF caches.
Impact: 5
Complexity: 1
Related components: OSDF cache.
Problem: On the XrootD 5.5.6, it is possible to evict a file from the cache. With this, it’s feasible to monitor if the cache can access the origin.
Solution: Add to OSDF docker image an environment variable to define the file path to be evicted and a crontab forcing the eviction every 30 minutes. CheckMk will request this file for each cache, testing the OSDF cache, origin and director, be sure to remove the test files.
Pelican status: Not planned. Cache testing is built in using a different mechanism.
6 OSDF Probe.
Impact: 5
Complexity: 4
Related components: HTCondor and OSDF client.
Problem: We need to check the performance between a site and the OSDF caches.
Solution: The idea is to create software to submit several jobs to the OSG sites to check if the sites use the proper cache and its performance (transfer rate, latency, and others). Using this probe, create a data visualization matrix showing sites, caches and origins using transfer rate, server load, IO wait, and other information.
Pelican status: SCHEDULED. We’ve identified a summer student to work on this project; starting in June 2024.
7 OSDF transfer rate classAds.
Impact: 5
Complexity: 5
Related components: HTCondor and OSDF client.
Problem: We need to check the performance between a site and the OSDF caches using jobs.
Solution: Add the transfer rate and other information to the classes to the job classAds.
Pelican status: DONE. Needs input and evaluation on whether all metrics we capture are useful and to help identify missing ones. Feature of HTCSS 23.5.0 and Pelican 7.5.0 (needs both).
8 Make the redirector line in the cache config customizable.
Impact: 2
Complexity: 2
Related components: OSDF docker image.
Problem: We need to check the performance between a site and the OSDF caches using jobs.
Solution: With the new Pelican redirector, changing the cache redirector on the OSDF cache is helpful.
Pelican status: DONE (obsolete). This line is auto generated in Pelican.
9 Multiple namespaces on OSDF origin for docker.
Impact: 1
Complexity: 2
Related components: OSG topology.
Problem: Several origins need to have several exports namespaces, now it is requiring using ConfigMap in Kubernetes.
Solution: Use the env vars to create a list of namespaces to be exported.
Pelican status: DONE. Feature for 7.6.0 (March 2024).
10 The topology endpoint (scitokens.conf) should not contain duplicate issuer sections.
Impact: 3
Complexity: 3
Related components: OSG topology.
Problem: XrootD behaves poorly with duplicate section names in the sci-tokens. conf.
Solution: Change OSG topology or pelican component to be sure there is not duplicate section.
Pelican status: DONE. The Pelican process generates the scitokens.cfg and merges appropriately.
11 Automatically report discrepancies between OASIS, CVMFS repo, and OSDF.
Impact: 5
Complexity: 3
Related components: CVMFS repository configuration, OSG topology and OASIS configuration.
Problem: To add a new cache to OSDF, it must be added to the OSG topology, CVMFS configuration and OASIS configuration.
Solution: A script to check if all caches are in OSG topology and the CVMS config. Considering if the cache is public or private.
Pelican status: Not planned. CVMFS config will always be separate.
12 Use cache connections to validate the site positions.
Impact: 5
Complexity: 4
Related components: OSDF cache
Problem: It is required to check if the site using a cache should be using that exactly cache.
Solution: The software will use the connections between the cache and the site to create a map reporting which site is using which cache.
Pelican status: Not planned. Could be useful for a summer student?
13 Create a backup GeoIP system.
Impact: 5
Complexity: 5
Related components: CVMFS config.
Problem: The idea is to use another GeoIP system as a backup or confirmation about the OSDF and host positions.
Solution: The software will use the connections between the cache and the site to create a map reporting which site is using which cache. The idea is work with a student to do this.
Pelican status: Not planned.
14 Load new configuration and certificates to XrootD without restarting the service.
Impact: 2
Complexity: 5
Related components: XRootD.
Problem: Changing the certificate and configurations without restarting the OSDF cache or Origin would be interesting.
Solution: Change XrootD code to load configuration and certificates automatically.
Pelican status: DONE/PARTIAL. Parts of the configuration will automatically reload for both XRootD and Pelican. Certificates, Authfile, and scitokens.cfg, for example, are already there. We should enumerate what we’d like to be able to change at runtime.
When a configuration is changed via the web interface, the service will auto-restart.
15 Stop using certificates.
Impact: 2
Complexity: 3
Related components: OSG topology.
Problem: We need a certificate for each OSDF host.
Solution: Use tokens for everything.
Pelican status: Not planned for caches. DONE for origins (but requires most caches to be upgraded to Pelican to use this functionality).
16 Create a software to upload files.
Impact: 4
Complexity: 3
Related components: OSG topology.
Problem: Each user needs to create a tool to upload files, it is requiring a tool to upload the file and check if the file is ok.
Solution: Create a script to write a file in the origin and check if the file is intact or create a plugin to rsync to upload file to OSDF.
Pelican status: DONE. Part of the “pelican object upload” toolset.
17 Create a helm chart to OSDF caches.
Impact: 3
Complexity: 2
Related components: OSG topology.
Problem: All the deployment for the NRP OSDF hosts is in Tiger repo using flux to deploy to NRP.
Solution: Create a helm chart to OSDF caches, with this is possible to migrate all the caches and use a standard deployment.
Pelican status: Not planned in the near future. May be a topic for Year 2
18 Index all the OSDF logs.
Impact: 4
Complexity: 3
Related components: OSG docker image.
Problem: There are no central logs for OSDF, and it is interesting to have a central indexing service to debug problem or even create a dashboard with errors and warning for the OSDF hosts.
Solution: Create on NRP or on tiger a service to gather all the logs from OSDF hosts, create on the Docker image a environment variable to configure the logs location.
Pelican status: Not in the Pelican scope.
19 Create a dashboard using the OSDF logs.
Impact: 3
Complexity: 1
Related components: Grafana
Problem: We would like to detect operational problems before the user analyzing the logs and frequency for some error could help do detect problems.
Solution: Create a dashboard using the OSDF logs.
Pelican status: Not in the Pelican scope. Pelican can help provide additional inputs to the dashboard as needed (either via XRootD monitoring or Pelican itself)
20 OSDF catalog.
Impact: 5
Complexity: 5
Related components: OSDF hosts
Problem: There is not catalog for the OSDF, each origin is indexed in different ways.
Solution: Create a catalog integrating all the OSDF origins.
Pelican status: Not in the Pelican scope. In discussions with NDP.
21 OSDF NRP Dashboard.
Impact: 3
Complexity: 3
Related components: OSDF hosts
Problem: There is not dashboard for the OSDF hosts on NRP, the idea is to monitor the system load for the caches and origins.
Solution: Create a Dashboard to monitor and alert when a OSDF host has system load variation.
Pelican status: Not in the Pelican scope.
22 Ability to handling of overloaded or failed origins, avoiding bad caches.
Impact: 5
Complexity: 5
Related components: OSDF hosts
Problem: Some caches get overload and the now the only way solve these creating downtimes and removing the caches from the CVMFS configuration repo.
Solution: Create an interface able to show the load, transfer rate and other cache and origins information. In this interface, should be possible to remove or pause a cache from the OSDF. Besides that, improve the load balance on the XrootD.
Pelican status: Goal is to have some basic functionality by June 2024. Will report load metrics and take them into account into the cache selection. Will allow the cache to report overloads back to the client (to better distinguish from an unresponsive cache).
23 Group the XrootD logs.
Impact: 3
Complexity: 5
Related components: XrootD
Problem: The XrootD log has all the information about each transfer, however, there is no logical or id information showing each transfer step.
Solution: Create an ID for each transaction, including it in the XrootD logs.
Pelican status: Willing to discuss. Note: XRootD developers are strongly against structured logging; we may have to do this through a plugin that will never be accepted upstream.
24 Include job information in the XrootD logs.
Impact: 3
Complexity: 5
Related components: XrootD and HTCondor.
Problem: It is not possible to track in a formal way which file was used from OSDF.
Solution: Include in the XrootD logs the HTCondor Job ID.
Pelican status: Project IDs are recorded as part of the user agent. We can add job IDs in as well. Opened as issue #996.
25 Improve the XrootD error and warning messages.
Impact: 5
Complexity: 2
Related components: XrootD.
Problem: Some XrootD error and warning messages do not include a comprehensive message or any hints what the problem could be.
Solution: Include in the XrootD logs a comprehensive message in the warning and error providing hints what could be the problem what to do solve.
Pelican status: Not in the Pelican scope but in-scope for the IRIS-HEP OSG-LHC activities.
26 Simple cache coherence.
Impact: 3
Complexity: 4
Related components: XrootD.
Problem: We don’t any cache coherence in the OSDF, sometimes the users change the files and there is no way to flush or sync it.
Solution: Include a function using the XrootD evict to remove a file from all the OSDF caches.
Pelican status: Not in the Pelican scope. However, Pelican is making it easier for users to generate unique filenames as a HTCSS plugin; this effectively does a “cache eviction” when input files change.
27 Create a OSDF file indexer.
Impact: 4
Complexity: 5
Related components: XrootD.
Problem: Each dataset is indexed his own way, there is no standard or recommendation.
Solution: Create a OSDF indexer, a software to process a dataset and create an indexing scheme or create a recommendation how to do it.
Pelican status: Not in the Pelican scope. Could be a joint project with NDP.
28 Create a OSDF data challenge.
Impact: 4
Complexity: 5
Related components: XrootD.
Problem: Each OSDF host has different hardware and network configuration, there is no information about the capacity or a base line transfer rate.
Solution: Create a set of tests to explorer each cache and origin to create a base line transfer rate.
Pelican status: Not in the Pelican plans; could be a good PATh or Pelican summer student activity.
29 Create alerts using the base transfer rate.
Impact: 4
Complexity: 5
Related components: checkmk
Problem: It is not possible to measure if a cache or origin has a problem with a transfer rate or other network related problem.
Solution: Using the base line transfer, create a set of alerts to show if a cache or origin could have a network problem.
Pelican status: Not clearly in the Pelican scope; could be done operationally within PATh.