Data robustness and scalability on Open Science Grid

by Fabio Andrijauskas and Frank Wüerthwein - University of California San Diego, as of December 31, 2021

Objective

This document provides a set of results about the data access in the Open Science Grid (OSG). The objective is to show issues on data access of an OSG job. All the process follows a methodology to understand how a user requests the information and any problem with this process.

Executive summary

Robustness is the degree to which a system or component can function correctly in the presence of valid inputs, and scalability is the responsiveness of a service with a reasonable performance given as the use of the service increases. Underlying this activity is the notion that robustness may deteriorate as scale of use of the service increases. To measure such a possible deterioration requires a set of benchmarks and tests accessing the data using different methods to check all these measures for robustness as a function of scale.

We have origins and caches on the OSG data federation, and the origins have data from various experiments. Caches have a way to provide data “locally” and speed up the process for the jobs. With numerous jobs running, it is necessary to detect when a job fails due to data access. After more than 10,000 jobs using a methodology to explore several data access methods, the following issues in the OSG Data Federation infrastructure were found:

Detailed description

Robustness is the degree to which a system or component can function correctly in the presence of valid inputs or stressful environmental conditions, and scalability is given a reasonable or good performance on a sample problem with a commensurate increase in computational resources [1,3]. One method to measure is one procedure referred to as FePIA [2,4], where the abbreviation stands for: identifying the performance features, the perturbation parameters, the impact of perturbation parameters on performance features, and the analysis to determine the robustness. For OSG it is required to check how a system could use computation resources related to robustness and scalability. There are many ways to execute code and access data on OSG data federation, and almost all interactions are through jobs.

We define a set of benchmarks and test accessing the data using different methods to check all these measures for robustness and scalability. We have origins and caches on the OSG data federation, and different origins have data from various organizations. XrootD caches can provide data “locally” and speed up the process for the jobs. However, numerous jobs running are required to detect when a job fails due to data access, and we do not have this monitoring. Figure 1 shows the basic idea of a job accessing data.

Data access and OSG job data request.
Figure 1: Data access and OSG job data request.

Methodology

To monitor data access requires analyzing three data access methods:

As users of OSG may use any of these methods, we chose a testing methodology that tested all of them independently of each other. Thus, checking the robustness and scalability is based on request files using stashcp, xrdcopy, and cvmfs in a different order with different files size and other variables following Table 1. We implement two different ways of accessing data via cvmfs. First as a simple cp from cvmfs to the local worker node, second as a direct posix read from cvmfs. There are thus 4 access methods, three of them are full file cp and the fourth is a posix read.

Set of tests using a combination of tools per job.
Table 1: Set of tests using a combination of tools per job.

The following steps are to measure the failure statistics for each of these 6 job types:

  1. Create the files on the /osgconnect/public/ with random names;
  2. Request the files following Table 1 using a condor job on different sites;
  3. Collect data about the requests;
  4. Delete the files from /osgconnect/public/;
  5. Back to step 1;

Implicit in these steps is that each job accesses 4 randomly named files. And each file needs to be copied via the cache from the origin. We are thus testing the situation of an “empty cache”. We consider this a worst case scenario for robustness. A “full cache” is less likely to fail than an empty cache because in both cases the cache is exercised but only the empty cache exercises the cache connection to the origin.

Results and conclusions

All these tests showed some issues in the OSG environment related to data access. Figure 2 shows statistics about the executed tests.

Tests statistics
Figure 2: Tests statistics

Table 2 shows some errors found during the execution of the jobs. Some network problems are shown on lines 1, 2, 4, 6, and 9 in Table 2. The errors on lines 3 and 5 could be related to the high load on the Chicago cache. Error number 7 was not part of the failure on statistics. However, it is a source of a problem in a user’s process.

Table 2: Errors found in the execution of the test.

  Date (Y/M/D) Error Message
1 2021-11-10 Unable to look up wlcg-wpad.fnal.gov
2 2021-11-10 Unable to resolve osg-kansas-city-stashcache.nrp.internet2.edu:1094: Name or service not known
3 2021-11-01 osg-chicago-stashcache.nrp.internet2.edu:1094 #0] elapsed = 0, pConnectionWindow = 120 seconds.
4 2021-11-01 [osg-kansas-city-stashcache.nrp.internet2.edu:1094 #0] Unable to resolve IP address for the host
5 2021-11-15 [osg-chicago-stashcache.nrp.internet2.edu:1094 #0.0] Unable to connect: Connection refused
6 2021-10-26 [osg-gftp2.pace.gatech.edu:1094 #0] Stream parameters: Network Stack: IPAuto, Connection Window: 30, ConnectionRetry: 2, Stream Error Widnow: 1800
7 2021-10-26 Write: failed Disk quota exceeded (from my user account)
8 2021-11-01 ERROR    Unable to look up wlcg-wpad.fnal.gov ERROR    Unable to look up wlcg-wpad.cern.ch ERROR    Unable to look up wlcg-wpad.fnal.gov ERROR    unable to get list of caches ERROR    Unable to look up wlcg-wpad.cern.ch ERROR    Unable to look up wlcg-wpad.fnal.gov
9 2021-11-01 Unable to look up wlcg-wpad.cern.chUnable to look up wlcg-wpad.fnal.gov
Data transfer ratio for each cache.
Figure 3: Data transfer ratio for each cache.

Table 3: Average latency between Chicago and the host.

Cache Latency Linear distance - cities
ucsd 58.8ms 1731.81 miles
houston 24.6ms 941.90 miles
its-condor-xrootd1.syr.edu 13.7ms 590.11 miles
chicago 0.649 ms 0 miles
new-york 19.6 ms 710.75 miles
dtn2-daejeon.kreonet.net 159 ms 6597.25 miles
osg-sunnyvale-stashcache.nrp.internet2.edu 46.4 ms 1815.37 miles
fiona.uvalight.net 108 ms 4105.72 miles
stashcache.gravity.cf.ac.uk 113 ms 3831.08 miles
Iperf test between Chicago OSG login and node and the UCSD cache host.
Figure 4: Iperf test between Chicago OSG login and node and the UCSD cache host.
One test executed to measure the data access on OSG.
Figure 5: One test executed to measure the data access on OSG.

Recommendations

To prevent or solve the issues, this is a set of recommendations:

References

[1] Ieee standard glossary of software engineering terminology. IEEE Std 610.12-1990, pages 1–84,1990.

[2] E.A. Luke. Defining and measuring scalability. In Proceedings of Scalable Parallel Libraries Conference, pages 183–186, 1993.

[3] S. Ali, A.A. Maciejewski, H.J. Siegel, and Jong-Kook Kim. Definition of a robustness metric for resource allocation. In Proceedings International Parallel and Distributed Processing Symposium, pages 10 pp.–, 2003.

[4] S. Ali, A.A. Maciejewski, H.J. Siegel, and Jong-Kook Kim. Measuring the robustness of a resource allocation. IEEE Transactions on Parallel and Distributed Systems, 15(7):630–641, 2004.

[5] Derek Weitzel, Brian Bockelman, Duncan A. Brown, Peter Couvares, Frank W¨urthwein, and Edgar Fajardo Hernandez. Data Access for LIGO on the OSG. arXiv e-prints, page arXiv:1705.06202, May 2017.