Proposal for classifying the utilization of OSG

by Fabio Andrijauskas, Igor Sfiligoi and Frank Wüerthwein - University of California San Diego, as of January 7, 2022

Open Science Grid (OSG) has a complex system to access all the available resources. Figure 1 shows a vision of HTCondor and glidein Workflow Managment System (GlideinWMS) how they provide access to the computer structure [1].

GlideinWMS for grid access with condor [2].
Figure 1: GlideinWMS for grid access with condor [2].

Figure 1 shows the main idea is that when demand for more resources is sensed by the Virtual Organization Frontend, Condor job execution daemons (aka glidein pilots) are submitted to the grid by the Glidein Factory (GF):

“This is known as “gliding in” to the grid. When they begin running, they “phone home” and join the glideinWMS Condor pool. Then they are available to run user jobs through the normal Condor mechanisms. Once demand for resources decreases to the point where some nodes have no work to do, the Condor job execution daemons on those idle nodes exit and the resources that were allocated to glideinWMS at the grid site are released for use by others” [2].

Even with this well-made structure, it is a fact of life that the resources are not always perfectly utilized. In order to improve utilization of the available resource, we first must have a clear understanding of how to classify the activity of those resources.

For this proposal, we are concerned only about wall time measures due to the combined gWMS and HTCondor systems. The proposed classification thus may be understood largely as a walk through the "state machine" of how these two systems work together.

Note that we will be considering CPU and GPU resources separately for ease of understanding. We thus define the unit of “time” as “cpu core hours” for CPUs and “gpu chip hours” for GPUs.

Moreover, to account for the fine-grained partitioning inside pilot jobs, we also define a notion of “canonical unit”. At this point in time, we propose to define it as “1 CPU core + 2 GB of RAM” for CPU resources and “1 GPU chip” for GPU resources. It can then be used to compute “canonical time” in the same units as normal time. The detailed use of this unit is discussed below.

We postulate that the time of resources available to OSG can be partitioned among the following 8 categories, with the first 6 belonging to the pilot infrastructure:

There are two additional classifications of time available to OSG and not directly related to pilot infrastructure:

Figure 2 shows a simplified pilot's lifetime and when each type of measure happens.

Pilot lifetime and the measure classification.
Figure 2: Pilot lifetime and the measure classification.

Further clarification of “Pilot overhead” and canonical units

OSG jobs come with fine-grained requirements, while the pilots offer a fixed total number of CPU cores, GPUs, and memory, and there is no optimal way to maximize the use of all the resources while keeping job priorities in consideration. OSG thus allows for certain resources to be idle, if some of the other equally important resources are fully utilized. For example, it is just as acceptable to “use all the CPU cores and only a fraction the memory” as it is to “use all the memory and only a subset of the CPU cores”.

The “canonical unit” and “canonical time” definitions provide a measure of what is the smallest unit that is considered “true overhead”. We thus account “Pilot overhead” only in multiples of “canonical units”. For example, given the CPU definition of canonical unit of “1 CPU core and 2 GB of memory”, an hour when we have 3 CPU cores and 3 GB of memory unused would count as “1 CPU core hour” (memory limited), the same period of 3 CPU cores and 1 GB of memory unused would count as “0 CPU core hours” (memory limited), and the same period of 2 CPU cores and 6 GB of memory unused would count as “2 CPU core hours” (CPU core limited).

The use of “canonical time” in “Pilot overhead” brings however an accounting problem; I.e. What to do with the remainder of the time that remains unaccounted for. Using the first example above, when we have 3 CPU cores and 3 GB of memory unused, we have a remainder of “2 CPU core hours”.

In order to fully account for all the resources, we thus account that remainder proportionally between any jobs that were running at that point in time. For example, if we had a remainder of 2 CPU cores (and any amount of memory) and there were two jobs running during the considered time period, say 1h for example simplicity, one which completed and one that never did due to the pilot getting preempted sometime in the future, we would account 1 CPU core hour to each of “Pilot goodput” and “Pilot preemption badput”. As a further clarification, any time the pilot does not run any jobs at all, all the time is accounted to “Pilot overhead”, even if it is not a multiple of “canonical time”.

Note that for GPU accounting “canonical time” is currently defined the same as “time”, i.e. “GPU chip hours”, so there is never any leftover there.

A word about time classification vs measurement:

This document only provides the partitioned definitions of how the resources are being used. It does not aim to provide any guidance regarding how to classify the resource usage using the existing monitoring tools.

As an example, it is currently unknown how to properly measure (a) and (b) but one could speculate that one could approximate it by means of measuring period with no pilots waiting in the sites’ queues. Such clarifications and guidance are of course necessary but will be subject to a separate future document.