HTCondor Annex - Scalability and Stability
By Fabio Andrijauskas - UCSD
One challenge for High-throughput computing (HTC) is integrating with the computational cloud or other clusters to provide more computational power. HTCondor Annex can create working nodes on clusters using schedulers such as HTCondor, Slurm, and others.
Executive summary
The HTCondor Annex creates a way to provide more computational power using the same structure as a standard HTCondor environment. About the HTCondor Annex and the test environment:
- All the tests were in the ap40.uw.osg-htc.org and ap1.facility.path-cc.io as an HTCondor Annex target.
- The methodology was based on the absolute number of worker daemons, the absolute number of worker daemons, peaks execute daemon termination, the external killing of executed daemons when the system is loaded, and the long-term stability of an extensive system.
- We ran jobs for three months, more than 250k jobs, and more than 100k Annex.
- The HTCondor Annex creation process frequency could be increased; the results show the Annex creation frequency decreases while the number of annexes increases.
- The annex can only be created if you create fewer jobs (50k jobs). The appendix shows the output of the attempt to create the Annex after the job creation.
- Integrating clusters besides the one already coded on the HTCondor Annex is impossible.
- The job removal doesn’t always work to remove the HTCondor Annexes from the target host. SSH sessions toward the target are also sometimes left behind.
Recommendations
The recommendations are ordered related to the user impact.
- Improve the checks when an Annex or job is interrupted; there are cases where the jobs stay on target.
- Fix orphaned SSH connections toward target after the Annexes have terminated.
- Create multiple jobs in the same scripts (“queue X”) using HTCondor Annex.
- Show the options for shared connection before the Annex starts the process. This could be very useful to prevent problems with the 2-step or CILogon authentication style. Figure 4 shows the output of a command to create an Annex and the CILogon authentication request.
- It is required to create all the jobs first and the Annexes afterward. It is interesting to be able to create the annex and the jobs or vice versa.
- Integrate or warn about the fact that condor_store_cred needs to start the Annexes.
- Adding other targets using configuration files in the HTCondor Annex is essential; now, it is only possible to use a “coded set” of targets.
- Give feedback about the SSH timeout command; it helps debug firewall issues.
- Consider an alternative for opening port 9618 for security reasons and administration issues.
- Create a command to check the status of the target jobs.
Methodology
The methodology applied to test the HTCondor Annex has a few goals to cover scalability, stability, and performance:
-
Absolute number of worker daemons
- Peaks execute daemon provisioning to show that we can start many execute daemons within a short amount of time and have jobs start on them.
- Peaks execute daemon termination.
- After having many execute daemons running, set up either job termination or Annex auto-termination to happen as closely as possible together.
- The external killing of executed daemons when the system is loaded (e.g., over 20k
already)
- Measure the success/error rates of the daemons de-registering from the negotiator, clean up after themselves, and adequately reflect termination in Annex commands.
- Long-term stability of an extensive system
- After bringing the system to a large baseline (e.g., 20k), check if it is possible to sustain that.
Tests HTCondor Annex
Table 1 shows the tests run using the HTCondor Annex, using the ap40.uw.osg- htc.org (HTCondor Annex) and ap1.facility.path-cc.io (worker target). The results are positive, negative, and neutral. Positive results mean the tests ran without error messages or other problems; a neutral result means the test was executed successfully. Still, some aspects must be analyzed, and a negative result means the test could not run.
Test | Results | Comments |
---|---|---|
Create a working node using OSG AP | Positive | It was required to run: echo | condor_store_cred add-oauth -i - -s scitokens and echo | condor_store_cred query- oauth. |
Running job using Annex | Positive | It is possible to create the Jobs and run the Annex. |
Run jobs using X Annex and Y jobs | Neutral | Figure 1 and Figure 2 shows the results. |
Run 50k jobs and use one Annex | Negative | It was not possible to create the Annex, the Appendices show more details about it. |
Run 50k Jobs and ran one Annex | Negative | It was not possible to create the Annex, the Appendices show more details about it. |
Kill an Annex while the job is running. | Positive | |
Kill a job while an Annex is running | Neutral | Kill the annex, and sometimes it is got stuck on the target. |
Long Run Annex – 1 month | Positive | Close to 100k jobs and 100k Annex in a one month. |
Create jobs to run on the Annex using: queue X | Negative | It is not possible to run multiple jobs using one script. |
Delete all Annex and all the jobs using condor_rm | Negative | There is some Annex got stuck on the target. |
Figure 1 shows tests executed using HTCondor Annex using the ap40.uw.osg-htc.org (HTCondor Annex) and ap1.facility.path-cc.io (worker target). On the left vertical axis is the time to start the job (simple job with a sleep one command), and the right vertical is the jobs executed per second; on the horizontal axis, on the first line is the number of Annex, and on the second line if the number of Jobs. Using 1 to 10 Annex, the performance of jobs per second is increasing. Adding more Annex, the job per second rate starts to decrease.
Figure 2 shows the same results as Figure 1. However, 3D visualization makes it possible to check the points where the performance starts to decrease; the bar color represents the jobs per second.
Appendix
The tests used two hosts, ap40.uw.osg-htc.org, to create jobs and Annexes, ap1.facility.pathcc.io, to run the Annexes:
- On ap40.uw.osg-htc.org:
- 502GB RAM.
- AMD EPYC 7763 64-Core Processor.
- CentOS Stream release 8.
- CondorVersion: 23.3.0 2023-12-15 BuildID: 695193 PackageID: 23.3.0-0.695193 RC - CondorPlatform: x86_64_AlmaLinux8
- ap1.facility.path-cc.io:
- 502GB RAM.
- AMD EPYC 7763 64-Core Processor
- CentOS Stream release 8
- CondorVersion: 23.3.0 2023-12-15 BuildID: 695193 PackageID: 23.3.0-0.695193 RC - CondorPlatform: x86_64_AlmaLinux8
Figure 3 shows the error when 50k jobs are created, and an attempt is made to create an Annex for those jobs.
Figure 4 shows the output of the Annex Creation, where an action about a 2-step authentication is requested.