Multiple files vs. Tar.gz
by Fabio Andrijauskas, as of January 7, 2022
The objective of these tests is to check if it is faster to transfer each file separately or compress all the files on a tar.gz file for a job on the Open Science Grid (OSG) using the HTCondor. Was used one tar.gz file with 50000 50KB (each file has a random content, the compress ratio is ≈ 1 : 1) to copy to the node, and the MD5 hash was calculated using the code from Listing 1 e 2.
Listing 1: Job requesting the tar.gz file.
error = short$(ProcId).error Arguments = $(ProcId) output = short$(ProcId).output log = short$(ProcId).log Requirements = GLIDEIN_Site == ”CU − Research Computing” +DESIRED_Sites = ”CU − Research Computing” transfer_input_files = /home/fandri/data2.tar.gz should_transfer_files = yes request_cpus = 1 request_memory = 100 MB request_disk = 100 MB executable = 1.sh queue 20
Listing 2: Script to check the data
#!/bin/bash tar −xf data.tar.gz cd data for f in file *; do md5sum $f done
Code on Listing 3 e 4 was used to process 50000 files 50KB each (files have a random content, the compress ratio is ≈ 1 : 1) to copy to the node and calculate the MD5 hash.
Listing 3: Job
error = short$(ProcId).error Arguments = $(ProcId) output = short$(ProcId).output log = short$(ProcId).log Requirements = GLIDEIN_Site == ”CU − Research Computing” +DESIRED_Sites = ”CU − Research Computing” transfer_input_files = /home/fandri/data2/ should_transfer_files = yes request_cpus = 1 request_memory = 100 MB request_disk = 100 MB executable = 1.sh queue 20
Listing 4: Script
for f in file *; do md5sum $f done
Table 1 shows the results using 50000 50KB (each file has a random content) on a tar.gz and multiple files requesting from Chicago - IL to:
- SU-ITS (Syracuse University - New York - NY);
- SDSC (San Diego - CA);
- CU – Research Computing (Colorado University - Boulder - CO).
The tests were executed 20 times. Table 1 contains statistical data from all the tests:
- “AVG transfer” column shows the average time to transfer the files;
- “AVG decompress” show the average time to decompress the files;
- “AVG proc” indicates the average time to process the files
- “AVG total proc” show the sum of the average time for all process steps (transfer, decompress, and process);
- “STDEV transfer” represents the standard deviation for the transfer time;
- “STDEV untar” represents the standard deviation for the decompress time;
- “STDEV proc” means the standard deviation for the processing time.
Table 1: Results from tests on SU-ITS, SDSC, and CU site from the Chicago OSG login.
AVG total proc
SU-ITS - targz
SU-ITS - files
CU - targz
CU - files
SDSC - targz
SDSC - files
Figure 1 shows the time to transfer the data using multiple files or using only one tar.gz file. The error bars are the standard deviation. Due to the standard deviation of all measures, there is no statistical difference between the test using tar.gz or the multiple files.
All tests show no statistical difference when using the tar.gz or the multiple files. The investigations of the HTCondor source code (https://github.com/htcondor/htcondor/blob/master/src/condor_utils/file_transfer.cpp) and HTCondor developers comments, it is possible to conclude that there is no hugely overhead for the multiple file approach on HTCondor. The only vantage to use the tar.gz is the possibility to compress the data and have less network usage.