Testing condor_ssh_to_job
Author: Fabio Andrijauskas - University of California San Diego
Date: 12/23/2024
HTCondor
allows users to establish an interactive session with any running job they own.
This is equivalent to creating an SSH section that connects to the job, where
the user can inspect the job logs, intermediate results and other files. The
command involved is called condor_ssh_to_job.
The objective of this activity was to document how reliable is this
functionality on the OSPool.
Executive summary
Our tests show that only 30 out of
the 56 tested sites available through OSPool provided full condor_ssh_to_job functionality. Another 20
provided partial functionality. Only 6 sites provided no condor_ssh_to_job
functionality. We did not see any difference between vanilla and container-based
jobs.
The error reported in the vast majority of
failed attempts was “Can't setns to user
namespace: Invalid argument”. This seems to be fatal for fully interactive access, but is considered only a warning for scripted
access.
Recommendations
We recommend that the HTCondor team graciously
deals with the reported namespace-realted failure
mode, as it does not seem to be essential for the functionality of condor_ssh_to_job.
Further details
The tests were performed using
ap21.uc.osg-htc.org and looked for all available sites in the OSPool using condor_status -pool
factory-1.osg-htc.org -any -const 'MyType=="glide
factory"' -af GLIDEIN_Site. Each site was tested 10 times using one
script running: condor_ssh_to_job jobid 'hostname' and condor_ssh_to_job jobid; the methodology was based on checking
if it is possible to log into the job, list files, and change to any directory
using Vanilla and Containers jobs. We do not see any difference between vanilla
and container jobs; interactive logins, however, failed significantly more
often than command invocations through condor_ssh_to_job.
All the tests were executed during 11/15/24 to 12/10/24.
Tables 1a and 1b show the results: 56 sites were reached, and 30 sites
provided an SSH session, file listing, and directory change. Regarding issues, 26
sites could not provide an SSH session. More details about each site can be
found in the appendix (https://drive.google.com/file/d/17ASdER6tz5D8_kVwgIR42uWlAKWZrpFR/view?usp=sharing). The nodes showing “Can't setns to user namespace: Invalid argument” could not
provide shell sessions. However, it was possible to run commands.
Table 1a: Numbers about sites where
interactive ssh-to-job works.
Status |
Quantity |
Comments |
Works |
30 |
3 Slow SSH – NotreDame
- UConn-HP - UIUC-TGI-RAILS |
Broken |
26 |
21 sites: “Can't setns
to user namespace: Invalid argument” 1 site: “‘$GahpVersion:
1.8.0 Mar 31 2008 INFN\ blahpd\ (poly,new_esc_format) $” |
Table 1b: Numbers about sites where ‘command
only’ for the ssh-to-job works
Status |
Qtd |
Comments |
Works |
50 |
21 sites: warning
for SSH commands “Can't
setns to user namespace: Invalid argument” |
Broken |
6 |
1 site: “/usr/bin/ssh-keygen: No such file or directory 1
site: “Failed, because sshd not
correctly configured (SSH_TO_JOB_SSHD=/usr/sbin/sshd): No such file or
directory” 1
site: “Connection to
condor-job.node827.dcs.ligo-wa.caltech.edu closed by remote host.” 1
site: “/bin/bash: Permission denied” 1 site: “Connection closed by UNKNOWN port
65535” 1 site: “‘$GahpVersion:
1.8.0 Mar 31 2008 INFN\ blahpd\ (poly,new_esc_format) $” |
Table 2 shows
the percentage of successes and failures for each site. Most sites were
homogenous, but a few showed errors in some of the nodes.
Table 2: Percentage of successes and
failures for each site.
Sites |
Command
only |
Interactive
session. |
Status |
||
Success |
Failure |
Success |
Failure |
|
|
AMNH |
100% |
0% |
0% |
100% |
‘Can't
setns to user namespace: Invalid argument’ |
Alabama-CHPC |
100% |
0% |
90% |
10% |
‘Can't
setns to user namespace: Invalid argument’ |
BEOCAT-SLATE |
0% |
100% |
0% |
100% |
‘$GahpVersion: 1.8.0 Mar 31 2008 INFN\ blahpd\
(poly,new_esc_format) $’ |
CHTC |
100% |
0% |
0% |
100% |
‘Can't
setns to user namespace: Invalid argument’ and ‘memory usage
exceeded request_memory’ |
CHTC-Spark |
100% |
0% |
100% |
0% |
|
Clemson-Palmetto |
100% |
0% |
100% |
0% |
|
Colorado |
100% |
0% |
0% |
100% |
‘Can't
setns to user namespace: Invalid argument’ |
Duke-NCShare |
100% |
0% |
100% |
0% |
|
ELSA |
100% |
0% |
0% |
100% |
‘Can't
setns to user namespace: Invalid argument’ |
FANDM-ITS |
100% |
0% |
100% |
0% |
|
FNAL |
0% |
100% |
0% |
100% |
‘/usr/bin/ssh-keygen: No such file or directory’ |
FNAL_GPGrid |
0% |
100% |
0% |
100% |
‘Failed,
because sshd not correctly configured
(SSH_TO_JOB_SSHD=/usr/sbin/sshd): No such file or directory’ |
GATech |
100% |
0% |
100% |
0% |
|
GRID_ce2 |
100% |
0% |
0% |
100% |
‘Can't
setns to user namespace: Invalid argument’ |
GSU-ACIDS |
100% |
0% |
100% |
0% |
|
Hawaii-Koa |
100% |
0% |
100% |
0% |
slow |
LIGO-WA |
0% |
100% |
0% |
100% |
‘Connection
to condor-job.node827.dcs.ligo-wa.caltech.edu closed by remote host.’ |
Lehigh - Hawk |
100% |
0% |
0% |
100% |
‘Can't
setns to user namespace: Invalid argument’ |
Langston-Lion |
100% |
0% |
0% |
100% |
‘Can't
setns to user namespace: Invalid argument’ |
Lafayette-Firebird |
100% |
0% |
100% |
0% |
|
LSU-Deep_Bayou |
100% |
0% |
100% |
0% |
|
LSUHSC-Tigerfish |
100% |
0% |
0% |
100% |
‘Can't
setns to user namespace: Invalid argument’ |
MI-HORUS |
0% |
100% |
0% |
100% |
‘/bin/bash:
Permission denied’ |
MSU-DataMachine |
100% |
0% |
0% |
100% |
‘Can't
setns to pid namespace:
Operation not permitted’ |
MTState-Tempest |
100% |
0% |
0% |
100% |
‘Can't
setns to user namespace: Invalid argument’ |
New Mexico
State Discovery |
100% |
0% |
0% |
100% |
‘Can't
setns to user namespace: Invalid argument’ |
NCSU-OSG |
100% |
0% |
100% |
0% |
|
NotreDame |
100% |
0% |
100% |
0% |
|
ORU-Titan |
100% |
0% |
0% |
100% |
‘Can't
setns to user namespace: Invalid argument’ |
OSG_US_FSU_HNPGRID |
100% |
0% |
100% |
0% |
slow |
PSU-LIGO |
100% |
0% |
100% |
0% |
|
Purdue-Anvil |
100% |
0% |
100% |
0% |
|
PuertoRico |
100% |
0% |
100% |
0% |
|
PDX-Coeus |
100% |
0% |
100% |
0% |
|
SIUE-CC-production |
100% |
0% |
100% |
0% |
|
SPRACE |
100% |
0% |
100% |
0% |
|
SU-ITS |
100% |
0% |
0% |
100% |
‘Can't
setns to user namespace: Invalid argument’ |
Swarthmore-Firebird |
100% |
0% |
100% |
0% |
|
ODU-Ubuntu |
100% |
0% |
0% |
100% |
‘Can't
setns to pid namespace:
Operation not permitted’ or timeout |
UAH-Voyager |
100% |
0% |
100% |
0% |
|
UC-Denver |
100% |
0% |
0% |
100% |
‘Can't
setns to user namespace: Invalid argument’ |
UWEC-BOSE |
100% |
0% |
100% |
0% |
|
UChicago |
60% |
40% |
60% |
40% |
‘Connection
closed by UNKNOWN port 65535’ |
UConn |
100% |
0% |
100% |
0% |
|
UConn-HPC |
100% |
0% |
100% |
0% |
|
UCR-HPCC |
100% |
0% |
0% |
100% |
‘Can't
setns to user namespace: Invalid argument’ |
UIUC-TGI-RAILS |
100% |
0% |
100% |
0% |
slow |
UMT-Hellgate |
100% |
0% |
100% |
0% |
|
UNR-CC |
100% |
0% |
0% |
100% |
‘Can't
setns to user namespace: Invalid argument’ |
UUCHPC |
100% |
0% |
0% |
100% |
‘Can't
setns to user namespace: Invalid argument’ |
UND-Talon |
100% |
0% |
100% |
0% |
|
UWM-Mortimer |
100% |
0% |
0% |
100% |
‘Can't
setns to user namespace: Invalid argument’ |
UW-IT |
100% |
0% |
100% |
0% |
|
Wisconsin |
100% |
0% |
100% |
0% |
|