Motivation
When Lustre is degraded, some savio3 compute nodes hang on I/O while others are fine. Before reserving a node (or targeting one with --node / --local-disk), it would help to see which nodes in a partition are actually free and healthy. Today that means shelling into a login node interactively (PIN+OTP) and running sinfo by hand.
Everything needed to do this non-interactively already exists in sucoder: each remote target carries a gateway and a 24h ControlMaster socket (~/.sucoder/ssh/<gateway>.sock), and the slurm targets carry a partition. A small query command could reuse that warm socket and run sinfo for us.
Proposal
A node-availability query, e.g. sucoder nodes [PARTITION] (this is the --nodes-available query I asked for; a subcommand is probably the more idiomatic typer shape, but either works):
sucoder -T savio-node nodes -> defaults the partition from the target's slurm.partition (savio3).
sucoder -T savio-htc nodes -> savio4_htc.
sucoder -T savio-node nodes savio3_gpu -> optional positional overrides the partition (gpu, bigmem, etc.).
Output: one row per node with state, CPUs Allocated/Idle/Other/Total, and load, plus the drain reasons. Roughly:
sinfo -p <partition> -N -o "%N %6t %.15C %.6O"
sinfo -p <partition> -R # drained/down nodes + reasons
Implementation sketch (reuse existing machinery)
- Resolve the target (
-T) the way other subcommands do; partition = optional positional arg, else remote.slurm.partition.
- Run
sinfo on the login node over the existing gateway ControlMaster socket (ControlPath=~/.sucoder/ssh/<gateway>.sock, ControlMaster=no, BatchMode=yes) so it costs no OTP when a tunnel is already warm; fall back to opening a tunnel if the socket is dead. tunnel.SshControl / _control_socket_path() already model this.
- No new config; read partitions from the target's
slurm block.
Composes naturally with the existing --node (adopt/target a node) and --local-disk flags: see which nodes are free, then aim at one.
Honest caveat
sinfo reports SLURM availability, not Lustre client health: a node can read idle while its Lustre mount is wedged. So -R (drain reasons, since admins often drain sick nodes) and an anomalous load on an otherwise-idle node are the best proxies the query can surface; it can't promise a node's filesystem is healthy. Worth a line in the output/help so the result isn't over-trusted.
Filed by Sue at Ethan's request (Letta Code session), 2026-06-15.
Motivation
When Lustre is degraded, some savio3 compute nodes hang on I/O while others are fine. Before reserving a node (or targeting one with
--node/--local-disk), it would help to see which nodes in a partition are actually free and healthy. Today that means shelling into a login node interactively (PIN+OTP) and runningsinfoby hand.Everything needed to do this non-interactively already exists in sucoder: each remote target carries a
gatewayand a 24h ControlMaster socket (~/.sucoder/ssh/<gateway>.sock), and the slurm targets carry apartition. A small query command could reuse that warm socket and runsinfofor us.Proposal
A node-availability query, e.g.
sucoder nodes [PARTITION](this is the--nodes-availablequery I asked for; a subcommand is probably the more idiomatic typer shape, but either works):sucoder -T savio-node nodes-> defaults the partition from the target'sslurm.partition(savio3).sucoder -T savio-htc nodes-> savio4_htc.sucoder -T savio-node nodes savio3_gpu-> optional positional overrides the partition (gpu, bigmem, etc.).Output: one row per node with state, CPUs Allocated/Idle/Other/Total, and load, plus the drain reasons. Roughly:
Implementation sketch (reuse existing machinery)
-T) the way other subcommands do; partition = optional positional arg, elseremote.slurm.partition.sinfoon the login node over the existing gateway ControlMaster socket (ControlPath=~/.sucoder/ssh/<gateway>.sock,ControlMaster=no,BatchMode=yes) so it costs no OTP when a tunnel is already warm; fall back to opening a tunnel if the socket is dead.tunnel.SshControl/_control_socket_path()already model this.slurmblock.Composes naturally with the existing
--node(adopt/target a node) and--local-diskflags: see which nodes are free, then aim at one.Honest caveat
sinforeports SLURM availability, not Lustre client health: a node can readidlewhile its Lustre mount is wedged. So-R(drain reasons, since admins often drain sick nodes) and an anomalous load on an otherwise-idle node are the best proxies the query can surface; it can't promise a node's filesystem is healthy. Worth a line in the output/help so the result isn't over-trusted.Filed by Sue at Ethan's request (Letta Code session), 2026-06-15.