Skip to content

Add a sucoder nodes [PARTITION] query: show available compute nodes without an interactive login #4

@ligon

Description

@ligon

Motivation

When Lustre is degraded, some savio3 compute nodes hang on I/O while others are fine. Before reserving a node (or targeting one with --node / --local-disk), it would help to see which nodes in a partition are actually free and healthy. Today that means shelling into a login node interactively (PIN+OTP) and running sinfo by hand.

Everything needed to do this non-interactively already exists in sucoder: each remote target carries a gateway and a 24h ControlMaster socket (~/.sucoder/ssh/<gateway>.sock), and the slurm targets carry a partition. A small query command could reuse that warm socket and run sinfo for us.

Proposal

A node-availability query, e.g. sucoder nodes [PARTITION] (this is the --nodes-available query I asked for; a subcommand is probably the more idiomatic typer shape, but either works):

  • sucoder -T savio-node nodes -> defaults the partition from the target's slurm.partition (savio3).
  • sucoder -T savio-htc nodes -> savio4_htc.
  • sucoder -T savio-node nodes savio3_gpu -> optional positional overrides the partition (gpu, bigmem, etc.).

Output: one row per node with state, CPUs Allocated/Idle/Other/Total, and load, plus the drain reasons. Roughly:

sinfo -p <partition> -N -o "%N %6t %.15C %.6O"
sinfo -p <partition> -R          # drained/down nodes + reasons

Implementation sketch (reuse existing machinery)

  • Resolve the target (-T) the way other subcommands do; partition = optional positional arg, else remote.slurm.partition.
  • Run sinfo on the login node over the existing gateway ControlMaster socket (ControlPath=~/.sucoder/ssh/<gateway>.sock, ControlMaster=no, BatchMode=yes) so it costs no OTP when a tunnel is already warm; fall back to opening a tunnel if the socket is dead. tunnel.SshControl / _control_socket_path() already model this.
  • No new config; read partitions from the target's slurm block.

Composes naturally with the existing --node (adopt/target a node) and --local-disk flags: see which nodes are free, then aim at one.

Honest caveat

sinfo reports SLURM availability, not Lustre client health: a node can read idle while its Lustre mount is wedged. So -R (drain reasons, since admins often drain sick nodes) and an anomalous load on an otherwise-idle node are the best proxies the query can surface; it can't promise a node's filesystem is healthy. Worth a line in the output/help so the result isn't over-trusted.


Filed by Sue at Ethan's request (Letta Code session), 2026-06-15.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions