Add a `sucoder nodes [PARTITION]` query: show available compute nodes without an interactive login

## Motivation

When Lustre is degraded, some savio3 compute nodes hang on I/O while others are fine. Before reserving a node (or targeting one with `--node` / `--local-disk`), it would help to see which nodes in a partition are actually free and healthy. Today that means shelling into a login node interactively (PIN+OTP) and running `sinfo` by hand.

Everything needed to do this non-interactively already exists in sucoder: each remote target carries a `gateway` and a 24h ControlMaster socket (`~/.sucoder/ssh/<gateway>.sock`), and the slurm targets carry a `partition`. A small query command could reuse that warm socket and run `sinfo` for us.

## Proposal

A node-availability query, e.g. `sucoder nodes [PARTITION]` (this is the `--nodes-available` query I asked for; a subcommand is probably the more idiomatic typer shape, but either works):

- `sucoder -T savio-node nodes` -> defaults the partition from the target's `slurm.partition` (savio3).
- `sucoder -T savio-htc nodes` -> savio4_htc.
- `sucoder -T savio-node nodes savio3_gpu` -> optional positional overrides the partition (gpu, bigmem, etc.).

Output: one row per node with state, CPUs Allocated/Idle/Other/Total, and load, plus the drain reasons. Roughly:

```
sinfo -p <partition> -N -o "%N %6t %.15C %.6O"
sinfo -p <partition> -R          # drained/down nodes + reasons
```

## Implementation sketch (reuse existing machinery)

- Resolve the target (`-T`) the way other subcommands do; partition = optional positional arg, else `remote.slurm.partition`.
- Run `sinfo` on the login node **over the existing gateway ControlMaster socket** (`ControlPath=~/.sucoder/ssh/<gateway>.sock`, `ControlMaster=no`, `BatchMode=yes`) so it costs no OTP when a tunnel is already warm; fall back to opening a tunnel if the socket is dead. `tunnel.SshControl` / `_control_socket_path()` already model this.
- No new config; read partitions from the target's `slurm` block.

Composes naturally with the existing `--node` (adopt/target a node) and `--local-disk` flags: see which nodes are free, then aim at one.

## Honest caveat

`sinfo` reports SLURM availability, not Lustre client health: a node can read `idle` while its Lustre mount is wedged. So `-R` (drain reasons, since admins often drain sick nodes) and an anomalous load on an otherwise-idle node are the best proxies the query can surface; it can't promise a node's filesystem is healthy. Worth a line in the output/help so the result isn't over-trusted.

---
_Filed by Sue at Ethan's request (Letta Code session), 2026-06-15._


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a `sucoder nodes [PARTITION]` query: show available compute nodes without an interactive login #4

Motivation

Proposal

Implementation sketch (reuse existing machinery)

Honest caveat

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Add a sucoder nodes [PARTITION] query: show available compute nodes without an interactive login #4

Description

Motivation

Proposal

Implementation sketch (reuse existing machinery)

Honest caveat

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Add a `sucoder nodes [PARTITION]` query: show available compute nodes without an interactive login #4