Chester (chester-ml on PyPI) is a Python experiment launcher for ML workflows. Define your training function and parameter sweep — Chester handles dispatching jobs to local subprocesses, SSH servers, or SLURM clusters, with Singularity container support, code syncing, and reproducibility snapshots baked in.
pip install chester-ml
# or
uv add chester-ml1. Create .chester/config.yaml in your project root:
log_dir: data
package_manager: uv
backends:
local:
type: local
prepare: .chester/backends/local/prepare.sh
myserver:
type: ssh
host: myserver # SSH alias from ~/.ssh/config
remote_dir: /home/user/myproject
prepare: .chester/backends/myserver/prepare.sh
mycluster:
type: slurm
host: mycluster
remote_dir: /home/user/myproject
prepare: .chester/backends/mycluster/prepare.sh
slurm:
partition: gpu
time: "24:00:00"
gpus: 1
cpus_per_gpu: 8
mem_per_gpu: 32G2. Write a launcher:
from chester.run_exp import run_experiment_lite, VariantGenerator, detect_local_gpus, flush_backend
def run_task(variant, log_dir, exp_name):
print(f"lr={variant['lr']}, batch={variant['batch_size']}")
# ... your training code ...
vg = VariantGenerator()
vg.add('lr', [1e-3, 1e-4])
vg.add('batch_size', [32, 64])
for v in vg.variants():
run_experiment_lite(
stub_method_call=run_task,
variant=v,
mode='local', # or 'myserver', 'mycluster'
exp_prefix='sweep',
max_num_processes=max(1, len(detect_local_gpus())),
)
flush_backend('local') # no-op for local; required after loop for batch SSH mode3. Run:
python launcher.py # local
python launcher.py myserver # SSH
python launcher.py mycluster # SLURM- Three backend types: local subprocess, SSH (
nohup), SLURM (sbatch) - Singularity on all backends: GPU passthrough, persistent overlays, per-container
prepare.sh - VariantGenerator: cartesian product sweeps, dependent parameters,
order="serial"(multi-step single job) andorder="dependent"(chained SLURM jobs) - Hydra integration: pass parameters as
key=valueoverrides with OmegaConf interpolation support - Git snapshot: saves
git_info.json+git_diff.patchper run for full reproducibility - Submodule commit pinning: pin specific submodule commits per job via remote git worktrees
- SSH batch-GPU mode: accumulate jobs across variants, fire one per GPU on
flush_backend() - Extra sync dirs: rsync additional paths (datasets, checkpoints) to remote before submission
- Per-experiment SLURM overrides: tune
time,gpus,mem_per_gpu, etc. perrun_experiment_lite()call - Graceful Ctrl+C: local kills subprocesses and stops the queue; remote detaches and lets jobs keep running
Full reference in docs/:
| Doc | What it covers |
|---|---|
| Configuration | .chester/config.yaml — all fields, global singularity block, YAML anchors |
| Backends | Local, SSH, SLURM — all options, batch-GPU, extra sync dirs |
| Singularity | Mounts, overlays, PID namespace, fakeroot, runtime override |
| Parameter Sweeps | VariantGenerator, serial/dependent ordering, derive, flush_backend |
| Hydra | hydra_enabled, flags, OmegaConf interpolations |
| Git Snapshot | git_info.json, git_diff.patch, submodule tracking, recovery |
| Submodule Pinning | Per-job submodule commit pinning via worktrees |
| Examples | Annotated real-world config patterns |
See docs/examples/ for annotated configs:
simple.yaml— local + SSH + SLURM, no Singularitysingularity-slurm.yaml— production SLURM + Singularity with NFS mountsmulti-gpu-ssh.yaml— multi-GPU SSH workstation with batch mode
myproject/
├── .chester/
│ ├── config.yaml # Main config
│ └── backends/
│ ├── local/
│ │ └── prepare.sh # Local env setup
│ ├── mycluster/
│ │ └── prepare.sh # Cluster setup (modules, paths)
│ └── myserver/
│ └── prepare.sh # SSH server setup
├── launchers/
│ └── launch_sweep.py
└── src/
Chester searches for .chester/config.yaml upward from the current directory, stopping at the .git root. Override with $CHESTER_CONFIG_PATH.
MIT