ClusterPilot

AI-assisted HPC workflow manager for any SLURM-managed cluster. clusterpilot.sh

Built by a computational physics PhD student who got tired of doing this manually.

What it does

ClusterPilot automates the full local to cluster to local research cycle:

Describe your job in plain English - ClusterPilot sends your description to an AI model to generate a correct, cluster-aware SLURM script
Upload and submit - files are rsynced to the cluster and sbatch is run over an existing SSH ControlMaster socket
Monitor without babysitting - a background poll daemon checks squeue every 5 minutes; no persistent SSH connection is held open
Get notified (optional) - push notifications to your phone on job start, completion, failure, and walltime warnings via ntfy.sh
Auto-sync results - on completion, output files are rsynced back to your local project directory

Everything runs from a keyboard-driven terminal UI (amber phosphor aesthetic, naturally).

clusterpilot-demo.mp4

F2 — Describe your job and generate a SLURM script

F1 — Monitor jobs, tail logs in real time, sync results

Supported clusters

ClusterPilot works with any SLURM cluster — if you can SSH into it and run sbatch, it will work. There are three cluster types that control which SLURM quirks are injected into the generated script:

`cluster_type`	Use for
`generic`	Any SLURM cluster (default if omitted)
`drac`	Compute Canada / Alliance (Cedar, Narval, Graham, Beluga)
`grex`	University of Manitoba Grex

Both drac and grex are specialisations of generic — they add cluster-specific rules (mandatory --account=, correct $SCRATCH path, GPU syntax) that the AI would otherwise have to guess. For every other cluster, generic is the right choice and the probe handles the rest automatically.

Requirements

Python >= 3.9
System ssh binary with ControlMaster support (standard on macOS/Linux)
An API key for your chosen AI provider (currently Anthropic)
(Optional) A free ntfy.sh topic for push notifications

Installation

# pip
pip install clusterpilot

# conda
conda install -c conda-forge clusterpilot

On first run, ClusterPilot creates a starter config at ~/.config/clusterpilot/config.toml, prints its location, and exits. Edit it to add your cluster username and account, then run clusterpilot again.

Hosted tier

A hosted tier is available at app.clusterpilot.sh for researchers who want zero API key setup and a web dashboard.

$3/month, 14-day free trial. The self-hosted version is always fully functional.

What you get

	Self-hosted (free)	Hosted tier
TUI, job submission, monitoring	✓	✓
Push notifications (ntfy.sh)	✓	✓
AI script generation	Bring your own API key	Managed — no key needed
Web dashboard	—	✓ app.clusterpilot.sh
Multi-machine sync	—	✓

Managed API key

With the hosted tier you do not need an Anthropic (or OpenAI) account. ClusterPilot issues you a cp- token that the TUI uses to route generation requests through the managed proxy. Set it in your config and the AI tab works out of the box:

[hosted]
api_url = "https://api.clusterpilot.sh"
api_token = "cp-your-token-here"

Web dashboard and multi-machine sync

Every job you submit is synced to the web dashboard automatically — including the generated SLURM script, job status, and log output. The dashboard updates within seconds of each state change (PENDING → RUNNING → COMPLETED/FAILED).

The cp- token is the identity key. Any machine that has the same token in its config will sync jobs to the same dashboard view:

MacBook (writing + submission)       Linux workstation (heavy SSH work)
  api_token = "cp-abc123..."           api_token = "cp-abc123..."
         │                                    │
         └──────────────┬─────────────────────┘
                        ▼
             app.clusterpilot.sh
             (all jobs, all machines, one view)

To use ClusterPilot from multiple machines: generate a key once from the Account page of the dashboard, then paste the same token into ~/.config/clusterpilot/config.toml on each machine. No per-machine registration or pairing is required.

Configuration

~/.config/clusterpilot/config.toml:

[defaults]
model = "claude-sonnet-4-6"   # AI model to use for script generation
api_key = ""                  # or set ANTHROPIC_API_KEY env var
poll_interval = 300           # seconds between job status checks

[[clusters]]
name = "grex"
host = "yak.hpc.umanitoba.ca"
user = "your_username"
account = "def-yoursupervisor"
scratch = "$HOME/clusterpilot_jobs"

[notifications]
backend = "ntfy"
ntfy_topic = "your-topic-string"
ntfy_server = "https://ntfy.sh"

AI providers

`provider`	`model` examples	API key
`anthropic` (default)	`claude-sonnet-4-6`, `claude-opus-4-6`	`ANTHROPIC_API_KEY` env var or `api_key` in config
`openai`	`gpt-4o`, `gpt-4o-mini`, `o4-mini`	`OPENAI_API_KEY` env var or `api_key` in config
`ollama`	`llama3.2`, `qwen2.5-coder`, any local model	not required

For Ollama, ClusterPilot connects to http://localhost:11434 by default. To use a remote Ollama instance, set api_base_url = "http://your-host:11434/v1" in config.

Any OpenAI-compatible API (vLLM, LM Studio, etc.) also works with provider = "openai" and api_base_url pointing at the server.

To switch provider or model, edit ~/.config/clusterpilot/config.toml directly, or press EDIT CONFIG on the F9 screen. Changes take effect on the next script generation; no restart needed.

Adding multiple clusters

Add as many [[clusters]] blocks as you need. All configured clusters appear in the cluster dropdown on the F2 Submit screen and are connected to automatically on startup.

[[clusters]]
name = "grex"
host = "yak.hpc.umanitoba.ca"
user = "jsmith"
account = "def-supervisor"
scratch = "$HOME/clusterpilot_jobs"
cluster_type = "grex"

[[clusters]]
name = "narval"
host = "narval.alliancecan.ca"
user = "jsmith"
account = "def-supervisor"
scratch = "/scratch/jsmith"
cluster_type = "drac"

[[clusters]]
name = "myuni-hpc"
host = "hpc.myuniversity.edu"
user = "jsmith"
account = ""                     # omit if not required
scratch = "$HOME/jobs"
cluster_type = "generic"         # any other SLURM cluster

cluster_type values:

Value	Use for
`generic`	Any SLURM cluster (default if omitted)
`drac`	Compute Canada / DRAC (Cedar, Narval, Graham, Beluga)
`grex`	University of Manitoba Grex (same as `generic` in practice)

ClusterPilot probes $SCRATCH at connection time, so storage advice in generated scripts is accurate for any cluster without manual configuration:

What the probe finds	Storage advice injected into the AI prompt
`$SCRATCH` is set (e.g. `/scratch/jsmith`)	Use `$SCRATCH` for large output; `$SLURM_TMPDIR` for temp files
`$SCRATCH` is unset	Use `$HOME` or the job working directory; `$SLURM_TMPDIR` for temp files
`cluster_type = "drac"` (regardless of probe)	Hard rule: never `$HOME` (DRAC home quota is ~50 GB and jobs writing there get killed)

The only reason to set cluster_type = "drac" is to get that hard warning. For every other cluster, generic is correct; the probe handles the rest.

Upload and download excludes

When uploading a project directory, ClusterPilot excludes files that are not needed on the cluster. When downloading results, it skips source files that are already on your machine and only pulls back output (SLURM logs, data files, etc.).

Both lists are configurable in the [defaults] section:

[defaults]
# Files/dirs excluded from upload to the cluster.
upload_excludes = [
    ".git/",
    ".julia/",
    "__pycache__/",
    "*.pyc",
    ".ipynb_checkpoints/",
    "node_modules/",
    "*.egg-info/",
    ".DS_Store",
    "CLAUDE.md",
    "clusterpilot_jobs/",
    # Large / media artefacts (add a specific file via EXTRA FILES if a job needs it).
    "*.jld2", "*.h5", "*.hdf5", "*.png", "*.pdf",
    "*.svg", "*.gif", "*.mp4", "*.zip", "*.tar*",
]

# Files/dirs excluded when syncing results back from the cluster.
# Everything not matched here is downloaded (SLURM logs, data output, etc.).
download_excludes = [
    "src/",
    "docs/",
    "examples/",
    "scripts/",
    "*.toml",
    "*.md",
    "*.sh",
    ".git/",
    "__pycache__/",
    ".DS_Store",
]

These are rsync glob patterns. If your job writes output to an unusual location, adjust download_excludes to avoid filtering it out.

Per-project ignore file

Add a .clusterpilotignore at the project root to exclude paths for that project on top of the built-in defaults (one pattern per line, gitignore-style syntax: data/ for a directory, *.h5 for a file type). Comments start with #. The older .clusterpilot_ignore name is still read and merged for backwards compatibility. Directory excludes are pruned entirely, so an ignored directory is never recreated on the cluster, not even as an empty folder.

Julia-project uploads

When the project root contains a Project.toml, ClusterPilot ships only what the job needs, preserving layout: Project.toml, Manifest.toml, the src/ tree, and the driver script. Everything else is left behind (your .clusterpilotignore still applies on top). This keeps uploads small even when the repo carries large data/ or output/ directories. Helper scripts the driver include()s, or any other file outside this set, should be listed under EXTRA FILES on the F2 screen, which uploads them at their correct relative path and bypasses the ignore rules.

Set PROJECT DIR to the project root (the folder holding Project.toml), not to a src/ subdirectory: ClusterPilot uploads the contents of PROJECT DIR as the job root, so pointing it inside src/ would flatten the package layout.

Usage

clusterpilot                 # launch the TUI
clusterpilot daemon run      # run the poll daemon in the foreground
clusterpilot daemon install  # install systemd user service (Linux)

TUI screens

Key	Screen
F1	Job list - status, log tail, cancel
F2	Submit - describe job, pick partition, generate + review script
F9	Settings - clusters, SSH, notifications, API key

Submitting a job (F2 workflow)

Select your cluster from the dropdown
Select a partition (populated from a live sinfo cache)
Type a plain-language description of your job, e.g.:

Train a small transformer on CIFAR-10 using PyTorch, 1 V100, 4 hours
ClusterPilot generates a complete sbatch script - review and edit as needed
Press Submit - files are uploaded and the job is queued

The partition you select is passed to the model as a hard constraint, not a suggestion. It will use the correct --gres syntax for that partition's hardware.

Project directory mode

If you set PROJECT DIR on the F2 screen, your project is rsynced to a job-specific directory on the cluster ($HOME/clusterpilot_jobs/<job-name>/), minus the built-in excludes and your .clusterpilotignore (and, for Julia projects, reduced to the manifest + src/ + driver). Each job gets its own isolated copy, so you can submit multiple jobs from the same local project without them interfering with each other. Modify a parameter, change the driver script, and submit again - each submission creates a fresh directory on the cluster.

When results are synced back, only output files are downloaded (SLURM logs, data files). Source code that was uploaded is skipped by default. See Upload and download excludes for details.

How SSH works

ClusterPilot uses your system ssh binary with ControlMaster multiplexing. You authenticate once (including MFA if required); all subsequent commands reuse the existing socket with sub-second latency.

No changes to ~/.ssh/config are required. ClusterPilot passes all ControlMaster flags directly on the command line. Your existing SSH config is left untouched.

Terminal colours

ClusterPilot uses 24-bit RGB colour throughout. Most modern terminal emulators support this, but the COLORTERM environment variable must be set to truecolor for Textual to detect it. Without it, colours fall back to the nearest 16 ANSI colours, which can look significantly different from the intended amber palette.

macOS (iTerm2, Terminal.app): truecolor works out of the box in a local window. No action needed.

Over SSH: the COLORTERM variable is often not forwarded to the remote session. Fix this by adding the following to ~/.bashrc (or ~/.zshrc) on the remote machine:

export COLORTERM=truecolor

Then reconnect, or run source ~/.bashrc in the current session.

To verify:

echo $COLORTERM   # should print: truecolor

iTerm2 users: you can also forward the variable automatically for all SSH sessions by adding COLORTERM = truecolor to the environment section of your iTerm2 profile (Profiles → Session → Environment).

The left screenshot below shows correct truecolor rendering. The right shows the 16-colour fallback over SSH without COLORTERM set — the amber backgrounds are approximated as red by the terminal.

Correct (truecolor)	16-colour fallback over SSH

Mouse support over SSH

ClusterPilot is fully keyboard-navigable (Tab, arrow keys, Enter, F1/F2/F9) and this is the recommended way to use it over SSH.

Mouse clicks work in local terminal windows and in most SSH sessions from macOS terminals. However, SSH into a Linux machine running Wayland is a known exception — mouse events are not reliably forwarded through the SSH connection in this configuration, regardless of terminal settings. This is a Wayland limitation, not a ClusterPilot bug, and affects most TUI applications.

Workaround: run ClusterPilot directly on the local machine and point it at the remote cluster via SSH ControlMaster, which is the intended workflow. If you need to run it on a remote Linux workstation, switching that session to an X11 fallback (ssh -X) may restore mouse support.

Notifications (optional)

Push notifications are entirely optional. If you prefer to just leave the TUI open and check job status from the F1 screen, that works perfectly well. The SSH connection stays alive as long as the TUI is running (ControlPersist 4h + ServerAliveInterval 60), the job list refreshes automatically every 10 seconds, and you can press TAIL or LOG at any time to see live output. No external service is needed for this workflow.

If you want push notifications to your phone (useful when you close the lid and walk away), ClusterPilot supports ntfy.sh.

Setting up ntfy (if you want it)

Pick a topic string - this is just a name, like a channel. Use something unique so strangers cannot read your notifications (e.g. clusterpilot-jfrank-a8f3, not test-jobs).

Add it to your config (~/.config/clusterpilot/config.toml):

[notifications]
backend = "ntfy"
ntfy_topic = "clusterpilot-jfrank-a8f3"   # your unique topic
ntfy_server = "https://ntfy.sh"           # or a self-hosted server

Subscribe on your phone - install the ntfy app (Android / iOS) and subscribe to the same topic string. No account or phone number is required.

That's it. You can also view notifications in a browser at https://ntfy.sh/your-topic-string.

Disabling notifications

Leave ntfy_topic empty (or remove it) and no notifications will be sent:

[notifications]
backend = "ntfy"
ntfy_topic = ""

Notification events

When enabled, ClusterPilot notifies on:

Job started (PENDING to RUNNING)
Job completed - results are syncing
Job failed - includes the last 6 lines of the SLURM log
Walltime warning - less than 30 minutes remaining
ETA update - periodic estimate while running

A self-hosted ntfy server or any HTTP POST webhook also works; set ntfy_server in the config accordingly.

Architecture

clusterpilot/
  ssh/           system ssh/rsync subprocess wrappers (ControlMaster)
  cluster/       sinfo/module avail probe + 24h JSON cache
  jobs/          AI script generation, sbatch submit, state machine
  notify/        ntfy.sh HTTP push
  daemon/        async poll loop + systemd service installer
  tui/           Textual app (F1 jobs / F2 submit / F9 settings)
  config.py      ~/.config/clusterpilot/config.toml loader
  db.py          aiosqlite job history

All cluster-specific SLURM quirks (account requirements, scratch paths, GPU syntax) live in one place and are injected into the AI prompt automatically.

Development

git clone https://github.com/ju-pixel/clusterpilot
cd clusterpilot
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

pytest          # 128 tests, no SSH required
ruff check .    # lint

Planned

~~Remote cleanup from F1: delete synced/terminal job directories on the cluster to reclaim scratch space without SSH-ing in manually~~
~~Support for additional AI providers (OpenAI, local models via Ollama, etc.)~~
~~Job array support in the submission UI~~
~~conda-forge package for HPC environments that prefer conda~~
~~Cost estimation before submission based on requested resources and account allocation~~
~~Hosted tier with managed API key and web dashboard~~ — live at app.clusterpilot.sh
Windows support (WSL2 path handling, no systemd dependency)

Support

ClusterPilot is free and open source. If it saves you time, consider sponsoring development.

Licence

MIT - free to use and self-host.

Name		Name	Last commit message	Last commit date
Latest commit History 116 Commits
.github		.github
api		api
clusterpilot		clusterpilot
dashboard		dashboard
docs/screenshots		docs/screenshots
frontend		frontend
proxy		proxy
recipe		recipe
site		site
tests		tests
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
clusterpilot-dashboard.jsx		clusterpilot-dashboard.jsx
clusterpilot-tui.jsx		clusterpilot-tui.jsx
clusterpilot.code-workspace		clusterpilot.code-workspace
hpc-app-flow.jsx		hpc-app-flow.jsx
netlify.toml		netlify.toml
pyproject.toml		pyproject.toml

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

ClusterPilot

What it does

F2 — Describe your job and generate a SLURM script

F1 — Monitor jobs, tail logs in real time, sync results

Supported clusters

Requirements

Installation

Hosted tier

What you get

Managed API key

Web dashboard and multi-machine sync

Configuration

AI providers

Adding multiple clusters

Upload and download excludes

Per-project ignore file

Julia-project uploads

Usage

TUI screens

Submitting a job (F2 workflow)

Project directory mode

How SSH works

Terminal colours

Mouse support over SSH

Notifications (optional)

Setting up ntfy (if you want it)

Disabling notifications

Notification events

Architecture

Development

Planned

Support

Licence

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 5

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages