xrun

ML experiment runner — one YAML manifest → provision GPU → upload data → train → track metrics → pull checkpoints.

Works with vast.ai and Kaggle. Keeps the full history in a local SQLite database. No third-party tracking service required.

xrun launch exp/resnet50.yaml --detach   # kick off on a vast.ai GPU
xrun events <id> --follow                # watch stages: provision → upload → running → done
xrun metrics <id> --ascii                # live loss / accuracy curves
xrun pull <id> --ckpt best               # download the best checkpoint

Install

macOS / Linux

curl -sSf https://raw.githubusercontent.com/gerrux/xrun/master/install.sh | sh

Installs xrun to ~/.local/bin/xrun and installs the Python TUI (xrun-tui) with pip --user. Pass --prefix /usr/local to change the binary location.

If Python 3.11+ is present but pip is missing, let the installer try python -m ensurepip --upgrade:

curl -sSf https://raw.githubusercontent.com/gerrux/xrun/master/install.sh | sh -s -- --install-pip

For CLI-only install without the TUI:

curl -sSf https://raw.githubusercontent.com/gerrux/xrun/master/install.sh | sh -s -- --no-tui

Windows (PowerShell)

irm https://raw.githubusercontent.com/gerrux/xrun/master/install.ps1 | iex

Installs xrun.exe to %LOCALAPPDATA%\xrun\bin\xrun.exe, adds it to your user PATH, and installs the Python TUI (xrun-tui) with pip --user.

If Python 3.11+ is present but pip is missing:

& ([scriptblock]::Create((irm 'https://raw.githubusercontent.com/gerrux/xrun/master/install.ps1'))) -InstallPip

For CLI-only install without the TUI:

& ([scriptblock]::Create((irm 'https://raw.githubusercontent.com/gerrux/xrun/master/install.ps1'))) -NoTui

Specific version

curl -sSf https://raw.githubusercontent.com/gerrux/xrun/master/install.sh | sh -s -- --version v0.7.1

From source

cargo install --git https://github.com/gerrux/xrun --branch master xrun-cli

Python TUI

The install scripts install the TUI by default because xrun without arguments opens it on a TTY. To install or repair only the TUI:

curl -sSf https://raw.githubusercontent.com/gerrux/xrun/master/install.sh | sh -s -- --tui-only

& ([scriptblock]::Create((irm 'https://raw.githubusercontent.com/gerrux/xrun/master/install.ps1'))) -TuiOnly

From a local clone during development:

pip install -e python/xrun_tui

After install, xrun without arguments opens the TUI automatically when stdout is a TTY.

Updates

On interactive startup (xrun or xrun tui), xrun checks GitHub Releases for a newer version. If one is available, it shows a confirmation prompt before running the official installer.

xrun update --check   # check only
xrun update           # ask, then install
xrun update --yes     # install without prompt

Use xrun update --no-tui to update only the Rust CLI. Set XRUN_NO_UPDATE_CHECK=1 to disable the startup check in scripted environments.

Agent skill (optional)

Teaches Codex or Claude Code how to use xrun correctly — which commands to call, how to parse output, what to avoid.

For a repository-local install, run this inside the project that uses xrun:

xrun install skill --codex   # writes .codex/skills/xrun/SKILL.md + AGENTS.md
xrun install skill --claude  # writes .claude/skills/xrun/SKILL.md + CLAUDE.md

The legacy global Claude installer is still available:

# macOS / Linux — install binary + skill together
curl -sSf https://raw.githubusercontent.com/gerrux/xrun/master/install.sh | sh -s -- --with-skill

# skill only (if xrun is already installed)
curl -sSf https://raw.githubusercontent.com/gerrux/xrun/master/install.sh | sh -s -- --skill-only

# Windows — install binary + skill together
& ([scriptblock]::Create((irm 'https://raw.githubusercontent.com/gerrux/xrun/master/install.ps1'))) -WithSkill

# skill only
& ([scriptblock]::Create((irm 'https://raw.githubusercontent.com/gerrux/xrun/master/install.ps1'))) -SkillOnly

The global installer writes ~/.claude/skills/xrun/SKILL.md.

Quick start

# 1. Check your environment
xrun doctor

# 2. Set your vast.ai API key (or use the TUI: xrun → V → i)
xrun config set vast.api_key <YOUR_KEY>

# 3. Create a manifest
cp exp/base.yaml exp/my_run.yaml   # edit gpu, data, run.cmd, …

# 4. Launch (detached background run)
xrun launch exp/my_run.yaml --detach
#  → prints run ID, e.g. 01J2KX...

# 5. Follow stages
xrun events <id> --follow
#  provision → upload → running → done

# 6. Watch metrics
xrun metrics <id> --ascii

# 7. Retrieve the best checkpoint
xrun pull <id> --ckpt best --into models/

Commands

Command	Description
`xrun launch <manifest>`	Provision → upload → exec; `--detach` returns immediately
`xrun ls`	List runs; `--status running\|done\|failed`, `--json`
`xrun show <id>`	Full run card from local DB
`xrun events <id>`	Stage timeline; `--follow` polls until terminal
`xrun logs <id>`	stdout log; `--follow` streams via SSH
`xrun metrics <id>`	Metrics table/chart; `--ascii`, `--json`, `--png`
`xrun pull <id>`	Download checkpoints; `--ckpt best\|latest\|all`
`xrun stop <id>`	Graceful stop → pull artifacts → destroy instance
`xrun rerun <id>`	Re-run with optional `--patch run.args.--lr=5e-4`
`xrun balance`	vast.ai account balance
`xrun doctor`	Check credentials, CLI tools, connectivity
`xrun config`	`init \| show \| set <key> <val>`
`xrun gc`	Remove orphan instances

All read commands support --json for scripting.
Full reference: docs/CLI.md

Manifest

name: resnet50_baseline
vendor: vast          # or: kaggle
gpu: RTX_4090

# budget guards (auto-destroy when exceeded)
max_cost: 5.0         # USD
max_hours: 8

data:
  - src: data/train.h5
    dst: /workspace/data/train.h5

run:
  cmd: python train.py
  args:
    --lr: 5e-4
    --epochs: 30
    --batch-size: 16

artifacts:
  patterns:
    - "checkpoints/best*.pt"
    - "logs/metrics.json"

Full schema: docs/MANIFEST.md

TUI

xrun        # opens TUI if stdout is a TTY
xrun-tui    # direct launch (after pip install)

Screens (chord navigation with g→X):

Key	Screen
`g d`	Dashboard — burn rate, active runs, runway warning
`g r`	Runs — filterable list with live status
`g i`	Instances — raw vast.ai / Kaggle instances
`g v`	Vendors — API key management, balance
`g l`	Launch — manifest picker
`g s`	Settings — config editor
`?`	Help
`:`	Command palette

Run detail opens on Enter and has tabs: Stages · Logs · Metrics · Artifacts · Manifest

Training hook

Add xrun_hook to your training script to emit structured events and metrics:

# pip install git+https://github.com/gerrux/xrun.git#subdirectory=python/xrun_hook
from xrun_hook import XRunHook

hook = XRunHook()
hook.event("epoch_start", stage="train", msg=f"epoch {epoch}")

for epoch in range(epochs):
    loss = train_one_epoch(...)
    hook.metric("train_loss", loss, step=epoch)
    hook.metric("val_f1",   eval_f1, step=epoch)

hook.event("done", stage="train")

Metrics appear in xrun metrics <id> and the TUI in real time.

Budget guards

xrun launch exp/foo.yaml \
  --max-cost 5.0      \  # auto-destroy after $5 spent
  --max-hours 8       \  # auto-destroy after 8 hours
  --idle-timeout 30      # auto-destroy if GPU idle for 30 minutes

The background poll-daemon monitors spend and destroys the instance automatically, writing auto_destroyed_reason to the local DB.

Architecture

┌─────────────────────────────────────────────────────┐
│                 Local machine                        │
│                                                      │
│  xrun (Rust CLI)  ──spawn──▶  xrun-tui (Python)     │
│        │                           │                 │
│        ▼                           ▼                 │
│     xrun-core  ◀──────────  SQLite runs.db           │
│        │                                             │
│   ┌────┴────┐   ┌────────────┐                       │
│   │xrun-vast│   │xrun-kaggle │                       │
│   └────┬────┘   └─────┬──────┘                       │
└────────┼──────────────┼──────────────────────────────┘
         ▼              ▼
    vast.ai GPU    Kaggle Kernel
    /workspace/    output/

xrun-cli — command routing, user-facing UX
xrun-core — manifest types, SQLite schema, vendor trait
xrun-vast — vast.ai: provision, SSH upload, exec, poll, transfer
xrun-kaggle — Kaggle: kernel push, status poll, output download
xrun-poller — background daemon: events/metrics → SQLite, budget enforcement
xrun-mlflow — optional MLflow REST mirror for metric storage
xrun-tui (Python) — Textual TUI, reads SQLite, calls CLI via subprocess

Requirements

For the CLI:

vastai CLI (pip install vastai) — for vast.ai runs
kaggle CLI (pip install kaggle) — for Kaggle runs
SSH key configured for vast.ai (checked by xrun doctor)

For the TUI:

Python ≥ 3.11
pip for that Python. The install scripts can try ensurepip via --install-pip / -InstallPip.
For local development: pip install -e python/xrun_tui

Building from source:

Rust stable (≥ 1.75) — install via rustup

Documentation

File	Contents
`docs/CLI.md`	All subcommands, flags, exit codes
`docs/MANIFEST.md`	Full YAML schema with examples
`docs/TUI.md`	Screens, key bindings, widgets
`docs/ARCHITECTURE.md`	Components and data flow
`docs/EVENTS.md`	events.jsonl protocol + Python hook
`docs/STATE.md`	SQLite schema
`docs/ROADMAP.md`	Version history and backlog
`CHANGELOG.md`	Release notes

Status: v0.7.1

✅ vast.ai: provision, upload, exec, poll, pull, destroy
✅ Kaggle: kernel push, status poll, output download
✅ Live events and metrics in SQLite
✅ MLflow mirror (metrics + UI link)
✅ Budget guards (caps, auto-destroy, spend dashboard)
✅ Python Textual TUI: 16 screens, chord navigation, Tokyo Night theme
✅ xrun events --follow, xrun logs --follow
✅ Install scripts for macOS, Linux, Windows
✅ Agent skill (xrun install skill --codex / --claude)
✅ xrun sweep — hyperparameter grid

License

MIT — see LICENSE

Name		Name	Last commit message	Last commit date
Latest commit History 148 Commits
.claude		.claude
.github/workflows		.github/workflows
checkpoints		checkpoints
claude		claude
crates		crates
docs		docs
exp		exp
python		python
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
ISSUES.md		ISSUES.md
LICENSE		LICENSE
README.md		README.md
install.ps1		install.ps1
install.sh		install.sh
issue.md		issue.md
run_plans.sh		run_plans.sh
rust-toolchain.toml		rust-toolchain.toml
rustfmt.toml		rustfmt.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

xrun

Install

macOS / Linux

Windows (PowerShell)

Specific version

From source

Python TUI

Updates

Agent skill (optional)

Quick start

Commands

Manifest

TUI

Training hook

Budget guards

Architecture

Requirements

Documentation

Status: v0.7.1

License

About

Uh oh!

Releases 5

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

xrun

Install

macOS / Linux

Windows (PowerShell)

Specific version

From source

Python TUI

Updates

Agent skill (optional)

Quick start

Commands

Manifest

TUI

Training hook

Budget guards

Architecture

Requirements

Documentation

Status: v0.7.1

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages