Isambard Batch Inference

vLLM batch inference pipeline for generating synthetic data on the Isambard AI supercomputer. Processes large-scale text generation workloads using SLURM array jobs with automatic parallel coordination across GPU workers.

Installation

Requires uv and Python 3.12+.

# Clone and install
cd /home/a5k/kyleobrien.a5k/isambard-batch-inference

# Install all dependencies (including dev/test tools)
uv sync --extra dev

# Download the default test model (required for SLURM integration tests)
# Tests auto-download this if not cached, but you can pre-download:
uv run python -c "from huggingface_hub import snapshot_download; snapshot_download('nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16')"

The uv sync command creates a .venv/ in the repo root with all dependencies including PyTorch (cu129 for ARM aarch64), vLLM, and NCCL 2.27+.

Running Tests

# Run all tests (unit + SLURM integration)
uv run python -m pytest tests/ -v

# Run only unit tests (no GPU/SLURM required, <1s)
uv run python -m pytest tests/ -m "not slurm" -v

# Run only SLURM integration tests (submits real GPU jobs, ~8-10 min)
uv run python -m pytest tests/ -m slurm -v

The SLURM integration tests submit real sbatch jobs that:

Run single-GPU inference and verify output files
Run 2-worker array jobs and verify per-rank outputs
Test input sampling (--num_samples)
Verify SLURM log diagnostic markers

Creating Batch Files

Input is a JSONL file where each line is an OpenAI /v1/chat/completions request. The model field is overridden by --model at runtime, so use PLACEHOLDER.

For reasoning tasks, use temperature=1.0 and top_p=1.0.

{"custom_id": "req-001", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "PLACEHOLDER", "messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is 2+2?"}], "max_completion_tokens": 16384, "temperature": 1.0, "top_p": 1.0}}
{"custom_id": "req-002", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "PLACEHOLDER", "messages": [{"role": "user", "content": "Explain photosynthesis."}], "max_completion_tokens": 16384, "temperature": 1.0, "top_p": 1.0}}

Each line must have:

custom_id — unique identifier for tracking requests to responses
method — always "POST"
url — always "/v1/chat/completions"
body.messages — list of {"role": "system"|"user"|"assistant", "content": "..."} messages
body.max_completion_tokens — maximum tokens to generate (16384 recommended)
body.temperature and body.top_p — sampling parameters

See example_notebooks/ for Jupyter notebooks that create batch files for various use cases.

Submitting Jobs

# Single GPU job
sbatch submit_vllm_batch.sbatch \
    --input /path/to/batch.jsonl \
    --model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
    --trust-remote-code

# Parallel across 4 GPUs (independent workers, different seeds)
sbatch --array=0-3 submit_vllm_batch.sbatch \
    --input /path/to/batch.jsonl \
    --experiment-id my_experiment \
    --model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
    --trust-remote-code \
    --use_wandb

# Sample 100 entries from a large batch file
sbatch submit_vllm_batch.sbatch \
    --input /path/to/large_batch.jsonl \
    --model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
    --trust-remote-code \
    --num_samples 100

Output Structure

Results are written to /projects/a5k/public/data/vllm_batches/{experiment_id}/:

{experiment_id}/
├── input.jsonl                 # Copy of original input (from rank 0)
├── input_sampled.jsonl         # Present if --num_samples was used
├── input_rewritten.jsonl       # Batch file with model field substituted
├── results.jsonl               # Generated completions (single job)
├── metadata.json               # Timing, token counts, success rates (single job)
├── rank_0_results.jsonl        # Worker 0 results (array jobs)
├── rank_0_metadata.json        # Worker 0 metadata
├── rank_1_results.jsonl        # Worker 1 results
├── rank_1_metadata.json
└── ...

Each result line:

{
  "custom_id": "req-001",
  "response": {
    "status_code": 200,
    "body": {
      "choices": [{"message": {"content": "Generated text..."}}]
    }
  }
}

CLI Arguments

Argument	Default	Description
`--input` / `-i`	(required)	Path to input batch JSONL file
`--model`	(required)	HuggingFace model name or path
`--experiment-id`	auto-generated	Identifier for the output directory
`--trust-remote-code`	off	Allow HuggingFace remote code execution
`--tensor-parallel-size` / `-tp`	`1`	Number of GPUs for tensor parallelism
`--gpu-memory-utilization`	`0.9`	Fraction of GPU memory to use
`--max-model-len`	auto	Maximum context length
`--dtype`	`auto`	Model weight dtype
`--num_samples`	all	Randomly sample N entries from input
`--rank`	none	Worker rank (auto-set by SLURM array jobs)
`--num_workers`	none	Total workers (auto-set by SLURM array jobs)
`--use_wandb`	off	Enable Weights & Biases logging
`--wandb_entity`	`geodesic`	W&B entity
`--wandb_project`	`Self-Fulfilling Model Organisms - Batch Inference`	W&B project

Architecture

example_notebooks/ (create batch JSONL) --> sbatch submit_vllm_batch.sbatch --> run_vllm_batch.py --> vLLM engine --> Results JSONL

Files

File	Description
`run_vllm_batch.py`	Main inference script. Handles input preparation, model override, parallel worker coordination, token counting, result analysis, and W&B logging.
`submit_vllm_batch.sbatch`	SLURM submission script. Configures the GPU environment (NCCL, CUDA 9.0, venv activation) and dispatches `run_vllm_batch.py`.
`cluster_status.sh`	Utility that logs cluster status at job start.
`example_notebooks/`	Jupyter notebooks that create batch input files and analyze results.

Parallel Worker Coordination

run_vllm_batch.py orchestrates multi-GPU jobs via SLURM array tasks. Rank 0 is the leader:

Rank 0 samples the input (if --num_samples), rewrites the batch file with the CLI --model, and drops marker files (.sampling_done, .rewrite_done)
Other ranks poll for marker files (up to 5 min for sampling, 60s for rewrite) then use rank 0's prepared input
Each rank gets a unique seed (nanosecond timestamp + rank offset) for stochastic variation
Results go to rank_{N}_results.jsonl and rank_{N}_metadata.json

Reasoning Trace Handling

Token counting strips reasoning traces before counting, supporting:

<think>...</think> tags
[BEGIN FINAL RESPONSE] markers
Trailing <|end|> tokens

The full generation (including reasoning) is preserved in results; only the token count metric excludes reasoning.

Environment Quirks (sbatch script)

Forces NCCL v2.27.5 via LD_PRELOAD (system v2.21.5 causes ncclGroupSimulateEnd errors)
Disables Triton/TorchDynamo (ARM architecture workaround)
Sets HF_HUB_OFFLINE=1 to avoid API rate limits across parallel ranks
Uses gcc-12/g++-12 compilers

Monitoring

# Check job status
squeue -u $USER | grep vllm-batch

# Tail logs
tail -f /projects/a5k/public/logs/vllm-batch/vllm-batch-<JOB_ID>_<ARRAY_TASK_ID>.out

# Check results
ls /projects/a5k/public/data/vllm_batches/<experiment_id>/

W&B Integration

When --use_wandb is enabled, the pipeline logs throughput metrics, success rates, and a sample_generations table with 20 random samples split into reasoning and generation columns. Parallel ranks are grouped under the same experiment ID.

Example Notebooks

Notebook	Description
`generate_scheming_mcqa.ipynb`	Creates batch files for scheming behavior MCQA data
`generate_alignment_sft_mix.ipynb`	Creates batch files for alignment SFT training data
`generate_dan_textbook_mcqa.ipynb`	Creates batch files for DAN textbook MCQA data
`generate_miri_docs_mcqa.ipynb`	Creates batch files for MIRI documentation MCQA data
`analyze_synthetic_docs.ipynb`	Analyzes and QA's generated synthetic documents
`generations_playground.ipynb`	Interactive exploration of generation outputs

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
example_notebooks		example_notebooks
tests		tests
.gitignore		.gitignore
.python-version		.python-version
CLAUDE.md		CLAUDE.md
README.md		README.md
cluster_status.sh		cluster_status.sh
pyproject.toml		pyproject.toml
run_vllm_batch.py		run_vllm_batch.py
setup_uv_env.sh		setup_uv_env.sh
submit_vllm_batch.sbatch		submit_vllm_batch.sbatch
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Isambard Batch Inference

Installation

Running Tests

Creating Batch Files

Submitting Jobs

Output Structure

CLI Arguments

Architecture

Files

Parallel Worker Coordination

Reasoning Trace Handling

Environment Quirks (sbatch script)

Monitoring

W&B Integration

Example Notebooks

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Isambard Batch Inference

Installation

Running Tests

Creating Batch Files

Submitting Jobs

Output Structure

CLI Arguments

Architecture

Files

Parallel Worker Coordination

Reasoning Trace Handling

Environment Quirks (sbatch script)

Monitoring

W&B Integration

Example Notebooks

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages