vLLM batch inference pipeline for generating synthetic data on the Isambard AI supercomputer. Processes large-scale text generation workloads using SLURM array jobs with automatic parallel coordination across GPU workers.
Requires uv and Python 3.12+.
# Clone and install
cd /home/a5k/kyleobrien.a5k/isambard-batch-inference
# Install all dependencies (including dev/test tools)
uv sync --extra dev
# Download the default test model (required for SLURM integration tests)
# Tests auto-download this if not cached, but you can pre-download:
uv run python -c "from huggingface_hub import snapshot_download; snapshot_download('nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16')"The uv sync command creates a .venv/ in the repo root with all dependencies including PyTorch (cu129 for ARM aarch64), vLLM, and NCCL 2.27+.
# Run all tests (unit + SLURM integration)
uv run python -m pytest tests/ -v
# Run only unit tests (no GPU/SLURM required, <1s)
uv run python -m pytest tests/ -m "not slurm" -v
# Run only SLURM integration tests (submits real GPU jobs, ~8-10 min)
uv run python -m pytest tests/ -m slurm -vThe SLURM integration tests submit real sbatch jobs that:
- Run single-GPU inference and verify output files
- Run 2-worker array jobs and verify per-rank outputs
- Test input sampling (
--num_samples) - Verify SLURM log diagnostic markers
Input is a JSONL file where each line is an OpenAI /v1/chat/completions request. The model field is overridden by --model at runtime, so use PLACEHOLDER.
For reasoning tasks, use temperature=1.0 and top_p=1.0.
{"custom_id": "req-001", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "PLACEHOLDER", "messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is 2+2?"}], "max_completion_tokens": 16384, "temperature": 1.0, "top_p": 1.0}}
{"custom_id": "req-002", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "PLACEHOLDER", "messages": [{"role": "user", "content": "Explain photosynthesis."}], "max_completion_tokens": 16384, "temperature": 1.0, "top_p": 1.0}}Each line must have:
custom_id— unique identifier for tracking requests to responsesmethod— always"POST"url— always"/v1/chat/completions"body.messages— list of{"role": "system"|"user"|"assistant", "content": "..."}messagesbody.max_completion_tokens— maximum tokens to generate (16384 recommended)body.temperatureandbody.top_p— sampling parameters
See example_notebooks/ for Jupyter notebooks that create batch files for various use cases.
# Single GPU job
sbatch submit_vllm_batch.sbatch \
--input /path/to/batch.jsonl \
--model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
--trust-remote-code
# Parallel across 4 GPUs (independent workers, different seeds)
sbatch --array=0-3 submit_vllm_batch.sbatch \
--input /path/to/batch.jsonl \
--experiment-id my_experiment \
--model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
--trust-remote-code \
--use_wandb
# Sample 100 entries from a large batch file
sbatch submit_vllm_batch.sbatch \
--input /path/to/large_batch.jsonl \
--model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
--trust-remote-code \
--num_samples 100Results are written to /projects/a5k/public/data/vllm_batches/{experiment_id}/:
{experiment_id}/
├── input.jsonl # Copy of original input (from rank 0)
├── input_sampled.jsonl # Present if --num_samples was used
├── input_rewritten.jsonl # Batch file with model field substituted
├── results.jsonl # Generated completions (single job)
├── metadata.json # Timing, token counts, success rates (single job)
├── rank_0_results.jsonl # Worker 0 results (array jobs)
├── rank_0_metadata.json # Worker 0 metadata
├── rank_1_results.jsonl # Worker 1 results
├── rank_1_metadata.json
└── ...
Each result line:
{
"custom_id": "req-001",
"response": {
"status_code": 200,
"body": {
"choices": [{"message": {"content": "Generated text..."}}]
}
}
}| Argument | Default | Description |
|---|---|---|
--input / -i |
(required) | Path to input batch JSONL file |
--model |
(required) | HuggingFace model name or path |
--experiment-id |
auto-generated | Identifier for the output directory |
--trust-remote-code |
off | Allow HuggingFace remote code execution |
--tensor-parallel-size / -tp |
1 |
Number of GPUs for tensor parallelism |
--gpu-memory-utilization |
0.9 |
Fraction of GPU memory to use |
--max-model-len |
auto | Maximum context length |
--dtype |
auto |
Model weight dtype |
--num_samples |
all | Randomly sample N entries from input |
--rank |
none | Worker rank (auto-set by SLURM array jobs) |
--num_workers |
none | Total workers (auto-set by SLURM array jobs) |
--use_wandb |
off | Enable Weights & Biases logging |
--wandb_entity |
geodesic |
W&B entity |
--wandb_project |
Self-Fulfilling Model Organisms - Batch Inference |
W&B project |
example_notebooks/ (create batch JSONL) --> sbatch submit_vllm_batch.sbatch --> run_vllm_batch.py --> vLLM engine --> Results JSONL
| File | Description |
|---|---|
run_vllm_batch.py |
Main inference script. Handles input preparation, model override, parallel worker coordination, token counting, result analysis, and W&B logging. |
submit_vllm_batch.sbatch |
SLURM submission script. Configures the GPU environment (NCCL, CUDA 9.0, venv activation) and dispatches run_vllm_batch.py. |
cluster_status.sh |
Utility that logs cluster status at job start. |
example_notebooks/ |
Jupyter notebooks that create batch input files and analyze results. |
run_vllm_batch.py orchestrates multi-GPU jobs via SLURM array tasks. Rank 0 is the leader:
- Rank 0 samples the input (if
--num_samples), rewrites the batch file with the CLI--model, and drops marker files (.sampling_done,.rewrite_done) - Other ranks poll for marker files (up to 5 min for sampling, 60s for rewrite) then use rank 0's prepared input
- Each rank gets a unique seed (nanosecond timestamp + rank offset) for stochastic variation
- Results go to
rank_{N}_results.jsonlandrank_{N}_metadata.json
Token counting strips reasoning traces before counting, supporting:
<think>...</think>tags[BEGIN FINAL RESPONSE]markers- Trailing
<|end|>tokens
The full generation (including reasoning) is preserved in results; only the token count metric excludes reasoning.
- Forces NCCL v2.27.5 via
LD_PRELOAD(system v2.21.5 causesncclGroupSimulateEnderrors) - Disables Triton/TorchDynamo (ARM architecture workaround)
- Sets
HF_HUB_OFFLINE=1to avoid API rate limits across parallel ranks - Uses
gcc-12/g++-12compilers
# Check job status
squeue -u $USER | grep vllm-batch
# Tail logs
tail -f /projects/a5k/public/logs/vllm-batch/vllm-batch-<JOB_ID>_<ARRAY_TASK_ID>.out
# Check results
ls /projects/a5k/public/data/vllm_batches/<experiment_id>/When --use_wandb is enabled, the pipeline logs throughput metrics, success rates, and a sample_generations table with 20 random samples split into reasoning and generation columns. Parallel ranks are grouped under the same experiment ID.
| Notebook | Description |
|---|---|
generate_scheming_mcqa.ipynb |
Creates batch files for scheming behavior MCQA data |
generate_alignment_sft_mix.ipynb |
Creates batch files for alignment SFT training data |
generate_dan_textbook_mcqa.ipynb |
Creates batch files for DAN textbook MCQA data |
generate_miri_docs_mcqa.ipynb |
Creates batch files for MIRI documentation MCQA data |
analyze_synthetic_docs.ipynb |
Analyzes and QA's generated synthetic documents |
generations_playground.ipynb |
Interactive exploration of generation outputs |