Skip to content

GeodesicResearch/isambard-batch-inference

Repository files navigation

Isambard Batch Inference

vLLM batch inference pipeline for generating synthetic data on the Isambard AI supercomputer. Processes large-scale text generation workloads using SLURM array jobs with automatic parallel coordination across GPU workers.

Installation

Requires uv and Python 3.12+.

# Clone and install
cd /home/a5k/kyleobrien.a5k/isambard-batch-inference

# Install all dependencies (including dev/test tools)
uv sync --extra dev

# Download the default test model (required for SLURM integration tests)
# Tests auto-download this if not cached, but you can pre-download:
uv run python -c "from huggingface_hub import snapshot_download; snapshot_download('nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16')"

The uv sync command creates a .venv/ in the repo root with all dependencies including PyTorch (cu129 for ARM aarch64), vLLM, and NCCL 2.27+.

Running Tests

# Run all tests (unit + SLURM integration)
uv run python -m pytest tests/ -v

# Run only unit tests (no GPU/SLURM required, <1s)
uv run python -m pytest tests/ -m "not slurm" -v

# Run only SLURM integration tests (submits real GPU jobs, ~8-10 min)
uv run python -m pytest tests/ -m slurm -v

The SLURM integration tests submit real sbatch jobs that:

  • Run single-GPU inference and verify output files
  • Run 2-worker array jobs and verify per-rank outputs
  • Test input sampling (--num_samples)
  • Verify SLURM log diagnostic markers

Creating Batch Files

Input is a JSONL file where each line is an OpenAI /v1/chat/completions request. The model field is overridden by --model at runtime, so use PLACEHOLDER.

For reasoning tasks, use temperature=1.0 and top_p=1.0.

{"custom_id": "req-001", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "PLACEHOLDER", "messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is 2+2?"}], "max_completion_tokens": 16384, "temperature": 1.0, "top_p": 1.0}}
{"custom_id": "req-002", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "PLACEHOLDER", "messages": [{"role": "user", "content": "Explain photosynthesis."}], "max_completion_tokens": 16384, "temperature": 1.0, "top_p": 1.0}}

Each line must have:

  • custom_id — unique identifier for tracking requests to responses
  • method — always "POST"
  • url — always "/v1/chat/completions"
  • body.messages — list of {"role": "system"|"user"|"assistant", "content": "..."} messages
  • body.max_completion_tokens — maximum tokens to generate (16384 recommended)
  • body.temperature and body.top_p — sampling parameters

See example_notebooks/ for Jupyter notebooks that create batch files for various use cases.

Submitting Jobs

# Single GPU job
sbatch submit_vllm_batch.sbatch \
    --input /path/to/batch.jsonl \
    --model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
    --trust-remote-code

# Parallel across 4 GPUs (independent workers, different seeds)
sbatch --array=0-3 submit_vllm_batch.sbatch \
    --input /path/to/batch.jsonl \
    --experiment-id my_experiment \
    --model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
    --trust-remote-code \
    --use_wandb

# Sample 100 entries from a large batch file
sbatch submit_vllm_batch.sbatch \
    --input /path/to/large_batch.jsonl \
    --model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
    --trust-remote-code \
    --num_samples 100

Output Structure

Results are written to /projects/a5k/public/data/vllm_batches/{experiment_id}/:

{experiment_id}/
├── input.jsonl                 # Copy of original input (from rank 0)
├── input_sampled.jsonl         # Present if --num_samples was used
├── input_rewritten.jsonl       # Batch file with model field substituted
├── results.jsonl               # Generated completions (single job)
├── metadata.json               # Timing, token counts, success rates (single job)
├── rank_0_results.jsonl        # Worker 0 results (array jobs)
├── rank_0_metadata.json        # Worker 0 metadata
├── rank_1_results.jsonl        # Worker 1 results
├── rank_1_metadata.json
└── ...

Each result line:

{
  "custom_id": "req-001",
  "response": {
    "status_code": 200,
    "body": {
      "choices": [{"message": {"content": "Generated text..."}}]
    }
  }
}

CLI Arguments

Argument Default Description
--input / -i (required) Path to input batch JSONL file
--model (required) HuggingFace model name or path
--experiment-id auto-generated Identifier for the output directory
--trust-remote-code off Allow HuggingFace remote code execution
--tensor-parallel-size / -tp 1 Number of GPUs for tensor parallelism
--gpu-memory-utilization 0.9 Fraction of GPU memory to use
--max-model-len auto Maximum context length
--dtype auto Model weight dtype
--num_samples all Randomly sample N entries from input
--rank none Worker rank (auto-set by SLURM array jobs)
--num_workers none Total workers (auto-set by SLURM array jobs)
--use_wandb off Enable Weights & Biases logging
--wandb_entity geodesic W&B entity
--wandb_project Self-Fulfilling Model Organisms - Batch Inference W&B project

Architecture

example_notebooks/ (create batch JSONL) --> sbatch submit_vllm_batch.sbatch --> run_vllm_batch.py --> vLLM engine --> Results JSONL

Files

File Description
run_vllm_batch.py Main inference script. Handles input preparation, model override, parallel worker coordination, token counting, result analysis, and W&B logging.
submit_vllm_batch.sbatch SLURM submission script. Configures the GPU environment (NCCL, CUDA 9.0, venv activation) and dispatches run_vllm_batch.py.
cluster_status.sh Utility that logs cluster status at job start.
example_notebooks/ Jupyter notebooks that create batch input files and analyze results.

Parallel Worker Coordination

run_vllm_batch.py orchestrates multi-GPU jobs via SLURM array tasks. Rank 0 is the leader:

  1. Rank 0 samples the input (if --num_samples), rewrites the batch file with the CLI --model, and drops marker files (.sampling_done, .rewrite_done)
  2. Other ranks poll for marker files (up to 5 min for sampling, 60s for rewrite) then use rank 0's prepared input
  3. Each rank gets a unique seed (nanosecond timestamp + rank offset) for stochastic variation
  4. Results go to rank_{N}_results.jsonl and rank_{N}_metadata.json

Reasoning Trace Handling

Token counting strips reasoning traces before counting, supporting:

  • <think>...</think> tags
  • [BEGIN FINAL RESPONSE] markers
  • Trailing <|end|> tokens

The full generation (including reasoning) is preserved in results; only the token count metric excludes reasoning.

Environment Quirks (sbatch script)

  • Forces NCCL v2.27.5 via LD_PRELOAD (system v2.21.5 causes ncclGroupSimulateEnd errors)
  • Disables Triton/TorchDynamo (ARM architecture workaround)
  • Sets HF_HUB_OFFLINE=1 to avoid API rate limits across parallel ranks
  • Uses gcc-12/g++-12 compilers

Monitoring

# Check job status
squeue -u $USER | grep vllm-batch

# Tail logs
tail -f /projects/a5k/public/logs/vllm-batch/vllm-batch-<JOB_ID>_<ARRAY_TASK_ID>.out

# Check results
ls /projects/a5k/public/data/vllm_batches/<experiment_id>/

W&B Integration

When --use_wandb is enabled, the pipeline logs throughput metrics, success rates, and a sample_generations table with 20 random samples split into reasoning and generation columns. Parallel ranks are grouped under the same experiment ID.

Example Notebooks

Notebook Description
generate_scheming_mcqa.ipynb Creates batch files for scheming behavior MCQA data
generate_alignment_sft_mix.ipynb Creates batch files for alignment SFT training data
generate_dan_textbook_mcqa.ipynb Creates batch files for DAN textbook MCQA data
generate_miri_docs_mcqa.ipynb Creates batch files for MIRI documentation MCQA data
analyze_synthetic_docs.ipynb Analyzes and QA's generated synthetic documents
generations_playground.ipynb Interactive exploration of generation outputs

About

vLLM batch inference pipeline for generating synthetic data on the Isambard AI supercomputer

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors