Delta-HPC is a reinforcement learning framework for dynamically managing NVIDIA Multi-Instance GPU (MIG) resources to serve heterogeneous LLM workloads. It trains an RL agent (Maskable PPO) that learns to split, merge, and transfer MIG slices across GPUs in real time, optimising latency and throughput for concurrent Coding Agent and RAG Agent workloads. The project also includes a discrete-event simulator for training and evaluation, as well as tooling for profiling vLLM performance parameters and benchmarking policies both in simulation and on real hardware.
- Environment Preparation
- Dataset Preparation
- Simulation Configuration
- Profiling
- RL Model Training
- Benchmarking (Simulation)
- Benchmarking (Actual Deployment)
| Requirement | Notes |
|---|---|
| OS | Ubuntu (22.04 or later recommended) |
| GPU | NVIDIA GPU(s) with MIG support (e.g. A100, H100, B200) |
| MIG mode | Must be enabled manually before using this repo |
| Python | ≥ 3.12 (managed via uv) |
| Package manager | uv |
| vLLM Docker image | Custom build vllm/vllm-openai:v0.17.0.custom (see below) |
MIG mode must be turned on for each target GPU before running any code in this repo. Replace <GPU_INDEX> with the actual GPU index (e.g. 0, 1, …):
sudo nvidia-smi -i <GPU_INDEX> -mig 1
# Verify
nvidia-smi -LRepeat for every GPU you intend to manage. A system reboot or driver reload may be required on some machines.
This project uses uv as the package manager. Install it first if you haven't:
curl -LsSf https://astral.sh/uv/install.sh | shThen create the virtual environment and install all dependencies (including the CUDA-enabled PyTorch build):
uv syncAll just recipes automatically prepend .venv/bin to PATH, so you can run them without explicitly activating the environment. To verify the environment is set up correctly:
just test-envThe deployment module launches vLLM as Docker containers (one per MIG slice). The image name is configured in docker-compose.yaml.
The vLLM image is not included in this repository. You are expected to build or pull a vLLM image that is compatible with your own machine (CUDA version, driver, GPU architecture), and update the
image:field indocker-compose.yamlaccordingly.
# docker-compose.yaml — change this line to match your image name
image: your-vllm-image-name:tagRefer to the vLLM documentation for build instructions. This project targets vLLM v0.17.
About agents. This repository currently defines two agents:
CodingAgent(serves coding-assistant LLM requests) andRAGAgent(serves retrieval-augmented generation requests). All dataset preparation, LLM configuration, profiling, and training are organised around these two agents.If you want to add more agents, the following files must be modified:
File What to change src/share/models.pyAdd a new value to the AgentIdenumsrc/share/models.pyExtend EnvironmentStateDatawith per-agent ratio fields if neededsrc/share/models.pyAdd new ResourceManagerActionentries for the new GPUsrc/simulation/environment_state.pyUpdate observation construction for the new agent src/training/config.pyAdd workload / request-rate configuration for the new agent configs/simulation_config.yamlAdd the new agent block under simulation.agentsconfigs/deployment.yamlAssign the new agent to a GPU slot Note: Adding more than two agents has not been tested. Proceed with caution and expect to debug edge cases in the simulator and RL environment.
Datasets are stored under assets/ and must be preprocessed before use. Two datasets are required: one for the Coding Agent workload and one for the RAG Agent workload.
Download URL: (fill in)
The raw dataset uses multi-round conversations. Preprocess it so that each conversation round becomes an independent row:
python -m src.dataset.coder_preprocess
# Output saved to: assets/processed_code_feedback/Download URL: (fill in)
Convert the dataset (both train and test splits) to ShareGPT format:
python -m src.dataset.rag_convert_to_sharegpt \
--hf-path assets/rag-dataset-sharegpt
# Output saved to: assets/rag-dataset-sharegpt/Before profiling, you must decide which LLMs will run on each MIG profile and prepare their configuration files. This is split across three types of files:
configs/<model_name>.yaml— vLLM server configuration for each LLM.configs/gpus/<GPU_MODEL>.py— defines the MIG profile set for a GPU model (e.g. A100 40 GB, B200).configs/simulation_config.yaml— maps models to MIG profiles and holds measured hardware parameters (filled in after profiling).
Each LLM needs a YAML file consumed by the vLLM server. Create one file per model you want to serve, using the naming convention configs/<model_name>.yaml (underscores replace dots/hyphens as needed). The file is passed directly to vLLM via --config.
Template:
# configs/qwen2_5-7b-instruct.yaml
model: "Qwen/Qwen2.5-7B-Instruct" # HuggingFace model ID (or local path)
max_model_len: 32767 # Maximum sequence length (tokens)
max_num_batched_tokens: 4096 # Max tokens processed per iteration
gpu_memory_utilization: 0.9 # Fraction of GPU memory vLLM may useTune
max_model_lenandmax_num_batched_tokensdown on smaller MIG slices if vLLM fails to start due to insufficient KV cache.
Each GPU model that appears in your cluster must have a corresponding Python file under configs/gpus/ that enumerates its supported MIG profiles. The file name (without .py) is the key used in simulation_config.yaml.
Example — A100 40 GB (configs/gpus/A100_40GB.py):
from src.share.models import MIGProfileBase, MIGProfile, ProfileInfo
class MIGProfileA100(MIGProfileBase):
# ProfileInfo(compute_slices, memory_GB, logical_profile_type)
MIG_7G_40GB = ProfileInfo(7, 40, MIGProfile.MIG_7G)
MIG_4G_20GB = ProfileInfo(4, 20, MIGProfile.MIG_4G)
MIG_3G_20GB = ProfileInfo(3, 20, MIGProfile.MIG_3G)
MIG_2G_10GB = ProfileInfo(2, 10, MIGProfile.MIG_2G)
MIG_1G_10GB = ProfileInfo(1, 10, MIGProfile.MIG_1G_LARGE)
@property
def gpu_model(self) -> str:
return "A100_40GB"
@classmethod
def unsupported_profiles(cls):
return [] # list any MIGProfile variants not supported on this GPU
MIG_PROFILE = MIGProfileA100The string representation of each profile (e.g. "2g.10gb") is derived automatically from the ProfileInfo compute-slice and memory values and is the key used throughout simulation_config.yaml.
This is the central configuration file. The model section and the cluster/agent structure should be filled in now (before profiling). The measured parameters (kv_cache_GB, restart_time, and param.*) are filled in after running the profiling steps in chapter 3.
model:
<model-name>: # Must match the key used in agents below
generate_path: profiling_results/generated/<generated>.jsonl
# Path to the JSONL file produced by profile-generate (chapter 3)
# (used to replay realistic output lengths in simulation)
kv_per_token_KB: <value> # KV cache consumed per token, in kilobytes (see §2.3.1)
vllm_config: configs/<model_name>.yaml # Path to the vLLM YAML config abovesimulation:
cluster:
<gpu_index>: <GPU_MODEL> # e.g. 0: A100_40GB
...
initial_state:
gpu_initial_agents:
<gpu_index>:
- <AgentName> # Which agent occupies this GPU initially (currently only support one GPU for one agent)
permanent_engines: # Fixed engines not managed by the RL agent
- gpu: <gpu_index>
mig: <mig_profile_string> # e.g. 2g.10gb
agent: <AgentName>
agents:
<AgentName>: # e.g. CodingAgent, RAGAgent
<GPU_MODEL>: # e.g. A100_40GB
mig:
<mig_profile_string>: # e.g. 1g.10gb, 2g.10gb, 7g.40gb
model: <model-name> # Which LLM runs on this profile
kv_cache_GB: <value> # Measured available KV cache (see §2.3.2 — fill after profiling)
restart_time: <seconds> # Measured restart time (see §2.3.3 — fill after profiling)
param:
prefill: # Fill after running profile-prefill (chapter 3)
alpha: <seconds>
beta: <seconds_per_token>
sigma: <seconds>
tpot: # Fill after running profile-tpot (chapter 3)
alpha: <seconds>
beta: <seconds_per_request>
sigma: <seconds>This value is the KV cache memory consumed per output token for a given model, in kilobytes. It depends on the model architecture, not on the MIG profile. Use the following formula:
kv_per_token_KB = 2 × num_layers × num_kv_heads × head_dim × bytes_per_element / 1024
Where:
2accounts for both the K and V tensorsbytes_per_element= 2 for FP16/BF16, 1 for FP8- All values are available in the model's
config.jsonon HuggingFace
Example — Qwen2.5-7B-Instruct (BF16):
| Parameter | Value |
|---|---|
num_hidden_layers |
28 |
num_key_value_heads |
4 |
head_dim (hidden_size / num_attention_heads) |
128 |
bytes_per_element |
2 (BF16) |
kv_per_token_KB = 2 × 28 × 4 × 128 × 2 / 1024 ≈ 56 KB/token
This is the amount of GPU memory actually available for the KV cache on a specific (model, MIG profile) combination after vLLM loads the model weights. Measure it from vLLM's startup logs after starting the model on the target MIG slice:
- Start vLLM on the target MIG slice with the desired model config.
- Search the startup logs for a line like:
Read the value directly from this line.
GPU KV cache size: 6.45 GiB
Alternatively, you can compute it from the # GPU blocks line:
# GPU blocks: 1234, # CPU blocks: 0
kv_cache_GB = gpu_blocks × block_size × kv_per_token_KB / 1024 / 1024
where block_size is vLLM's block size in tokens (default: 16).
This is the wall-clock time (in seconds) for a vLLM container to shut down and restart on a given MIG slice. It is used by the simulator to model the downtime cost of MIG reconfigurations.
Measure it by timing a full down-then-up cycle using scripts/launch_vllm.sh:
# scripts/launch_vllm.sh <MIG_UUID> <MODEL_ID> <PORT> <up|down|logs>
time scripts/launch_vllm.sh <MIG_UUID> <MODEL_ID> <PORT> down \
&& time scripts/launch_vllm.sh <MIG_UUID> <MODEL_ID> <PORT> upRecord the total elapsed time. Typical values are in the range of 60–90 seconds depending on model size and MIG memory.
With the LLM-to-MIG-profile assignments decided in chapter 2, this step measures the actual latency behaviour of each (model, MIG profile) combination and fits linear models for prefill (TTFT) and decoding (TPOT). The fitted parameters are then written back into configs/simulation_config.yaml under the appropriate param fields.
Profiling must be run for every model listed under the model: section of configs/simulation_config.yaml, once per MIG profile it is assigned to.
| Script | Entry point | Purpose |
|---|---|---|
generate.py |
src.profiling.generate |
Query a live vLLM server with dataset prompts and save token usage + responses |
prefill.py |
src.profiling.prefill |
Fit a linear model TTFT = α + β·x + ε to prefill latency data |
tpot.py |
src.profiling.tpot |
Fit a linear model ITL = α + β·N + ε to decoding latency data across concurrency sweeps |
Time Warning: Generating offline responses for thousands of requests can take several hours to days depending on the dataset size, model size, and hardware speed.
Start a vLLM server externally (e.g. via Docker), then run:
just profile-generate <PORT> <MODEL_NAME> <DATASET_DIR> <OUTPUT_DIR>
# Example:
just profile-generate 8003 qwen2.5-3b-instruct assets/rag-dataset-sharegpt profiling_results/generatedThe input is a benchmark_detailed_results.json file produced by a vLLM benchmark tool (e.g. vllm benchmark_serving with --save-detailed-results):
just profile-prefill <INPUT_JSON> <OUTPUT_DIR>
# Example:
just profile-prefill profiling_results/raw/prefill-1g.10gb.json profiling_results/prefillRun a concurrency sweep and collect one JSON per concurrency level. Then:
just profile-tpot <INPUT_DIR> <OUTPUT_DIR>
# Example:
just profile-tpot profiling_results/raw/tpot-1g.10gb/ profiling_results/tpotAfter running the profiling pipeline, the profiling_results/ directory will look like:
profiling_results/
├── generated/ # Step 1 outputs
│ └── <model>-port-<port>-<dataset>-generated.jsonl
├── prefill/ # Step 2 outputs
│ ├── <benchmark_name>-param.json # Fitted α, β, σ parameters
│ └── <benchmark_name>-plot.png # Scatter plot with regression line
└── tpot/ # Step 3 outputs
├── <benchmark_name>-param.json # Fitted α, β, σ parameters
└── <benchmark_name>-plot.png # Scatter plot with regression line
Each -param.json file contains the following fields:
{
"alpha": 0.03,
"beta": 0.00011,
"sigma": 0.029,
"r_squared": 0.97,
"unit_alpha": "seconds",
"unit_beta": "seconds_per_token",
"model_formula": "y = beta * x + alpha + N(0, sigma^2)"
}Then copy the parameter values into configs/simulation_config.yaml under the param fields of the appropriate MIG profile entry (see §2.3).
Time Warning: Reinforcement learning requires simulating hundreds of thousands of environment steps. A full training run from scratch may take several days. Consider running it in the background using
tmuxorscreen.
The RL agent is trained inside a discrete-event simulator. The main training code is in src/training/ and the simulator logic is in src/simulation/.
- Algorithm: Maskable PPO (
sb3-contrib) — action masking prevents the agent from selecting physically impossible MIG reconfigurations. - Environment:
TrainingMIGResourceEnvwraps the simulator and drives it step-by-step, generating synthetic LLM request workloads. - Simulator (
src/simulation/): A discrete-event engine that models vLLM engines on MIG slices, request queueing, prefill/decoding latency (using the profiled parameters), and MIG reconfiguration overheads. - Config: Edit
configs/training_config.yamlto adjust hyperparameters, episode length, reward shaping, etc. - Cluster: The
training.clusterfield inconfigs/training_config.yamlspecifies which GPU model to simulate (e.g.A100_40GB). Note that this cluster configuration will be copied toconfigs/simulation_config.yamlautomatically when training starts (viasrc/training/train.py). (A similar sync occurs frombench_config.yamlor a snapshot during benchmarking viasrc/bench/main.py).
Set the GPU index in the justfile (edit gpu := "" to e.g. gpu := "0") then run:
just trainjust train <PATH_TO_CHECKPOINT.zip>
# Example:
just train results/20250501-120000-000/ckpts/20250501-120000-000/ppo_mig_resource_manager_5120_steps.zipTo sweep over hyperparameter configurations defined in a YAML file:
just grid-search configs/grid_search.yamlEach training run produces a timestamped directory under results/:
results/
└── <run_id>/ # e.g. 20250501-120000-000
├── ckpts/
│ └── <run_id>/
│ ├── ppo_mig_resource_manager_<N>_steps.zip # Periodic checkpoints
│ ├── ppo_mig_resource_manager_<N>_steps_vecnormalize.pkl
│ ├── ppo_mig_resource_manager.zip # Final model
│ └── ppo_mig_resource_manager_vecnormalize.pkl # Final VecNormalize stats
├── logs/
│ └── train/ # Per-episode step logs (JSONL)
├── snapshots/
│ └── training_config.yaml # Snapshot of the config used for this run
└── tboards/
└── <run_id>/ # TensorBoard event files
Monitor training progress:
tensorboard --logdir resultsEvaluate one or more RL checkpoints (and optional baselines) in the simulator. The benchmarking code is in src/bench/.
| Policy / flag | Description |
|---|---|
--ckpt <path> |
RL checkpoint(s) to evaluate |
--bl static_no_mig |
Baseline: single 7G instance per GPU (no MIG splitting) |
--bl static_split_extreme |
Baseline: maximum MIG splitting |
--bl heuristic |
Baseline: rule-based heuristic agent |
--bl all |
Run all three baselines |
# Evaluate a specific checkpoint
just bench results/<run_id>/ckpts/<run_id>/ppo_mig_resource_manager.zip
# Evaluate multiple checkpoints
just bench <ckpt1.zip> <ckpt2.zip>
# Evaluate baselines only
just bench-bl all
# Evaluate latest checkpoints + all baselines (uses scripts/get_latest_ckpts.py)
just bench-allResults are written under the corresponding results/<run_id>/bench/ directory:
results/
└── <run_id>/
└── bench/
├── results_<run_name>.txt # Printed metrics table (TTFT, TPOT, queue length, MIG usage, …)
└── figs/
└── <run_name>/
├── split.png # Workload timeline annotated with Split events
├── merge.png # Workload timeline annotated with Merge events
└── transfer.png # Workload timeline annotated with Transfer events
Deploy and benchmark a policy on real hardware. The deployment code is in src/deploy/. This module:
- Configures physical MIG instances on the GPUs.
- Launches one vLLM Docker container per MIG slice (via
docker-compose). - Dispatches real LLM requests according to a configurable workload pattern.
- Runs the policy agent (RL, heuristic, or static) in a control loop.
- Exposes a live web dashboard for monitoring.
Note: This requires
sudobecause MIG reconfiguration (vianvidia-smi) needs root privileges. The justfile recipe usessudo .venv/bin/python3to avoid root-owned file artefacts while still having the required permissions.
Edit configs/deployment.yaml to declare which GPUs are managed and which are reserved for permanent engines. Edit configs/bench_config.yaml to configure the workload pattern.
# Run with an RL checkpoint
just deploy-bench <PATH_TO_CHECKPOINT.zip> <DURATION_SECONDS>
# Example:
just deploy-bench results/20250501-120000-000/ckpts/20250501-120000-000/ppo_mig_resource_manager.zip $((12*60*60))
# Run with the rule-based heuristic
just deploy-bench heuristic $((12*60*60))
# Run with a static single-instance (no MIG) baseline
just deploy-bench static-7g $((12*60*60))
# Run with a static maximally-split baseline
just deploy-bench static-2g $((12*60*60))Adjust the logging verbosity with the optional third argument (default INFO):
just deploy-bench heuristic $((12*60*60)) DEBUGAfter a run (or after a crash), clean up any lingering Docker containers and MIG instances:
just clean-deployWhile a deployment benchmark is running, a live web dashboard is served (host and port configured in configs/deployment.yaml). Use the watch command to monitor the dashboard:
watch -n1 "curl -s http://localhost:9000" # Dashboard port is configurable in configs/deployment.yamlThere are a few extra just recipes included for development and debugging:
Runs a mock version of the simulator with a random action policy and no Stable Baselines overhead (python -m src.simulation.main). This is useful to rapidly debug the discrete-event logic or verify that metrics are tracked correctly before kicking off a multi-day training run.
just mock-trainDeletes all generated log files and output from mock runs.
just clean-logsRuns ruff over the src/ directory to automatically fix and format python code.
just lint