vLLM with Continuum Scheduling

This repository contains a modified version of vLLM with Continuum scheduling support for improved inference performance.

Prerequisites

Python 3.8+
uv package manager
Hugging Face account with access token
GPU(s) with appropriate CUDA drivers

Installation

# Create and activate a virtual environment
uv venv
source .venv/bin/activate

# Install the package in editable mode
uv pip install -e .

# Install mini-swe-agent
cd mini-swe-agent
uv pip install -e .
uv pip install datasets
cd ..

# Install additional dependencies
uv pip install lmcache hf_transfer

# Log in to Hugging Face (required for model access)
hf auth login
# Enter your Hugging Face access token when prompted

Additional Setup: Follow the instructions to set up sb-cli, which is required for pass rate evaluation.

Usage

Starting the Server

Original vLLM Mode

Run vLLM with standard scheduling:

# Without CPU offload
vllm serve <MODEL_NAME> \
  --tensor-parallel-size <NUM_GPUS> \
  --port <PORT_ID>

# With CPU offload (requires lmcache)
LMCACHE_MAX_LOCAL_CPU_SIZE=<CPU_SIZE_GB> \
vllm serve <MODEL_NAME> \
  --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_both"}' \
  --tensor-parallel-size <NUM_GPUS> \
  --port <PORT_ID>

Continuum Scheduling Mode

Run vLLM with Continuum scheduling for optimized performance:

# Without CPU offload
vllm serve <MODEL_NAME> \
  --scheduling-policy continuum \
  --tensor-parallel-size <NUM_GPUS> \
  --port <PORT_ID>

# With CPU offload (requires lmcache)
LMCACHE_MAX_LOCAL_CPU_SIZE=<CPU_SIZE_GB> \
vllm serve <MODEL_NAME> \
  --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_both"}' \
  --scheduling-policy continuum \
  --tensor-parallel-size <NUM_GPUS> \
  --port <PORT_ID>

Example:

# Run Llama-3.1-70B-Instruct with Continuum on 4 GPUs
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --scheduling-policy continuum \
  --tensor-parallel-size 4

Evaluation

Running SWE-bench Evaluation

Note: The default evaluation setup uses meta-llama/Llama-3.1-70B-Instruct on 4 H100 GPUs. Mini-swe-agent may encounter issues with smaller or simpler models.

Start the vLLM server (see Usage section above)
Run the SWE-bench evaluation:

# Clear previous output before each run
rm -rf ./swebench_output

# Run evaluation (fixed concurrency)
mini-extra swebench \
  --model-class vllm \
  --model <MODEL_NAME> \
  --port <PORT_ID> \
  --subset verified \
  --split test \
  --workers 64 \
  --output ./swebench_output

# Run evaluation (Poisson arrival rate)
mini-extra swebench \
  --model-class vllm \
  --model <MODEL_NAME> \
  --port <PORT_ID> \
  --subset verified \
  --split test \
  --use-jps --jps 1.0 \
  --output ./swebench_output

Load Control Modes

Mode	Flag	Description
Workers	`--workers N`	Fixed N concurrent jobs
JPS	`--use-jps --jps X`	Poisson process at X jobs/second

Analyzing Results

Important: Terminate the vLLM server (Ctrl+C) before running the evaluation analysis.

# Analyze latency metrics
python continuum_exp/analyze.py \
  --output-dir <OUTPUT_DIRECTORY>

# Submit pass rate evaluation (use a unique run_id for each evaluation)
sb-cli submit swe-bench_verified test \
  --predictions_path swebench_output/preds.json \
  --run_id <UNIQUE_RUN_ID>

Configuration Parameters

Parameter	Description	Example
`<MODEL_NAME>`	Hugging Face model identifier	`meta-llama/Llama-3.1-70B-Instruct`
`<NUM_GPUS>`	Number of GPUs for tensor parallelism	`4`
`<CPU_SIZE_GB>`	CPU memory size in GB for KV cache offload	`200`
`<OUTPUT_DIRECTORY>`	Directory for analysis output	`./continuum_exp/result`
`<UNIQUE_RUN_ID>`	Identifier for evaluation run	`continuum_run_001`
`--workers`	Fixed number of concurrent jobs	`64`
`--use-jps`	Enable Poisson arrival mode	-
`--jps`	Jobs per second (with `--use-jps`)	`1.0`

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.buildkite		.buildkite
.gemini		.gemini
.github		.github
benchmarks		benchmarks
cmake		cmake
continuum_exp		continuum_exp
csrc		csrc
docker		docker
docs		docs
examples		examples
mini-swe-agent		mini-swe-agent
output		output
queue_time		queue_time
requirements		requirements
step_time		step_time
test_scripts		test_scripts
tests		tests
tool_call_time		tool_call_time
tools		tools
vllm		vllm
.clang-format		.clang-format
.dockerignore		.dockerignore
.gitignore		.gitignore
.markdownlint.yaml		.markdownlint.yaml
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
.shellcheckrc		.shellcheckrc
.yapfignore		.yapfignore
CMakeLists.txt		CMakeLists.txt
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
DCO		DCO
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
PULL_IMPLEMENTATION_REPORT.md		PULL_IMPLEMENTATION_REPORT.md
README.md		README.md
RELEASE.md		RELEASE.md
SECURITY.md		SECURITY.md
download.py		download.py
find_cuda_init.py		find_cuda_init.py
format.sh		format.sh
mkdocs.yaml		mkdocs.yaml
pyproject.toml		pyproject.toml
setup.py		setup.py
use_existing_torch.py		use_existing_torch.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

vLLM with Continuum Scheduling

Table of Contents

Prerequisites

Installation

Usage

Starting the Server

Original vLLM Mode

Continuum Scheduling Mode

Evaluation

Running SWE-bench Evaluation

Load Control Modes

Analyzing Results

Configuration Parameters

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

vLLM with Continuum Scheduling

Table of Contents

Prerequisites

Installation

Usage

Starting the Server

Original vLLM Mode

Continuum Scheduling Mode

Evaluation

Running SWE-bench Evaluation

Load Control Modes

Analyzing Results

Configuration Parameters

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages