Skip to content

Whrlicht/vllm-continuum

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

vLLM with Continuum Scheduling

This repository contains a modified version of vLLM with Continuum scheduling support for improved inference performance.

Table of Contents

Prerequisites

  • Python 3.8+
  • uv package manager
  • Hugging Face account with access token
  • GPU(s) with appropriate CUDA drivers

Installation

# Create and activate a virtual environment
uv venv
source .venv/bin/activate

# Install the package in editable mode
uv pip install -e .

# Install mini-swe-agent
cd mini-swe-agent
uv pip install -e .
uv pip install datasets
cd ..

# Install additional dependencies
uv pip install lmcache hf_transfer

# Log in to Hugging Face (required for model access)
hf auth login
# Enter your Hugging Face access token when prompted

Additional Setup: Follow the instructions to set up sb-cli, which is required for pass rate evaluation.

Usage

Starting the Server

Original vLLM Mode

Run vLLM with standard scheduling:

# Without CPU offload
vllm serve <MODEL_NAME> \
  --tensor-parallel-size <NUM_GPUS> \
  --port <PORT_ID>

# With CPU offload (requires lmcache)
LMCACHE_MAX_LOCAL_CPU_SIZE=<CPU_SIZE_GB> \
vllm serve <MODEL_NAME> \
  --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_both"}' \
  --tensor-parallel-size <NUM_GPUS> \
  --port <PORT_ID>

Continuum Scheduling Mode

Run vLLM with Continuum scheduling for optimized performance:

# Without CPU offload
vllm serve <MODEL_NAME> \
  --scheduling-policy continuum \
  --tensor-parallel-size <NUM_GPUS> \
  --port <PORT_ID>

# With CPU offload (requires lmcache)
LMCACHE_MAX_LOCAL_CPU_SIZE=<CPU_SIZE_GB> \
vllm serve <MODEL_NAME> \
  --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_both"}' \
  --scheduling-policy continuum \
  --tensor-parallel-size <NUM_GPUS> \
  --port <PORT_ID>

Example:

# Run Llama-3.1-70B-Instruct with Continuum on 4 GPUs
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --scheduling-policy continuum \
  --tensor-parallel-size 4

Evaluation

Running SWE-bench Evaluation

Note: The default evaluation setup uses meta-llama/Llama-3.1-70B-Instruct on 4 H100 GPUs. Mini-swe-agent may encounter issues with smaller or simpler models.

  1. Start the vLLM server (see Usage section above)

  2. Run the SWE-bench evaluation:

# Clear previous output before each run
rm -rf ./swebench_output

# Run evaluation (fixed concurrency)
mini-extra swebench \
  --model-class vllm \
  --model <MODEL_NAME> \
  --port <PORT_ID> \
  --subset verified \
  --split test \
  --workers 64 \
  --output ./swebench_output

# Run evaluation (Poisson arrival rate)
mini-extra swebench \
  --model-class vllm \
  --model <MODEL_NAME> \
  --port <PORT_ID> \
  --subset verified \
  --split test \
  --use-jps --jps 1.0 \
  --output ./swebench_output

Load Control Modes

Mode Flag Description
Workers --workers N Fixed N concurrent jobs
JPS --use-jps --jps X Poisson process at X jobs/second

Analyzing Results

Important: Terminate the vLLM server (Ctrl+C) before running the evaluation analysis.

# Analyze latency metrics
python continuum_exp/analyze.py \
  --output-dir <OUTPUT_DIRECTORY>

# Submit pass rate evaluation (use a unique run_id for each evaluation)
sb-cli submit swe-bench_verified test \
  --predictions_path swebench_output/preds.json \
  --run_id <UNIQUE_RUN_ID>

Configuration Parameters

Parameter Description Example
<MODEL_NAME> Hugging Face model identifier meta-llama/Llama-3.1-70B-Instruct
<NUM_GPUS> Number of GPUs for tensor parallelism 4
<CPU_SIZE_GB> CPU memory size in GB for KV cache offload 200
<OUTPUT_DIRECTORY> Directory for analysis output ./continuum_exp/result
<UNIQUE_RUN_ID> Identifier for evaluation run continuum_run_001
--workers Fixed number of concurrent jobs 64
--use-jps Enable Poisson arrival mode -
--jps Jobs per second (with --use-jps) 1.0

About

Preview Code for Continuum Paper

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 87.1%
  • Cuda 7.4%
  • C++ 3.9%
  • Shell 0.7%
  • C 0.4%
  • CMake 0.3%
  • Other 0.2%