A flexible development environment for vLLM that provides:
- Single-machine tensor parallelism and distributed inference
- Python-based configuration management
- Dynamic model deployment system
- Comprehensive parameter customization
- RDMA-optimized networking
The environment uses a YAML-based registry for model configurations, supporting all vLLM parameters and features through a robust Python launcher.
- GPUs: 2x NVIDIA GeForce RTX 3090 Ti (24GB VRAM each)
- CUDA Version: 12.8
- Driver: 570.172.08
- Network: RDMA over Converged Ethernet (RoCE)
- Bandwidth: Up to 12 GB/s inter-node communication
- Requirements:
- RDMA-capable NICs (e.g., Mellanox ConnectX)
- GPUDirect RDMA support
- High-bandwidth interconnect
For detailed distributed setup information, see:
This project was initialized using create_project.sh at the root.
# Install all dependencies including dev tools
uv sync --all-extras
# Or install without dev dependencies
uv syncModels are configured through docker/config/model_registry.yml. The configuration system is fully flexible and supports any vLLM parameter - each YAML key is automatically converted to a CLI argument (with -- prefix).
Common parameters include:
model: "path/to/model" # --model
dtype: "bfloat16" # --dtype
tensor-parallel-size: 2 # --tensor-parallel-size
gpu-memory-utilization: 0.35 # --gpu-memory-utilization
max-num-seqs: 8 # --max-num-seqs
max-model-len: 131072 # --max-model-len
trust-remote-code: true # --trust-remote-code (flag only if true)
enable-prefix-caching: true # --enable-prefix-caching (flag only if true)
description: "..." # (metadata, not passed to vLLM)Example registry entry:
gpt-oss-20b:
model: openai/gpt-oss-20b
dtype: bfloat16
tensor-parallel-size: 2
gpu-memory-utilization: 0.35
max-num-seqs: 8
max-model-len: 131072
swap-space: 4 # Additional vLLM parameter
max-num-batched-tokens: 8192 # Additional vLLM parameter
enable-prefix-caching: true # Optional feature flag
description: LightweightAny valid vLLM parameter can be added to the configuration. The launcher will automatically convert all parameters (except description) into appropriate command-line arguments.
The Python-based launch system (docker/config/launch.py) provides:
- Robust YAML configuration parsing
- Validation of model settings
- Dry-run capability for testing
- User-friendly error messages
- Available models listing
Features:
# Normal launch
MODEL_NAME=gpt-oss-20b python launch.py
# Validate configuration
MODEL_NAME=gpt-oss-20b python launch.py --dry-run
# See error messages and available models
MODEL_NAME=invalid-model python launch.py- Set the model in your environment:
# In .env file
MODEL_NAME=gpt-oss-20b # Must match an entry in model_registry.yml- Launch the service:
cd docker/head # or docker/worker for distributed setup
docker compose up -dThe Python launcher will automatically load and validate the configuration before starting the model.
# Run commands with uv (recommended)
uv run python script.py
# Or activate the virtual environment
source .venv/bin/activate.
โโโ configs/ # Configuration files
โ โโโ example_model.yaml # Template configuration
โ โโโ llama-70b-tp2.yaml # Tensor parallel config for Llama 70B
โ โโโ llama-7b.yaml # Single GPU config for Llama 7B
โ โโโ mcp_servers.yaml # MCP servers configuration
โ โโโ phi3-mini-with-terminal.yaml # Configuration for Phi-3 model
โโโ docker/ # Docker deployment configurations
โ โโโ config/ # Configuration directory
โ โ โโโ launch.py # Python-based model launcher
โ โ โโโ model_registry.yml # Model configurations
โ โโโ head/ # Ray head node setup
โ โ โโโ docker-compose.yml # Head node container config
โ โ โโโ .env.example # Environment template for head
โ โ โโโ README.md # Head node documentation
โ โโโ stand_alone/ # Single node deployment
โ โ โโโ docker-compose.yml # Standalone container config
โ โโโ worker/ # Ray worker node setup
โ โโโ docker-compose.yml # Worker node container config
โ โโโ .env.exmple # Environment template for worker
โ โโโ README.md # Worker node documentation
โโโ mixvllm_server/ # Core server implementation
โ โโโ src/ # Source code
โ โ โโโ cli/ # Command-line interface
โ โ โโโ config/ # Server configuration
โ โ โโโ inference/ # Inference implementation
โ โโโ README.md # Server documentation
โโโ mixvllm-chat/ # Chat interface implementation
โ โโโ app/ # Application code
โ โ โโโ client/ # Chat client implementation
โ โ โโโ utils/ # Utility functions
โ โโโ terminal/ # Terminal server implementation
โโโ multi_node_gpu_cluster_with_rdma/ # RDMA cluster setup guides
โ โโโ setup.md # Detailed RDMA configuration
โโโ tests/ # Test suite
โโโ tensor_parallel.py # Tensor parallelism tests
โโโ test_mcp.py # MCP integration tests
โ โโโ docker-compose.yml
โ โโโ entrypoint.sh
โ โโโ README.md
โโโ mixvllm_server/
โ โโโ pyproject.toml
โ โโโ README.md
โ โโโ src/
โ โโโ cli/
โ โ โโโ serve_model.py
โ โ โโโ README.md
โ โโโ config/
โ โ โโโ gpt-oss-20b.yaml
โ โ โโโ phi3-mini.yaml
โ โโโ inference/
โ โ โโโ config.py
โ โ โโโ server.py
โ โ โโโ utils.py
โ โ โโโ terminal_server.py
โ โ โโโ README.md
โโโ mixvllm-chat/
โ โโโ Dockerfile.terminal
โ โโโ pyproject.toml
โ โโโ README.md
โ โโโ app/
โ โ โโโ chat_client.py
โ โ โโโ client/
โ โ โ โโโ chat_client.py
โ โ โ โโโ chat_engine.py
โ โ โ โโโ cli.py
โ โ โ โโโ config.py
โ โ โ โโโ connection_manager.py
โ โ โ โโโ history_manager.py
โ โ โ โโโ response_handler.py
โ โ โ โโโ tool_manager.py
โ โ โ โโโ ui_manager.py
โ โ โ โโโ utils/
โ โ โ โโโ mcp_client.py
โ โ โ โโโ mcp_tools.py
โ โ โโโ config/
โ โ โ โโโ mcp_servers.yaml
โ โ โโโ utils/
โ โ โโโ mcp_client.py
โ โ โโโ mcp_tools.py
โ โโโ terminal/
โ โ โโโ terminal_server_standalone.py
โโโ tests/
โ โโโ __init__.py
โ โโโ tensor_parallel.py
โ โโโ test_mcp.py
uv run python .claude/experiments/test_gpu.pyuv run python .claude/experiments/test_vllm.pyfrom vllm import LLM
llm = LLM(
model="meta-llama/Llama-2-70b-hf",
tensor_parallel_size=2, # Use both GPUs
gpu_memory_utilization=0.90,
trust_remote_code=True
)
outputs = llm.generate("Hello, my name is")
print(outputs[0].outputs[0].text)See configs/example_model.yaml for a complete configuration template.
# Run tests
uv run pytest
# Type checking
uv run mypy mixvllm/
# Linting
uv run ruff check mixvllm/
# Auto-formatting
uv run black mixvllm/With 2x RTX 3090 Ti (24GB each = 48GB total):
- Can run 70B models in FP16 (requires ~140GB, use quantization)
- Can run 70B models in 4-bit quantization comfortably
- Can run 34B models in FP16 easily
- Communication overhead between GPUs is minimal on PCIe 4.0
Out of Memory Errors:
- Reduce
gpu_memory_utilization(try 0.85 or 0.80) - Use quantization (4-bit or 8-bit)
- Reduce
max_model_len
Slow Inference:
- Check GPU utilization with
nvidia-smi - Verify both GPUs are being used
- Ensure PCIe link is running at full speed
401 Unauthorized Errors:
- Set
HF_TOKENenvironment variable with your HuggingFace token - For gated models, request access on the HuggingFace model page
- Verify token has read permissions:
huggingface-cli whoami - Some models require accepting terms/conditions on HuggingFace
Some models require authentication to access from HuggingFace. If you encounter 401 Unauthorized errors, you need to:
- Get a token: Visit https://huggingface.co/settings/tokens to create an access token
- Set environment variable:
export HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxx - Or login via CLI:
huggingface-cli login
Some models (especially from OpenAI, Meta, etc.) are gated repositories that require:
- โ Valid HuggingFace account
- โ Explicit access approval on the model page
- โ Proper authentication token
Example with authentication:
HF_TOKEN=$HF_TOKEN uv run mixvllm-serve --config configs/gpt-oss-20b.yamlModels that may require authentication:
openai/gpt-oss-20b(gated)meta-llama/Llama-2-*(gated)meta-llama/Llama-3-*(gated)
Public models (no auth required):
microsoft/Phi-3-mini-4k-instruct- Most Microsoft and Google models
Serve vLLM models with the serve_model.py script, which provides an OpenAI-compatible API server.
# Serve Phi-3 Mini on single GPU (no auth required)
uv run mixvllm-serve --model microsoft/Phi-3-mini-4k-instruct --gpus 1
# Serve Llama 2 70B with tensor parallelism (requires HF_TOKEN)
HF_TOKEN=$HF_TOKEN uv run mixvllm-serve --model meta-llama/Llama-2-70b-hf --gpus 2 --trust-remote-code# Use predefined configurations
uv run mixvllm-serve --config configs/phi3-mini.yaml # No auth required
uv run mixvllm-serve --config configs/llama-7b.yaml # May require HF_TOKEN
HF_TOKEN=$HF_TOKEN uv run mixvllm-serve --config configs/llama-70b-tp2.yaml # Requires HF_TOKEN
HF_TOKEN=$HF_TOKEN uv run mixvllm-serve --config configs/gpt-oss-20b.yaml # Requires HF_TOKEN
# Override config with CLI options
uv run mixvllm-serve --config configs/phi3-mini.yaml --port 8080HF_TOKEN=$HF_TOKEN uv run mixvllm-serve \
--model meta-llama/Llama-2-70b-hf \
--gpus 2 \
--gpu-memory 0.85 \
--max-model-len 4096 \
--port 8000 \
--temperature 0.8 \
--max-tokens 1024Once running, the server provides an OpenAI-compatible API:
# Health check
curl http://localhost:8000/health
# Chat completion
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "microsoft/Phi-3-mini-4k-instruct",
"messages": [{"role": "user", "content": "Hello!"}]
}'The mixvllm-chat command provides a CLI chat interface for interactive conversations with your served models. It features rich terminal formatting and enhanced input handling similar to modern CLI applications.
# Install dependencies (if not already done)
uv sync
# Start chatting with default settings
uv run mixvllm-chat
# Connect to specific server and model
uv run mixvllm-chat --base-url http://localhost:8000 --model microsoft/Phi-3-mini-4k-instruct
# Enable streaming responses
uv run mixvllm-chat --stream --temperature 0.8- Rich Terminal UI: Beautiful formatting with colors, panels, and markdown rendering
- Conversation Context: Maintains chat history for coherent conversations
- Command Support:
/help,/clear,/history,/quit - Enhanced Input: History-based auto-completion and navigation (with prompt_toolkit)
- Streaming Support: Real-time response streaming with live updates
- Model Auto-detection: Automatically detects available models from server
- Error Handling: Clear error messages with appropriate formatting
The chat client uses these optional libraries for enhanced UI:
rich: Beautiful terminal formatting and colorsprompt_toolkit: Enhanced input with history and completionrequests: HTTP client for API calls
If these libraries are not available, the client falls back to basic text output.
โ Connected to vLLM server at http://localhost:8000
โ Auto-selected model: microsoft/Phi-3-mini-4k-instruct
โญโ Welcome โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ โ
โ ๐ค vLLM Chat Client โ
โ โ
โ Configuration: โ
โ โข Server: http://localhost:8000 โ
โ โข Model: microsoft/Phi-3-mini-4k-instruct โ
โ โ
โ Commands: /help, /clear, /history, /quit โ
โ Type your message and press Enter to chat! โ
โ โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
You: Hello! How are you today?
โญโ ๐ค Assistant โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ Hello! I'm doing well, thank you for asking. I'm here and ready to help โ
โ you with any questions or tasks you might have. How can I assist you โ
โ today? โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
You: Tell me about machine learning
โญโ ๐ค Assistant โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ Machine learning is a fascinating field that involves teaching computers โ
โ to learn from data and make predictions or decisions without being โ
โ explicitly programmed for each specific task. It's a subset of artificial โ
โ intelligence that focuses on algorithms and statistical models that can โ
โ improve their performance as they are exposed to more data. โ
โ โ
โ There are several main types of machine learning: โ
โ โ
โ 1. **Supervised Learning**: The algorithm learns from labeled training โ
โ data to make predictions on new, unseen data. Examples include โ
โ classification (like spam detection) and regression (like predicting โ
โ house prices). โ
โ โ
โ 2. **Unsupervised Learning**: The algorithm finds patterns in data โ
โ without labeled examples. This includes clustering (grouping similar โ
โ data points) and dimensionality reduction. โ
โ โ
โ 3. **Reinforcement Learning**: An agent learns through trial and error by โ
โ interacting with an environment, receiving rewards or penalties for โ
โ actions. โ
โ โ
โ Machine learning has applications in many fields including computer โ
โ vision, natural language processing, recommendation systems, autonomous โ
โ vehicles, medical diagnosis, and financial trading. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
You: /history
โญโ ๐ Conversation History โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ โโโโโโโโณโโโโโโโโโโโโโโโณโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Turn โ Role โ Content โ โ
โ โกโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฉ โ
โ โ 1 โ User โ Hello! How are you today? โ
โ โ 2 โ Assistant โ Hello! I'm doing well, thank you for asking. ... โ
โ โ 3 โ User โ Tell me about machine learning โ
โ โ 4 โ Assistant โ Machine learning is a fascinating field that... โ
โ โโโโโโโโดโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
You: /quit
๐ Goodbye!
The mixvllm-chat command provides an advanced chat client with MCP (Model Context Protocol) tool integration, enabling the LLM to call external tools during conversations.
- MCP Tool Integration: Weather queries and other MCP tools
- Tool Discovery Display: Shows available MCP tools on startup
- Dual Modes: Simple chat or agent mode with tool calling
- Rich Terminal UI: Enhanced formatting with panels and colors
- Conversation Context: Maintains chat history
- Streaming Support: Real-time response streaming
- Command System:
/help,/clear,/history,/mcp,/quit
Install additional dependencies for MCP support:
uv syncTo run the gpt-oss-20b model using your configuration file, use the following command from the mixvllm_server directory:
HF_TOKEN=<your_huggingface_token> uv run mixvllm-serve --config src/config/gpt-oss-20b.yamlNotes:
- Replace
<your_huggingface_token>with your actual Hugging Face access token. - This command will use all settings from
src/config/gpt-oss-20b.yaml, including model name, tensor parallelism, GPU memory, and dtype. - Make sure your environment has access to the required GPUs and the Hugging Face repository.
- If you encounter quantization or dtype errors, ensure your config file sets
dtype: bfloat16as shown in the example config.
# Basic chat with vLLM server (auto-detects model)
uv run mixvllm-chat
# Connect to specific server (auto-detects model)
uv run mixvllm-chat --base-url http://localhost:8000
# Specify model explicitly (optional)
uv run mixvllm-chat --base-url http://localhost:8000 --model microsoft/Phi-3-mini-4k-instruct# Enable MCP tools for weather queries (auto-detects model)
uv run mixvllm-chat --enable-mcp
# Full configuration with custom MCP config
uv run mixvllm-chat \
--enable-mcp \
--mcp-config configs/mcp_servers.yaml \
--base-url http://localhost:8000 \
--stream \
--temperature 0.8When MCP mode is enabled, the following tools are available:
- Weather Queries: Get current weather, forecasts, and historical data
- Location Support: Supports city names and coordinates
- Units: Celsius or Fahrenheit temperature units
โ Connected to vLLM server at http://localhost:8000
โ Auto-selected model: microsoft/Phi-3-mini-4k-instruct
โ MCP tools enabled (2 tools available)
โญโ Welcome โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ ๐ค Enhanced vLLM Chat Client (with MCP tools) โ
โ โ
โ Configuration: โ
โ โข Server: http://localhost:8000 โ
โ โข Model: microsoft/Phi-3-mini-4k-instruct โ
โ โข MCP Tools: Enabled โ
โ โ
โ Available MCP Tools (2): โ
โ โข weather_get_hourly_weather - Get hourly weather forecast for a locationโ
โ using Open-Meteo API (Weather information and forecasts) โ
โ โข weather_geocode_location - Get coordinates and timezone information forโ
โ a location. (Weather information and forecasts) โ
โ โ
โ Commands: /help, /clear, /history, /mcp, /quit โ
โ Type your message and press Enter to chat! โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
You: What's the weather like in New York?
โญโ ๐ค๏ธ Assistant (with tools) โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ The user is asking about the weather in New York. I should use the โ
โ weather_get_weather tool to get current weather information. โ
โ โ
โ Tool Call: weather_get_weather(location="New York", units="celsius") โ
โ โ
โ Tool Result: [weather] Weather for New York: 22ยฐC, Partly Cloudy, Wind 5 โ
โ km/h โ
โ โ
โ Current weather in New York: 22ยฐC with partly cloudy conditions and light โ
โ winds at 5 km/h. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
You: /mcp
โญโ ๐ง MCP Integration Status โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ โโโโโโโโโโโณโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโณโโโโโโโโโโโโโโ โ
โ โ Server โ Status โ Tools โ โ
โ โกโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฉ โ
โ โ weather โ โ Connected (2 tools) โ get_hourly_ โ
โ โ โ โ weather, โ
โ โ โ โ geocode_loc โ
โ โ โ โ ation โ
โ โโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโ โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
The system supports both standalone and distributed deployment modes, leveraging Ray for cluster management and RDMA for high-performance communication.
graph TB
subgraph "MixVLLM Architecture"
Client(["Client"])
subgraph "Head Node"
HS["HTTP Server"]
RC["Ray Controller"]
MS1["Model Shard 1"]
end
subgraph "Worker Node"
RW["Ray Worker"]
MS2["Model Shard 2"]
end
Client -->|"HTTP/REST"| HS
HS --> RC
RC <-->|"RDMA/Ray"| RW
RC --> MS1
RW --> MS2
MS1 <-->|"NCCL (12GB/s)"| MS2
end
For single-node deployment:
cd docker/stand_alone
docker-compose up -dHead Node:
cd docker/head
cp .env.example .env
# Configure environment
docker-compose up -dWorker Node:
cd docker/worker
cp .env.exmple .env
# Configure environment
docker-compose up -dFor detailed configuration, see Docker Configuration Guide.
The web terminal provides browser-based access to CLI tools and is now designed to run as a separate process from the model server. This separation allows for more flexible deployment, improved scalability, and independent management of terminal and model services.
- Model Server: Serves the vLLM API (OpenAI-compatible) on a configurable port (default: 8000).
- Terminal Server: Runs independently, connects to any model server via HTTP, and provides a full-featured shell and chat interface in the browser (default port: 8888).
Start the model server:
uv run mixvllm-serve --config configs/gpt-oss-20b.yaml
# or use any supported config/model optionsStart the terminal server (in a separate process):
uv run mixvllm-terminal-server --model-server-url http://localhost:8000
# or use --port to change the terminal port- Browser-Based Terminal: xterm.js frontend with full shell access
- Auto-Start Chat: Optionally launches
mixvllm-chatconnected to your model server - Flexible Connection: Terminal server can connect to any OpenAI-compatible API endpoint
- Separate Ports: Terminal and model server run on independent ports for easier scaling and security
- Docker Support: Docker Compose can launch both services as separate containers
cd docker
docker-compose up -d
# Starts both model-server and terminal-server containers independentlyThe web terminal provides full shell access with the same permissions as the server process. Only enable in trusted environments. For production, consider network restrictions or authentication.