A step-by-step guide to train and evaluate RL agents for LLM routing.
The run.sh script will automatically download the dataset if it's missing.
However, if you prefer to download manually or the automatic download fails:
-
Download from Hugging Face:
- Visit: https://huggingface.co/datasets/withmartian/routerbench/blob/main/routerbench_raw.pkl
- Click the "Download" button to download
routerbench_raw.pkl(~1.2 GB)
-
Move to data folder:
# Move the downloaded file to the data directory mv ~/Downloads/routerbench_raw.pkl data/ # Or if downloaded to current directory: mv routerbench_raw.pkl data/
-
Verify the file is in place:
ls -lh data/routerbench_raw.pkl
The file should be approximately 1.2 GB in size.
For a fully automated setup that handles everything from environment setup to running the Streamlit dashboard:
# Make the script executable (first time only)
chmod +x run.sh
# Run everything automatically
./run.shThis script will:
- ✅ Check Python installation
- ✅ Create and activate virtual environment
- ✅ Install all dependencies
- ✅ Automatically download dataset if missing (from Hugging Face)
- ✅ Precompute embeddings for embedding-based experiments
- ✅ Train all agents (DQN, PPO, LinUCB, PickLLM, Greedy)
- ✅ Run evaluation on all agents
- ✅ Run comparison experiments (baseline and embedding comparisons)
- ✅ Launch Streamlit dashboard
Options:
# Skip steps (comma-separated list)
./run.sh --skip=download
./run.sh --skip=embedding
./run.sh --skip=training
./run.sh --skip=evaluation
./run.sh --skip=download,embedding
./run.sh --skip=download,embedding,training
./run.sh --skip=download,embedding,training,evaluation
# Customize episode counts
./run.sh --episodes 1000 --eval-episodes 50
# Combine options
./run.sh --skip=download,embedding,training --episodes 1000
# Get help
./run.sh --helpNote: The script will automatically download the dataset (~1.2 GB) from Hugging Face if it's not found. This requires an internet connection and may take several minutes.
For manual setup and more control over each step:
# Clone and enter directory
cd LLMRouter
# Create virtual environment
python -m venv .venv
source .venv/bin/activate # Linux/Mac
# or: .venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txtFor faster training with embeddings:
python scripts/precompute_embeddings.py --model all-MiniLM-L6-v2This creates data/prompt_embeddings_all-MiniLM-L6-v2.pkl (~100MB, takes ~5 min).
python training/train.py --algo dqn \
--num_episodes 1000 \python training/train.py --algo dqn \
--num_episodes 1000 \
--use_embeddings \
--embeddings_path data/prompt_embeddings_all-MiniLM-L6-v2.pkl \
--embedding_model all-MiniLM-L6-v2 \
--output_dir results/dqn_embeddingspython training/train.py --algo ppo \
--num_episodes 1000 \python training/train.py --algo ppo \
--num_episodes 1000 \
--use_embeddings \
--embeddings_path data/prompt_embeddings_all-MiniLM-L6-v2.pkl \
--embedding_model all-MiniLM-L6-v2 \
--output_dir results/ppo_embeddingspython training/train.py --algo linucb \
--num_episodes 1000 \
--linucb_alpha 1.0 \python training/train.py --algo pickllm \
--num_episodes 1000 \
--pickllm_lr 0.1 \
--pickllm_gamma 0.0 \
--pickllm_epsilon_start 0.1 \
--pickllm_epsilon_end 0.01 \
--pickllm_epsilon_decay 0.995 \Training takes ~10-30 minutes depending on episodes and hardware.
Edit compare_embedding_experiments.py to point to your trained models, then:
python evaluation/compare_embedding_experiments.py --num_episodes 100Output:
======================================================================
SUMMARY
======================================================================
Experiment Reward Std Cost AIQ
-------------------------------------------------------------------------------
Greedy 1.7015 0.2269 0.001152 1.4138
DQN (Features) 1.4828 0.3365 0.007123 1.1497
PPO (Embeddings) 1.4389 0.3075 0.007950 1.8301
python evaluation/evaluate.py \
--agent_type dqn \
--dqn_agent_path results/dqn_embeddings/models/dqn_agent_final.pt \
--use_embeddings \
--embeddings_path data/prompt_embeddings_all-MiniLM-L6-v2.pklpython evaluation/evaluate.py \
--agent_type all streamlit run evaluation/streamlit_app.pyOpens at http://localhost:8501 with:
- Sidebar: File selection and upload for JSON results
- Summary Metrics: Best agent, best reward, lowest cost, best efficiency
- All Metrics Table: Comprehensive metrics for all agents (reward, performance, cost, efficiency, AIQ, budget utilization, completion rate)
- Visualizations (4 tabs):
- Reward: Average reward and performance bar charts
- Cost & Efficiency: Cost and efficiency comparisons
- AIQ: AIQ scores and cost vs performance scatter plots
- Episode Distribution: Box plots and histograms of episode rewards
- Agent Comparison: Interactive radar chart for comparing selected agents
results/
├── models/ # All trained models
│ ├── dqn_agent_final.pt # DQN model
│ ├── dqn_agent_episode_*.pt # DQN checkpoints
│ ├── ppo_agent_final.pt # PPO model
│ ├── ppo_agent_episode_*.pt # PPO checkpoints
│ ├── linucb_agent_final.npz # LinUCB model
│ └── pickllm_agent_final.npz # PickLLM model
├── dqn_training_stats.json # DQN training statistics
├── ppo_training_stats.json # PPO training statistics
├── linucb_training_stats.json # LinUCB training statistics
├── pickllm_training_stats.json # PickLLM training statistics
├── greedy_stats.json # Greedy baseline statistics
├── evaluation_results.json # Evaluation results (from evaluate.py)
├── baseline_comparison.json # Baseline comparison results
├── comparison/ # Comparison experiments
│ └── comparison.json
└── embedding_comparison/ # Embedding comparison results
└── embedding_comparison.json
# 1. Setup
source .venv/bin/activate
pip install -r requirements.txt
# 2. Pre-compute embeddings
python scripts/precompute_embeddings.py --model all-MiniLM-L6-v2
# 3. Train DQN with embeddings
python training/train.py --algo dqn \
--num_episodes 1000 \
--use_embeddings \
--embeddings_path data/prompt_embeddings_all-MiniLM-L6-v2.pkl \
# 4. Train PPO with embeddings
python training/train.py --algo ppo \
--num_episodes 1000 \
--use_embeddings \
--embeddings_path data/prompt_embeddings_all-MiniLM-L6-v2.pkl \
# 5. Evaluate all agents
python compare_embedding_experiments.py --num_episodes 100
# 6. View dashboard
streamlit run evaluation/streamlit_app.pypython training/train.py --algo dqn \
--lr 0.0005 \ # Learning rate
--gamma 0.99 \ # Discount factor
--epsilon_decay 0.997 \ # Slower exploration decay
--reward_lambda 0.2 \ # Cost penalty weight
--latency_lambda 0.1 \ # Latency penalty weight
--batch_size 128 \
--hidden_dims 256,256 # Larger networkpython training/train.py --algo ppo \
--ppo_lr 1e-4 \
--ppo_epochs 15 \
--ppo_clip_epsilon 0.1python training/train.py --algo linucb \
--linucb_alpha 1.0 \ # Exploration parameter (higher = more exploration)
--reward_lambda 0.2 \ # Cost penalty weight
--num_episodes 1000python training/train.py --algo pickllm \
--pickllm_lr 0.1 \ # Learning rate
--pickllm_gamma 0.0 \ # Discount factor (0.0 for immediate rewards)
--pickllm_epsilon_start 0.1 \ # Initial exploration rate
--pickllm_epsilon_end 0.01 \ # Final exploration rate
--pickllm_epsilon_decay 0.995 \ # Exploration decay rate
--reward_lambda 0.2 \ # Cost penalty weight
--num_episodes 1000| Agent | Avg Reward | Avg Cost | Cost Efficiency |
|---|---|---|---|
| Greedy | ~1.70 | ~0.0011 | ~1800 |
| LinUCB | ~1.27 | ~0.0004 | ~3700 |
| PickLLM | ~1.23 | ~0.0004 | ~3500 |
| DQN (Features) | ~1.48 | ~0.0071 | ~1250 |
| DQN (Embeddings) | ~1.41 | ~0.0054 | ~850 |
| PPO (Embeddings) | ~1.44 | ~0.0080 | ~1800 |
"ModuleNotFoundError": Run pip install -r requirements.txt
"FileNotFoundError: routerbench_raw.pkl": Place dataset in data/ folder
"State dimension mismatch": Ensure embeddings match the trained model's configuration
Slow training: Use pre-computed embeddings instead of live encoding