A reinforcement learning framework for intelligent SLM/LLM orchestration that dynamically routes queries to the most appropriate model, balancing performance, cost, and latency.
The AAMC framework addresses the challenge of efficiently orchestrating heterogeneous language models by:
- Task Complexity Estimator (TCE): A multi-task encoder that analyzes incoming queries to predict complexity, category, and tool-use requirements
- Reinforcement Learning Router (RLR): A preference-conditioned PPO agent that learns dynamic routing policies optimizing for multiple objectives
- Simulation Environment: A high-fidelity gymnasium environment modeling realistic model pools with cost, latency, and performance characteristics
Key Results: AAMC achieves >90% task success rate (comparable to LLM-only) while reducing operational costs by over 70% and significantly improving inference latency.
AAMC-project/
├── README.md # This file
├── requirements.txt # Python dependencies
├── environment.yml # Conda environment
├── Makefile # Common commands
├── Dockerfile # Docker build configuration
├── configs/ # Configuration files
│ ├── tce.yaml # TCE training config
│ ├── rlr.yaml # RLR training config
│ └── sim.yaml # Simulation environment config
├── data/ # Data directory
│ ├── generate_tce_dataset.py # Synthetic data generator
│ ├── dtce_train.csv # Training data (generated)
│ ├── dtce_val.csv # Validation data
│ └── dtce_test.csv # Test data
├── aamc_env/ # Simulation environment
│ ├── __init__.py
│ └── env.py # Gymnasium environment
├── tce/ # Task Complexity Estimator
│ ├── __init__.py
│ ├── model.py # TCE model architecture
│ ├── train_tce.py # Training script
│ └── eval_tce.py # Evaluation script
├── rler/ # Reinforcement Learning Router
│ ├── __init__.py
│ ├── model.py # Actor-critic networks
│ ├── ppo.py # PPO algorithm
│ ├── train_rlr.py # Training script
│ └── eval_rlr.py # Evaluation script
├── baselines/ # Baseline strategies
│ ├── __init__.py
│ └── strategies.py # LLM-only, SLM-only, rule-based, supervised
├── inference/ # Inference module
│ ├── __init__.py
│ ├── aamc_inference.py # Single-query routing
│ └── service_stub.py # FastAPI service (optional)
├── scripts/ # Experiment scripts
│ ├── evaluate_all.py # Comprehensive evaluation
│ └── reproduce_fig5.sh # Reproduce paper figures
├── experiments_results/ # Results directory
├── checkpoints/ # Model checkpoints
│ ├── tce/
│ └── rlr/
├── logs/ # Training logs
│ ├── tce/
│ └── rlr/
└── tests/ # Unit tests
├── __init__.py
├── test_env.py
└── test_tce.py
# Clone repository
git clone <repository-url>
cd AAMC-project
# Create virtual environment
python3.10 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt# Create conda environment
conda env create -f environment.yml
conda activate aamc
# Verify installation
python -c "import torch; print(torch.__version__)"# Build Docker image
docker build -t aamc:latest .
# Run container
docker run -it --gpus all -v $(pwd):/workspace aamc:latest bashmake data
# Or manually:
python data/generate_tce_dataset.py --n_train 10000 --n_val 2000 --n_test 2000This generates synthetic queries with complexity scores, categories, and tool-use labels.
make tce_train
# Or manually:
python tce/train_tce.py --config configs/tce.yaml --data_dir dataExpected training time: 1-2 hours on GPU, 4-6 hours on CPU
Output: Trained TCE checkpoint at checkpoints/tce/best_model.pt
make rlr_train
# Or manually:
python rler/train_rlr.py --config configs/rlr.yaml --sim_config configs/sim.yaml --tce_checkpoint checkpoints/tce/best_model.ptExpected training time: 4-8 hours on GPU, 16-24 hours on CPU
Output: Trained RLR checkpoint at checkpoints/rlr/final_model.pt
For quick testing (reduced timesteps):
make rlr_train_quickmake evaluate
# Or manually:
python scripts/evaluate_all.py \
--sim_config configs/sim.yaml \
--tce_checkpoint checkpoints/tce/best_model.pt \
--rlr_checkpoint checkpoints/rlr/final_model.pt \
--n_episodes 10Output: Comparison table and metrics in experiments_results/
python inference/aamc_inference.py \
--query "Write a Python function to sort a list" \
--tce_checkpoint checkpoints/tce/best_model.pt \
--rlr_checkpoint checkpoints/rlr/final_model.pt \
--preference 0.7,0.2,0.1Output: Selected model with explanation
Key parameters:
encoder: Pre-trained transformer (default:distilbert-base-uncased)learning_rate: 2e-5num_epochs: 10lambda_c,lambda_cat,lambda_t: Loss weights for multi-task learning
Key parameters:
gamma: Discount factor (0.99)gae_lambda: GAE lambda (0.95)clip_epsilon: PPO clipping (0.2)n_steps: Steps per rollout (2048)total_timesteps: Total training steps (1,000,000)preference_sampling: Strategy for sampling preference vectors
Defines:
- Model pool specifications (cost, latency, performance)
- Task categories and distributions
- Performance matrix (success probability per model-task pair)
- Reward normalization ranges
The evaluation script computes:
- Task Success Rate: Percentage of successfully completed tasks
- Average Cost per Task: Mean operational cost in USD
- Average Latency: Mean end-to-end latency in milliseconds
- Overall Efficiency Score:
success_rate / (α·cost + β·latency) - Per-Category Success Rates: Fairness analysis across task types
- Model Distribution: Frequency of model selections
The implementation includes four baseline strategies:
- LLM-Only: Always route to LLM-XLarge (maximum performance, maximum cost)
- SLM-Only: Always route to SLM-Medium (balanced performance/cost)
- Rule-Based: Threshold-based routing using TCE complexity score
- Supervised Router: Trained classifier for model selection
# Generate full dataset
python data/generate_tce_dataset.py --n_train 50000 --n_val 10000 --n_test 10000
# Train TCE
make tce_train
# Train RLR
make rlr_train
# Evaluate all strategies
make evaluate
# Generate comparison plots
python scripts/plot_results.py --results_dir experiments_resultsTest robustness to noisy TCE predictions:
python scripts/evaluate_robustness.py \
--noise_levels 0.0,0.1,0.2,0.3 \
--n_episodes 20Measure decision time vs. number of models:
python scripts/evaluate_scalability.py \
--model_counts 3,5,10,20 \
--n_queries 1000- CPU: 4 cores
- RAM: 16 GB
- Storage: 20 GB
- Time: ~30 minutes
- GPU: NVIDIA GPU with 16+ GB VRAM (V100, A100, RTX 4090)
- CPU: 16+ cores
- RAM: 64 GB
- Storage: 100 GB
- Time: ~8-12 GPU-hours
- AWS: p3.2xlarge (V100) - ~$3/hour
- Google Cloud: n1-standard-8 + T4 GPU - ~$2-4/hour
- Azure: NC6s_v3 (V100) - ~$3-6/hour
Run unit tests:
make test
# Or manually:
pytest tests/ -v --cov=. --cov-report=htmlView coverage report:
open htmlcov/index.htmlDecision: Use distilbert-base-uncased as default
Rationale: Balances performance and efficiency (66M parameters, ~40ms inference on CPU)
Alternatives: bert-tiny (4M params) for ultra-fast, mobileBERT for mobile deployment
Decision: Simple M/M/1 queue per model
Rationale: Captures essential dynamics while remaining computationally efficient
Deviation: Paper may use more complex queueing; this is a reasonable approximation
Decision: Generate synthetic queries with templates
Rationale: Enables end-to-end pipeline without external dependencies
Note: Real datasets can be plugged in via documented interface
Decision: Implement from scratch rather than using stable-baselines3
Rationale: Fine-grained control over preference conditioning and vector rewards
- Reduce batch size in configs
- Use gradient accumulation
- Try mixed precision training
- Use smaller encoder (e.g.,
bert-tiny) - Reduce dataset size
- Enable mixed precision
- Increase training timesteps
- Adjust PPO hyperparameters (clip_epsilon, learning rate)
- Check reward normalization ranges
If you use this code, please cite:
@article{rjoub2025aamc,
title={Adaptive Agentic Meta-Controller (AAMC): A Reinforcement Learning Framework for Intelligent SLM/LLM Orchestration},
author={Rjoub, Gaith and Bentahar, Jamal and Almolydeen, Shahed and Irjoob, Ahmad},
journal={Neurocomputing},
year={2025}
}This implementation is provided for research and educational purposes.
For questions or issues:
- Open an issue on GitHub
- Email: grjoub@aut.edu.jo
This implementation is based on the paper "Adaptive Agentic Meta-Controller (AAMC): A Reinforcement Learning Framework for Intelligent SLM/LLM Orchestration" and uses:
- PyTorch for deep learning
- Hugging Face Transformers for pre-trained encoders
- Gymnasium for RL environments
- Stable-Baselines3 for reference implementations
Version: 1.0
Last Updated: 2025-10-09