Tianyi Niu1 | Justin Chih-Yao Chen1 | Genta Indra Winata2 | Shi-Xiong (Austin) Zhang2 | Supriyo Chakraborty2 | Sambit Sahu2 | Yue Zhang1 | Elias Stengel-Eskin3 | Mohit Bansal1
1UNC Chapel Hill
2Capital One
3The University of Texas at Austin
Large Language Model (LLM) routers dynamically select optimal models for given inputs. Existing approaches typically assume access to ground-truth labeled data, which is often unavailable in practice, especially when user request distributions are heterogeneous and unknown. We introduce Routing with Generated Data (RGD), a challenging setting in which routers are trained exclusively on generated queries and answers produced from high-level task descriptions by generator LLMs. We evaluate query-answer routers (using both queries and labels) and query-only routers across four diverse benchmarks and 12 models, finding that query-answer routers degrade faster than query-only routers as generator quality decreases. Our analysis reveals two crucial characteristics of effective generators: they must accurately respond to their own questions, and their questions must produce sufficient performance differentiation among the model pool. We then show how filtering for these characteristics can improve the quality of generated data. We further propose CASCAL, a novel query-only router that estimates model correctness through consensus voting and identifies model-specific skill niches via hierarchical clustering. CASCAL is substantially more robust to generator quality, outperforming the best query-answer router by 4.6% absolute accuracy when trained on weak generator data.
Overview of CASCAL. (A) Consensus Scoring: we extract model responses for each query and compute confidence-weighted consensus scores. (B) Centroid Identification: For each model and subject, we cluster queries where the model demonstrates proficiency to obtain skill centroids, then we merge similar centroids across models. (C) Cluster Ranking: we assign queries to their nearest centroid and rank models within each cluster by average consensus score. (D) Inference: we route test queries to the nearest subject and centroid, select the top-3 (or top-1) ranked models, and aggregate responses via consensus voting.
The codebase makes frequent use of the following model nicknames:
| Nickname | Model Name |
|---|---|
| OSS120 | openai/gpt-oss-120b |
| Qwen3-32 | Qwen/Qwen3-32B |
| GLM4-32 | zai-org/GLM-4-32B-0414 |
| Llama70 | meta-llama/Llama-3.3-70B-Instruct |
| Gemma3-27 | google/gemma-3-27b-it |
| Exaone4-32 | LGAI-EXAONE/EXAONE-4.0-32B |
| Gemma | google/gemma-2-9b-it |
| Exaone | LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct |
| GLM4 | zai-org/glm-4-9b-chat |
| Qwen3-8 | Qwen/Qwen3-8B |
| DpskMath | deepseek-ai/deepseek-math-7b-instruct |
| Yi15 | 01-ai/Yi-1.5-9B-Chat-16K |
To run experiments with proprietary models, create a .env file in the project root with the following:
GEMINI_API_KEY=your_gemini_api_key_here
# Clone repository
git clone https://github.com/tianyiniu/RoutingGenData.git
cd RoutingGenData
# Install dependencies
pip install -r requirements.txtRequired packages include: vllm, transformers, torch, datasets, google-genai, scikit-learn, openai, math-verify.
The codebase supports four major benchmarks. Each dataset should be formatted as a JSON list of question dictionaries with the following structure:
{
"question_id": 0,
"category": "subject_name",
"question": "Question text here",
"options": ["Option A", "Option B", "Option C", "Option D"],
"answer": "A"
}| Dataset | Script | Source |
|---|---|---|
| MMLU-Pro | code/prepare_data/MMLU_Pro_format.py |
TIGER-Lab/MMLU-Pro |
| SuperGPQA | code/prepare_data/SuperGPQA_format.py |
m-a-p/SuperGPQA |
| MedMCQA | code/prepare_data/MedMCQA_format.py |
openlifescienceai/medmcqa |
| BBEH | code/prepare_data/BigBench_ExtraHard_format.py |
google-deepmind/bbeh |
To download and format a dataset:
python -m code.prepare_data.MMLU_Pro_format
# Output: ./Data/MMLU_Pro/all.jsonpython -m code.prepare_data.train_test_split \
--task MMLU_Pro
# Output: ./Data/MMLU_Pro/train.json, ./Data/MMLU_Pro/test.jsonpython -m code.RGD.local_gen \
--model_nickname Qwen3-32 \
--task MMLU_Pro \
--split train \
--cuda-devices 0,1python -m code.RGD.gemini_gen \
--task MMLU_Pro \
--split train
# Output: `./Model_Responses/{TASK}-{SPLIT}/{model_nickname}.json`python -m code.preprocess.agent_responses \
--model_nickname Llama70 \
--task MMLU_Pro \
--split train \
--cuda-devices 0,1python -m code.preprocess.get_consensus_score \
--task MMLU_Pro \
--split train \
--models large
# Output: Updated model response files with consensus_score and z_score fieldspython -m code.preprocess.get_embeddings \
--task MMLU_Pro \
--split train \
--cuda-devices 7
# Output: ./artifacts/MMLU_Pro_only_question_embeddings_early.ptpython -m code.CASCAL.train \
--task MMLU_Pro \
--split train \
--models large \
--merge-threshold 0.15 \
--min-consensus 0.5
# Output: ./artifacts/MMLU_Pro/{centroids_*, rankings_*, assigned_*, subject_representatives.json}python -m code.CASCAL.infer \
--train-task MMLU_Pro \
--test-task MMLU_Pro \
--split test \
--models large
# Output: ./artifacts/MMLU_Pro_agent_selections.jsonpython -m code.CASCAL.rankings \
--task MMLU_Pro \
--split test \
--models large
# Output: Selection accuracy metrics and scaling statisticsIf you have pre-computed model responses and embeddings, you can directly train and evaluate the router:
# Assuming you have:
# - ./Data/MMLU_Pro/train.json (formatted questions)
# - ./Model_Responses/MMLU_Pro-train/*.json (model responses with consensus scores)
# - ./Data/MMLU_Pro/test.json (formatted questions)
# - ./Model_Responses/MMLU_Pro-test/*.json (model responses with consensus scores)
# - ./artifacts/MMLU_Pro_only_question_embeddings_early.pt (embeddings)
# Train router
python -m code.CASCAL.train --task MMLU_Pro --split train --models large
# Run inference on test set
python -m code.CASCAL.infer --train-task MMLU_Pro --test-task MMLU_Pro --split test --models large
# Evaluate
python -m code.CASCAL.rankings --task MMLU_Pro --split test --models largeIf you find our project useful in your research, please cite the following paper:
@misc{niu2026routinggenerateddataannotationfree,
title={Routing with Generated Data: Annotation-Free LLM Skill Estimation and Expert Selection},
author={Tianyi Niu and Justin Chih-Yao Chen and Genta Indra Winata and Shi-Xiong Zhang and Supriyo Chakraborty and Sambit Sahu and Yue Zhang and Elias Stengel-Eskin and Mohit Bansal},
year={2026},
eprint={2601.09692},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2601.09692},
}
