distill-sql

Self-hosted text-to-SQL distilled from GPT-4o-mini into Qwen2.5 students. The deployable 1.5B 4-bit model is 847 MB on disk and runs in 1.16 seconds per query on a laptop. The 7B variant reaches 75.0% on Spider dev, against 80.1% for the closed teacher.

Live demo: huggingface.co/spaces/zxuhan7/Distill-SQL

Evaluation

All eight model configurations evaluated on the same 1,034-example Spider dev set using the official test-suite-sql-eval executor. Rows sorted by overall execution accuracy.

model	exec	easy	medium	hard	extra
Base 0.5B (no training)	33.9%	50.8%	36.1%	22.4%	15.1%
Distilled 0.5B (direct-only ablation)	59.4%	78.6%	64.3%	48.9%	28.3%
Distilled 0.5B (primary recipe)	60.0%	81.5%	66.8%	47.7%	22.3%
Distilled 1.5B 4-bit fused (deployment)	62.5%	83.5%	69.5%	49.4%	25.9%
Distilled 1.5B (bf16)	69.2%	85.5%	75.6%	53.4%	44.6%
Distilled 3B (4-bit base)	72.6%	90.3%	81.4%	56.9%	39.2%
Distilled 7B (cloud)	75.0%	86.7%	81.4%	64.4%	51.8%
GPT-4o-mini (closed teacher)	80.1%	93.1%	84.3%	71.8%	57.8%

The scaling axis is monotonic across the 14× parameter range. The 0.5B → 1.5B gain (+9.2 pt) reflects fewer execution errors at the smallest scale; the 3B → 7B gain (+2.4 pt overall) is concentrated on hard (+7.5 pt) and extra (+12.6 pt) queries, where the closed teacher still has room above the student. Easy and medium are within 3 points of teacher already at the 3B.

Quantization tradeoff

Fusing the 1.5B LoRA adapter into the bf16 base, then post-training-quantizing to 4 bits:

	size on disk	exec accuracy	warm latency	tokens/s
1.5B bf16	2.9 GB	69.2%	1.59s	14
1.5B 4-bit (fused)	847 MB	62.5%	1.16s	18
change	3.4× smaller	-6.7 pt	27% faster	+29%

The 4-bit 1.5B beats every 0.5B configuration trained in this repository (best 0.5B distilled: 60.0%) in less storage than the 0.5B base model.

Failure modes

The teacher generates three samples per question at temperature 0.3. Each sample executes against the example's SQLite database; the winning sample whose result set matches gold rows under multiset equality is kept as the training trace. Of ~3,747 Spider train questions attempted (limited by the Tier-1 OpenAI daily-request cap), 3,397 (~91%) yielded at least one passing sample and produced a kept trace. The cloud 7B run trains on a 2,125-example subset that fits the 1024-token context window after length-trimming.

Failure-mode counts on the 1034-example Spider dev set (counts, not percentages):

failure mode	base 0.5B	distilled 0.5B	distilled 1.5B	distilled 3B	distilled 7B
`ok` (correct)	329	575	670	709	776
`wrong-result` (parses, runs, wrong rows)	283	308	281	266	213
`execution` (SQLite error)	404	144	83	56	44
`parse` (sqlglot fails)	1	3	0	2	1
`empty` (no SQL produced)	17	4	0	1	0

Execution errors fall by roughly 10× from base to 7B distilled. The model has stopped inventing column names that do not appear in the schema.

Example outputs

Three queries from Spider dev, chosen to illustrate how additional capacity changes model behavior on increasingly hard structures.

-- Q179 (easy, db=flight_2)
-- "What is the country of the airline JetBlue Airways?"
gold:    SELECT Country FROM AIRLINES WHERE Airline = "JetBlue Airways"
0.5B:    SELECT country FROM airlines WHERE abbreviation = 'JetBlue Airways'  -- wrong column
3B:      SELECT country FROM airlines WHERE airline = 'JetBlue Airways'        -- ok

-- Q270 (medium, db=employee_hire_evaluation)
-- "Find the manager name and district of the shop with the most products."
gold:    SELECT manager_name, district FROM shop ORDER BY number_products DESC LIMIT 1
0.5B:    SELECT s.name, s.district FROM shop AS s JOIN hiring AS h ...        -- joins it doesn't need
3B:      SELECT manager_name, district FROM shop
         WHERE number_products = (SELECT MAX(number_products) FROM shop)        -- ok (semantically equivalent)

-- Q24 (hard, db=concert_singer)
-- "Find names and capacities of stadiums that held concerts in 2014 or after."
gold:    SELECT T2.name, T2.capacity FROM concert AS T1 JOIN stadium AS T2
         ON T1.stadium_id = T2.stadium_id WHERE T1.year >= 2014
0.5B:    ... WHERE c.year = '2014' GROUP BY ...                               -- wrong operator + spurious GROUP BY
3B:      ... WHERE c.year >= '2014' GROUP BY ...                              -- ok

The 0.5B distilled student has the vocabulary (concert, stadium, manager_name) but not enough capacity to get every operator and join key right under one prompt. Each step up the scaling axis trades raw size for failure-mode coverage.

Performance

Cold-load and warm-steady-state, sampled on a 16 GB M1 Pro with greedy decoding and roughly 464-token schema-linked prompts.

model	warm latency	tokens/s	cold load	size on disk
Base 0.5B	0.61s	60	3.3s	1.0 GB
Distilled 0.5B	0.65s	31	1.9s	1.0 GB
Distilled 1.5B 4-bit fused	1.16s	18	0.8s	847 MB
Distilled 1.5B (bf16)	1.59s	14	3.2s	2.9 GB
Distilled 3B (4-bit base)	2.03s	10	0.8s	1.7 GB

Self-hosted models incur no per-query API cost. GPT-4o-mini at the same prompt sizes runs roughly $0.30 per 1,000 queries at posted Tier-1 rates, in addition to network round-trip latency. Full table: reports/latency.md.

Architecture

The same trace JSONL feeds the Mac and cloud arms. LoRA hyperparameters (rank 16, alpha 32, all linear targets), learning rate, and schedule are identical across all five training runs. Only the parameter count and the framework differ.

Detailed module map and design choices: docs/methodology.md. Cloud A100 reproduction recipe: docs/cloud_a100.md.

Reproduce

The repository is uv-driven. Cross-platform dependencies install cleanly: mlx and mlx-lm are gated behind sys_platform == 'darwin', so a Linux/CUDA box skips them and pulls only the cloud-arm requirements.

git clone https://github.com/zxuhan/distill-sql
cd distill-sql
uv sync --all-extras                 # creates .venv with all deps + dev tools

Then, in order:

make data                            # fetch Spider from HF (~1 GB, ~30s on a fast pipe)
cp .env.example .env                 # set OPENAI_API_KEY
make teacher                         # generate self-consistency traces
make train                           # primary 0.5B, ~50 min on M1
uv run python scripts/03_train_student.py --config configs/train_ablation.yaml
uv run python scripts/03_train_student.py --config configs/train_1p5b.yaml
uv run python scripts/03_train_student.py --config configs/train_3b.yaml

# fuse + 4-bit quantize the 1.5B for deployment
uv run python -m mlx_lm fuse \
    --model mlx-community/Qwen2.5-1.5B-Instruct-bf16 \
    --adapter-path artifacts/runs/scaling_1p5b/adapter \
    --save-path artifacts/runs/scaling_1p5b/fused
uv run python -m mlx_lm convert \
    --hf-path artifacts/runs/scaling_1p5b/fused \
    --mlx-path artifacts/runs/scaling_1p5b/fused_q4 -q

make eval                            # all local students + GPT-4o-mini reference
make report                          # rebuild reports/results.md + figures + README headline table

For the 7B cloud point, see docs/cloud_a100.md. The launcher bash scripts/run_a100.sh auto-detects GPU count and dispatches DDP. Score the cloud predictions on the home machine via:

uv run python scripts/score_jsonl.py \
    --predictions reports/predictions/distilled_7b.jsonl \
    --name distilled_7b

Practical notes:

Teacher generation requires an OpenAI API key. Tier-1 accounts have a 10K-requests-per-day cap that this pipeline can hit. Cached responses live under artifacts/cache/teacher/ and survive across runs.
The 3B M1 run takes about five hours on a 16 GB M1 Pro at 4-bit base + LoRA + grad checkpoint, sequence length 1024.
The 7B cloud run completes in roughly 2.5 minutes of training plus 10-15 minutes of evaluation on a single 80 GB A100. RunPod community-cloud A100 SXM session cost: roughly $5.

To skip reproduction entirely and try the model directly, visit the HF Space.

Methodology

Execution-validated self-consistency at the teacher. Three samples per question at temperature 0.3, each executed against the example's SQLite database. Multiset-equality vs gold rows is the keep filter. About 12% of teacher generations pass.
Schema linking via BM25 with foreign-key closure when the full schema would exceed roughly 1500 tokens. Affects the long tail of larger Spider schemas.
Two prompt modes during trace generation. A 60/40 mix of direct-output and brief-reasoning-then-output. Direct mode is used at inference for the small students.
Final checkpoint, not validation-loss-best. Validation loss is token-level cross-entropy on held-out teacher traces and does not measure Spider execution accuracy.
MLX-native LoRA on the Mac (mlx-lm, rank 16, alpha 32, all decoder linears). trl + peft + bitsandbytes with identical hyperparameters on CUDA. The predictions JSONL schema is identical, so scoring and reporting work uniformly across the two framework arms.

Full notes: docs/methodology.md.

Limitations

Spider is from 2018. Modern text-to-SQL work has moved to BIRD (Li et al., 2023), which features harder schemas and more realistic queries. A BIRD-dev evaluation on the existing 1.5B, 3B, and 7B adapters is configured but not yet executed.
The 7B distilled student does not exceed the closed teacher. Overall accuracy is 75.0% versus 80.1% (5.1-point gap). A 14B 4-bit variant is configured at configs/train_14b_cuda.yaml. Best estimate from the slope of the scaling line: 14B lands at 77-80%, possibly tying the teacher overall.
No reinforcement learning from execution feedback. A standard practice for closing the last gap to teacher would follow the SFT stage with PPO or GRPO using gold-vs-prediction execution match as the reward signal.
Single-database execution match. Reported numbers use the lenient single-database executor. The full test-suite execution accuracy of Zhong et al. (2020) is supported by the vendored evaluator via --etype all but not surfaced in the headline table.
Trace dataset is small. 3,397 filtered traces is the result of one Tier-1 daily-budget OpenAI run, partial. A re-run at higher API tier should approximately double the filtered set.

References

Spider benchmark. Yu, T. et al. (2018). Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. EMNLP.
BIRD benchmark. Li, J. et al. (2023). Can LLM Already Serve as a Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs. NeurIPS.
Test-suite execution accuracy. Zhong, R. et al. (2020). Semantic Evaluation for Text-to-SQL with Distilled Test Suites. EMNLP. Vendored at third_party/test-suite-sql-eval/.
Knowledge distillation. Hinton, G. et al. (2015). Distilling the Knowledge in a Neural Network.
LoRA. Hu, E. et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. ICLR.
Self-consistency. Wang, X. et al. (2022). Self-Consistency Improves Chain of Thought Reasoning in Language Models.

Citing Spider

@inproceedings{yu-etal-2018-spider,
  title     = "Spider: A Large-Scale Human-Labeled Dataset for Complex and
               Cross-Domain Semantic Parsing and Text-to-SQL Task",
  author    = "Yu, Tao and Zhang, Rui and Yang, Kai and Yasunaga, Michihiro and
               Wang, Dongxu and Li, Zifan and Ma, James and Li, Irene and
               Yao, Qingning and Roman, Shanelle and Zhang, Zilin and Radev,
               Dragomir",
  booktitle = "EMNLP",
  year      = "2018"
}

The official evaluator vendored at third_party/test-suite-sql-eval/ is from https://github.com/taoyds/test-suite-sql-eval under Apache 2.0; the original license is preserved alongside the source.

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
assets		assets
configs		configs
docs		docs
reports		reports
scripts		scripts
space		space
src/distill_sql		src/distill_sql
tests		tests
third_party/test-suite-sql-eval		third_party/test-suite-sql-eval
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

distill-sql

Live demo: huggingface.co/spaces/zxuhan7/Distill-SQL

Evaluation

Quantization tradeoff

Failure modes

Example outputs

Performance

Architecture

Reproduce

Methodology

Limitations

References

Citing Spider

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

distill-sql

Live demo: huggingface.co/spaces/zxuhan7/Distill-SQL

Evaluation

Quantization tradeoff

Failure modes

Example outputs

Performance

Architecture

Reproduce

Methodology

Limitations

References

Citing Spider

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages