Skip to content

zxuhan/distill-sql

Repository files navigation

distill-sql

Python PyTorch MLX Transformers Spider dev exec On-device size Tests

Self-hosted text-to-SQL distilled from GPT-4o-mini into Qwen2.5 students. The deployable 1.5B 4-bit model is 847 MB on disk and runs in 1.16 seconds per query on a laptop. The 7B variant reaches 75.0% on Spider dev, against 80.1% for the closed teacher.

distill-sql streaming a SQL response on three Spider dev questions

Spider dev execution accuracy across the scaling axis


Evaluation

All eight model configurations evaluated on the same 1,034-example Spider dev set using the official test-suite-sql-eval executor. Rows sorted by overall execution accuracy.

model exec easy medium hard extra
Base 0.5B (no training) 33.9% 50.8% 36.1% 22.4% 15.1%
Distilled 0.5B (direct-only ablation) 59.4% 78.6% 64.3% 48.9% 28.3%
Distilled 0.5B (primary recipe) 60.0% 81.5% 66.8% 47.7% 22.3%
Distilled 1.5B 4-bit fused (deployment) 62.5% 83.5% 69.5% 49.4% 25.9%
Distilled 1.5B (bf16) 69.2% 85.5% 75.6% 53.4% 44.6%
Distilled 3B (4-bit base) 72.6% 90.3% 81.4% 56.9% 39.2%
Distilled 7B (cloud) 75.0% 86.7% 81.4% 64.4% 51.8%
GPT-4o-mini (closed teacher) 80.1% 93.1% 84.3% 71.8% 57.8%

Per-difficulty execution accuracy across the scaling axis

The scaling axis is monotonic across the 14× parameter range. The 0.5B → 1.5B gain (+9.2 pt) reflects fewer execution errors at the smallest scale; the 3B → 7B gain (+2.4 pt overall) is concentrated on hard (+7.5 pt) and extra (+12.6 pt) queries, where the closed teacher still has room above the student. Easy and medium are within 3 points of teacher already at the 3B.

Quantization tradeoff

Fusing the 1.5B LoRA adapter into the bf16 base, then post-training-quantizing to 4 bits:

size on disk exec accuracy warm latency tokens/s
1.5B bf16 2.9 GB 69.2% 1.59s 14
1.5B 4-bit (fused) 847 MB 62.5% 1.16s 18
change 3.4× smaller -6.7 pt 27% faster +29%

The 4-bit 1.5B beats every 0.5B configuration trained in this repository (best 0.5B distilled: 60.0%) in less storage than the 0.5B base model.

Failure modes

The teacher generates three samples per question at temperature 0.3. Each sample executes against the example's SQLite database; the winning sample whose result set matches gold rows under multiset equality is kept as the training trace. Of ~3,747 Spider train questions attempted (limited by the Tier-1 OpenAI daily-request cap), 3,397 (~91%) yielded at least one passing sample and produced a kept trace. The cloud 7B run trains on a 2,125-example subset that fits the 1024-token context window after length-trimming.

Failure-mode counts on the 1034-example Spider dev set (counts, not percentages):

failure mode base 0.5B distilled 0.5B distilled 1.5B distilled 3B distilled 7B
ok (correct) 329 575 670 709 776
wrong-result (parses, runs, wrong rows) 283 308 281 266 213
execution (SQLite error) 404 144 83 56 44
parse (sqlglot fails) 1 3 0 2 1
empty (no SQL produced) 17 4 0 1 0

Execution errors fall by roughly 10× from base to 7B distilled. The model has stopped inventing column names that do not appear in the schema.

Example outputs

Three queries from Spider dev, chosen to illustrate how additional capacity changes model behavior on increasingly hard structures.

-- Q179 (easy, db=flight_2)
-- "What is the country of the airline JetBlue Airways?"
gold:    SELECT Country FROM AIRLINES WHERE Airline = "JetBlue Airways"
0.5B:    SELECT country FROM airlines WHERE abbreviation = 'JetBlue Airways'  -- wrong column
3B:      SELECT country FROM airlines WHERE airline = 'JetBlue Airways'        -- ok
-- Q270 (medium, db=employee_hire_evaluation)
-- "Find the manager name and district of the shop with the most products."
gold:    SELECT manager_name, district FROM shop ORDER BY number_products DESC LIMIT 1
0.5B:    SELECT s.name, s.district FROM shop AS s JOIN hiring AS h ...        -- joins it doesn't need
3B:      SELECT manager_name, district FROM shop
         WHERE number_products = (SELECT MAX(number_products) FROM shop)        -- ok (semantically equivalent)
-- Q24 (hard, db=concert_singer)
-- "Find names and capacities of stadiums that held concerts in 2014 or after."
gold:    SELECT T2.name, T2.capacity FROM concert AS T1 JOIN stadium AS T2
         ON T1.stadium_id = T2.stadium_id WHERE T1.year >= 2014
0.5B:    ... WHERE c.year = '2014' GROUP BY ...                               -- wrong operator + spurious GROUP BY
3B:      ... WHERE c.year >= '2014' GROUP BY ...                              -- ok

The 0.5B distilled student has the vocabulary (concert, stadium, manager_name) but not enough capacity to get every operator and join key right under one prompt. Each step up the scaling axis trades raw size for failure-mode coverage.

Performance

Cold-load and warm-steady-state, sampled on a 16 GB M1 Pro with greedy decoding and roughly 464-token schema-linked prompts.

model warm latency tokens/s cold load size on disk
Base 0.5B 0.61s 60 3.3s 1.0 GB
Distilled 0.5B 0.65s 31 1.9s 1.0 GB
Distilled 1.5B 4-bit fused 1.16s 18 0.8s 847 MB
Distilled 1.5B (bf16) 1.59s 14 3.2s 2.9 GB
Distilled 3B (4-bit base) 2.03s 10 0.8s 1.7 GB

Self-hosted models incur no per-query API cost. GPT-4o-mini at the same prompt sizes runs roughly $0.30 per 1,000 queries at posted Tier-1 rates, in addition to network round-trip latency. Full table: reports/latency.md.

Architecture

distill-sql architecture

The same trace JSONL feeds the Mac and cloud arms. LoRA hyperparameters (rank 16, alpha 32, all linear targets), learning rate, and schedule are identical across all five training runs. Only the parameter count and the framework differ.

Detailed module map and design choices: docs/methodology.md. Cloud A100 reproduction recipe: docs/cloud_a100.md.

Reproduce

The repository is uv-driven. Cross-platform dependencies install cleanly: mlx and mlx-lm are gated behind sys_platform == 'darwin', so a Linux/CUDA box skips them and pulls only the cloud-arm requirements.

git clone https://github.com/zxuhan/distill-sql
cd distill-sql
uv sync --all-extras                 # creates .venv with all deps + dev tools

Then, in order:

make data                            # fetch Spider from HF (~1 GB, ~30s on a fast pipe)
cp .env.example .env                 # set OPENAI_API_KEY
make teacher                         # generate self-consistency traces
make train                           # primary 0.5B, ~50 min on M1
uv run python scripts/03_train_student.py --config configs/train_ablation.yaml
uv run python scripts/03_train_student.py --config configs/train_1p5b.yaml
uv run python scripts/03_train_student.py --config configs/train_3b.yaml

# fuse + 4-bit quantize the 1.5B for deployment
uv run python -m mlx_lm fuse \
    --model mlx-community/Qwen2.5-1.5B-Instruct-bf16 \
    --adapter-path artifacts/runs/scaling_1p5b/adapter \
    --save-path artifacts/runs/scaling_1p5b/fused
uv run python -m mlx_lm convert \
    --hf-path artifacts/runs/scaling_1p5b/fused \
    --mlx-path artifacts/runs/scaling_1p5b/fused_q4 -q

make eval                            # all local students + GPT-4o-mini reference
make report                          # rebuild reports/results.md + figures + README headline table

For the 7B cloud point, see docs/cloud_a100.md. The launcher bash scripts/run_a100.sh auto-detects GPU count and dispatches DDP. Score the cloud predictions on the home machine via:

uv run python scripts/score_jsonl.py \
    --predictions reports/predictions/distilled_7b.jsonl \
    --name distilled_7b

Practical notes:

  • Teacher generation requires an OpenAI API key. Tier-1 accounts have a 10K-requests-per-day cap that this pipeline can hit. Cached responses live under artifacts/cache/teacher/ and survive across runs.
  • The 3B M1 run takes about five hours on a 16 GB M1 Pro at 4-bit base + LoRA + grad checkpoint, sequence length 1024.
  • The 7B cloud run completes in roughly 2.5 minutes of training plus 10-15 minutes of evaluation on a single 80 GB A100. RunPod community-cloud A100 SXM session cost: roughly $5.

To skip reproduction entirely and try the model directly, visit the HF Space.

Methodology

  • Execution-validated self-consistency at the teacher. Three samples per question at temperature 0.3, each executed against the example's SQLite database. Multiset-equality vs gold rows is the keep filter. About 12% of teacher generations pass.
  • Schema linking via BM25 with foreign-key closure when the full schema would exceed roughly 1500 tokens. Affects the long tail of larger Spider schemas.
  • Two prompt modes during trace generation. A 60/40 mix of direct-output and brief-reasoning-then-output. Direct mode is used at inference for the small students.
  • Final checkpoint, not validation-loss-best. Validation loss is token-level cross-entropy on held-out teacher traces and does not measure Spider execution accuracy.
  • MLX-native LoRA on the Mac (mlx-lm, rank 16, alpha 32, all decoder linears). trl + peft + bitsandbytes with identical hyperparameters on CUDA. The predictions JSONL schema is identical, so scoring and reporting work uniformly across the two framework arms.

Full notes: docs/methodology.md.

Limitations

  • Spider is from 2018. Modern text-to-SQL work has moved to BIRD (Li et al., 2023), which features harder schemas and more realistic queries. A BIRD-dev evaluation on the existing 1.5B, 3B, and 7B adapters is configured but not yet executed.
  • The 7B distilled student does not exceed the closed teacher. Overall accuracy is 75.0% versus 80.1% (5.1-point gap). A 14B 4-bit variant is configured at configs/train_14b_cuda.yaml. Best estimate from the slope of the scaling line: 14B lands at 77-80%, possibly tying the teacher overall.
  • No reinforcement learning from execution feedback. A standard practice for closing the last gap to teacher would follow the SFT stage with PPO or GRPO using gold-vs-prediction execution match as the reward signal.
  • Single-database execution match. Reported numbers use the lenient single-database executor. The full test-suite execution accuracy of Zhong et al. (2020) is supported by the vendored evaluator via --etype all but not surfaced in the headline table.
  • Trace dataset is small. 3,397 filtered traces is the result of one Tier-1 daily-budget OpenAI run, partial. A re-run at higher API tier should approximately double the filtered set.

References

Citing Spider

@inproceedings{yu-etal-2018-spider,
  title     = "Spider: A Large-Scale Human-Labeled Dataset for Complex and
               Cross-Domain Semantic Parsing and Text-to-SQL Task",
  author    = "Yu, Tao and Zhang, Rui and Yang, Kai and Yasunaga, Michihiro and
               Wang, Dongxu and Li, Zifan and Ma, James and Li, Irene and
               Yao, Qingning and Roman, Shanelle and Zhang, Zilin and Radev,
               Dragomir",
  booktitle = "EMNLP",
  year      = "2018"
}

The official evaluator vendored at third_party/test-suite-sql-eval/ is from https://github.com/taoyds/test-suite-sql-eval under Apache 2.0; the original license is preserved alongside the source.

About

On-device text-to-SQL distilled from GPT-4o-mini into Qwen2.5 (0.5B → 3B locally on M1 via mlx-lm LoRA, 7B+ on cloud A100). 847 MB at 62.5% on Spider dev; 3B variant hits 72.6%, 7B variant hits 75.0%.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages