Efficient LLM Reasoning via Programmatic Supervision and Cognitive Distillation CSE 537 (Artificial Intelligence) — Stony Brook University, Spring 2025 Authors: Shang-Jui (Ray) Kuo, Adebayo Braimah · Instructor: Prof. Niranjan Balasubramanian
A locally-deployable chess tutor that combines move prediction and human-like explanation in a single 1.1B-parameter LLM (TinyLlama-1.1B-Chat) via two LoRA adapters trained sequentially:
- Phase 1 — Programmatic Supervision. ~60K rule-grounded samples generated
from Lichess PGN data using
python-chessas a verifier. Teaches chess fundamentals: legal moves, piece identification, attacked squares, comment parsing, next-move prediction. - Phase 2 — Cognitive Distillation. ~10K explanation samples distilled from
a Mixtral-8x7B-Instruct teacher. Trains a second LoRA adapter on top of a
frozen Phase 1 adapter, blended at inference via PEFT's
add_weighted_adapterwith tunable (α, β) coefficients.
A full reproduction was performed in May 2026 on Stony Brook's NuWulf HPC cluster (NVIDIA H200 sm_90 GPUs, SLURM). 47 evaluations ran across α, β, combination_type, BERTScore backbone, and training recipe.
- ✅ Base BERTScore F1 baseline (claimed 0.4744 / measured 0.4773, Δ = 0.003)
- ✅ Phase 1 chess fundamentals.
can_piece_move2.9 % → 94.2 %;is_square_attacked31.8 % → 49.7 %;list_legal_movesF1 0 → 0.23. - ✅ α/β trade-off curve. Smooth, monotone, 17-point: F1 drops 0.480 → 0.357 as α grows 0.25 → 1.5, while SSD@1 (Stockfish Score Delta, lower = better moves) ranges 238 → 727 cp — a 3× spread. The dual-LoRA design enables a real, characterizable trade-off between move quality and explanation similarity.
- ❌ The "Phase 2 BERTScore F1 0.4744 → 0.5891" claim. Peak F1 across our full grid (any α, β, any combination_type ∈ {linear, svd, ties, dare_linear, dare_ties, cat}, both training recipes 5-ep and 10-ep, both BERTScore backbones {deberta-xlarge-mnli, roberta-large}) is 0.4801 at (α=0.25, β=1.5). The Phase 2 adapter does not push BERTScore above base.
See REPORT.md for the full writeup and docs/PROJECT_SUMMARY.md for the claim-by-claim verdict matrix.
git clone git@github.com:raykuo18/2025Spring_AI_Project.git
cd 2025Spring_AI_Project
# Create a fresh conda env (~5 min) and pin the working stack.
# peft 0.13.2 keeps PeftModel.add_weighted_adapter delegation (newer PEFT drops it).
conda create -n chess-tutor python=3.11 -y
conda activate chess-tutor
pip install --index-url https://download.pytorch.org/whl/cu121 'torch>=2.1.0'
pip install -e . # installs chess_tutor + all pinned deps from pyproject.toml
# Optional: GUI demo + Llama-2 GPTQ benchmark
pip install -e '.[gui,benchmarks]'If you don't want to install the package, the entry points also work via
PYTHONPATH=src (already set by scripts/env.sh).
Edit scripts/env.sh — point $PROJ at a large-storage
location for the HF model cache (~100 GB for Mixtral 4-bit), Lichess raw
dumps, trained adapters, and eval outputs. Put your HF token in .hf_token
(gitignored). Required model: TinyLlama (public) and Mixtral-8x7B-Instruct
(gated — accept the license at HuggingFace first).
# Stage 1 — download Stockfish, TinyLlama, Mixtral (~100 GB, ~15 min)
sbatch scripts/download_mixtral.sh
# (run TinyLlama download separately; it's small)
# Stage 2 — download + parse Lichess broadcasts, generate Phase 1/2 data
sbatch scripts/full_data_gen.sh # ~12 min
# Stage 3 — Phase 1 fine-tuning (60K samples × 3 epochs)
sbatch scripts/phase1_full.sh # ~8.5 h on 1 H200
# Stage 4 — Phase 2 distillation (Mixtral teacher on 10K prompts)
sbatch scripts/phase2_distill.sh # ~4.5 h on 1 H200
# Stage 5 — Phase 2 LoRA training (frozen P1 + trainable P2)
sbatch scripts/phase2_full.sh # ~30 min on 1 H200
# Stage 6 — full α/β eval sweep (base + P1 + P2 + 9-cell grid)
sbatch scripts/eval_full_sweep.sh # ~2 h on 1 H200Smoke runs (~5 min each) for plumbing validation: scripts/phase1_smoke.sh, scripts/phase2_smoke.sh, scripts/eval_smoke.sh.
See docs/REPRO_LOG.md for the full reproduction log, including bugs encountered.
.
├── README.md, REPORT.md ← intro + full results writeup
├── AGENT.md, CLUSTER_POLICY.md ← original mission brief + SLURM routing
├── pyproject.toml ← Python package metadata + pinned deps
├── requirements.txt ← pip requirements (subset)
├── .gitignore, .hf_token ← (.hf_token is gitignored)
│
├── docs/ ← deliverable documentation
│ ├── CODEBASE_NOTES.md ← faithful map of the code
│ ├── REPRO_LOG.md ← time-ordered reproduction log
│ ├── PROJECT_SUMMARY.md ← claim verdict matrix + honest assessment
│ └── RELATED_WORK.md ← 6-axis literature survey
│
├── src/chess_tutor/ ← Python package (importable, no pip install needed)
│ ├── training/
│ │ ├── phase1.py ← Phase 1 entry: `python -m chess_tutor.training.phase1`
│ │ ├── phase2.py ← Phase 2 entry: dual-LoRA recipe
│ │ └── phase1_continue.py
│ ├── eval/
│ │ ├── single.py ← single-adapter eval harness
│ │ ├── combined.py ← dual-adapter eval with α/β blending
│ │ └── tables.py ← post-process eval JSONs
│ ├── data/
│ │ ├── parse_broadcast.py ← Lichess PGN → processed JSON
│ │ ├── generate_phase1.py ← Phase 1 JSONL generation
│ │ ├── generate_phase2_prompts.py
│ │ ├── generate_phase2_explanations.py ← Mixtral teacher loop
│ │ ├── organize_phase2.py ← schema check + split
│ │ ├── simple_split.py ← train/val/test splitter
│ │ ├── extract_comments.py
│ │ ├── parse_broadcast_parallel.py
│ │ └── make_hf_dataset.py ← HuggingFace dataset format
│ ├── inference/
│ │ └── lora.py ← Load adapter + run inference
│ ├── benchmarks/
│ │ └── llama2_mixtral.py ← Llama-2 GPTQ + Mixtral benchmark
│ └── gui/ ← PyQt5 chess board demo
│ ├── chess_gui.py
│ └── images/pieces-basic-svg/
│
├── scripts/ ← SLURM submission + reproduction scripts
│ ├── env.sh ← env vars + conda activation (source me!)
│ ├── env-freeze.txt ← pinned pip versions
│ ├── verify_env.sh
│ ├── full_data_gen.sh ← end-to-end Phase 1+2 data generation
│ ├── download_mixtral.sh
│ ├── phase1_{smoke,full}.sh
│ ├── phase2_{smoke,distill,full,full_10ep,smoke_finish}.sh
│ └── eval_*.sh ← single, alpha grid, combination_type sweep, etc.
│
├── tests/
│ ├── test_stockfish.py ← stockfish smoke (`python -m tests.test_stockfish`)
│ ├── test_resources.py ← GPU/memory pre-flight check
│ └── test_pipeline.py ← end-to-end smoke
│
├── examples/ ← example PGN samples
│ ├── example_games.pgn
│ └── short_example.pgn
│
├── training_data/ ← data-location symlinks pointing at $PROJ
└── exp-outputs/ ← historical SLURM logs from old runs
evaluation_results/, training_output/, hf_cache/, and the
training_data/phase* symlinks all live under $PROJ (the large-storage
area). The repo itself stays small (~20 MB tracked).
This is course coursework from CSE 537, Spring 2025. If you build on it, please cite the original course project plus this reproduction:
Kuo, S-J. and Braimah, A. (2025). "Adaptive Chess Tutoring: Efficient LLM Reasoning
via Programmatic Supervision and Cognitive Distillation." CSE 537 final project,
Stony Brook University.
Kuo, S-J. (2026). "Reproduction of CSE 537 Chess-Tutor project on NuWulf HPC cluster."
Internal report. https://github.com/raykuo18/2025Spring_AI_Project
