Adaptive Chess Tutoring

Efficient LLM Reasoning via Programmatic Supervision and Cognitive Distillation CSE 537 (Artificial Intelligence) — Stony Brook University, Spring 2025 Authors: Shang-Jui (Ray) Kuo, Adebayo Braimah · Instructor: Prof. Niranjan Balasubramanian

A locally-deployable chess tutor that combines move prediction and human-like explanation in a single 1.1B-parameter LLM (TinyLlama-1.1B-Chat) via two LoRA adapters trained sequentially:

Phase 1 — Programmatic Supervision. ~60K rule-grounded samples generated from Lichess PGN data using python-chess as a verifier. Teaches chess fundamentals: legal moves, piece identification, attacked squares, comment parsing, next-move prediction.
Phase 2 — Cognitive Distillation. ~10K explanation samples distilled from a Mixtral-8x7B-Instruct teacher. Trains a second LoRA adapter on top of a frozen Phase 1 adapter, blended at inference via PEFT's add_weighted_adapter with tunable (α, β) coefficients.

Reproduction status (NuWulf cluster, May 2026)

A full reproduction was performed in May 2026 on Stony Brook's NuWulf HPC cluster (NVIDIA H200 sm_90 GPUs, SLURM). 47 evaluations ran across α, β, combination_type, BERTScore backbone, and training recipe.

What reproduces

✅ Base BERTScore F1 baseline (claimed 0.4744 / measured 0.4773, Δ = 0.003)
✅ Phase 1 chess fundamentals. can_piece_move 2.9 % → 94.2 %; is_square_attacked 31.8 % → 49.7 %; list_legal_moves F1 0 → 0.23.
✅ α/β trade-off curve. Smooth, monotone, 17-point: F1 drops 0.480 → 0.357 as α grows 0.25 → 1.5, while SSD@1 (Stockfish Score Delta, lower = better moves) ranges 238 → 727 cp — a 3× spread. The dual-LoRA design enables a real, characterizable trade-off between move quality and explanation similarity.

What does NOT reproduce

❌ The "Phase 2 BERTScore F1 0.4744 → 0.5891" claim. Peak F1 across our full grid (any α, β, any combination_type ∈ {linear, svd, ties, dare_linear, dare_ties, cat}, both training recipes 5-ep and 10-ep, both BERTScore backbones {deberta-xlarge-mnli, roberta-large}) is 0.4801 at (α=0.25, β=1.5). The Phase 2 adapter does not push BERTScore above base.

See REPORT.md for the full writeup and docs/PROJECT_SUMMARY.md for the claim-by-claim verdict matrix.

Quick start

1. Clone + environment

git clone git@github.com:raykuo18/2025Spring_AI_Project.git
cd 2025Spring_AI_Project

# Create a fresh conda env (~5 min) and pin the working stack.
# peft 0.13.2 keeps PeftModel.add_weighted_adapter delegation (newer PEFT drops it).
conda create -n chess-tutor python=3.11 -y
conda activate chess-tutor
pip install --index-url https://download.pytorch.org/whl/cu121 'torch>=2.1.0'
pip install -e .                    # installs chess_tutor + all pinned deps from pyproject.toml
# Optional: GUI demo + Llama-2 GPTQ benchmark
pip install -e '.[gui,benchmarks]'

If you don't want to install the package, the entry points also work via PYTHONPATH=src (already set by scripts/env.sh).

2. Set up your data + model paths

Edit scripts/env.sh — point $PROJ at a large-storage location for the HF model cache (~100 GB for Mixtral 4-bit), Lichess raw dumps, trained adapters, and eval outputs. Put your HF token in .hf_token (gitignored). Required model: TinyLlama (public) and Mixtral-8x7B-Instruct (gated — accept the license at HuggingFace first).

3. End-to-end pipeline

# Stage 1 — download Stockfish, TinyLlama, Mixtral (~100 GB, ~15 min)
sbatch scripts/download_mixtral.sh
# (run TinyLlama download separately; it's small)

# Stage 2 — download + parse Lichess broadcasts, generate Phase 1/2 data
sbatch scripts/full_data_gen.sh                     # ~12 min

# Stage 3 — Phase 1 fine-tuning (60K samples × 3 epochs)
sbatch scripts/phase1_full.sh                       # ~8.5 h on 1 H200

# Stage 4 — Phase 2 distillation (Mixtral teacher on 10K prompts)
sbatch scripts/phase2_distill.sh                    # ~4.5 h on 1 H200

# Stage 5 — Phase 2 LoRA training (frozen P1 + trainable P2)
sbatch scripts/phase2_full.sh                       # ~30 min on 1 H200

# Stage 6 — full α/β eval sweep (base + P1 + P2 + 9-cell grid)
sbatch scripts/eval_full_sweep.sh                   # ~2 h on 1 H200

Smoke runs (~5 min each) for plumbing validation: scripts/phase1_smoke.sh, scripts/phase2_smoke.sh, scripts/eval_smoke.sh.

See docs/REPRO_LOG.md for the full reproduction log, including bugs encountered.

Repository layout

.
├── README.md, REPORT.md             ← intro + full results writeup
├── AGENT.md, CLUSTER_POLICY.md      ← original mission brief + SLURM routing
├── pyproject.toml                   ← Python package metadata + pinned deps
├── requirements.txt                 ← pip requirements (subset)
├── .gitignore, .hf_token            ← (.hf_token is gitignored)
│
├── docs/                            ← deliverable documentation
│   ├── CODEBASE_NOTES.md            ← faithful map of the code
│   ├── REPRO_LOG.md                 ← time-ordered reproduction log
│   ├── PROJECT_SUMMARY.md           ← claim verdict matrix + honest assessment
│   └── RELATED_WORK.md              ← 6-axis literature survey
│
├── src/chess_tutor/                 ← Python package (importable, no pip install needed)
│   ├── training/
│   │   ├── phase1.py                ← Phase 1 entry: `python -m chess_tutor.training.phase1`
│   │   ├── phase2.py                ← Phase 2 entry: dual-LoRA recipe
│   │   └── phase1_continue.py
│   ├── eval/
│   │   ├── single.py                ← single-adapter eval harness
│   │   ├── combined.py              ← dual-adapter eval with α/β blending
│   │   └── tables.py                ← post-process eval JSONs
│   ├── data/
│   │   ├── parse_broadcast.py       ← Lichess PGN → processed JSON
│   │   ├── generate_phase1.py       ← Phase 1 JSONL generation
│   │   ├── generate_phase2_prompts.py
│   │   ├── generate_phase2_explanations.py  ← Mixtral teacher loop
│   │   ├── organize_phase2.py       ← schema check + split
│   │   ├── simple_split.py          ← train/val/test splitter
│   │   ├── extract_comments.py
│   │   ├── parse_broadcast_parallel.py
│   │   └── make_hf_dataset.py       ← HuggingFace dataset format
│   ├── inference/
│   │   └── lora.py                  ← Load adapter + run inference
│   ├── benchmarks/
│   │   └── llama2_mixtral.py        ← Llama-2 GPTQ + Mixtral benchmark
│   └── gui/                         ← PyQt5 chess board demo
│       ├── chess_gui.py
│       └── images/pieces-basic-svg/
│
├── scripts/                         ← SLURM submission + reproduction scripts
│   ├── env.sh                       ← env vars + conda activation (source me!)
│   ├── env-freeze.txt               ← pinned pip versions
│   ├── verify_env.sh
│   ├── full_data_gen.sh             ← end-to-end Phase 1+2 data generation
│   ├── download_mixtral.sh
│   ├── phase1_{smoke,full}.sh
│   ├── phase2_{smoke,distill,full,full_10ep,smoke_finish}.sh
│   └── eval_*.sh                    ← single, alpha grid, combination_type sweep, etc.
│
├── tests/
│   ├── test_stockfish.py            ← stockfish smoke (`python -m tests.test_stockfish`)
│   ├── test_resources.py            ← GPU/memory pre-flight check
│   └── test_pipeline.py             ← end-to-end smoke
│
├── examples/                        ← example PGN samples
│   ├── example_games.pgn
│   └── short_example.pgn
│
├── training_data/                   ← data-location symlinks pointing at $PROJ
└── exp-outputs/                     ← historical SLURM logs from old runs

evaluation_results/, training_output/, hf_cache/, and the training_data/phase* symlinks all live under $PROJ (the large-storage area). The repo itself stays small (~20 MB tracked).

Citation / acknowledgement

This is course coursework from CSE 537, Spring 2025. If you build on it, please cite the original course project plus this reproduction:

Kuo, S-J. and Braimah, A. (2025). "Adaptive Chess Tutoring: Efficient LLM Reasoning
via Programmatic Supervision and Cognitive Distillation." CSE 537 final project,
Stony Brook University.

Kuo, S-J. (2026). "Reproduction of CSE 537 Chess-Tutor project on NuWulf HPC cluster."
Internal report. https://github.com/raykuo18/2025Spring_AI_Project

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Adaptive Chess Tutoring

Reproduction status (NuWulf cluster, May 2026)

What reproduces

What does NOT reproduce

Quick start

1. Clone + environment

2. Set up your data + model paths

3. End-to-end pipeline

Repository layout

Citation / acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 151 Commits
docs		docs
examples		examples
exp-outputs		exp-outputs
scripts		scripts
src/chess_tutor		src/chess_tutor
tests		tests
training_data		training_data
.gitignore		.gitignore
AGENT.md		AGENT.md
CLUSTER_POLICY.md		CLUSTER_POLICY.md
README.md		README.md
REPORT.md		REPORT.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Adaptive Chess Tutoring

Reproduction status (NuWulf cluster, May 2026)

What reproduces

What does NOT reproduce

Quick start

1. Clone + environment

2. Set up your data + model paths

3. End-to-end pipeline

Repository layout

Citation / acknowledgement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages