For a country with a huge linguistic divide and low levels of literacy, teachers generally use translanguaging as a teaching strategy particularly for rural and peri-urban primary schools. We present Sunbird-Tutor, a Gemma4-powered classroom-grade translanguage assistant for Ugandan primary schools, with the ability to operate fully offline on a basic smartphone in multiple languages. Sunbird-tutor is an end-to-end speech model capable of ingesting and outputing speech in 12 local Ugandan languages fine-tuned on primary education data from the National Curriculum Development Center in Uganda.
This repository contains the training and evaluation pipeline for Sunbird-Tutor — continued pretraining, SFT (text and multimodal), GRPO, and translation / ASR evaluation.
uv sync # installs everything pinned in pyproject.toml
apt-get install -y ffmpeg libnpp-12-8 # needed for torchcodec (speech data only)
hf auth login # Hugging Face token for gated models
export MLFLOW_TRACKING_USERNAME=... MLFLOW_TRACKING_PASSWORD=...uv sync pulls torch + CUDA wheels from the PyTorch cu128 index (see
[tool.uv.sources]), so you don't pip install unsloth separately.
Every command is <script> --config <yaml>. Add --dry-run to validate
config + load model without training.
# Continued pretraining
uv run python scripts/pretrain.py --config configs/pretrain/translategemma-12b.yml
# SFT -- text-only (translation)
uv run python scripts/sft.py --config configs/sft/translategemma-12b.yml
# SFT -- multimodal (translation + speech, Gemma 4)
uv run python scripts/sft.py --config configs/sft/gemma4-e4b-speech.yml
# GRPO
uv run python scripts/rl.py --config configs/rl/translategemma-grpo.yml
# Multi-GPU (DDP)
torchrun --nproc_per_node=2 scripts/sft.py --config configs/sft/translategemma-12b.ymlThe SFT script auto-detects modality: if any entry under sft_datasets
in the YAML has modality: audio, it switches to the Unsloth Gemma-4
vision pipeline (multimodal processor, UnslothVisionDataCollator,
per-modality eval losses). Otherwise it runs the legacy text pipeline
with response-only masking. Existing TranslateGemma and Qwen3 configs
are unchanged.
# Translation eval (vLLM by default; --engine transformers for the HF fallback)
uv run python scripts/eval.py --config configs/eval/translation.yml
uv run python scripts/eval.py --config configs/eval/translation.yml --model-id Sunbird/other-model
uv run python scripts/eval.py --config configs/eval/translation.yml --engine transformers \
--save-predictions outputs/eval/preds.jsonl
# ASR eval (multimodal Gemma 4 — loads adapters or merged checkpoints via Unsloth)
uv run python scripts/eval_asr.py --config configs/eval/asr.yml
uv run python scripts/eval_asr.py --config configs/eval/asr.yml --model-id Sunbird/gemma4-lora-adapter
uv run python scripts/eval_asr.py --config configs/eval/asr.yml --limit 20 # debug
# Analyse + plot results (auto-detects translation vs ASR from *_results.json)
uv run python scripts/analysis/plot_metrics.py results/
uv run python scripts/analysis/plot_metrics.py results/ --mode asreval.py writes {model}_results.json (BLEU/chrF per language) and logs
metrics + a per-language chrF chart to MLflow. eval_asr.py writes
{model}_asr_results.json plus {model}_asr_predictions.jsonl with per-row
WER/CER for qualitative inspection. plot_metrics.py reads any
*_results.json in the given directory and emits aggregate + per-language
bar charts for whichever mode it detects.
Aggregate across 11 Ugandan languages:
Per-language chrF:
Aggregate WER and CER across evaluated languages:
Per-language WER:
Per-language CER:




