Skip to content

SunbirdAI/sunbird-tutor-modelling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

213 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sunbird Tutor: a translanguage spoken educational assistant for Uganda

For a country with a huge linguistic divide and low levels of literacy, teachers generally use translanguaging as a teaching strategy particularly for rural and peri-urban primary schools. We present Sunbird-Tutor, a Gemma4-powered classroom-grade translanguage assistant for Ugandan primary schools, with the ability to operate fully offline on a basic smartphone in multiple languages. Sunbird-tutor is an end-to-end speech model capable of ingesting and outputing speech in 12 local Ugandan languages fine-tuned on primary education data from the National Curriculum Development Center in Uganda.

This repository contains the training and evaluation pipeline for Sunbird-Tutor — continued pretraining, SFT (text and multimodal), GRPO, and translation / ASR evaluation.

Quick start

Setup (once per machine)

uv sync                          # installs everything pinned in pyproject.toml
apt-get install -y ffmpeg libnpp-12-8     # needed for torchcodec (speech data only)
hf auth login                    # Hugging Face token for gated models
export MLFLOW_TRACKING_USERNAME=... MLFLOW_TRACKING_PASSWORD=...

uv sync pulls torch + CUDA wheels from the PyTorch cu128 index (see [tool.uv.sources]), so you don't pip install unsloth separately.

Train

Every command is <script> --config <yaml>. Add --dry-run to validate config + load model without training.

# Continued pretraining
uv run python scripts/pretrain.py --config configs/pretrain/translategemma-12b.yml

# SFT -- text-only (translation)
uv run python scripts/sft.py --config configs/sft/translategemma-12b.yml

# SFT -- multimodal (translation + speech, Gemma 4)
uv run python scripts/sft.py --config configs/sft/gemma4-e4b-speech.yml

# GRPO
uv run python scripts/rl.py --config configs/rl/translategemma-grpo.yml

# Multi-GPU (DDP)
torchrun --nproc_per_node=2 scripts/sft.py --config configs/sft/translategemma-12b.yml

The SFT script auto-detects modality: if any entry under sft_datasets in the YAML has modality: audio, it switches to the Unsloth Gemma-4 vision pipeline (multimodal processor, UnslothVisionDataCollator, per-modality eval losses). Otherwise it runs the legacy text pipeline with response-only masking. Existing TranslateGemma and Qwen3 configs are unchanged.

Eval

# Translation eval (vLLM by default; --engine transformers for the HF fallback)
uv run python scripts/eval.py --config configs/eval/translation.yml
uv run python scripts/eval.py --config configs/eval/translation.yml --model-id Sunbird/other-model
uv run python scripts/eval.py --config configs/eval/translation.yml --engine transformers \
    --save-predictions outputs/eval/preds.jsonl

# ASR eval (multimodal Gemma 4 — loads adapters or merged checkpoints via Unsloth)
uv run python scripts/eval_asr.py --config configs/eval/asr.yml
uv run python scripts/eval_asr.py --config configs/eval/asr.yml --model-id Sunbird/gemma4-lora-adapter
uv run python scripts/eval_asr.py --config configs/eval/asr.yml --limit 20   # debug

# Analyse + plot results (auto-detects translation vs ASR from *_results.json)
uv run python scripts/analysis/plot_metrics.py results/
uv run python scripts/analysis/plot_metrics.py results/ --mode asr

eval.py writes {model}_results.json (BLEU/chrF per language) and logs metrics + a per-language chrF chart to MLflow. eval_asr.py writes {model}_asr_results.json plus {model}_asr_predictions.jsonl with per-row WER/CER for qualitative inspection. plot_metrics.py reads any *_results.json in the given directory and emits aggregate + per-language bar charts for whichever mode it detects.

Results

Translation (chrF, higher is better)

Aggregate across 11 Ugandan languages:

Translation aggregate chrF

Per-language chrF:

Translation per-language chrF

ASR (WER / CER, lower is better)

Aggregate WER and CER across evaluated languages:

ASR aggregate WER/CER

Per-language WER:

ASR per-language WER

Per-language CER:

ASR per-language CER

Documentation

About

Gemma4 - Sunbird Tutor: a translanguage education assistant

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors