Intelligent Chip-based Lecture Recording System with SpEx+ Speaker Extraction and RAG-Enhanced Automated Summarization for Medical Physics Education
Survey shows 83.63% of dual-degree students need automated lecture assistance due to intensive schedules. SpExNote integrates IoT hardware with a multi-stage AI pipeline β Silero-VAD, SpEx+ speaker extraction, Thonburian-Whisper Thai ASR, and Qwen-RAG β enabling efficient lecture tracking for students with scheduling conflicts.
The full pipeline consists of three major stages:
Audio Input β [Silero-VAD] β [Thonburian-Whisper ASR] β [Qwen + RAG] β Lecture Summary
(deployed pipeline)
Audio Data β [Silero-VAD + ClearVoice] β [Pairing & Mixing] β SpEx+ Fine-tune Dataset
(dataset creation only)
Note: ClearerVoice-Studio is used only during dataset creation for SpEx+ fine-tuning. In the actual deployment pipeline, it is intentionally omitted β Thonburian-Whisper was trained on noisy audio and performs better with natural noise present, rather than pre-enhanced audio.
SpExNote/
βββ assets/
β βββ pipeline.png # System architecture diagram
βββ poster/
β βββ SpExNote_poster.pdf # Research poster
βββ notebooks/
β βββ Fine_tune_SpEx_Plus.ipynb # Dataset preparation & SpEx+ fine-tuning (Google Colab)
β βββ Deploy_SpExNote_RAG_LLM.ipynb # Deployment pipeline with Gradio UI (Google Colab)
βββ data/
β βββ youtube_urls.txt # YouTube URLs used for dataset creation
βββ LICENSE
βββ README.md
Note: Raw audio files are not included. The dataset was derived from publicly available YouTube content (see
data/youtube_urls.txt). Trained model weights are not included due to file size constraints.
Purpose: Creates the training dataset and fine-tunes the SpEx+ speaker extraction model.
Pipeline:
- Download audio from YouTube using
yt-dlp - Apply Silero-VAD to segment speech regions
- Apply ClearerVoice-Studio (FRCRN_SE_16K) to enhance audio segments
- Pair clean (single-speaker) + interfere (multi-speaker) segments to create
mix / clean / reftriplets - Fine-tune SpEx+ model on the resulting dataset
Dataset sources: See data/youtube_urls.txt β Thai educational YouTube videos (single-speaker and multi-speaker), totaling ~15.28 hours of audio.
Base model: SpEx+ by gemengtju/SpEx_Plus
Fine-tuning result: Best Val Loss: -8.6007 β the model extracts teacher's voice but with residual noise from unfrozen decoder. This stage is experimental and not integrated into the current deployment.
Purpose: Full inference pipeline deployed as a Gradio web app.
Pipeline:
- Silero-VAD β detect speech segments
- Thonburian-Whisper β Thai ASR transcription
- Qwen Embedding + FAISS β embed lecture PDF documents into vector DB
- Qwen LLM + RAG β generate context-aware lecture summary from transcript + PDF documents
Audio data for SpEx+ fine-tuning was sourced from Thai educational YouTube content. See data/youtube_urls.txt for the full list.
| Category | Description |
|---|---|
| Single-speaker | Thai lecture recordings (used as target/clean speaker) |
| Multi-speaker | Thai discussion/panel videos (used as interference) |
| Noise sample | Classroom ambient noise |
These videos are used solely for academic, non-commercial research purposes. No audio files are redistributed in this repository.
# Core
pip install torch torchaudio
pip install yt-dlp
pip install gradio
pip install faiss-cpu
# ASR
# Thonburian-Whisper: https://github.com/biodatlab/thonburian-whisper
# VAD
# Silero-VAD: loaded via torch.hub from snakers4/silero-vad
# Speech Enhancement (dataset creation only)
# ClearerVoice-Studio: https://github.com/modelscope/ClearerVoice-Studio
# LLM / Embedding
# Qwen: https://github.com/QwenLM/QwenThis project builds upon the following open-source works:
| Component | Role | Repository | License |
|---|---|---|---|
| SpEx+ | Speaker extraction model (base) | gemengtju/SpEx_Plus | MIT |
| Silero-VAD | Voice activity detection | snakers4/silero-vad | CC BY-NC-SA 4.0 (models) / MIT (code) |
| Thonburian-Whisper | Thai ASR | biodatlab/thonburian-whisper | MIT |
| Qwen | LLM + Embedding for RAG | QwenLM/Qwen | Apache 2.0 |
| ClearerVoice-Studio | Speech enhancement (dataset only) | modelscope/ClearerVoice-Studio | Apache 2.0 |
- Silero-VAD models are licensed under CC BY-NC-SA 4.0 β non-commercial use only. This project is academic/non-commercial research.
- SpEx+ code is MIT licensed (Copyright Β© 2021 Meng Ge). Our fine-tuned weights are derived from this codebase.
- Thonburian-Whisper is MIT licensed (Copyright Β© 2022 OpenAI, fine-tuned by Biodatlab).
- Qwen and ClearerVoice-Studio are Apache 2.0 licensed β permissive for modification and redistribution.
SpExNote successfully integrates IoT hardware with a multi-stage AI pipeline:
- SpEx+ fine-tuned with 15.28 hours of Thai audio effectively extracts the teacher's voice
- Despite residual noise from the unfrozen decoder, it surprisingly improves Thonburian-Whisper's ASR performance
- RAG ensures accurate, document-grounded summaries, addressing 83.63% of students' needs
Future work: Extended fine-tuning of TSE, ASR, and LLM components; expansion to other courses.
This project's own code is released under the MIT License β see LICENSE.
Note that third-party model weights and components retain their original licenses as listed above. Users must comply with each component's license terms, particularly the non-commercial restriction of Silero-VAD models.
If you use this work, please cite:
Tamprasert, N., & Panchalal, P. (2025). SpExNote: Intelligent Chip-based Lecture Recording
System with SpEx+ Speaker Extraction and RAG-Enhanced Automated Summarization for Medical
Physics Education. KMITL.
