Skip to content

natthawin0614/SpExNote

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

SpExNote πŸŽ™οΈ

Intelligent Chip-based Lecture Recording System with SpEx+ Speaker Extraction and RAG-Enhanced Automated Summarization for Medical Physics Education


Overview

Survey shows 83.63% of dual-degree students need automated lecture assistance due to intensive schedules. SpExNote integrates IoT hardware with a multi-stage AI pipeline β€” Silero-VAD, SpEx+ speaker extraction, Thonburian-Whisper Thai ASR, and Qwen-RAG β€” enabling efficient lecture tracking for students with scheduling conflicts.


System Architecture

SpExNote Methodologies Software Pipeline

The full pipeline consists of three major stages:

Audio Input β†’ [Silero-VAD] β†’ [Thonburian-Whisper ASR] β†’ [Qwen + RAG] β†’ Lecture Summary
                              (deployed pipeline)

Audio Data β†’ [Silero-VAD + ClearVoice] β†’ [Pairing & Mixing] β†’ SpEx+ Fine-tune Dataset
                                          (dataset creation only)

Note: ClearerVoice-Studio is used only during dataset creation for SpEx+ fine-tuning. In the actual deployment pipeline, it is intentionally omitted β€” Thonburian-Whisper was trained on noisy audio and performs better with natural noise present, rather than pre-enhanced audio.


Repository Structure

SpExNote/
β”œβ”€β”€ assets/
β”‚   └── pipeline.png              # System architecture diagram
β”œβ”€β”€ poster/
β”‚   └── SpExNote_poster.pdf       # Research poster
β”œβ”€β”€ notebooks/
β”‚   β”œβ”€β”€ Fine_tune_SpEx_Plus.ipynb # Dataset preparation & SpEx+ fine-tuning (Google Colab)
β”‚   └── Deploy_SpExNote_RAG_LLM.ipynb  # Deployment pipeline with Gradio UI (Google Colab)
β”œβ”€β”€ data/
β”‚   └── youtube_urls.txt          # YouTube URLs used for dataset creation
β”œβ”€β”€ LICENSE
└── README.md

Note: Raw audio files are not included. The dataset was derived from publicly available YouTube content (see data/youtube_urls.txt). Trained model weights are not included due to file size constraints.


Notebooks

1. Fine_tune_SpEx_Plus.ipynb β€” Dataset Preparation & SpEx+ Fine-tuning

Purpose: Creates the training dataset and fine-tunes the SpEx+ speaker extraction model.

Pipeline:

  1. Download audio from YouTube using yt-dlp
  2. Apply Silero-VAD to segment speech regions
  3. Apply ClearerVoice-Studio (FRCRN_SE_16K) to enhance audio segments
  4. Pair clean (single-speaker) + interfere (multi-speaker) segments to create mix / clean / ref triplets
  5. Fine-tune SpEx+ model on the resulting dataset

Dataset sources: See data/youtube_urls.txt β€” Thai educational YouTube videos (single-speaker and multi-speaker), totaling ~15.28 hours of audio.

Base model: SpEx+ by gemengtju/SpEx_Plus

Fine-tuning result: Best Val Loss: -8.6007 β€” the model extracts teacher's voice but with residual noise from unfrozen decoder. This stage is experimental and not integrated into the current deployment.


2. Deploy_SpExNote_RAG_LLM.ipynb β€” Deployment Pipeline

Purpose: Full inference pipeline deployed as a Gradio web app.

Pipeline:

  1. Silero-VAD β€” detect speech segments
  2. Thonburian-Whisper β€” Thai ASR transcription
  3. Qwen Embedding + FAISS β€” embed lecture PDF documents into vector DB
  4. Qwen LLM + RAG β€” generate context-aware lecture summary from transcript + PDF documents

Dataset Sources

Audio data for SpEx+ fine-tuning was sourced from Thai educational YouTube content. See data/youtube_urls.txt for the full list.

Category Description
Single-speaker Thai lecture recordings (used as target/clean speaker)
Multi-speaker Thai discussion/panel videos (used as interference)
Noise sample Classroom ambient noise

These videos are used solely for academic, non-commercial research purposes. No audio files are redistributed in this repository.


Dependencies

# Core
pip install torch torchaudio
pip install yt-dlp
pip install gradio
pip install faiss-cpu

# ASR
# Thonburian-Whisper: https://github.com/biodatlab/thonburian-whisper

# VAD
# Silero-VAD: loaded via torch.hub from snakers4/silero-vad

# Speech Enhancement (dataset creation only)
# ClearerVoice-Studio: https://github.com/modelscope/ClearerVoice-Studio

# LLM / Embedding
# Qwen: https://github.com/QwenLM/Qwen

Acknowledgements & Third-Party Licenses

This project builds upon the following open-source works:

Component Role Repository License
SpEx+ Speaker extraction model (base) gemengtju/SpEx_Plus MIT
Silero-VAD Voice activity detection snakers4/silero-vad CC BY-NC-SA 4.0 (models) / MIT (code)
Thonburian-Whisper Thai ASR biodatlab/thonburian-whisper MIT
Qwen LLM + Embedding for RAG QwenLM/Qwen Apache 2.0
ClearerVoice-Studio Speech enhancement (dataset only) modelscope/ClearerVoice-Studio Apache 2.0

Important License Notes

  • Silero-VAD models are licensed under CC BY-NC-SA 4.0 β€” non-commercial use only. This project is academic/non-commercial research.
  • SpEx+ code is MIT licensed (Copyright Β© 2021 Meng Ge). Our fine-tuned weights are derived from this codebase.
  • Thonburian-Whisper is MIT licensed (Copyright Β© 2022 OpenAI, fine-tuned by Biodatlab).
  • Qwen and ClearerVoice-Studio are Apache 2.0 licensed β€” permissive for modification and redistribution.

Results

SpExNote successfully integrates IoT hardware with a multi-stage AI pipeline:

  • SpEx+ fine-tuned with 15.28 hours of Thai audio effectively extracts the teacher's voice
  • Despite residual noise from the unfrozen decoder, it surprisingly improves Thonburian-Whisper's ASR performance
  • RAG ensures accurate, document-grounded summaries, addressing 83.63% of students' needs

Future work: Extended fine-tuning of TSE, ASR, and LLM components; expansion to other courses.


License

This project's own code is released under the MIT License β€” see LICENSE.

Note that third-party model weights and components retain their original licenses as listed above. Users must comply with each component's license terms, particularly the non-commercial restriction of Silero-VAD models.


Citation

If you use this work, please cite:

Tamprasert, N., & Panchalal, P. (2025). SpExNote: Intelligent Chip-based Lecture Recording
System with SpEx+ Speaker Extraction and RAG-Enhanced Automated Summarization for Medical
Physics Education. KMITL.

About

πŸŽ™οΈ Chip-based lecture recording system with SpEx+ speaker extraction, Thai ASR (Thonburian-Whisper), and RAG-enhanced summarization for medical physics education.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors