Medical ASR Training Data Preparation Pipeline
by IntelMedica LLC -- Physician-Led Open-Source Medical AI
A complete pipeline for generating high-quality medical speech training data from public medical terminology APIs. Designed for fine-tuning Whisper-based ASR models on clinical vocabulary (drugs, diagnoses, procedures, lab tests, nursing terminology).
This pipeline was used to produce the IntelMedica medical speech datasets on HuggingFace -- 460K+ sentences across three audience-specific datasets.
This is a data preparation tool, not a clinical decision support system.
Medical APIs Term Collection Sentence Generation Quality Filtering
(UMLS, RxNorm, -> (collect_*) -> (generate_*) -> (clean_terms_*)
FDA, LOINC...)
| |
v v
TTS Synthesis Audience Split Train/Val/Test Split HuggingFace Upload
(synthesize_*) <- (split_audience_*) <- (split_train_*) -> (upload_*)
| Stage | Scripts | Description |
|---|---|---|
| 1. Term Collection | collect_terms_*.py |
Pull medical terms from 8 public APIs (UMLS/SNOMED CT, RxNorm, DailyMed, MeSH, NCI Thesaurus, HCPCS, LOINC, openFDA) |
| 2. Term Cleaning | clean_terms_v3.py |
Apply 12 quality rules to remove chemical formulas, NCI experimental codes, molecular biology terms, abbreviation soup, etc. |
| 3. Sentence Generation | generate_sentences_v2.py, generate_sentences_v3.py |
Generate clinical sentences from templates + optional LLM (Qwen 0.5B) |
| 4. v2/v3 Merge | merge_v2_into_v3.py |
Merge drug-heavy v2 sentences into the broader v3 set with deduplication |
| 5. Audience Split | split_audience_v3.py |
Route sentences to nursing, physician, or general medical files |
| 6. Train/Val/Test | split_train_val_test.py |
70/15/15 stratified split per audience |
| 7. TTS Synthesis | synthesize_audio_v2.py, synthesize_audio_v3.py |
GPU-accelerated Kokoro TTS, 19 voices, 3 accents (US/UK/Indian), 16kHz WAV output |
| 8. HF Upload | upload_dataset_v2.py, upload_dataset_16khz.py |
Upload as Parquet shards to HuggingFace Hub |
| Dataset | Sentences | Audience | Link |
|---|---|---|---|
| nursing-sentences-1 | ~40K | Nurses (SBAR, vitals, med admin, wound care) | HuggingFace |
| physician-sentences-1 | ~108K | Physicians (SOAP, HPI, ROS, discharge) | HuggingFace |
| general-medical-sentences-1 | ~313K | General medical (drugs, labs, diagnoses, procedures) | HuggingFace |
See also: jfmdai/medical-speech-data-collections -- a curated directory of all publicly available medical speech datasets.
# Clone
git clone https://github.com/intelmedica/med-speech-data-prep.git
cd med-speech-data-prep
# Set up Python environment
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
# Configure API keys
cp .env.example .env.local
# Edit .env.local with your API keys (see docs/api-access-guide.md)
source .env.localEach collector targets a different medical terminology source. Run them in any order.
# UMLS/SNOMED CT (requires UMLS API key)
python3 scripts/collect_terms_snomed.py
# RxNorm, openFDA, ICD-10, LOINC, abbreviations (no auth for most)
python3 scripts/collect_terms_v2.py
# DailyMed drug labels (no auth)
python3 scripts/collect_terms_dailymed.py
# MeSH via SPARQL (no auth)
python3 scripts/collect_terms_mesh.py
# NCI Thesaurus (no auth)
python3 scripts/collect_terms_nci.py
# HCPCS procedure codes (no auth)
python3 scripts/collect_terms_hcpcs.pyAll collectors write JSONL output to prep/datasets/terms/<source>/terms.jsonl. Each line is:
{"term": "atrial fibrillation", "category": "condition", "source": "snomed_ct", "metadata": {...}}python3 scripts/clean_terms_v3.pyApplies 12 cleaning rules (see docs/data-quality.md). Writes terms_clean.jsonl alongside each source.
# v2: Template-based generation (fast, drug-heavy)
python3 scripts/generate_sentences_v2.py
# v3: Broader generation with optional LLM support
CUDA_VISIBLE_DEVICES=0 python3 scripts/generate_sentences_v3.py# Merge v2 drug sentences into v3
python3 scripts/merge_v2_into_v3.py
# Split by audience (nursing / physician / general)
python3 scripts/split_audience_v3.py
# 70/15/15 train/val/test split
python3 scripts/split_train_val_test.pyRequires GPU with CUDA. Uses Kokoro TTS (82M params, fits in 4GB VRAM).
# Install TTS dependencies
pip install kokoro soundfile scipy
# Synthesize one audience at a time
python3 scripts/synthesize_audio_v3.py --split nursing --resume
python3 scripts/synthesize_audio_v3.py --split physician --resume
python3 scripts/synthesize_audio_v3.py --split general --resume
# Check progress
bash scripts/check_synthesis_progress.sh# Requires HF_TOKEN with write access
python3 scripts/upload_dataset_v2.pyThe cleaning pipeline removes ~15-20% of raw terms using 12 rules:
| Rule | What It Removes | Example |
|---|---|---|
| Chemical formulas | IUPAC names, nested parens | ((1R)-1-...boronic acid) |
| NCI experimental codes | Drug candidate codes | fac00109 |
| Molecular biology | Gene therapy, CRISPR terms | allogeneic anti-IL13RA2 |
| HCPCS abbreviation soup | Unpronounceable codes | insrt atril pm w/l vent lead |
| DailyMed fragments | Section refs, short snippets | Warning: see full prescribing |
| MeSH non-medical | Info science, humanities | deep learning, data mining |
| Product codes | NDC codes, identifiers | 12345678 |
| Too short | Under 3 chars, pure numbers | AB, 42 |
| Veterinary | Animal-specific terms | canine distemper |
| LOINC surveys | Survey instruments (not labs) | CMS assessment tool |
| Deduplication | Case-insensitive dedup | -- |
| TNM versions | Staging metadata | TNM finding v8 |
False-positive protections: heparin (kept despite "bovine" triggers), COVID vaccines (kept despite chemical formula patterns), clinical lab panels (kept despite "panel" keyword).
See docs/data-quality.md for the full rule reference with before/after examples.
| API | Auth Required | Free | Guide |
|---|---|---|---|
| UMLS / SNOMED CT | API key | Yes (with license) | docs/api-access-guide.md |
| RxNorm | None | Yes | Direct API calls |
| DailyMed | None | Yes | Direct API calls |
| openFDA | Optional key | Yes | Higher rate limits with key |
| MeSH | None | Yes | SPARQL endpoint |
| NCI Thesaurus | None | Yes | REST API |
| HCPCS | None | Yes | NLM Clinical Tables API |
| LOINC | Download | Yes (with license) | docs/api-access-guide.md |
| HuggingFace | Write token | Yes | For dataset upload only |
See docs/api-access-guide.md for step-by-step instructions on getting each API key.
- Fork the repo
- Create a feature branch:
git checkout -b feature/my-change - Make your changes
- Run the security check:
python3 -c "import ast; [ast.parse(open(f).read()) for f in __import__('glob').glob('scripts/*.py')]" - Submit a PR to
next
Branch strategy: feature/* -> next (integration) -> main (stable releases).
This project is licensed under CC BY-NC 4.0.
You are free to share and adapt the material for non-commercial purposes, with attribution.
Junaid Farooq, MD IntelMedica LLC -- Physician-Led Open-Source Medical AI
- jfmdai/medical-speech-data-collections -- Curated directory of all public medical speech datasets
- IntelMedica on HuggingFace -- Our medical speech datasets
- UMLS Terminology Services -- NLM's unified medical terminology API
- Kokoro TTS -- The TTS model used for synthesis