med-speech-data-prep

Medical ASR Training Data Preparation Pipeline

by IntelMedica LLC -- Physician-Led Open-Source Medical AI

What This Is

A complete pipeline for generating high-quality medical speech training data from public medical terminology APIs. Designed for fine-tuning Whisper-based ASR models on clinical vocabulary (drugs, diagnoses, procedures, lab tests, nursing terminology).

This pipeline was used to produce the IntelMedica medical speech datasets on HuggingFace -- 460K+ sentences across three audience-specific datasets.

This is a data preparation tool, not a clinical decision support system.

Pipeline Overview

Medical APIs       Term Collection      Sentence Generation    Quality Filtering
(UMLS, RxNorm,  ->  (collect_*)      ->  (generate_*)       ->  (clean_terms_*)
 FDA, LOINC...)

     |                                                               |
     v                                                               v

TTS Synthesis      Audience Split       Train/Val/Test Split    HuggingFace Upload
(synthesize_*)  <-  (split_audience_*)  <-  (split_train_*)  ->  (upload_*)

Pipeline Stages

Stage	Scripts	Description
1. Term Collection	`collect_terms_*.py`	Pull medical terms from 8 public APIs (UMLS/SNOMED CT, RxNorm, DailyMed, MeSH, NCI Thesaurus, HCPCS, LOINC, openFDA)
2. Term Cleaning	`clean_terms_v3.py`	Apply 12 quality rules to remove chemical formulas, NCI experimental codes, molecular biology terms, abbreviation soup, etc.
3. Sentence Generation	`generate_sentences_v2.py`, `generate_sentences_v3.py`	Generate clinical sentences from templates + optional LLM (Qwen 0.5B)
4. v2/v3 Merge	`merge_v2_into_v3.py`	Merge drug-heavy v2 sentences into the broader v3 set with deduplication
5. Audience Split	`split_audience_v3.py`	Route sentences to nursing, physician, or general medical files
6. Train/Val/Test	`split_train_val_test.py`	70/15/15 stratified split per audience
7. TTS Synthesis	`synthesize_audio_v2.py`, `synthesize_audio_v3.py`	GPU-accelerated Kokoro TTS, 19 voices, 3 accents (US/UK/Indian), 16kHz WAV output
8. HF Upload	`upload_dataset_v2.py`, `upload_dataset_16khz.py`	Upload as Parquet shards to HuggingFace Hub

Output Datasets

Dataset	Sentences	Audience	Link
nursing-sentences-1	~40K	Nurses (SBAR, vitals, med admin, wound care)	HuggingFace
physician-sentences-1	~108K	Physicians (SOAP, HPI, ROS, discharge)	HuggingFace
general-medical-sentences-1	~313K	General medical (drugs, labs, diagnoses, procedures)	HuggingFace

See also: jfmdai/medical-speech-data-collections -- a curated directory of all publicly available medical speech datasets.

Quick Start

# Clone
git clone https://github.com/intelmedica/med-speech-data-prep.git
cd med-speech-data-prep

# Set up Python environment
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# Configure API keys
cp .env.example .env.local
# Edit .env.local with your API keys (see docs/api-access-guide.md)
source .env.local

Running the Pipeline

Stage 1: Collect Terms

Each collector targets a different medical terminology source. Run them in any order.

# UMLS/SNOMED CT (requires UMLS API key)
python3 scripts/collect_terms_snomed.py

# RxNorm, openFDA, ICD-10, LOINC, abbreviations (no auth for most)
python3 scripts/collect_terms_v2.py

# DailyMed drug labels (no auth)
python3 scripts/collect_terms_dailymed.py

# MeSH via SPARQL (no auth)
python3 scripts/collect_terms_mesh.py

# NCI Thesaurus (no auth)
python3 scripts/collect_terms_nci.py

# HCPCS procedure codes (no auth)
python3 scripts/collect_terms_hcpcs.py

All collectors write JSONL output to prep/datasets/terms/<source>/terms.jsonl. Each line is:

{"term": "atrial fibrillation", "category": "condition", "source": "snomed_ct", "metadata": {...}}

Stage 2: Clean Terms

python3 scripts/clean_terms_v3.py

Applies 12 cleaning rules (see docs/data-quality.md). Writes terms_clean.jsonl alongside each source.

Stage 3: Generate Sentences

# v2: Template-based generation (fast, drug-heavy)
python3 scripts/generate_sentences_v2.py

# v3: Broader generation with optional LLM support
CUDA_VISIBLE_DEVICES=0 python3 scripts/generate_sentences_v3.py

Stage 4: Merge and Split

# Merge v2 drug sentences into v3
python3 scripts/merge_v2_into_v3.py

# Split by audience (nursing / physician / general)
python3 scripts/split_audience_v3.py

# 70/15/15 train/val/test split
python3 scripts/split_train_val_test.py

Stage 5: TTS Synthesis

Requires GPU with CUDA. Uses Kokoro TTS (82M params, fits in 4GB VRAM).

# Install TTS dependencies
pip install kokoro soundfile scipy

# Synthesize one audience at a time
python3 scripts/synthesize_audio_v3.py --split nursing --resume
python3 scripts/synthesize_audio_v3.py --split physician --resume
python3 scripts/synthesize_audio_v3.py --split general --resume

# Check progress
bash scripts/check_synthesis_progress.sh

Stage 6: Upload to HuggingFace

# Requires HF_TOKEN with write access
python3 scripts/upload_dataset_v2.py

Data Quality

The cleaning pipeline removes ~15-20% of raw terms using 12 rules:

Rule	What It Removes	Example
Chemical formulas	IUPAC names, nested parens	`((1R)-1-...boronic acid)`
NCI experimental codes	Drug candidate codes	`fac00109`
Molecular biology	Gene therapy, CRISPR terms	`allogeneic anti-IL13RA2`
HCPCS abbreviation soup	Unpronounceable codes	`insrt atril pm w/l vent lead`
DailyMed fragments	Section refs, short snippets	`Warning: see full prescribing`
MeSH non-medical	Info science, humanities	`deep learning`, `data mining`
Product codes	NDC codes, identifiers	`12345678`
Too short	Under 3 chars, pure numbers	`AB`, `42`
Veterinary	Animal-specific terms	`canine distemper`
LOINC surveys	Survey instruments (not labs)	`CMS assessment tool`
Deduplication	Case-insensitive dedup	--
TNM versions	Staging metadata	`TNM finding v8`

False-positive protections: heparin (kept despite "bovine" triggers), COVID vaccines (kept despite chemical formula patterns), clinical lab panels (kept despite "panel" keyword).

See docs/data-quality.md for the full rule reference with before/after examples.

API Access

API	Auth Required	Free	Guide
UMLS / SNOMED CT	API key	Yes (with license)	docs/api-access-guide.md
RxNorm	None	Yes	Direct API calls
DailyMed	None	Yes	Direct API calls
openFDA	Optional key	Yes	Higher rate limits with key
MeSH	None	Yes	SPARQL endpoint
NCI Thesaurus	None	Yes	REST API
HCPCS	None	Yes	NLM Clinical Tables API
LOINC	Download	Yes (with license)	docs/api-access-guide.md
HuggingFace	Write token	Yes	For dataset upload only

See docs/api-access-guide.md for step-by-step instructions on getting each API key.

Contributing

Fork the repo
Create a feature branch: git checkout -b feature/my-change
Make your changes
Run the security check: python3 -c "import ast; [ast.parse(open(f).read()) for f in __import__('glob').glob('scripts/*.py')]"
Submit a PR to next

Branch strategy: feature/* -> next (integration) -> main (stable releases).

License

This project is licensed under CC BY-NC 4.0.

You are free to share and adapt the material for non-commercial purposes, with attribution.

Author

Junaid Farooq, MD IntelMedica LLC -- Physician-Led Open-Source Medical AI

Related Resources

jfmdai/medical-speech-data-collections -- Curated directory of all public medical speech datasets
IntelMedica on HuggingFace -- Our medical speech datasets
UMLS Terminology Services -- NLM's unified medical terminology API
Kokoro TTS -- The TTS model used for synthesis

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
configs		configs
docs		docs
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
CODEOWNERS		CODEOWNERS
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

med-speech-data-prep

What This Is

Pipeline Overview

Pipeline Stages

Output Datasets

Quick Start

Running the Pipeline

Stage 1: Collect Terms

Stage 2: Clean Terms

Stage 3: Generate Sentences

Stage 4: Merge and Split

Stage 5: TTS Synthesis

Stage 6: Upload to HuggingFace

Data Quality

API Access

Contributing

License

Author

Related Resources

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

med-speech-data-prep

What This Is

Pipeline Overview

Pipeline Stages

Output Datasets

Quick Start

Running the Pipeline

Stage 1: Collect Terms

Stage 2: Clean Terms

Stage 3: Generate Sentences

Stage 4: Merge and Split

Stage 5: TTS Synthesis

Stage 6: Upload to HuggingFace

Data Quality

API Access

Contributing

License

Author

Related Resources

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages