Skip to content

IntelMedica/med-speech-data-prep

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

med-speech-data-prep

Medical ASR Training Data Preparation Pipeline

by IntelMedica LLC -- Physician-Led Open-Source Medical AI


What This Is

A complete pipeline for generating high-quality medical speech training data from public medical terminology APIs. Designed for fine-tuning Whisper-based ASR models on clinical vocabulary (drugs, diagnoses, procedures, lab tests, nursing terminology).

This pipeline was used to produce the IntelMedica medical speech datasets on HuggingFace -- 460K+ sentences across three audience-specific datasets.

This is a data preparation tool, not a clinical decision support system.


Pipeline Overview

Medical APIs       Term Collection      Sentence Generation    Quality Filtering
(UMLS, RxNorm,  ->  (collect_*)      ->  (generate_*)       ->  (clean_terms_*)
 FDA, LOINC...)

     |                                                               |
     v                                                               v

TTS Synthesis      Audience Split       Train/Val/Test Split    HuggingFace Upload
(synthesize_*)  <-  (split_audience_*)  <-  (split_train_*)  ->  (upload_*)

Pipeline Stages

Stage Scripts Description
1. Term Collection collect_terms_*.py Pull medical terms from 8 public APIs (UMLS/SNOMED CT, RxNorm, DailyMed, MeSH, NCI Thesaurus, HCPCS, LOINC, openFDA)
2. Term Cleaning clean_terms_v3.py Apply 12 quality rules to remove chemical formulas, NCI experimental codes, molecular biology terms, abbreviation soup, etc.
3. Sentence Generation generate_sentences_v2.py, generate_sentences_v3.py Generate clinical sentences from templates + optional LLM (Qwen 0.5B)
4. v2/v3 Merge merge_v2_into_v3.py Merge drug-heavy v2 sentences into the broader v3 set with deduplication
5. Audience Split split_audience_v3.py Route sentences to nursing, physician, or general medical files
6. Train/Val/Test split_train_val_test.py 70/15/15 stratified split per audience
7. TTS Synthesis synthesize_audio_v2.py, synthesize_audio_v3.py GPU-accelerated Kokoro TTS, 19 voices, 3 accents (US/UK/Indian), 16kHz WAV output
8. HF Upload upload_dataset_v2.py, upload_dataset_16khz.py Upload as Parquet shards to HuggingFace Hub

Output Datasets

Dataset Sentences Audience Link
nursing-sentences-1 ~40K Nurses (SBAR, vitals, med admin, wound care) HuggingFace
physician-sentences-1 ~108K Physicians (SOAP, HPI, ROS, discharge) HuggingFace
general-medical-sentences-1 ~313K General medical (drugs, labs, diagnoses, procedures) HuggingFace

See also: jfmdai/medical-speech-data-collections -- a curated directory of all publicly available medical speech datasets.


Quick Start

# Clone
git clone https://github.com/intelmedica/med-speech-data-prep.git
cd med-speech-data-prep

# Set up Python environment
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# Configure API keys
cp .env.example .env.local
# Edit .env.local with your API keys (see docs/api-access-guide.md)
source .env.local

Running the Pipeline

Stage 1: Collect Terms

Each collector targets a different medical terminology source. Run them in any order.

# UMLS/SNOMED CT (requires UMLS API key)
python3 scripts/collect_terms_snomed.py

# RxNorm, openFDA, ICD-10, LOINC, abbreviations (no auth for most)
python3 scripts/collect_terms_v2.py

# DailyMed drug labels (no auth)
python3 scripts/collect_terms_dailymed.py

# MeSH via SPARQL (no auth)
python3 scripts/collect_terms_mesh.py

# NCI Thesaurus (no auth)
python3 scripts/collect_terms_nci.py

# HCPCS procedure codes (no auth)
python3 scripts/collect_terms_hcpcs.py

All collectors write JSONL output to prep/datasets/terms/<source>/terms.jsonl. Each line is:

{"term": "atrial fibrillation", "category": "condition", "source": "snomed_ct", "metadata": {...}}

Stage 2: Clean Terms

python3 scripts/clean_terms_v3.py

Applies 12 cleaning rules (see docs/data-quality.md). Writes terms_clean.jsonl alongside each source.

Stage 3: Generate Sentences

# v2: Template-based generation (fast, drug-heavy)
python3 scripts/generate_sentences_v2.py

# v3: Broader generation with optional LLM support
CUDA_VISIBLE_DEVICES=0 python3 scripts/generate_sentences_v3.py

Stage 4: Merge and Split

# Merge v2 drug sentences into v3
python3 scripts/merge_v2_into_v3.py

# Split by audience (nursing / physician / general)
python3 scripts/split_audience_v3.py

# 70/15/15 train/val/test split
python3 scripts/split_train_val_test.py

Stage 5: TTS Synthesis

Requires GPU with CUDA. Uses Kokoro TTS (82M params, fits in 4GB VRAM).

# Install TTS dependencies
pip install kokoro soundfile scipy

# Synthesize one audience at a time
python3 scripts/synthesize_audio_v3.py --split nursing --resume
python3 scripts/synthesize_audio_v3.py --split physician --resume
python3 scripts/synthesize_audio_v3.py --split general --resume

# Check progress
bash scripts/check_synthesis_progress.sh

Stage 6: Upload to HuggingFace

# Requires HF_TOKEN with write access
python3 scripts/upload_dataset_v2.py

Data Quality

The cleaning pipeline removes ~15-20% of raw terms using 12 rules:

Rule What It Removes Example
Chemical formulas IUPAC names, nested parens ((1R)-1-...boronic acid)
NCI experimental codes Drug candidate codes fac00109
Molecular biology Gene therapy, CRISPR terms allogeneic anti-IL13RA2
HCPCS abbreviation soup Unpronounceable codes insrt atril pm w/l vent lead
DailyMed fragments Section refs, short snippets Warning: see full prescribing
MeSH non-medical Info science, humanities deep learning, data mining
Product codes NDC codes, identifiers 12345678
Too short Under 3 chars, pure numbers AB, 42
Veterinary Animal-specific terms canine distemper
LOINC surveys Survey instruments (not labs) CMS assessment tool
Deduplication Case-insensitive dedup --
TNM versions Staging metadata TNM finding v8

False-positive protections: heparin (kept despite "bovine" triggers), COVID vaccines (kept despite chemical formula patterns), clinical lab panels (kept despite "panel" keyword).

See docs/data-quality.md for the full rule reference with before/after examples.


API Access

API Auth Required Free Guide
UMLS / SNOMED CT API key Yes (with license) docs/api-access-guide.md
RxNorm None Yes Direct API calls
DailyMed None Yes Direct API calls
openFDA Optional key Yes Higher rate limits with key
MeSH None Yes SPARQL endpoint
NCI Thesaurus None Yes REST API
HCPCS None Yes NLM Clinical Tables API
LOINC Download Yes (with license) docs/api-access-guide.md
HuggingFace Write token Yes For dataset upload only

See docs/api-access-guide.md for step-by-step instructions on getting each API key.


Contributing

  1. Fork the repo
  2. Create a feature branch: git checkout -b feature/my-change
  3. Make your changes
  4. Run the security check: python3 -c "import ast; [ast.parse(open(f).read()) for f in __import__('glob').glob('scripts/*.py')]"
  5. Submit a PR to next

Branch strategy: feature/* -> next (integration) -> main (stable releases).


License

This project is licensed under CC BY-NC 4.0.

You are free to share and adapt the material for non-commercial purposes, with attribution.


Author

Junaid Farooq, MD IntelMedica LLC -- Physician-Led Open-Source Medical AI


Related Resources

About

Medical ASR training data preparation pipeline — term collection, sentence generation, quality filtering, TTS synthesis

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors