Code accompanying:
Sartipi S, Godfrey DS, Tauțan A-M, Fernandes MP, Ghanta M, Gupta A, Nearing B, Kim J, Struck A, Westover MB, Zafar SF. Automated Extraction of Seizures and Ictal-Interictal Continuum Patterns from EEG Reports to Enable Large-Scale Neurocritical Care Research. Manuscript under review, 2026.
Data + code (citable): bdsp.io/content/s9mc6d6c3dkczko0o4v0/ (DOI 10.60508/j6w0-v279)
A five-stage pipeline for extracting structured seizure and ictal-interictal continuum (IIC) pattern attributes from free-text EEG reports using a large language model:
- Preprocessing — whitespace/casing normalization, removal of non-informative lines and decorative symbols.
- Rule-based pattern filtering — keyword lists identify candidate report segments mentioning seizures, GPD, GRDA, LPD, LRDA.
- Segment-focused prompt generation — assembles focused prompts per candidate segment.
- LLM inference — extracts structured attributes (presence, count, burden, frequency, prevalence) from each segment.
- Post-processing — schema validation, deduplication, and provenance tracking.
code/
├── code_eeg_annotation.ipynb main pipeline notebook
├── help_functions.py shared helpers (text normalization, etc.)
├── list_szPatterns0x.txt seizure-keyword lists used in rule-based filtering
├── list_szPatterns1.txt
├── list_szPatternsR.txt
├── list_gpdPatterns.txt IIC-pattern keyword lists
├── list_grdaPatterns.txt
├── list_lpdPatterns.txt
└── list_lrdaPatterns.txt
The extracted findings used in the paper (deidentified EEG-report extracted attributes from three institutions — BCH, BIDMC, MGB) are hosted on BDSP's credentialed-access S3 bucket. To request access:
-
Register at bdsp.io and apply for credentialed access (you'll need to sign the BDSP Credentialed Health Data Use Agreement).
-
Once approved, sync the data:
aws s3 sync s3://bdsp-opendata-credentialed/eeg-report-extraction/data/ ./data/
Or use the BDSP S3 access-point alias if your team has one provisioned:
aws s3 sync \ s3://bdsp-credentialed-pr-fymwc8rqh9fzdisq7om7eiq9wutqhuse1b-s3alias/eeg-report-extraction/data/ \ ./data/
# Python 3.10+
python -m venv .venv && source .venv/bin/activate
pip install pandas numpy jupyterlab openai # or your LLM client of choice
jupyter lab code/code_eeg_annotation.ipynbThe notebook expects EEG-report text in a column-shaped CSV/TSV (one report per row). Configure your LLM credentials and model name in the notebook; the original work used GPT-4-class models, and smaller models will degrade extraction quality on rare pattern types.
If you use this code or data, please cite the published project:
Sartipi S, Godfrey DS, Tauțan A-M, Fernandes MP, Ghanta M, Gupta A, Nearing B,
Kim J, Struck A, Westover MB, Zafar SF. Automated Extraction of Seizures and
Ictal-Interictal Continuum Patterns from EEG Reports: Data and Code.
BDSP, 2026. https://doi.org/10.60508/j6w0-v279
Code and data are released under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0). You're welcome to use, modify, and redistribute everything here for academic and non-commercial research; for commercial use please contact the corresponding author.