Skip to content

bdsp-core/eeg-report-extraction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Automated Extraction of Seizures and IIC Patterns from EEG Reports

Code accompanying:

Sartipi S, Godfrey DS, Tauțan A-M, Fernandes MP, Ghanta M, Gupta A, Nearing B, Kim J, Struck A, Westover MB, Zafar SF. Automated Extraction of Seizures and Ictal-Interictal Continuum Patterns from EEG Reports to Enable Large-Scale Neurocritical Care Research. Manuscript under review, 2026.

Data + code (citable): bdsp.io/content/s9mc6d6c3dkczko0o4v0/ (DOI 10.60508/j6w0-v279)

What's in this repo

A five-stage pipeline for extracting structured seizure and ictal-interictal continuum (IIC) pattern attributes from free-text EEG reports using a large language model:

  1. Preprocessing — whitespace/casing normalization, removal of non-informative lines and decorative symbols.
  2. Rule-based pattern filtering — keyword lists identify candidate report segments mentioning seizures, GPD, GRDA, LPD, LRDA.
  3. Segment-focused prompt generation — assembles focused prompts per candidate segment.
  4. LLM inference — extracts structured attributes (presence, count, burden, frequency, prevalence) from each segment.
  5. Post-processing — schema validation, deduplication, and provenance tracking.
code/
├── code_eeg_annotation.ipynb     main pipeline notebook
├── help_functions.py             shared helpers (text normalization, etc.)
├── list_szPatterns0x.txt         seizure-keyword lists used in rule-based filtering
├── list_szPatterns1.txt
├── list_szPatternsR.txt
├── list_gpdPatterns.txt          IIC-pattern keyword lists
├── list_grdaPatterns.txt
├── list_lpdPatterns.txt
└── list_lrdaPatterns.txt

Data

The extracted findings used in the paper (deidentified EEG-report extracted attributes from three institutions — BCH, BIDMC, MGB) are hosted on BDSP's credentialed-access S3 bucket. To request access:

  1. Register at bdsp.io and apply for credentialed access (you'll need to sign the BDSP Credentialed Health Data Use Agreement).

  2. Once approved, sync the data:

    aws s3 sync s3://bdsp-opendata-credentialed/eeg-report-extraction/data/ ./data/

    Or use the BDSP S3 access-point alias if your team has one provisioned:

    aws s3 sync \
      s3://bdsp-credentialed-pr-fymwc8rqh9fzdisq7om7eiq9wutqhuse1b-s3alias/eeg-report-extraction/data/ \
      ./data/

Running the pipeline

# Python 3.10+
python -m venv .venv && source .venv/bin/activate
pip install pandas numpy jupyterlab openai  # or your LLM client of choice

jupyter lab code/code_eeg_annotation.ipynb

The notebook expects EEG-report text in a column-shaped CSV/TSV (one report per row). Configure your LLM credentials and model name in the notebook; the original work used GPT-4-class models, and smaller models will degrade extraction quality on rare pattern types.

Citation

If you use this code or data, please cite the published project:

Sartipi S, Godfrey DS, Tauțan A-M, Fernandes MP, Ghanta M, Gupta A, Nearing B,
Kim J, Struck A, Westover MB, Zafar SF. Automated Extraction of Seizures and
Ictal-Interictal Continuum Patterns from EEG Reports: Data and Code.
BDSP, 2026. https://doi.org/10.60508/j6w0-v279

License

Code and data are released under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0). You're welcome to use, modify, and redistribute everything here for academic and non-commercial research; for commercial use please contact the corresponding author.

About

Code accompanying Sartipi et al. — five-stage LLM pipeline for extracting seizure and IIC pattern attributes from free-text EEG reports. Data + DOI on bdsp.io.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors