Skip to content

sociocom/prism-annotator

Repository files navigation

prism-annotator

A CLI tool for automatic PRISM annotation of medical/clinical texts using LLMs.

PRISM (Problem-oriented, Real-time, Informatics-based, Structured, Medical record) defines a schema for annotating medical entities (diseases, symptoms, anatomical parts, tests, medications, etc.) and their relations (temporal, spatial, causal) in clinical text.

prism-annotator uses the TANL (Translation between Augmented Natural Languages; Paolini+, ICLR 2021) inline annotation format with LLMs to extract structured PRISM annotations from free-text medical documents.

Installation

pip install prism-annotator

Or with uv:

uv add prism-annotator

Quick Start

1. Scaffold a new project

prism init my-project --language en
cd my-project

This creates:

  • config.yaml — extraction configuration
  • prompts/ — system prompts and few-shot examples (customise these)
  • data/ — place your input texts here
  • .env.example — API key template

2. Add your input data

Place .txt files in data/, or point config.yaml at a CSV file:

data:
  input_path: "data/"          # directory of .txt files
  # input_path: "data/notes.csv"  # or a CSV file
  # text_column: "text"           # CSV column name

3. Add few-shot examples

Edit prompts/entity_examples.yaml with domain-specific examples:

- input: "Chest CT showed ground-glass opacity in the right lung."
  output: "[Chest CT | t-test(+)] showed [ground-glass opacity | d(+)] in the [right lung | a]."

4. Set your API key and run

export OPENROUTER_API_KEY=sk-...
prism extract --config config.yaml

CLI Commands

prism extract       Run entity or relation extraction
prism merge         Merge entity + relation results
prism visualise     Generate interactive HTML viewer
prism to-xml        Convert to PRISM inline XML
prism validate      Validate results against PRISM schema
prism init          Scaffold a new project

Multi-phase pipeline

PRISM annotation runs in three phases:

# Phase 1: Entity extraction
prism extract --config entity.yaml

# Phase 2a: Medical relation extraction
prism extract --config medical_rel.yaml --entity-results output/entity/results.json

# Phase 2b: Temporal relation extraction
prism extract --config time_rel.yaml --entity-results output/entity/results.json

# Phase 3: Merge
prism merge --entity output/entity/results.json \
            --medical-relation output/medical/results.json \
            --time-relation output/time/results.json \
            -o output/merged

# Generate viewer
prism visualise output/merged

Configuration

See config.yaml generated by prism init for all options. Key settings:

Section Field Description
data.input_path Path Directory of .txt files or a .csv file
data.text_column String CSV column containing document text
model.model_id String LLM model ID (OpenRouter format)
model.base_url URL API endpoint (OpenRouter, OpenAI, etc.)
prompts.language ja/en Language for built-in system prompts
prompts.prompts_dir Path Custom prompts directory

Custom Prompts

The prompt fallback chain:

  1. prompts_dir in config (if set)
  2. prompts/ in working directory
  3. Built-in defaults (Japanese or English)

Each phase has a system prompt (.md) and few-shot examples (.yaml):

Phase System prompt Examples
Entity entity_system.md entity_examples.yaml
Medical relation medical_relation_system.md medical_relation_examples.yaml
Time relation time_relation_system.md time_relation_examples.yaml

Output Formats

  • JSON (results.json) — primary structured output
  • XML (results.xml) — PRISM inline-annotated XML
  • HTML (viewer.html) — interactive browser-based viewer
  • Statistics (stats.json) — entity/relation distribution

PRISM Schema (v8)

13 entity types, 10 relation types. See the PRISM Annotation Guidelines v8 for full specification:

The PRISM annotation scheme was originally proposed in the following works:

Shuntaro Yada, Ayami Joh, Ribeka Tanaka, Fei Cheng, Eiji Aramaki, and Sadao Kurohashi. 2020. Towards a Versatile Medical-Annotation Guideline Feasible Without Heavy Medical Knowledge: Starting From Critical Lung Diseases. In Proceedings of the Twelfth Language Resources and Evaluation Conference (LREC), pages 4565–4572, Marseille, France. European Language Resources Association. [ACL Anthology]

矢田 竣太郎, 田中 リベカ, Fei Cheng, 荒牧 英治, 黒橋 禎夫. 2022. 汎用的な臨床医学テキストアノテーション仕様およびガイドラインの策定:重篤肺疾患ドメインに着目して. 自然言語処理, 29(4), pp. 1165–1197. [J-STAGE]

Note that our "PRISM" acronym (i.e. Problem-oriented, Real-time, Informatics-based, Structured, Medical record) dereived from a research funding scheme, called PRISM (Public/Private R&D Investment Strategic Expansion PrograM), by which our research project above was originally supported.

The TANL inline annotation format used by this tool is adapted from:

Giovanni Paolini, Ben Athiwaratkun, Jason Krone, Jie Ma, Alessandro Achille, Rishita Anubhai, Cicero Nogueira dos Santos, Bing Xiang, and Stefano Soatto. 2021. Structured Prediction as Translation between Augmented Natural Languages. In Proceedings of the Ninth International Conference on Learning Representations (ICLR). [OpenReview]

Supported LLM Providers

Any OpenAI-compatible API endpoint works. Configure in config.yaml:

model:
  model_id: "anthropic/claude-sonnet-4"   # OpenRouter
  base_url: "https://openrouter.ai/api/v1"
  api_key_env: "OPENROUTER_API_KEY"
model:
  model_id: "gpt-4o"                        # OpenAI direct
  base_url: "https://api.openai.com/v1"
  api_key_env: "OPENAI_API_KEY"

Citation

If you use this tool, please cite the original PRISM annotation works:

@inproceedings{yada-etal-2020-towards,
    title = "Towards a Versatile Medical-Annotation Guideline Feasible Without Heavy Medical Knowledge: Starting From Critical Lung Diseases",
    author = "Yada, Shuntaro and Joh, Ayami and Tanaka, Ribeka and Cheng, Fei and Aramaki, Eiji and Kurohashi, Sadao",
    booktitle = "Proceedings of the Twelfth Language Resources and Evaluation Conference",
    month = may,
    year = "2020",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://aclanthology.org/2020.lrec-1.561/",
    pages = "4565--4572",
    isbn = "979-10-95546-34-4",
}

@article{yada-etal-2022-prism,
    title = "汎用的な臨床医学テキストアノテーション仕様およびガイドラインの策定:重篤肺疾患ドメインに着目して",
    author = "矢田, 竣太郎 and 田中, リベカ and Cheng, Fei and 荒牧, 英治 and 黒橋, 禎夫",
    journal = "自然言語処理",
    volume = "29",
    number = "4",
    pages = "1165--1197",
    year = "2022",
    url = "https://www.jstage.jst.go.jp/article/jnlp/29/4/29_1165/_article/-char/ja/",
}

See also CITATION.cff for machine-readable citation metadata.

Changelog

0.2.1

  • Fix: viewer entity highlights now align to the original source text instead of using TANL-stripped offsets, which caused the first character(s) of entities to be excluded or misaligned.

0.2.0

  • Initial release.

Licence

MIT

About

A CLI tool for automatic PRISM annotation of medical/clinical texts using LLMs

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages