SPO Extractor using Stanza

A dependency parsing-based tool for extracting Subject-Predicate-Object (SPO) triples from biomedical text using Stanford CoreNLP and Stanza.

Overview

This tool extracts structured SPO (Subject-Predicate-Object) relationships from natural language sentences, particularly useful for biomedical text mining. It uses dependency parsing patterns and noun phrase chunking to identify semantic relationships between entities.

Features

Trigger-based extraction with dependency parsing
Neural pipeline for accurate dependency analysis
Support for complex grammatical structures (passive voice, coordinating conjunctions, relative clauses)
Integration with Stanford CoreNLP's Tregex for chunking
Stanza's neural models for dependency parsing

Requirements

Python 3.x
Stanford CoreNLP 4.0.0 or later
Stanza library with English models

Installation

1. Download Stanford CoreNLP

Download and extract the following:

Place the model JAR files in the CoreNLP distribution folder.

2. Set Environment Variables

export CORENLP_HOME=/path/to/stanford-corenlp-full-2020-04-20
export DATA_DIR=/path/to/data/

3. Install Python Dependencies

pip install -r requirements.txt

4. Download Stanza English Model

python -c 'import stanza; stanza.download("en")'

Usage

Extract SPO from Input Directory

PYTHONPATH=. python bin/run_spo.py -i input_directory -o output_file

Parameters:

-i, --input: Directory containing input text files
-o, --output: Output file for extracted SPO triples

Test with Example Sentences

PYTHONPATH=. python tests/test_SPOs.py

Example Sentences and Universal Dependencies

The encapsulation of rifampicin leads to a reduction of the Mycobacterium smegmatis inside macrophages.
nsubj (nominal subject) <= VERB => obl (oblique nominal)

The Norwalk virus is the prototype virus that causes epidemic gastroenteritis infecting predominantly older children and adults.
acl:relcl (adjectival clause) => VERB => obj (object)

It is widely agreed that the exposure to ambient air pollution may cause serious respiratory illnesses and that weather conditions may also contribute to the seriousness.
nsubj <= VERB => obj

In this report, ribavirin was shown to inhibit SARS coronavirus replication in five different cell types of animal or human origin at therapeutically achievable concentrations.
nsubj:pass <= xcomp => VERB => obj

Chronic hepatitis virus infection is a major cause of chronic hepatitis, cirrhosis, and hepatocellular carcinoma worldwide.
nsubj <= NOUN => nmod => conj
coordinating conjunctions

Supported Dependency Patterns

The extractor handles various dependency patterns including:

nsubj => VERB => obj: Basic subject-verb-object
nsubj => VERB => obl: Verb with oblique nominal
acl:relcl => VERB => obj: Relative clauses
nsubj:pass => xcomp => VERB => obj: Passive voice constructions
nsubj => NOUN => nmod => conj: Coordinating conjunctions

What is Stanza?

Stanza is a Python wrapper that combines Stanford CoreNLP and PyTorch-based NLP models:

Tregex for noun phrase chunking
Neural pipeline for dependency parsing

Algorithm

The extraction process follows these steps:

Input Processing: Accept a sentence and a list of trigger words
Trigger Detection: Check if any trigger word appears in the sentence
Parsing: If triggered, run dependency parser and chunker on the sentence
Head Word Identification: Use dependency relations from the trigger to identify syntactic head words
NP Extraction: Extract noun phrases by merging dependency relations and chunks based on head words

Testing

Run All Tests

cd /path/to/project-directory
pytest tests/test_SPOs.py

Run Data Reader Tests

export DATA_DIR="$(pwd)/data/tests"
pytest tests/test_data_reader.py

Test with Example Sentences

PYTHONPATH=. python tests/test_SPOs.py

Future Enhancements

Biomedical NER Integration: Biomedical Named Entity Recognizers can improve NP chunking and identify semantic roles of noun phrases
Entity Type Classification: Distinguish between different types of biomedical entities (diseases, drugs, proteins, etc.)
Relation Classification: Categorize extracted relationships by type (causation, association, treatment, etc.)

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
.github/workflows		.github/workflows
bin		bin
data/tests		data/tests
image		image
result		result
spo		spo
tests		tests
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
noxfile.py		noxfile.py
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SPO Extractor using Stanza

Table of Contents

Overview

Features

Requirements

Installation

1. Download Stanford CoreNLP

2. Set Environment Variables

3. Install Python Dependencies

4. Download Stanza English Model

Usage

Extract SPO from Input Directory

Test with Example Sentences

Example Sentences and Universal Dependencies

Supported Dependency Patterns

What is Stanza?

Algorithm

Testing

Run All Tests

Run Data Reader Tests

Test with Example Sentences

Future Enhancements

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SPO Extractor using Stanza

Table of Contents

Overview

Features

Requirements

Installation

1. Download Stanford CoreNLP

2. Set Environment Variables

3. Install Python Dependencies

4. Download Stanza English Model

Usage

Extract SPO from Input Directory

Test with Example Sentences

Example Sentences and Universal Dependencies

Supported Dependency Patterns

What is Stanza?

Algorithm

Testing

Run All Tests

Run Data Reader Tests

Test with Example Sentences

Future Enhancements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages