A dependency parsing-based tool for extracting Subject-Predicate-Object (SPO) triples from biomedical text using Stanford CoreNLP and Stanza.
- Overview
- Features
- Requirements
- Installation
- Usage
- Example Sentences and Universal Dependencies
- Algorithm
- Testing
- Future Enhancements
This tool extracts structured SPO (Subject-Predicate-Object) relationships from natural language sentences, particularly useful for biomedical text mining. It uses dependency parsing patterns and noun phrase chunking to identify semantic relationships between entities.
- Trigger-based extraction with dependency parsing
- Neural pipeline for accurate dependency analysis
- Support for complex grammatical structures (passive voice, coordinating conjunctions, relative clauses)
- Integration with Stanford CoreNLP's Tregex for chunking
- Stanza's neural models for dependency parsing
- Python 3.x
- Stanford CoreNLP 4.0.0 or later
- Stanza library with English models
Download and extract the following:
Place the model JAR files in the CoreNLP distribution folder.
export CORENLP_HOME=/path/to/stanford-corenlp-full-2020-04-20
export DATA_DIR=/path/to/data/pip install -r requirements.txtpython -c 'import stanza; stanza.download("en")'PYTHONPATH=. python bin/run_spo.py -i input_directory -o output_fileParameters:
-i, --input: Directory containing input text files-o, --output: Output file for extracted SPO triples
PYTHONPATH=. python tests/test_SPOs.py- The encapsulation of rifampicin leads to a reduction of the Mycobacterium smegmatis inside macrophages.
- nsubj (nominal subject) <= VERB => obl (oblique nominal)
- The Norwalk virus is the prototype virus that causes epidemic gastroenteritis infecting predominantly older children and adults.
- acl:relcl (adjectival clause) => VERB => obj (object)
- It is widely agreed that the exposure to ambient air pollution may cause serious respiratory illnesses and that weather conditions may also contribute to the seriousness.
- nsubj <= VERB => obj
- In this report, ribavirin was shown to inhibit SARS coronavirus replication in five different cell types of animal or human origin at therapeutically achievable concentrations.
- nsubj:pass <= xcomp => VERB => obj
- Chronic hepatitis virus infection is a major cause of chronic hepatitis, cirrhosis, and hepatocellular carcinoma worldwide.
- nsubj <= NOUN => nmod => conj
- coordinating conjunctions
The extractor handles various dependency patterns including:
- nsubj => VERB => obj: Basic subject-verb-object
- nsubj => VERB => obl: Verb with oblique nominal
- acl:relcl => VERB => obj: Relative clauses
- nsubj:pass => xcomp => VERB => obj: Passive voice constructions
- nsubj => NOUN => nmod => conj: Coordinating conjunctions
Stanza is a Python wrapper that combines Stanford CoreNLP and PyTorch-based NLP models:
- Tregex for noun phrase chunking
- Neural pipeline for dependency parsing
The extraction process follows these steps:
- Input Processing: Accept a sentence and a list of trigger words
- Trigger Detection: Check if any trigger word appears in the sentence
- Parsing: If triggered, run dependency parser and chunker on the sentence
- Head Word Identification: Use dependency relations from the trigger to identify syntactic head words
- NP Extraction: Extract noun phrases by merging dependency relations and chunks based on head words
cd /path/to/project-directory
pytest tests/test_SPOs.pyexport DATA_DIR="$(pwd)/data/tests"
pytest tests/test_data_reader.pyPYTHONPATH=. python tests/test_SPOs.py- Biomedical NER Integration: Biomedical Named Entity Recognizers can improve NP chunking and identify semantic roles of noun phrases
- Entity Type Classification: Distinguish between different types of biomedical entities (diseases, drugs, proteins, etc.)
- Relation Classification: Categorize extracted relationships by type (causation, association, treatment, etc.)




