This project implements a Knowledge Graph Generation system that can extract semantic relationships and create RDF-based knowledge graphs from unstructured text documents.
- Competency Question Generation (https://arxiv.org/pdf/2412.20942)
- Relation Extraction
- Ontology Matching
- Knowledge Graph Construction
- RDF Serialization
- Dynamic Entity Configuration
- Python 3.8+
- Poetry (Python dependency management)
-
Clone the repository
-
Install Poetry (if not already installed):
pip install poetry -
Install project dependencies:
poetry install -
Download spaCy language model:
poetry run python -m spacy download en_core_web_sm
Generate a knowledge graph from a text file:
poetry run python -m src.main input_document.txtinput_file: Path to the input text file (required)--max-questions: Maximum number of competency questions to generate (default: 3)--output-format: Output format for the knowledge graph (choices: turtle, xml, json-ld, default: turtle)
Example:
poetry run python -m src.main document.txt --max-questions 5 --output-format json-ldThe project now supports dynamic entity configuration through a JSON file:
- Create or modify
entity_config.json:
{
"research_area": {
"description": "The primary field of research or study",
"domain": "Researcher",
"range": "Academic Field"
},
"academic_institution": {
"description": "The primary academic institution associated with an individual",
"domain": "Researcher",
"range": "Institution"
}
}- Use in code:
from src.relation_extractor import RelationExtractor
# Load relations from a JSON file
extractor = RelationExtractor('entity_config.json')
# Add a new custom relation dynamically
extractor.add_custom_relation({
'relation': 'research_project',
'description': 'A significant research project',
'domain': 'Researcher',
'range': 'Research Project'
})
# Save custom relations to a file
extractor.save_custom_relations('updated_entity_config.json')src/main.py: Main orchestration scriptcq_generator.py: Competency Question Generationrelation_extractor.py: Relation Extractionontology_matcher.py: Ontology Matchingkg_builder.py: Knowledge Graph Construction
entity_config.json: Custom entity configuration file
poetry run pytestpoetry run black .
poetry run isort .poetry run mypy .- spaCy: NLP processing
- RDFLib: RDF graph manipulation
- Poetry: Dependency management
- Add dynamic entity configuration
- Improve Named Entity Recognition
- Add more sophisticated relation extraction
- Support multiple input document formats
- Enhance ontology matching capabilities
[Specify your license here]
Contributions are welcome! Please feel free to submit a Pull Request.
- SpaCy has a number of pre-trained models to select for nlp use, there is also a 3rd party list of models. *Disable all components in CNN/CPU pipeline except NER for a more performance.https://spacy.io/models/en
- For a list of NER entities to extract - click here.
- The only question i ask in CQ is "What is the {entity_type} of {entity}?" - Feel free to update the program files for more variations specific to your domain.
- [This link gives you a 3rd party nlp model to use for extracting biological and clinical entities from documents (https://allenai.github.io/scispacy/)