Neural Dependency Parser for Old Church Slavonic and Old East Slavic
Modern neural dependency parser for Old Slavic texts, combining state-of-the-art neural NLP with symbolic grammar validation.
OldSlavNet-Modernized is a modernization of the original OldSlavNet parser by Nilo Pedrazzini, updated to use:
- Stanza framework (Stanford NLP)
- DiaParser (biaffine attention)
- PyTorch (modern deep learning)
- Prolog validation (neural-symbolic hybrid)
Adapted from the Coptic dependency parser architecture.
✨ Modern NLP Stack
- Python 3.9+ compatible (no compilation needed!)
- PyTorch backend
- Easy installation with pip
🧠 Neural + Symbolic
- Neural parsing with Stanza/DiaParser
- Prolog-based grammatical validation
- Handles Old Slavic complexity:
- 7 cases (nom, gen, dat, acc, inst, loc, voc)
- Dual number
- Complex aspect system (aorist, imperfect, perfect)
📚 Trained on TOROT
- Uses Tromsø Old Russian and OCS Treebank
- Supports multiple Old Slavic varieties
- Python 3.9 or higher
- SWI-Prolog (for grammatical validation, optional)
- Clone the repository:
git clone https://github.com/YOUR-USERNAME/oldslavnet-modernized.git
cd oldslavnet-modernized- Create virtual environment:
python3.9 -m venv .venv
source .venv/bin/activate # On Linux/macOS
# .venv\Scripts\activate # On Windows- Install dependencies:
pip install -r requirements.txt- Install SWI-Prolog (optional, for validation):
# Ubuntu/Debian:
sudo apt install swi-prolog
# macOS:
brew install swi-prolog
# Then install Python bindings:
pip install pyswippython oldslavic_parser.py \
--input path/to/input.conllu \
--output path/to/output.conllu \
--model-dir path/to/models # optionalfrom oldslavic_parser import OldSlavicParser
# Initialize parser
parser = OldSlavicParser(use_prolog=True)
# Parse text
text = "Въ начѧлѣ бѣ слово" # "In the beginning was the Word"
result = parser.parse(text)
# Access parsed data
for sentence in result['sentences']:
for word in sentence:
print(f"{word['text']}\t{word['lemma']}\t{word['upos']}\t{word['deprel']}")A trained DiaParser model is available in data/models/oldslavic_parser/ with strong performance on TOROT:
- UAS (Unlabeled Attachment Score): 86.47%
- LAS (Labeled Attachment Score): 81.48%
- Training data: ~30K sentences from TOROT treebank
- Vocabulary: 22,191 words, 43 dependency relations
python scripts/evaluate_parser.py \
--model data/models/oldslavic_parser/model \
--test data/training/test.conlluTo retrain or train on your own data:
# Prepare TOROT data (or your own CoNLL-U files)
python scripts/prepare_torot_data.py \
--torot-dir /path/to/torot \
--output-dir data/trainingpython scripts/train_diaparser.py \
--train data/training/train.conllu \
--dev data/training/dev.conllu \
--test data/training/test.conllu \
--output data/models/oldslavic_parserTraining takes ~5-6 hours on CPU for 100 epochs.
- Tokenizer: Stanza
- POS Tagger: BiLSTM-CRF (Stanza)
- Lemmatizer: Sequence-to-sequence (Stanza)
- Parser: Biaffine attention (DiaParser)
A Prolog-based validation layer is under development to detect and correct neural parser errors. The framework includes rules for:
- Case agreement rules (7-case system)
- Number agreement (singular/dual/plural)
- Genitive of negation (objects → genitive with negated verbs)
- Participle agreement (case/number/gender)
- Dependency relation constraints
Status: Rule definitions complete (oldslavic_prolog_rules.py), integration with parser pipeline planned for v2.0.
This follows the proven neural-symbolic architecture of the Coptic dependency parser, which uses Janus Prolog for error detection and hallucination prevention.
| Feature | Original (2021) | Modernized (2025) |
|---|---|---|
| Framework | dynet | Stanza/PyTorch |
| Python | 3.4-3.9 | 3.9-3.12 |
| Installation | Complex compilation | Simple pip install |
| Grammar validation | None | Prolog rules |
| GUI | No | Planned |
| Maintenance | Archived | Active |
Current (v1.0):
- Core parser architecture
- DiaParser model trained (86.47% UAS, 81.48% LAS on TOROT)
- Command-line interface
- Prolog rule framework (foundation laid)
Planned Development:
- Neural-symbolic integration (connect DiaParser → Prolog validator)
- Implement error detection for parser hallucinations
- Automatic correction based on Old Slavonic grammar rules
- Following proven Coptic parser architecture
- Extended Prolog rules (aspect, word order, clitics)
- Stanza tokenizer/tagger models (optional enhancement)
- Web demo (Hugging Face Spaces)
- GUI application
- Unified ancient language framework
Original OldSlavNet:
- Nilo Pedrazzini (2020-2021)
- Based on jPTDP architecture
- Trained on TOROT treebank
Modernization:
- Architecture adapted from Coptic dependency parser
- Neural-symbolic integration inspired by Coptic parser's Prolog validation
Data:
- TOROT - Tromsø Old Russian and Old Church Slavonic Treebank
- Universal Dependencies framework
Contributions welcome! Areas needing work:
- Model training - Train Stanza models on TOROT
- Prolog rules - Expand Old Slavic grammar coverage
- Testing - Validation on historical texts
- Documentation - Usage examples, tutorials
- GUI - Desktop application with visualization
CC BY-NC-SA 4.0 - See LICENSE file
If you use this parser, please cite:
Original OldSlavNet:
@inproceedings{pedrazzini2020oldslavnet,
title={Exploiting Cross-Dialectal Gold Syntax for Low-Resource Historical Languages},
author={Pedrazzini, Nilo},
booktitle={CHR 2020},
year={2020}
}Modernization (paper forthcoming)
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: relanir@bluewin.ch
- Original OldSlavNet - dynet-based version
- Coptic Parser - Sister project for Coptic
- TOROT - Training data
- Stanza - NLP framework