Skip to content

Rogaton/oldslavnet-modernized

Repository files navigation

OldSlavNet-Modernized

Neural Dependency Parser for Old Church Slavonic and Old East Slavic

License: CC BY-NC-SA 4.0 Python Stanza

Modern neural dependency parser for Old Slavic texts, combining state-of-the-art neural NLP with symbolic grammar validation.


Overview

OldSlavNet-Modernized is a modernization of the original OldSlavNet parser by Nilo Pedrazzini, updated to use:

  • Stanza framework (Stanford NLP)
  • DiaParser (biaffine attention)
  • PyTorch (modern deep learning)
  • Prolog validation (neural-symbolic hybrid)

Adapted from the Coptic dependency parser architecture.

Features

Modern NLP Stack

  • Python 3.9+ compatible (no compilation needed!)
  • PyTorch backend
  • Easy installation with pip

🧠 Neural + Symbolic

  • Neural parsing with Stanza/DiaParser
  • Prolog-based grammatical validation
  • Handles Old Slavic complexity:
    • 7 cases (nom, gen, dat, acc, inst, loc, voc)
    • Dual number
    • Complex aspect system (aorist, imperfect, perfect)

📚 Trained on TOROT

  • Uses Tromsø Old Russian and OCS Treebank
  • Supports multiple Old Slavic varieties

Installation

Prerequisites

  • Python 3.9 or higher
  • SWI-Prolog (for grammatical validation, optional)

Quick Start

  1. Clone the repository:
git clone https://github.com/YOUR-USERNAME/oldslavnet-modernized.git
cd oldslavnet-modernized
  1. Create virtual environment:
python3.9 -m venv .venv
source .venv/bin/activate  # On Linux/macOS
# .venv\Scripts\activate  # On Windows
  1. Install dependencies:
pip install -r requirements.txt
  1. Install SWI-Prolog (optional, for validation):
# Ubuntu/Debian:
sudo apt install swi-prolog

# macOS:
brew install swi-prolog

# Then install Python bindings:
pip install pyswip

Usage

Command Line

python oldslavic_parser.py \
  --input path/to/input.conllu \
  --output path/to/output.conllu \
  --model-dir path/to/models  # optional

Python API

from oldslavic_parser import OldSlavicParser

# Initialize parser
parser = OldSlavicParser(use_prolog=True)

# Parse text
text = "Въ начѧлѣ бѣ слово"  # "In the beginning was the Word"
result = parser.parse(text)

# Access parsed data
for sentence in result['sentences']:
    for word in sentence:
        print(f"{word['text']}\t{word['lemma']}\t{word['upos']}\t{word['deprel']}")

Pre-trained Model

A trained DiaParser model is available in data/models/oldslavic_parser/ with strong performance on TOROT:

  • UAS (Unlabeled Attachment Score): 86.47%
  • LAS (Labeled Attachment Score): 81.48%
  • Training data: ~30K sentences from TOROT treebank
  • Vocabulary: 22,191 words, 43 dependency relations

Evaluate the Model

python scripts/evaluate_parser.py \
  --model data/models/oldslavic_parser/model \
  --test data/training/test.conllu

Training Your Own Model

To retrain or train on your own data:

1. Data Preparation

# Prepare TOROT data (or your own CoNLL-U files)
python scripts/prepare_torot_data.py \
  --torot-dir /path/to/torot \
  --output-dir data/training

2. Train DiaParser Model

python scripts/train_diaparser.py \
  --train data/training/train.conllu \
  --dev data/training/dev.conllu \
  --test data/training/test.conllu \
  --output data/models/oldslavic_parser

Training takes ~5-6 hours on CPU for 100 epochs.


Architecture

Neural Component

  • Tokenizer: Stanza
  • POS Tagger: BiLSTM-CRF (Stanza)
  • Lemmatizer: Sequence-to-sequence (Stanza)
  • Parser: Biaffine attention (DiaParser)

Symbolic Component (Prolog) - In Development

A Prolog-based validation layer is under development to detect and correct neural parser errors. The framework includes rules for:

  • Case agreement rules (7-case system)
  • Number agreement (singular/dual/plural)
  • Genitive of negation (objects → genitive with negated verbs)
  • Participle agreement (case/number/gender)
  • Dependency relation constraints

Status: Rule definitions complete (oldslavic_prolog_rules.py), integration with parser pipeline planned for v2.0.

This follows the proven neural-symbolic architecture of the Coptic dependency parser, which uses Janus Prolog for error detection and hallucination prevention.


Comparison with Original OldSlavNet

Feature Original (2021) Modernized (2025)
Framework dynet Stanza/PyTorch
Python 3.4-3.9 3.9-3.12
Installation Complex compilation Simple pip install
Grammar validation None Prolog rules
GUI No Planned
Maintenance Archived Active

Project Status

Current (v1.0):

  • Core parser architecture
  • DiaParser model trained (86.47% UAS, 81.48% LAS on TOROT)
  • Command-line interface
  • Prolog rule framework (foundation laid)

Planned Development:

  • Neural-symbolic integration (connect DiaParser → Prolog validator)
    • Implement error detection for parser hallucinations
    • Automatic correction based on Old Slavonic grammar rules
    • Following proven Coptic parser architecture
  • Extended Prolog rules (aspect, word order, clitics)
  • Stanza tokenizer/tagger models (optional enhancement)
  • Web demo (Hugging Face Spaces)
  • GUI application
  • Unified ancient language framework

Credits

Original OldSlavNet:

  • Nilo Pedrazzini (2020-2021)
  • Based on jPTDP architecture
  • Trained on TOROT treebank

Modernization:

  • Architecture adapted from Coptic dependency parser
  • Neural-symbolic integration inspired by Coptic parser's Prolog validation

Data:

  • TOROT - Tromsø Old Russian and Old Church Slavonic Treebank
  • Universal Dependencies framework

Contributing

Contributions welcome! Areas needing work:

  1. Model training - Train Stanza models on TOROT
  2. Prolog rules - Expand Old Slavic grammar coverage
  3. Testing - Validation on historical texts
  4. Documentation - Usage examples, tutorials
  5. GUI - Desktop application with visualization

License

CC BY-NC-SA 4.0 - See LICENSE file


Citation

If you use this parser, please cite:

Original OldSlavNet:

@inproceedings{pedrazzini2020oldslavnet,
  title={Exploiting Cross-Dialectal Gold Syntax for Low-Resource Historical Languages},
  author={Pedrazzini, Nilo},
  booktitle={CHR 2020},
  year={2020}
}

Modernization (paper forthcoming)


Contact


Related Projects

About

Neural Dependency Parser for Old Church Slavonic and Old East Slavic

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages