Skip to content

massimomazzariol/Lexicon

Lexicon Platform

CI Code License Content License

Lexicon Platform is a reusable lexical content repository. It contains the canonical source material, language plugins, runtime pack builders, local lexicon storage, and the file-based distribution contract needed to ship language data into an application or service.

It powers Vokabell, a German vocabulary-learning app built with Flutter.

Why This Exists

Lexical knowledge for language learning is scattered and inconsistent: definitions vary by source, examples are uneven, CEFR levels are rarely assigned coherently, and the relations between words (synonyms, antonyms, sense families) are mostly missing. Assembling that live, per request, would be slow, costly, non-deterministic and offline-hostile... and a learner cannot tell a good entry from a bad one.

Lexicon solves this by consolidating the knowledge once into a single curated source of truth, then materialising it into static, versioned, downloadable packs. The payoff:

  • Deterministic and reviewable. A word always returns the same vetted entry. Quality is fixed once and only improves; the source is content-as-code, so every change is a reviewable diff and nothing ships unreviewed.
  • Leveled and interconnected, consistently. CEFR levels (A1→C2) and concept relations are global judgments made once and anchored to authoritative references, not re-derived each time - which is exactly where ad-hoc generation is least reliable.
  • Offline, instant, free at the point of use. No network dependency, no per-use cost; it works with no signal and scales to any number of users without scaling cost.
  • An owned asset, not a runtime dependency on an external service.

In short: pay the curation cost once, serve quality forever.

Engineering highlights

  • Content-as-code, deterministic packs. A single curated source is built into versioned runtime packs plus a file-based distribution contract. The same word always returns the same vetted entry; every change is a reviewable diff.
  • Capability-driven language plugins. Language behaviour (noun declension, separable-verb decomposition, ...) lives in per-language plugins that advertise capabilities. The core stays language-neutral, so adding a language is a plugin, not a core change.
  • Deterministic morphology + curated irregulars. German noun declension and separable-verb decomposition (auf|stehen) are rule-derived; irregulars come from curated overrides, never guessed. Stable form ids keyed on the grammatical slot so surface edits never orphan a learner's progress.
  • Self-optimising model selection, nothing hardcoded. When several local models are installed, a dueling-bandit learns from an LLM judge's per-field preferences and routes work to whichever model is best on this content, exploring across runs.
  • Human-gated generation. Nothing AI-drafted ships unreviewed: records are needs_review, a guardrail gate auto-promotes only the clean, corroborated ones, and the build excludes the rest.
  • Offline-first, owned asset. No network dependency, no per-use cost; packs ship into the app and work with no signal, scaling to any number of users.

Authoring & quality

Contributors grow the source pack by proposing entries (see authoring/). To make that fast, the repo ships an optional local toolchain for anyone running a local language model: it can draft entries and fill gaps, and - when several models are installed - a self-optimising selector (a dueling-bandit that learns from a judge's per-field preferences) routes work to whichever model is actually best on this content, with no hardcoded model choices.

Nothing generated ships automatically. Every drafted record is marked needs_review; a guardrail gate auto-promotes only the clean ones and holds the rest for a human, and the pack build excludes anything still under review. The reviewer's final check is the git diff. Details in authoring/README.md.

The toolchain is driven by a single console - npm run lexicon - with an interactive menu and an unattended autopilot. How to use it and how to read a run (the definitions / synonyms / examples DE IT EN output) are documented in docs/CONSOLE.md.

What It Contains

  • canonical source packs
  • editorial templates and authoring tools
  • generated runtime packs
  • file-based distribution artifacts
  • Dart packages for contracts, parsing, storage, import, and optional language-plugin add-ons

Start Here

  • docs/README.md: documentation map and reading order
  • docs/guides/CONSUMER_GUIDE.md: integration model for applications and services
  • docs/guides/PACK_AUTHORING.md: source pack authoring workflow
  • docs/reference/LEXICON_FILE_CONTRACT_0_1_0.md: distribution contract
  • docs/reference/TOOLS.md: tool catalog and workflow roles
  • docs/reference/WORKFLOW_COMMANDS.md: canonical command reference
  • CHANGELOG.md: platform, tooling, package, and release history
  • docs/reference/CONTENT_CHANGELOG.md: concept-first lexical-content history
  • docs/guides/RELEASING.md: release checklist and verification flow
  • CONTRIBUTING.md: contribution guide

Repository Layout

  • docs/README.md
  • docs/guides/
  • docs/reference/
  • docs/policies/
  • packs/templates/
  • packs/lexicon_source/ current canonical source pack
  • packs/lexicon_*_{a1,a2,b1,b2}/
  • tools/pipeline/
  • tools/reports/
  • tools/maintenance/
  • tools/lib/
  • packages/lexicon_platform/
  • packages/lexicon_content/
  • packages/lexicon_content_db/
  • packages/lexicon_core/
  • packages/lexicon_german/ current German language-plugin package (optional add-on)
  • packages/lexicon_italian/ current Italian language-plugin package (optional add-on)

Public Package

Use lexicon_platform as the main dependency.

Git dependency example:

dependencies:
  lexicon_platform:
    git:
      url: https://github.com/massimomazzariol/Lexicon.git
      path: packages/lexicon_platform
      ref: v0.5.1

Local path example:

dependencies:
  lexicon_platform:
    path: ../Lexicon/packages/lexicon_platform

Granular exports are available from the umbrella package:

import 'package:lexicon_platform/core.dart';
import 'package:lexicon_platform/content.dart';
import 'package:lexicon_platform/content_db.dart';

Optional language-plugin imports should be added only when an app needs them. The current repository ships German and Italian as concrete add-ons:

dependencies:
  lexicon_platform:
    git:
      url: https://github.com/massimomazzariol/Lexicon.git
      path: packages/lexicon_platform
      ref: v0.5.1
  lexicon_german:
    git:
      url: https://github.com/massimomazzariol/Lexicon.git
      path: packages/lexicon_german
      ref: v0.5.1
  lexicon_italian:
    git:
      url: https://github.com/massimomazzariol/Lexicon.git
      path: packages/lexicon_italian
      ref: v0.5.1
import 'package:lexicon_german/lexicon_german.dart'; // optional add-on
import 'package:lexicon_italian/lexicon_italian.dart'; // optional add-on

This is not required for consumers that only need the generic contracts, runtime-pack parsing, and storage/import layers. lexicon_platform stays neutral and does not re-export language plugins.

Consumer-side language selection and pack resolution can now be expressed through one shared contract:

final consumerResolution = LexiconConsumerSelectionResolver.resolve(
  catalog: availablePackManifests,
  availablePlugins: const [GermanLexiconPlugin(), ItalianLexiconPlugin()],
  request: LexiconConsumerSelectionRequest(
    targetLanguage: 'de',
    baseLanguage: 'it',
    hintLanguage: 'en',
    level: 'A1',
  ),
);

final selectedPack = consumerResolution.selectedManifest;
final activePluginLanguages = consumerResolution.activePluginLanguages;
final missingPluginLanguages = consumerResolution.missingPluginLanguages;

Lower-level APIs such as LexiconStaticPluginRegistry and LexiconPackResolver still exist when an integration needs finer control.

Development Commands

Use docs/reference/WORKFLOW_COMMANDS.md as the command source of truth.

Typical source-pack entrypoint:

pnpm node tools/pipeline/run_pack_pipeline.mjs \
  --pack-dir packs/lexicon_source \
  --with-forms

Detailed workflow:

  • docs/guides/PACK_AUTHORING.md
  • docs/reference/TOOLS.md

Licensing

Lexicon Platform uses a permissive split:

  • code in packages/, tools/, and local demo code is licensed under Apache-2.0
  • lexical content, packs, and documentation are licensed under CC BY 4.0 unless noted otherwise

This is intended to keep the repository open, reusable, and easy to extend, including in proprietary software, while preserving attribution to the source project.

See:

  • LICENSE
  • LICENSE-CONTENT.md
  • NOTICE
  • ATTRIBUTION.md

For the end-to-end integration flow, including local DB import, distribution, and hosting options, see docs/guides/CONSUMER_GUIDE.md.

About

Content-as-code lexical packs for language learning: CEFR-leveled, deterministic, multi-language plugins, offline-first.

Topics

Resources

License

Apache-2.0, Unknown licenses found

Licenses found

Apache-2.0
LICENSE
Unknown
LICENSE-CONTENT.md

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors