feat(pipeline): Voikko-based kirjakieli filter#2
Merged
Conversation
A morphological judge (typically libvoikko) is abstracted behind a trait so the pipeline stays pure and tests can drive a static mock without needing the Voikko C library installed. The filter drops tokens the lexicon accepts (those are kirjakieli, already covered by HeliBoard's main_fi) and keeps tokens it rejects (the puhekieli/ slang signal). The allowlist overrides: listed tokens are kept even when accepted, for established words used in colloquial new senses. Two unit tests with a StaticLexicon mock cover the basic drop/keep behaviour and the allowlist override. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wraps voikko-rs 1.1 and exposes it through niinku_pipeline::Lexicon by mapping `SpellReturn::SpellOk` to acceptance. Internal errors and charset failures map to non-acceptance so the pipeline does not drop a word it couldn't analyse. A small pkg-config-based build.rs ensures the linker finds libvoikko on macOS Homebrew (/opt/homebrew/lib) and other non-default locations; falls back silently to the platform default search path when pkg-config or libvoikko.pc isn't present. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
assemble now loads data/allowlist.txt alongside the denylist and applies the kirjakieli filter between min-count and scoring. --no-voikko skips the filter (useful when libvoikko isn't installed, or for debugging the unfiltered pipeline). --voikko-dict-path points to a non-default dictionary location. Local run against fi_50k OpenSubtitles: drops 42565 of 50000 tokens (85% kirjakieli), leaving 7435 puhekieli/slang/proper-noun entries. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Required by niinku-sources after the VoikkoLexicon addition. The checkout-then-toolchain-then-cache ordering is preserved; libvoikko install lands right after checkout because it does not depend on Rust state. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the brew/apt install hints, lists the new `just download` and `just generate` recipes, and updates the Status section to reflect that Voikko-based filtering and the OpenSubtitles + Urbaani adapters are now landed (only Mastodon/Suomi24 + the .combined header remain deferred). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4 tasks
laurigates
pushed a commit
that referenced
this pull request
Jun 3, 2026
🤖 I have created a release *beep* *boop* --- ## [0.1.0](v0.0.1...v0.1.0) (2026-06-03) ### Features * **cli:** wire `niinku assemble` end-to-end with opensubtitles + urbaani ([#1](#1)) ([72ab50d](72ab50d)) * **curated:** expand puhekieli seed list (+138) and add dedup test ([#8](#8)) ([d53c5ab](d53c5ab)) * emit .combined header and compile to .dict via dicttool ([#3](#3)) ([f2c283d](f2c283d)) * **pipeline:** Voikko-based kirjakieli filter ([#2](#2)) ([ae29987](ae29987)) * **sources:** Mastodon hashtag-timeline ingestion ([#4](#4)) ([f773873](f773873)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: laurigates-release-please[bot] <272124289+laurigates-release-please[bot]@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements the core discriminator from the README: frequent in colloquial corpus + rejected by libvoikko = puhekieli/slang.
Lexicontrait inniinku-pipelineand a pureapply_kirjakieli_filter(table, lexicon, allowlist)function. Trait-typed so pipeline tests stay hermetic with aStaticLexiconmock.VoikkoLexiconwrapper inniinku-sourcesusingvoikko-rs1.1.SpellOk→ accepted; internal/charset errors → not accepted (conservative — never drop a word we couldn't analyse).niinku-sources/build.rsusespkg-configto find libvoikko on macOS Homebrew (/opt/homebrew/lib) and other non-default locations; falls back silently when pkg-config /libvoikko.pcisn't present.data/allowlist.txt, adds--no-voikkoand--voikko-dict-pathflags.libvoikko-dev+voikko-fibefore cargo runs.Local end-to-end run
Spot-check — the puhekieli vocabulary now lands cleanly:
Known noise at the head of the list: English proper nouns from translated subtitles (john, jack, michael, fbi). These are valid OOV tokens that Voikko correctly rejects but they aren't useful as puhekieli. A future commit can either add common English names to the denylist or pre-filter by character-set heuristics — not in scope here.
Test plan
just lint(fmt + clippy -D warnings) — greenjust test(17 unit tests across pipeline + sources, including 2 new filter tests) — greenjust generateend-to-end — produces 7435-entry filtered.combined--no-voikkoflag bypasses the filter (back to original 50000 entries)Remaining open items (not in this PR)
.combinedheader +.dictcompile viadicttool_aosp.jar🤖 Generated with Claude Code