Skip to content

feat(pipeline): Voikko-based kirjakieli filter#2

Merged
laurigates merged 5 commits into
mainfrom
feat/voikko-filter
May 27, 2026
Merged

feat(pipeline): Voikko-based kirjakieli filter#2
laurigates merged 5 commits into
mainfrom
feat/voikko-filter

Conversation

@laurigates

Copy link
Copy Markdown
Owner

Summary

Implements the core discriminator from the README: frequent in colloquial corpus + rejected by libvoikko = puhekieli/slang.

  • New Lexicon trait in niinku-pipeline and a pure apply_kirjakieli_filter(table, lexicon, allowlist) function. Trait-typed so pipeline tests stay hermetic with a StaticLexicon mock.
  • New VoikkoLexicon wrapper in niinku-sources using voikko-rs 1.1. SpellOk → accepted; internal/charset errors → not accepted (conservative — never drop a word we couldn't analyse).
  • niinku-sources/build.rs uses pkg-config to find libvoikko on macOS Homebrew (/opt/homebrew/lib) and other non-default locations; falls back silently when pkg-config / libvoikko.pc isn't present.
  • CLI wires the filter between min-count and scoring, loads data/allowlist.txt, adds --no-voikko and --voikko-dict-path flags.
  • CI installs libvoikko-dev + voikko-fi before cargo runs.
  • README documents the new prerequisite.

Local end-to-end run

loading opensubtitles-fi from data/cached/opensubtitles-fi.txt
  50000 tokens
denylist: 0 entries
allowlist: 0 entries
merged: 50000 tokens
after denylist + min-count>=5: 50000 tokens
after voikko kirjakieli filter: 7435 tokens (dropped 42565)
wrote 7435 entries to data/out/niinku.combined

Spot-check — the puhekieli vocabulary now lands cleanly:

word=mä,f=219     word=sä,f=215     word=sun,f=207    word=mun,f=209
word=tää,f=201    word=oon,f=198    word=mua,f=200    word=sua,f=200
word=niinku,f=186 word=sit,f=182    word=ois,f=182    word=mee,f=174
word=onks,f=181   word=nää,f=181    word=noi,f=170    word=moro,f=169
word=emmä,f=165   word=just,f=194   word=kaa,f=181    word=tuu,f=182
word=törppö,f=172 word=sulta,f=170

Known noise at the head of the list: English proper nouns from translated subtitles (john, jack, michael, fbi). These are valid OOV tokens that Voikko correctly rejects but they aren't useful as puhekieli. A future commit can either add common English names to the denylist or pre-filter by character-set heuristics — not in scope here.

Test plan

  • just lint (fmt + clippy -D warnings) — green
  • just test (17 unit tests across pipeline + sources, including 2 new filter tests) — green
  • just generate end-to-end — produces 7435-entry filtered .combined
  • Spot-check: puhekieli vocabulary present at expected freq band
  • --no-voikko flag bypasses the filter (back to original 50000 entries)

Remaining open items (not in this PR)

  • .combined header + .dict compile via dicttool_aosp.jar
  • Mastodon adapter
  • Suomi24 adapter
  • Filtering English proper-noun noise from subtitle corpora

🤖 Generated with Claude Code

laurigates and others added 5 commits May 27, 2026 15:20
A morphological judge (typically libvoikko) is abstracted behind a
trait so the pipeline stays pure and tests can drive a static mock
without needing the Voikko C library installed. The filter drops
tokens the lexicon accepts (those are kirjakieli, already covered by
HeliBoard's main_fi) and keeps tokens it rejects (the puhekieli/
slang signal). The allowlist overrides: listed tokens are kept even
when accepted, for established words used in colloquial new senses.

Two unit tests with a StaticLexicon mock cover the basic
drop/keep behaviour and the allowlist override.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wraps voikko-rs 1.1 and exposes it through niinku_pipeline::Lexicon
by mapping `SpellReturn::SpellOk` to acceptance. Internal errors
and charset failures map to non-acceptance so the pipeline does not
drop a word it couldn't analyse.

A small pkg-config-based build.rs ensures the linker finds libvoikko
on macOS Homebrew (/opt/homebrew/lib) and other non-default
locations; falls back silently to the platform default search path
when pkg-config or libvoikko.pc isn't present.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
assemble now loads data/allowlist.txt alongside the denylist and
applies the kirjakieli filter between min-count and scoring.
--no-voikko skips the filter (useful when libvoikko isn't installed,
or for debugging the unfiltered pipeline). --voikko-dict-path points
to a non-default dictionary location.

Local run against fi_50k OpenSubtitles: drops 42565 of 50000 tokens
(85% kirjakieli), leaving 7435 puhekieli/slang/proper-noun entries.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Required by niinku-sources after the VoikkoLexicon addition. The
checkout-then-toolchain-then-cache ordering is preserved; libvoikko
install lands right after checkout because it does not depend on
Rust state.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the brew/apt install hints, lists the new `just download` and
`just generate` recipes, and updates the Status section to reflect
that Voikko-based filtering and the OpenSubtitles + Urbaani adapters
are now landed (only Mastodon/Suomi24 + the .combined header remain
deferred).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@laurigates laurigates added the enhancement New feature or improvement label May 27, 2026
@laurigates laurigates self-assigned this May 27, 2026
@laurigates laurigates merged commit ae29987 into main May 27, 2026
1 check passed
@laurigates laurigates deleted the feat/voikko-filter branch May 27, 2026 12:22
laurigates pushed a commit that referenced this pull request Jun 3, 2026
🤖 I have created a release *beep* *boop*
---


## [0.1.0](v0.0.1...v0.1.0)
(2026-06-03)


### Features

* **cli:** wire `niinku assemble` end-to-end with opensubtitles +
urbaani ([#1](#1))
([72ab50d](72ab50d))
* **curated:** expand puhekieli seed list (+138) and add dedup test
([#8](#8))
([d53c5ab](d53c5ab))
* emit .combined header and compile to .dict via dicttool
([#3](#3))
([f2c283d](f2c283d))
* **pipeline:** Voikko-based kirjakieli filter
([#2](#2))
([ae29987](ae29987))
* **sources:** Mastodon hashtag-timeline ingestion
([#4](#4))
([f773873](f773873))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: laurigates-release-please[bot] <272124289+laurigates-release-please[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or improvement

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant