feat(curated): expand puhekieli seed list (+138) and add dedup test#8
Merged
Conversation
Grows the curated seed list from the MVP's 169 to 307 entries, focusing on the highest-coverage / lowest-ambiguity categories from issue #6 (1): - more reduced -ks question forms (ostaaks, lähteeks, pitääks, …) - second-person -tsä agglutinations (ootsä, näätsä, tiiätsä, …) - -s clitic interrogatives (mikäs, missäs, koskas, …) - conditional -skö / emphatic -pa clitic chains (tekiskö, kävisköhän, olispa, tossapa, …) - interrogative -kohan/-köhän paradigm completion (montakohan, paljonkohan, kauankohan, kenetköhän) — deliberate clitic chains force-kept past Voikko, matching the seeded kukakohan/koskakohan - pronoun case-form completion (mul, sil, tos, niihi, teijät, …) - olla / common-verb spoken passive conjugations (mentii, tehtii, ollaa, mentäis, …) - numeral final-vowel droppers (yks, kaks, kol, nel, kuus, …) Every candidate was screened with voikkospell: forms Voikko rejects are genuine puhekieli absent from main_fi; the four accepted clitic chains are kept on purpose (the allowlist force-keeps every curated token). Standard kirjakieli Voikko accepts (montako, viis, mitäs, …) is left out to avoid duplicating main_fi. Adds committed_seed_list_has_no_duplicates over data/curated-fi.txt, closing the silent-summing footgun (4): duplicate tokens sum without error, so the test asserts uniqueness and also enforces lowercase / well-formed lines. Documents the weight tiers in the file header (5).
laurigates
pushed a commit
that referenced
this pull request
Jun 3, 2026
🤖 I have created a release *beep* *boop* --- ## [0.1.0](v0.0.1...v0.1.0) (2026-06-03) ### Features * **cli:** wire `niinku assemble` end-to-end with opensubtitles + urbaani ([#1](#1)) ([72ab50d](72ab50d)) * **curated:** expand puhekieli seed list (+138) and add dedup test ([#8](#8)) ([d53c5ab](d53c5ab)) * emit .combined header and compile to .dict via dicttool ([#3](#3)) ([f2c283d](f2c283d)) * **pipeline:** Voikko-based kirjakieli filter ([#2](#2)) ([ae29987](ae29987)) * **sources:** Mastodon hashtag-timeline ingestion ([#4](#4)) ([f773873](f773873)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: laurigates-release-please[bot] <272124289+laurigates-release-please[bot]@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Implements the suggested first PR from #6: item 1 (systematic morphological expansion) + item 4's dedup test, plus the item 5 weight-tier documentation.
What changed
data/curated-fi.txt: 169 → 307 entries (+138)Grouped under new
#section headers, covering the highest-coverage / lowest-ambiguity categories:-ksquestion forms —ostaaks,lähteeks,pitääks,auttaaks,käydääks, …-tsäagglutinations —ootsä,näätsä,tiiätsä,muistatsä, …-sclitic interrogatives —mikäs,missäs,mistäs,koskas,minnes, …-skö/ emphatic-paclitic chains —tekiskö,kävisköhän,viittiskö,olispa,tossapa, …-kohan/-köhänparadigm completion —montakohan,paljonkohan,kauankohan,kenetköhän,kenelköhänmul,sul,sil,tos,tohon,tähä,niihi,teijät,meijät, …ollaa,mentii,tehtii,nähtii,käytii,mentäis, …yks,kaks,kol,nel,kuus,seittemä,kaheksa,yheksä,kymmeneEvery candidate was screened with
voikkospell(the codified version of #5's manual check):main_fi→ added.-kohan/-köhänchains Voikko accepts are kept on purpose — the allowlist force-keeps every curated token past the Voikko filter, exactly like the seededkukakohan/koskakohan/milloinkohan.montako,viis,mitäs,tässäpä, …) is left out to avoid duplicatingmain_fi(a project non-goal).Homographs were avoided (e.g. dropped
must, which collides with the loanword /musta).crates/sources/src/curated.rs: dedup guard (closes the silent-summing footgun)read_curatedsums duplicate tokens with no error (see the existingduplicate_words_sumtest), so a token accidentally listed twice would silently inflate its weight. The newcommitted_seed_list_has_no_duplicatestest reads the realdata/curated-fi.txtand asserts:word [<weight>].Verified it fails loudly on an injected duplicate, then passes once removed.
Weight tiers documented (item 5)
Added a tier reference to the file header (function ≈5000, very common ≈2500–3500, common ≈1500–2000, rare ≈1000–1200) so the long tail doesn't crowd suggestions as the list grows.
Validation
cargo fmt --all --check✓cargo clippy --workspace --all-targets -- -D warnings✓cargo test --workspace✓ (25 tests; dedup guard included)Not in this PR (tracked in #6)
Dialect batches (item 2), IM-slang/Urbaani coordination (item 3), the Voikko-acceptance lint recipe and coverage-diff tooling (rest of item 4), and the paradigm generator.
Closes part of #6.
Generated by Claude Code