Add curated puhekieli seed source to fill dictionary gaps#5
Merged
Conversation
The dictionary previously came only from fetched corpora (OpenSubtitles, Urbaani, Mastodon), none committed, and the allowlist could only *keep* tokens already present in a corpus — there was no way to guarantee a specific spoken form landed in the output. Common puhekieli forms like oliks, saaks, and miksköhän were therefore frequently missing. Add a committed, hand-vetted curated seed list (data/curated-fi.txt) as a new always-on Source. It covers the systematic gaps: pronoun/verb contractions, reduced -ks question forms, and clitic chains. assemble folds every curated token into the allowlist so hand-vetted forms survive the Voikko kirjakieli filter even when libvoikko accepts the form (e.g. clitic chains like miksiköhän/saakohan that main_fi's frequency list omits). Plain standard conjugations already in main_fi are deliberately excluded to avoid duplicating the kirjakieli dictionary. - crates/sources/src/curated.rs: Curated source, 'word [weight]' format - crates/cli: wire in as default-on source (--no-curated), fold tokens into allowlist - data/curated-fi.txt: 169 vetted puhekieli forms, weight-tiered - README: document the new source
This was referenced May 30, 2026
laurigates
added a commit
that referenced
this pull request
May 30, 2026
Implements the **suggested first PR** from #6: item **1** (systematic morphological expansion) + item **4**'s dedup test, plus the item **5** weight-tier documentation. ## What changed ### `data/curated-fi.txt`: 169 → 307 entries (+138) Grouped under new `#` section headers, covering the highest-coverage / lowest-ambiguity categories: - **More reduced `-ks` question forms** — `ostaaks`, `lähteeks`, `pitääks`, `auttaaks`, `käydääks`, … - **Second-person `-tsä` agglutinations** — `ootsä`, `näätsä`, `tiiätsä`, `muistatsä`, … - **`-s` clitic interrogatives** — `mikäs`, `missäs`, `mistäs`, `koskas`, `minnes`, … - **Conditional `-skö` / emphatic `-pa` clitic chains** — `tekiskö`, `kävisköhän`, `viittiskö`, `olispa`, `tossapa`, … - **Interrogative `-kohan`/`-köhän` paradigm completion** — `montakohan`, `paljonkohan`, `kauankohan`, `kenetköhän`, `kenelköhän` - **Pronoun case-form completion** — `mul`, `sul`, `sil`, `tos`, `tohon`, `tähä`, `niihi`, `teijät`, `meijät`, … - **olla / common-verb spoken passive conjugations** — `ollaa`, `mentii`, `tehtii`, `nähtii`, `käytii`, `mentäis`, … - **Numeral final-vowel droppers** — `yks`, `kaks`, `kol`, `nel`, `kuus`, `seittemä`, `kaheksa`, `yheksä`, `kymmene` **Every candidate was screened with `voikkospell`** (the codified version of #5's manual check): - Forms Voikko **rejects** are genuine puhekieli absent from `main_fi` → added. - The four `-kohan`/`-köhän` chains Voikko **accepts** are kept **on purpose** — the allowlist force-keeps every curated token past the Voikko filter, exactly like the seeded `kukakohan`/`koskakohan`/`milloinkohan`. - Standard kirjakieli Voikko accepts (`montako`, `viis`, `mitäs`, `tässäpä`, …) is **left out** to avoid duplicating `main_fi` (a project non-goal). Homographs were avoided (e.g. dropped `must`, which collides with the loanword / `musta`). ### `crates/sources/src/curated.rs`: dedup guard (closes the silent-summing footgun) `read_curated` **sums** duplicate tokens with no error (see the existing `duplicate_words_sum` test), so a token accidentally listed twice would silently inflate its weight. The new `committed_seed_list_has_no_duplicates` test reads the real `data/curated-fi.txt` and asserts: - no duplicate tokens (reports the offending lines), - all entries lowercase, - every line is a well-formed `word [<weight>]`. Verified it fails loudly on an injected duplicate, then passes once removed. ### Weight tiers documented (item 5) Added a tier reference to the file header (function ≈5000, very common ≈2500–3500, common ≈1500–2000, rare ≈1000–1200) so the long tail doesn't crowd suggestions as the list grows. ## Validation - `cargo fmt --all --check` ✓ - `cargo clippy --workspace --all-targets -- -D warnings` ✓ - `cargo test --workspace` ✓ (25 tests; dedup guard included) ## Not in this PR (tracked in #6) Dialect batches (item 2), IM-slang/Urbaani coordination (item 3), the Voikko-acceptance *lint recipe* and coverage-diff tooling (rest of item 4), and the paradigm generator. Closes part of #6. --- _Generated by [Claude Code](https://claude.ai/code/session_01GiGLDXQH93CdAZLvAUy3HE)_ Co-authored-by: Claude <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The dictionary is missing a lot of common spoken Finnish — e.g.
oliks(vsoliko),saaks(vssaakohan),miksköhän(vsmiksiköhän).Root cause: the dictionary was built only from fetched corpora (OpenSubtitles, Urbaani, Mastodon), none of which are committed, and the
allowlistcan only keep tokens that already appear in a corpus — it can't inject a word. So there was no deterministic, reproducible way to guarantee a specific spoken form lands in the output. A form showed up only if some fetched corpus happened to contain it often enough to clear the min-count floor.The examples also reveal the systematic gap: puhekieli grammar — reduced
-ksquestion forms, clitic chains, and pronoun/verb contractions.Change
Adds a committed, hand-vetted curated seed source — the deterministic gap-filler that was missing — following the existing
UrbaaniSanakirjaadapter pattern, in line with the project goal that "adding a source should be a small, local change."crates/sources/src/curated.rs— newCuratedsource readingdata/curated-fi.txt. Format iswordorword <weight>; weights let high-frequency function words (mä,oon) outrank rarer forms after log-scoring. Fully unit-tested.crates/cli— wired intoassembleas a default-on source (disable with--no-curated). Unlike the corpora, it's committed, so it's always present. Every curated token is folded into the allowlist, so hand-vetted forms survive the Voikko kirjakieli filter even when libvoikko accepts the form — this is what lets useful clitic chains likemiksiköhän/saakohan(morphologically valid but absent frommain_fi's frequency list) through.data/curated-fi.txt— 169 vetted forms, weight-tiered across pronouns,olla/verb contractions, reduced-ksquestions, clitic-chain pondering forms, fillers, and a few slang greetings. Plain standard conjugations already inmain_fi(voin,teen,sanon,montako, …) are deliberately excluded to avoid duplicating the kirjakieli dictionary.README.md— documents the new source.Verification
cargo fmt --check,cargo clippy -D warnings, andcargo test --workspaceall pass (36 tests, 6 new).assemblewith only the curated source, Voikko enabled): all 169 tokens survive the filter, and every example form lands —oliks,saaks,saakohan,miksköhän,miksiköhän,onks,eiks, …Notes
olikoitself is intentionally not added — it's standard kirjakieli already shipped bymain_fi. We add the spoken variantoliks.https://claude.ai/code/session_018xYxT7WXd64cm6C2wSbkFq
Generated by Claude Code