Add curated puhekieli seed source to fill dictionary gaps by laurigates · Pull Request #5 · laurigates/niinku

laurigates · 2026-05-30T05:47:34Z

Problem

The dictionary is missing a lot of common spoken Finnish — e.g. oliks (vs oliko), saaks (vs saakohan), miksköhän (vs miksiköhän).

Root cause: the dictionary was built only from fetched corpora (OpenSubtitles, Urbaani, Mastodon), none of which are committed, and the allowlist can only keep tokens that already appear in a corpus — it can't inject a word. So there was no deterministic, reproducible way to guarantee a specific spoken form lands in the output. A form showed up only if some fetched corpus happened to contain it often enough to clear the min-count floor.

The examples also reveal the systematic gap: puhekieli grammar — reduced -ks question forms, clitic chains, and pronoun/verb contractions.

Change

Adds a committed, hand-vetted curated seed source — the deterministic gap-filler that was missing — following the existing UrbaaniSanakirja adapter pattern, in line with the project goal that "adding a source should be a small, local change."

crates/sources/src/curated.rs — new Curated source reading data/curated-fi.txt. Format is word or word <weight>; weights let high-frequency function words (mä, oon) outrank rarer forms after log-scoring. Fully unit-tested.
crates/cli — wired into assemble as a default-on source (disable with --no-curated). Unlike the corpora, it's committed, so it's always present. Every curated token is folded into the allowlist, so hand-vetted forms survive the Voikko kirjakieli filter even when libvoikko accepts the form — this is what lets useful clitic chains like miksiköhän / saakohan (morphologically valid but absent from main_fi's frequency list) through.
data/curated-fi.txt — 169 vetted forms, weight-tiered across pronouns, olla/verb contractions, reduced -ks questions, clitic-chain pondering forms, fillers, and a few slang greetings. Plain standard conjugations already in main_fi (voin, teen, sanon, montako, …) are deliberately excluded to avoid duplicating the kirjakieli dictionary.
README.md — documents the new source.

Verification

cargo fmt --check, cargo clippy -D warnings, and cargo test --workspace all pass (36 tests, 6 new).
End-to-end smoke test (assemble with only the curated source, Voikko enabled): all 169 tokens survive the filter, and every example form lands — oliks, saaks, saakohan, miksköhän, miksiköhän, onks, eiks, …

Notes

oliko itself is intentionally not added — it's standard kirjakieli already shipped by main_fi. We add the spoken variant oliks.
The list is meant to grow via PR; the mechanism is the durable part.

https://claude.ai/code/session_018xYxT7WXd64cm6C2wSbkFq

Generated by Claude Code

The dictionary previously came only from fetched corpora (OpenSubtitles, Urbaani, Mastodon), none committed, and the allowlist could only *keep* tokens already present in a corpus — there was no way to guarantee a specific spoken form landed in the output. Common puhekieli forms like oliks, saaks, and miksköhän were therefore frequently missing. Add a committed, hand-vetted curated seed list (data/curated-fi.txt) as a new always-on Source. It covers the systematic gaps: pronoun/verb contractions, reduced -ks question forms, and clitic chains. assemble folds every curated token into the allowlist so hand-vetted forms survive the Voikko kirjakieli filter even when libvoikko accepts the form (e.g. clitic chains like miksiköhän/saakohan that main_fi's frequency list omits). Plain standard conjugations already in main_fi are deliberately excluded to avoid duplicating the kirjakieli dictionary. - crates/sources/src/curated.rs: Curated source, 'word [weight]' format - crates/cli: wire in as default-on source (--no-curated), fold tokens into allowlist - data/curated-fi.txt: 169 vetted puhekieli forms, weight-tiered - README: document the new source

Implements the **suggested first PR** from #6: item **1** (systematic morphological expansion) + item **4**'s dedup test, plus the item **5** weight-tier documentation. ## What changed ### `data/curated-fi.txt`: 169 → 307 entries (+138) Grouped under new `#` section headers, covering the highest-coverage / lowest-ambiguity categories: - **More reduced `-ks` question forms** — `ostaaks`, `lähteeks`, `pitääks`, `auttaaks`, `käydääks`, … - **Second-person `-tsä` agglutinations** — `ootsä`, `näätsä`, `tiiätsä`, `muistatsä`, … - **`-s` clitic interrogatives** — `mikäs`, `missäs`, `mistäs`, `koskas`, `minnes`, … - **Conditional `-skö` / emphatic `-pa` clitic chains** — `tekiskö`, `kävisköhän`, `viittiskö`, `olispa`, `tossapa`, … - **Interrogative `-kohan`/`-köhän` paradigm completion** — `montakohan`, `paljonkohan`, `kauankohan`, `kenetköhän`, `kenelköhän` - **Pronoun case-form completion** — `mul`, `sul`, `sil`, `tos`, `tohon`, `tähä`, `niihi`, `teijät`, `meijät`, … - **olla / common-verb spoken passive conjugations** — `ollaa`, `mentii`, `tehtii`, `nähtii`, `käytii`, `mentäis`, … - **Numeral final-vowel droppers** — `yks`, `kaks`, `kol`, `nel`, `kuus`, `seittemä`, `kaheksa`, `yheksä`, `kymmene` **Every candidate was screened with `voikkospell`** (the codified version of #5's manual check): - Forms Voikko **rejects** are genuine puhekieli absent from `main_fi` → added. - The four `-kohan`/`-köhän` chains Voikko **accepts** are kept **on purpose** — the allowlist force-keeps every curated token past the Voikko filter, exactly like the seeded `kukakohan`/`koskakohan`/`milloinkohan`. - Standard kirjakieli Voikko accepts (`montako`, `viis`, `mitäs`, `tässäpä`, …) is **left out** to avoid duplicating `main_fi` (a project non-goal). Homographs were avoided (e.g. dropped `must`, which collides with the loanword / `musta`). ### `crates/sources/src/curated.rs`: dedup guard (closes the silent-summing footgun) `read_curated` **sums** duplicate tokens with no error (see the existing `duplicate_words_sum` test), so a token accidentally listed twice would silently inflate its weight. The new `committed_seed_list_has_no_duplicates` test reads the real `data/curated-fi.txt` and asserts: - no duplicate tokens (reports the offending lines), - all entries lowercase, - every line is a well-formed `word [<weight>]`. Verified it fails loudly on an injected duplicate, then passes once removed. ### Weight tiers documented (item 5) Added a tier reference to the file header (function ≈5000, very common ≈2500–3500, common ≈1500–2000, rare ≈1000–1200) so the long tail doesn't crowd suggestions as the list grows. ## Validation - `cargo fmt --all --check` ✓ - `cargo clippy --workspace --all-targets -- -D warnings` ✓ - `cargo test --workspace` ✓ (25 tests; dedup guard included) ## Not in this PR (tracked in #6) Dialect batches (item 2), IM-slang/Urbaani coordination (item 3), the Voikko-acceptance *lint recipe* and coverage-diff tooling (rest of item 4), and the paradigm generator. Closes part of #6. --- _Generated by [Claude Code](https://claude.ai/code/session_01GiGLDXQH93CdAZLvAUy3HE)_ Co-authored-by: Claude <noreply@anthropic.com>

laurigates marked this pull request as ready for review May 30, 2026 06:34

laurigates merged commit cc853f0 into main May 30, 2026
1 check passed

laurigates deleted the claude/finnish-dictionary-gaps-wD9Zl branch May 30, 2026 06:52

This was referenced May 30, 2026

Expand the curated puhekieli seed list (data/curated-fi.txt) #6

Open

feat(curated): expand puhekieli seed list (+138) and add dedup test #8

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add curated puhekieli seed source to fill dictionary gaps#5

Add curated puhekieli seed source to fill dictionary gaps#5
laurigates merged 1 commit into
mainfrom
claude/finnish-dictionary-gaps-wD9Zl

laurigates commented May 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

laurigates commented May 30, 2026

Problem

Change

Verification

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants