Skip to content

Add curated puhekieli seed source to fill dictionary gaps#5

Merged
laurigates merged 1 commit into
mainfrom
claude/finnish-dictionary-gaps-wD9Zl
May 30, 2026
Merged

Add curated puhekieli seed source to fill dictionary gaps#5
laurigates merged 1 commit into
mainfrom
claude/finnish-dictionary-gaps-wD9Zl

Conversation

@laurigates

Copy link
Copy Markdown
Owner

Problem

The dictionary is missing a lot of common spoken Finnish — e.g. oliks (vs oliko), saaks (vs saakohan), miksköhän (vs miksiköhän).

Root cause: the dictionary was built only from fetched corpora (OpenSubtitles, Urbaani, Mastodon), none of which are committed, and the allowlist can only keep tokens that already appear in a corpus — it can't inject a word. So there was no deterministic, reproducible way to guarantee a specific spoken form lands in the output. A form showed up only if some fetched corpus happened to contain it often enough to clear the min-count floor.

The examples also reveal the systematic gap: puhekieli grammar — reduced -ks question forms, clitic chains, and pronoun/verb contractions.

Change

Adds a committed, hand-vetted curated seed source — the deterministic gap-filler that was missing — following the existing UrbaaniSanakirja adapter pattern, in line with the project goal that "adding a source should be a small, local change."

  • crates/sources/src/curated.rs — new Curated source reading data/curated-fi.txt. Format is word or word <weight>; weights let high-frequency function words (, oon) outrank rarer forms after log-scoring. Fully unit-tested.
  • crates/cli — wired into assemble as a default-on source (disable with --no-curated). Unlike the corpora, it's committed, so it's always present. Every curated token is folded into the allowlist, so hand-vetted forms survive the Voikko kirjakieli filter even when libvoikko accepts the form — this is what lets useful clitic chains like miksiköhän / saakohan (morphologically valid but absent from main_fi's frequency list) through.
  • data/curated-fi.txt — 169 vetted forms, weight-tiered across pronouns, olla/verb contractions, reduced -ks questions, clitic-chain pondering forms, fillers, and a few slang greetings. Plain standard conjugations already in main_fi (voin, teen, sanon, montako, …) are deliberately excluded to avoid duplicating the kirjakieli dictionary.
  • README.md — documents the new source.

Verification

  • cargo fmt --check, cargo clippy -D warnings, and cargo test --workspace all pass (36 tests, 6 new).
  • End-to-end smoke test (assemble with only the curated source, Voikko enabled): all 169 tokens survive the filter, and every example form lands — oliks, saaks, saakohan, miksköhän, miksiköhän, onks, eiks, …

Notes

  • oliko itself is intentionally not added — it's standard kirjakieli already shipped by main_fi. We add the spoken variant oliks.
  • The list is meant to grow via PR; the mechanism is the durable part.

https://claude.ai/code/session_018xYxT7WXd64cm6C2wSbkFq


Generated by Claude Code

The dictionary previously came only from fetched corpora (OpenSubtitles,
Urbaani, Mastodon), none committed, and the allowlist could only *keep*
tokens already present in a corpus — there was no way to guarantee a
specific spoken form landed in the output. Common puhekieli forms like
oliks, saaks, and miksköhän were therefore frequently missing.

Add a committed, hand-vetted curated seed list (data/curated-fi.txt) as a
new always-on Source. It covers the systematic gaps: pronoun/verb
contractions, reduced -ks question forms, and clitic chains. assemble
folds every curated token into the allowlist so hand-vetted forms survive
the Voikko kirjakieli filter even when libvoikko accepts the form (e.g.
clitic chains like miksiköhän/saakohan that main_fi's frequency list
omits). Plain standard conjugations already in main_fi are deliberately
excluded to avoid duplicating the kirjakieli dictionary.

- crates/sources/src/curated.rs: Curated source, 'word [weight]' format
- crates/cli: wire in as default-on source (--no-curated), fold tokens
  into allowlist
- data/curated-fi.txt: 169 vetted puhekieli forms, weight-tiered
- README: document the new source
@laurigates laurigates marked this pull request as ready for review May 30, 2026 06:34
@laurigates laurigates merged commit cc853f0 into main May 30, 2026
1 check passed
@laurigates laurigates deleted the claude/finnish-dictionary-gaps-wD9Zl branch May 30, 2026 06:52
laurigates added a commit that referenced this pull request May 30, 2026
Implements the **suggested first PR** from #6: item **1** (systematic
morphological expansion) + item **4**'s dedup test, plus the item **5**
weight-tier documentation.

## What changed

### `data/curated-fi.txt`: 169 → 307 entries (+138)
Grouped under new `#` section headers, covering the highest-coverage /
lowest-ambiguity categories:

- **More reduced `-ks` question forms** — `ostaaks`, `lähteeks`,
`pitääks`, `auttaaks`, `käydääks`, …
- **Second-person `-tsä` agglutinations** — `ootsä`, `näätsä`,
`tiiätsä`, `muistatsä`, …
- **`-s` clitic interrogatives** — `mikäs`, `missäs`, `mistäs`,
`koskas`, `minnes`, …
- **Conditional `-skö` / emphatic `-pa` clitic chains** — `tekiskö`,
`kävisköhän`, `viittiskö`, `olispa`, `tossapa`, …
- **Interrogative `-kohan`/`-köhän` paradigm completion** —
`montakohan`, `paljonkohan`, `kauankohan`, `kenetköhän`, `kenelköhän`
- **Pronoun case-form completion** — `mul`, `sul`, `sil`, `tos`,
`tohon`, `tähä`, `niihi`, `teijät`, `meijät`, …
- **olla / common-verb spoken passive conjugations** — `ollaa`,
`mentii`, `tehtii`, `nähtii`, `käytii`, `mentäis`, …
- **Numeral final-vowel droppers** — `yks`, `kaks`, `kol`, `nel`,
`kuus`, `seittemä`, `kaheksa`, `yheksä`, `kymmene`

**Every candidate was screened with `voikkospell`** (the codified
version of #5's manual check):
- Forms Voikko **rejects** are genuine puhekieli absent from `main_fi` →
added.
- The four `-kohan`/`-köhän` chains Voikko **accepts** are kept **on
purpose** — the allowlist force-keeps every curated token past the
Voikko filter, exactly like the seeded
`kukakohan`/`koskakohan`/`milloinkohan`.
- Standard kirjakieli Voikko accepts (`montako`, `viis`, `mitäs`,
`tässäpä`, …) is **left out** to avoid duplicating `main_fi` (a project
non-goal).

Homographs were avoided (e.g. dropped `must`, which collides with the
loanword / `musta`).

### `crates/sources/src/curated.rs`: dedup guard (closes the
silent-summing footgun)
`read_curated` **sums** duplicate tokens with no error (see the existing
`duplicate_words_sum` test), so a token accidentally listed twice would
silently inflate its weight. The new
`committed_seed_list_has_no_duplicates` test reads the real
`data/curated-fi.txt` and asserts:
- no duplicate tokens (reports the offending lines),
- all entries lowercase,
- every line is a well-formed `word [<weight>]`.

Verified it fails loudly on an injected duplicate, then passes once
removed.

### Weight tiers documented (item 5)
Added a tier reference to the file header (function ≈5000, very common
≈2500–3500, common ≈1500–2000, rare ≈1000–1200) so the long tail doesn't
crowd suggestions as the list grows.

## Validation
- `cargo fmt --all --check` ✓
- `cargo clippy --workspace --all-targets -- -D warnings` ✓
- `cargo test --workspace` ✓ (25 tests; dedup guard included)

## Not in this PR (tracked in #6)
Dialect batches (item 2), IM-slang/Urbaani coordination (item 3), the
Voikko-acceptance *lint recipe* and coverage-diff tooling (rest of item
4), and the paradigm generator.

Closes part of #6.

---
_Generated by [Claude
Code](https://claude.ai/code/session_01GiGLDXQH93CdAZLvAUy3HE)_

Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants