Skip to content

feat(curated): expand puhekieli seed list (+138) and add dedup test#8

Merged
laurigates merged 1 commit into
mainfrom
claude/github-issue-6-pZuBc
May 30, 2026
Merged

feat(curated): expand puhekieli seed list (+138) and add dedup test#8
laurigates merged 1 commit into
mainfrom
claude/github-issue-6-pZuBc

Conversation

@laurigates

Copy link
Copy Markdown
Owner

Implements the suggested first PR from #6: item 1 (systematic morphological expansion) + item 4's dedup test, plus the item 5 weight-tier documentation.

What changed

data/curated-fi.txt: 169 → 307 entries (+138)

Grouped under new # section headers, covering the highest-coverage / lowest-ambiguity categories:

  • More reduced -ks question formsostaaks, lähteeks, pitääks, auttaaks, käydääks, …
  • Second-person -tsä agglutinationsootsä, näätsä, tiiätsä, muistatsä, …
  • -s clitic interrogativesmikäs, missäs, mistäs, koskas, minnes, …
  • Conditional -skö / emphatic -pa clitic chainstekiskö, kävisköhän, viittiskö, olispa, tossapa, …
  • Interrogative -kohan/-köhän paradigm completionmontakohan, paljonkohan, kauankohan, kenetköhän, kenelköhän
  • Pronoun case-form completionmul, sul, sil, tos, tohon, tähä, niihi, teijät, meijät, …
  • olla / common-verb spoken passive conjugationsollaa, mentii, tehtii, nähtii, käytii, mentäis, …
  • Numeral final-vowel droppersyks, kaks, kol, nel, kuus, seittemä, kaheksa, yheksä, kymmene

Every candidate was screened with voikkospell (the codified version of #5's manual check):

  • Forms Voikko rejects are genuine puhekieli absent from main_fi → added.
  • The four -kohan/-köhän chains Voikko accepts are kept on purpose — the allowlist force-keeps every curated token past the Voikko filter, exactly like the seeded kukakohan/koskakohan/milloinkohan.
  • Standard kirjakieli Voikko accepts (montako, viis, mitäs, tässäpä, …) is left out to avoid duplicating main_fi (a project non-goal).

Homographs were avoided (e.g. dropped must, which collides with the loanword / musta).

crates/sources/src/curated.rs: dedup guard (closes the silent-summing footgun)

read_curated sums duplicate tokens with no error (see the existing duplicate_words_sum test), so a token accidentally listed twice would silently inflate its weight. The new committed_seed_list_has_no_duplicates test reads the real data/curated-fi.txt and asserts:

  • no duplicate tokens (reports the offending lines),
  • all entries lowercase,
  • every line is a well-formed word [<weight>].

Verified it fails loudly on an injected duplicate, then passes once removed.

Weight tiers documented (item 5)

Added a tier reference to the file header (function ≈5000, very common ≈2500–3500, common ≈1500–2000, rare ≈1000–1200) so the long tail doesn't crowd suggestions as the list grows.

Validation

  • cargo fmt --all --check
  • cargo clippy --workspace --all-targets -- -D warnings
  • cargo test --workspace ✓ (25 tests; dedup guard included)

Not in this PR (tracked in #6)

Dialect batches (item 2), IM-slang/Urbaani coordination (item 3), the Voikko-acceptance lint recipe and coverage-diff tooling (rest of item 4), and the paradigm generator.

Closes part of #6.


Generated by Claude Code

Grows the curated seed list from the MVP's 169 to 307 entries, focusing
on the highest-coverage / lowest-ambiguity categories from issue #6 (1):

- more reduced -ks question forms (ostaaks, lähteeks, pitääks, …)
- second-person -tsä agglutinations (ootsä, näätsä, tiiätsä, …)
- -s clitic interrogatives (mikäs, missäs, koskas, …)
- conditional -skö / emphatic -pa clitic chains (tekiskö, kävisköhän,
  olispa, tossapa, …)
- interrogative -kohan/-köhän paradigm completion (montakohan,
  paljonkohan, kauankohan, kenetköhän) — deliberate clitic chains
  force-kept past Voikko, matching the seeded kukakohan/koskakohan
- pronoun case-form completion (mul, sil, tos, niihi, teijät, …)
- olla / common-verb spoken passive conjugations (mentii, tehtii,
  ollaa, mentäis, …)
- numeral final-vowel droppers (yks, kaks, kol, nel, kuus, …)

Every candidate was screened with voikkospell: forms Voikko rejects are
genuine puhekieli absent from main_fi; the four accepted clitic chains
are kept on purpose (the allowlist force-keeps every curated token).
Standard kirjakieli Voikko accepts (montako, viis, mitäs, …) is left out
to avoid duplicating main_fi.

Adds committed_seed_list_has_no_duplicates over data/curated-fi.txt,
closing the silent-summing footgun (4): duplicate tokens sum without
error, so the test asserts uniqueness and also enforces lowercase /
well-formed lines. Documents the weight tiers in the file header (5).
@laurigates laurigates marked this pull request as ready for review May 30, 2026 17:31
@laurigates laurigates merged commit d53c5ab into main May 30, 2026
1 check passed
@laurigates laurigates deleted the claude/github-issue-6-pZuBc branch May 30, 2026 17:31
laurigates pushed a commit that referenced this pull request Jun 3, 2026
🤖 I have created a release *beep* *boop*
---


## [0.1.0](v0.0.1...v0.1.0)
(2026-06-03)


### Features

* **cli:** wire `niinku assemble` end-to-end with opensubtitles +
urbaani ([#1](#1))
([72ab50d](72ab50d))
* **curated:** expand puhekieli seed list (+138) and add dedup test
([#8](#8))
([d53c5ab](d53c5ab))
* emit .combined header and compile to .dict via dicttool
([#3](#3))
([f2c283d](f2c283d))
* **pipeline:** Voikko-based kirjakieli filter
([#2](#2))
([ae29987](ae29987))
* **sources:** Mastodon hashtag-timeline ingestion
([#4](#4))
([f773873](f773873))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: laurigates-release-please[bot] <272124289+laurigates-release-please[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants