A living, regenerated word dictionary for colloquial and slang Finnish, plus a HeliBoard mechanism to consume it without manual re-imports.
niinku — the puhekieli filler-and-discourse marker. Used here as a stand-in for everything
main_fidoesn't cover.
MVP. The pipeline crate's pure functions (merge, denylist, min-count,
Voikko-based kirjakieli filter, log-scoring, .combined header + body
emit) are implemented and unit-tested. The assemble CLI is wired
end-to-end with the curated seed list and the OpenSubtitles, Urbaani,
and Mastodon adapters;
ingest mastodon pulls fresh Finnish posts from public hashtag
streams; compile shells out to dicttool_aosp.jar to produce a
HeliBoard-loadable .dict. Only the Suomi24 adapter remains stubbed.
Prerequisites:
- A stable Rust toolchain (the workspace targets
edition = 2021). just.libvoikkoand the Finnish morphology dictionary:- macOS:
brew install libvoikko - Debian/Ubuntu:
sudo apt-get install libvoikko-dev voikko-fi
- macOS:
- Java (any JDK 11+) for the
.dictcompile step.
just test # cargo test --workspace
just lint # fmt --check + clippy -D warnings
just ci # lint + test
just build # cargo build --release --workspace
just download # fetch OpenSubtitles Finnish frequency list into data/cached/
just download-jar # fetch dicttool_aosp.jar into tools/
just generate # download corpus + jar, assemble, compile → data/out/puhekieli_fi.dict
just assemble # run Stage B with custom flags (see `niinku assemble --help`)
just compile # run dicttool on data/out/niinku.combined
just ingest-mastodon # pull Finnish hashtag posts → data/cached/mastodon-fi.txt
just ingest # pass through to the CLI (e.g. `just ingest mastodon --tags suomi`)
Pre-built .dict files are attached to each
GitHub Release as
puhekieli_fi-<version>.dict (built automatically — see CI/CD below).
You can also build one locally with just generate, which writes
data/out/puhekieli_fi.dict. To use it:
- Transfer to an Android device that has HeliBoard installed.
- Settings → Languages → Finnish → "Add dictionary from file" → select the
.dict.
The dictionary=puhekieli:fi header field declares this as an
additional (non-main) Finnish dictionary, so HeliBoard loads it
alongside its built-in main_fi rather than replacing it.
HeliBoard's Finnish dictionaries (main_fi, AOSP-format) are kirjakieli only —
derived from written-Finnish frequency corpora. Finnish IM and social writing
is puhekieli (mä, oon, tää, niinku, sun) and fast-moving internet
slang. These forms are absent from any loaded dictionary, so word suggestion
and — more importantly — glide typing fail for them. HeliBoard supports
multiple dictionaries per locale, so the fix is additive: ship a separate
puhekieli dictionary alongside the kirjakieli one.
Slang evolves quickly, so the dictionary must be regenerated on a cadence rather than authored once.
- An automated, iterable pipeline that produces a HeliBoard-compatible
.dictfile for colloquial Finnish from one or more sources. - Iteration without re-engineering: adding a source or tuning the filter should be a small, local change.
- A HeliBoard path (feature request, fork as fallback) to load dictionaries from a URL, cache them, and optionally auto-update.
- Replacing the kirjakieli dictionary. The output supplements it.
- Shipping on F-Droid as part of HeliBoard proper (separate concern; not blocking).
- Perfect coverage. A curated, well-ranked subset beats an exhaustive noisy one.
Core implementation in Rust: fast, single static binary, TDD-friendly, matches
the maintainer's stack. dicttool_aosp.jar (Java) is invoked as a subprocess
for the final compile step only.
Heavy corpus crunching is separated from light scheduled regeneration so CI runs stay cheap.
Stage A — ingest (heavy, cached or manual). Each source is processed by
an adapter into a normalized per-source frequency table (token, count).
Large corpora (Suomi24, OpenSubtitles) are crunched rarely; their tables are
cached via actions/cache keyed on corpus version, or committed compressed.
Not re-run per build.
Stage B — assemble (light, scheduled). Live sources are fetched fresh, merged with cached tables, filtered, scored, and emitted. This is the cron-driven job.
- Curated seed list (
data/curated-fi.txt) — hand-vetted puhekieli forms committed to the repo: pronoun/verb contractions (mä,oon), reduced-ksquestion forms (oliks,saaks), and clitic chains (miksköhän,saakohan). Unlike the fetched corpora it is always present, so these forms are guaranteed in the output; assemble folds every curated token into the allowlist so they survive the Voikko filter even when libvoikko accepts the form. Grow it via PR. - Urbaani Sanakirja — curated slang headwords. Extract headwords only, not definitions.
- Mastodon public timeline — Finnish-language posts via API, no auth. Freshest slang signal.
- OpenSubtitles Finnish frequency — conversational register, openly available.
- Suomi24 corpus (Kielipankki) — strongest puhekieli mass; access has academic licensing. A derived frequency table is redistributable even where the source text is not — consume it, publish only counts.
Each source is an adapter behind a common trait, so sources can be added or dropped independently.
The discriminator is frequency × morphological rejection:
- A token that is frequent in a colloquial corpus but rejected by libvoikko is almost certainly puhekieli or slang — keep it.
- A token Voikko accepts is kirjakieli, already covered by
main_fi— drop it. - Use
voikko-rsto keep the whole pipeline in one language.
The frequency floor removes typos and one-off noise. A repo-managed
denylist.txt / allowlist.txt handles the residual, applied at assembly
and evolved via PR.
Map log-frequency to HeliBoard's f value (0–255). Tier the output:
- puhekieli function words (pronouns, verb contractions) — high,
f ≈ 200+ - slang content words — lower,
f ≈ 100–170, so they surface without drowning real words
bigram= next-word data is a later quality improvement; the format supports
it.
- Emit a
.combinedwordlist (plain text, oneword=X,f=Nline per entry). - Compile to
.dictviadicttool_aosp.jar makedict. - Verify the exact
.combinedheader fields and the dictionary-type prefix expected for an additional (non-main) dictionary against the HeliBoard wiki before finalizing — this is an open detail (see Open questions).
- Scheduled
cron(monthly suggested) runs Stage B. - The job does not auto-publish. It opens a PR containing the regenerated wordlist plus a diff of new-vs-removed entries. A human merges.
- Releases are cut by Release Please
(
release.yml): it maintains a release PR from Conventional Commits onmain; merging it tags the commit and creates a GitHub Release. - On release,
release.ymlcompiles the.dictand attaches it (plus the.combinedsource) to the release as a stable, versioned download (consumed by Workstream B). Version bumps are driven by commit type (feat:→ minor,fix:→ patch) — see Conventional Commits.
/crates
/pipeline # merge, filter, score, emit — core, fully tested
/sources # one adapter module per source
/data
/cached # committed/cached frequency tables (Stage A output)
curated-fi.txt # hand-vetted puhekieli seed list (always-on source)
allowlist.txt
denylist.txt
/.github/workflows
ingest.yml # Stage A, manual / version-triggered
assemble.yml # Stage B, cron + PR
release.yml # Release Please + compile/publish .dict on release
ci.yml # fmt, clippy, test
/tools
dicttool_aosp.jar # not yet vendored
TDD throughout: the filter and scoring logic are pure functions over frequency tables and must be unit-tested.
Today a user-added dictionary is a one-off file import; HeliBoard does not refresh it. With a monthly-regenerated dictionary this means a manual re-import every cycle.
Open an issue on HeliBoard proposing: load a dictionary from a URL, cache it locally, and optionally auto-update — either on a cadence or when a version change is detected. Frame it generically (any locale benefits), with this project as the motivating use case. Cheapest path if accepted.
If upstream declines, fork. Scope:
- Settings entry: add a dictionary by URL per locale.
- Cache the fetched
.dictin app storage. - Update trigger: cadence (configurable) or version detection — compare a
version field (HTTP
ETag/Last-Modified, or a version in the.combinedheader) before downloading. - Offline-safe: keep the last good cached copy on fetch failure.
Keep the fork minimal and rebaseable on upstream to ease an eventual merge.
The Release artifact URL from Workstream A is the input here, so the two workstreams meet at a stable, versioned URL.
Prove the pipeline end to end before breadth:
- One corpus (OpenSubtitles Finnish) + Urbaani Sanakirja headwords.
- Voikko-rejection filter + manual
denylist.txt. .combined→.dictcompile.- Monthly cron → PR → Release artifact.
- Manual import into HeliBoard to validate suggestions and glide typing.
Then: add Mastodon, the new-vs-removed diff in the PR, bigram data, and begin Workstream B.
- Discard raw post/corpus text immediately after counting. Persist only aggregate frequencies. This sidesteps most licensing friction and is the correct default.
- Headword lists and derived frequency tables are generally redistributable even where source text is not; definitions and full corpus text are not. Keep the distinction explicit per source.
- Exact
.combinedheader and dictionary-type prefix for an additional Finnish dictionary loaded alongsidemain_fi— verify against the HeliBoard wiki. - Suomi24 / Kielipankki access terms and what specifically may be redistributed.
- Mastodon Finnish-language detection: rely on the post
languagefield, or run detection locally. - Whether puhekieli and internet slang should be one dictionary or two (lets the user enable them independently).
- Update cadence — monthly is a starting assumption; slang half-life may argue for shorter.
MIT.