Skip to content

feat: emit .combined header and compile to .dict via dicttool#3

Merged
laurigates merged 4 commits into
mainfrom
feat/combined-header-and-compile
May 27, 2026
Merged

feat: emit .combined header and compile to .dict via dicttool#3
laurigates merged 4 commits into
mainfrom
feat/combined-header-and-compile

Conversation

@laurigates

Copy link
Copy Markdown
Owner

Summary

Closes the MVP loop: niinku now produces a HeliBoard-loadable .dict file end-to-end, not just a wordlist body.

  • niinku-pipeline: new CombinedHeader struct + emit_combined_header function, with the five-field format the AOSP dicttool loader requires.
  • niinku-cli/assemble: prepends the header to file output (stdout stays body-only for piping). Defaults are tuned for the puhekieli use case (dictionary=puhekieli:fi, locale=fi_FI, description and version configurable via flags).
  • niinku-cli/compile: new subcommand. Thin wrapper around java -jar tools/dicttool_aosp.jar makedict -s <combined> -d <dict> with up-front validation and pass-through stdout/stderr.
  • just download-jar: fetches dicttool_aosp.jar from the remi0s/aosp-dictionary-tools community mirror (AOSP doesn't publish the jar standalone).
  • just generate is now end-to-end: download corpus + jar, assemble .combined, compile to .dict.
  • README updated with Java prerequisite and HeliBoard import instructions.

Local end-to-end

$ just generate
... [assemble logs] ...
after voikko kirjakieli filter: 7435 tokens (dropped 42565)
wrote header + 7435 entries to data/out/niinku.combined

java -jar tools/dicttool_aosp.jar makedict -s data/out/niinku.combined -d data/out/puhekieli_fi.dict
Flattening the tree...
Counted nodes : 8910
Computing addresses...
Compressing the array addresses. Original size : 68448
After address compression : 45662
Statistics:
  Total file size 45662
  3505 node arrays
  8910 PtNodes (2.5420828 PtNodes per node)
Done
wrote data/out/puhekieli_fi.dict

Final .combined header line:

dictionary=puhekieli:fi,locale=fi_FI,description=niinku Finnish puhekieli + slang,date=1779886506,version=1

(file data/out/puhekieli_fi.dict reports "OpenPGP Public Key" — a known false-positive; AOSP .dict magic bytes overlap with PGP.)

Test plan

  • just lint (fmt + clippy -D warnings) — green
  • just test (18 unit tests, including new emit_combined_header_matches_heliboard_format) — green
  • just generate end-to-end — produces 45K puhekieli_fi.dict from 7435 entries
  • Header line format matches the AOSP sample.combined reference
  • (Manual) Transfer to Android, import into HeliBoard via Settings → Languages → Finnish → Add dictionary from file — not run; needs a physical device

Format references

Remaining deferred

  • Mastodon adapter (live freshest-slang signal)
  • Suomi24 adapter (Kielipankki, requires academic access)
  • Cleanup of English proper-noun noise in subtitle corpora

🤖 Generated with Claude Code

laurigates and others added 4 commits May 27, 2026 15:56
HeliBoard's `.combined` loader requires a single-line header with five
fields: `dictionary=<type>:<dict_locale>,locale=<locale>,description=
<desc>,date=<unix_ts>,version=<v>`. `dicttool_aosp.jar makedict` errors
without it.

`dict_type=main` substitutes the locale's primary dictionary; any other
type string (e.g. `puhekieli`, `slang`, `emoji`) loads as an additional
dictionary alongside `main_fi`. Confirmed via HeliBoard discussion #701.

The struct keeps the two locale fields separate because they format
differently — `dict_locale=fi` (lowercase, for the dictionary= tag)
and `locale=fi_FI` (full BCP 47 tag).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
assemble now prepends a CombinedHeader to file output. Defaults match
HeliBoard's additional-dictionary convention for Finnish puhekieli:
- dict_type=puhekieli, dict_locale=fi
- locale=fi_FI
- description="niinku Finnish puhekieli + slang"
- date=now (Unix timestamp at generation time)
- version=1

Override via --dict-type/--dict-locale/--locale/--description/--version.
Stdout output stays body-only for piping/grep.

compile is a thin wrapper around `java -jar tools/dicttool_aosp.jar
makedict -s <combined> -d <dict>`. --jar overrides the default path.
Validates jar + input existence up front, propagates dicttool's stdout/
stderr verbatim, and returns its exit code.

Local run produces a 45K .dict from 7435 puhekieli/slang entries.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…*.jar

just download-jar fetches dicttool_aosp.jar from the remi0s/
aosp-dictionary-tools community mirror (the most-cited prebuilt source
since AOSP doesn't publish the jar standalone). just compile shells the
CLI's compile subcommand with the conventional output name
data/out/puhekieli_fi.dict.

just generate is now end-to-end: download corpus + jar, assemble the
.combined, compile to .dict — no manual steps between a fresh clone and
a HeliBoard-importable file.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Updates status to MVP (pipeline now produces a loadable .dict, not just
a wordlist body), adds Java prerequisite, lists the new recipes, and
spells out the on-device import path:

  Settings → Languages → Finnish → "Add dictionary from file"

Notes the dictionary=puhekieli:fi header field as the mechanism that
makes HeliBoard load this alongside main_fi instead of replacing it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@laurigates laurigates added the enhancement New feature or improvement label May 27, 2026
@laurigates laurigates self-assigned this May 27, 2026
@laurigates laurigates added the enhancement New feature or improvement label May 27, 2026
@laurigates laurigates merged commit f2c283d into main May 27, 2026
1 check passed
@laurigates laurigates deleted the feat/combined-header-and-compile branch May 27, 2026 12:59
laurigates pushed a commit that referenced this pull request Jun 3, 2026
🤖 I have created a release *beep* *boop*
---


## [0.1.0](v0.0.1...v0.1.0)
(2026-06-03)


### Features

* **cli:** wire `niinku assemble` end-to-end with opensubtitles +
urbaani ([#1](#1))
([72ab50d](72ab50d))
* **curated:** expand puhekieli seed list (+138) and add dedup test
([#8](#8))
([d53c5ab](d53c5ab))
* emit .combined header and compile to .dict via dicttool
([#3](#3))
([f2c283d](f2c283d))
* **pipeline:** Voikko-based kirjakieli filter
([#2](#2))
([ae29987](ae29987))
* **sources:** Mastodon hashtag-timeline ingestion
([#4](#4))
([f773873](f773873))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: laurigates-release-please[bot] <272124289+laurigates-release-please[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or improvement

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant