feat: emit .combined header and compile to .dict via dicttool#3
Merged
Conversation
HeliBoard's `.combined` loader requires a single-line header with five fields: `dictionary=<type>:<dict_locale>,locale=<locale>,description= <desc>,date=<unix_ts>,version=<v>`. `dicttool_aosp.jar makedict` errors without it. `dict_type=main` substitutes the locale's primary dictionary; any other type string (e.g. `puhekieli`, `slang`, `emoji`) loads as an additional dictionary alongside `main_fi`. Confirmed via HeliBoard discussion #701. The struct keeps the two locale fields separate because they format differently — `dict_locale=fi` (lowercase, for the dictionary= tag) and `locale=fi_FI` (full BCP 47 tag). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
assemble now prepends a CombinedHeader to file output. Defaults match HeliBoard's additional-dictionary convention for Finnish puhekieli: - dict_type=puhekieli, dict_locale=fi - locale=fi_FI - description="niinku Finnish puhekieli + slang" - date=now (Unix timestamp at generation time) - version=1 Override via --dict-type/--dict-locale/--locale/--description/--version. Stdout output stays body-only for piping/grep. compile is a thin wrapper around `java -jar tools/dicttool_aosp.jar makedict -s <combined> -d <dict>`. --jar overrides the default path. Validates jar + input existence up front, propagates dicttool's stdout/ stderr verbatim, and returns its exit code. Local run produces a 45K .dict from 7435 puhekieli/slang entries. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…*.jar just download-jar fetches dicttool_aosp.jar from the remi0s/ aosp-dictionary-tools community mirror (the most-cited prebuilt source since AOSP doesn't publish the jar standalone). just compile shells the CLI's compile subcommand with the conventional output name data/out/puhekieli_fi.dict. just generate is now end-to-end: download corpus + jar, assemble the .combined, compile to .dict — no manual steps between a fresh clone and a HeliBoard-importable file. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Updates status to MVP (pipeline now produces a loadable .dict, not just a wordlist body), adds Java prerequisite, lists the new recipes, and spells out the on-device import path: Settings → Languages → Finnish → "Add dictionary from file" Notes the dictionary=puhekieli:fi header field as the mechanism that makes HeliBoard load this alongside main_fi instead of replacing it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
laurigates
pushed a commit
that referenced
this pull request
Jun 3, 2026
🤖 I have created a release *beep* *boop* --- ## [0.1.0](v0.0.1...v0.1.0) (2026-06-03) ### Features * **cli:** wire `niinku assemble` end-to-end with opensubtitles + urbaani ([#1](#1)) ([72ab50d](72ab50d)) * **curated:** expand puhekieli seed list (+138) and add dedup test ([#8](#8)) ([d53c5ab](d53c5ab)) * emit .combined header and compile to .dict via dicttool ([#3](#3)) ([f2c283d](f2c283d)) * **pipeline:** Voikko-based kirjakieli filter ([#2](#2)) ([ae29987](ae29987)) * **sources:** Mastodon hashtag-timeline ingestion ([#4](#4)) ([f773873](f773873)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: laurigates-release-please[bot] <272124289+laurigates-release-please[bot]@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes the MVP loop: niinku now produces a HeliBoard-loadable
.dictfile end-to-end, not just a wordlist body.niinku-pipeline: newCombinedHeaderstruct +emit_combined_headerfunction, with the five-field format the AOSPdicttoolloader requires.niinku-cli/assemble: prepends the header to file output (stdout stays body-only for piping). Defaults are tuned for the puhekieli use case (dictionary=puhekieli:fi,locale=fi_FI, description and version configurable via flags).niinku-cli/compile: new subcommand. Thin wrapper aroundjava -jar tools/dicttool_aosp.jar makedict -s <combined> -d <dict>with up-front validation and pass-through stdout/stderr.just download-jar: fetchesdicttool_aosp.jarfrom the remi0s/aosp-dictionary-tools community mirror (AOSP doesn't publish the jar standalone).just generateis now end-to-end: download corpus + jar, assemble.combined, compile to.dict.Local end-to-end
Final
.combinedheader line:(
file data/out/puhekieli_fi.dictreports "OpenPGP Public Key" — a known false-positive; AOSP.dictmagic bytes overlap with PGP.)Test plan
just lint(fmt + clippy -D warnings) — greenjust test(18 unit tests, including newemit_combined_header_matches_heliboard_format) — greenjust generateend-to-end — produces 45Kpuhekieli_fi.dictfrom 7435 entriesFormat references
<type>:<locale>semantics: Discussion #701sample.combined(canonical header reference): aosp-dictionaries sampleRemaining deferred
🤖 Generated with Claude Code