feat(sources): Mastodon hashtag-timeline ingestion#4
Merged
Conversation
Three new modules in niinku-sources: - tokenize: pure functions used by live-source ingestion. strip_html(s) handles Mastodon's server-sanitised HTML (tag stripping + the entity set Mastodon actually emits). tokenize_for_freq(s) pre-strips @mentions / #hashtags / URLs (which unicode_words would otherwise turn into bare words after dropping the prefix), then lowercases and word-segments with unicode-segmentation. Drops single-char and pure-numeric tokens (including decimals like 3.14 that unicode_words preserves intact). Does not try to detect language — Voikko handles that downstream. - mastodon (cache adapter): Mirrors the OpenSubtitles adapter — reads `word count` lines from data/cached/mastodon-fi.txt. Lets Stage B (assemble) stay offline + testable. - mastodon_ingest (Stage A network code): pulls posts from hashtag timelines on a target instance (one or more tags in a single run), filters by Status.language, tokenises, accumulates counts, discards raw text per README's privacy default. Uses hashtag streams in preference to /timelines/public because modern Mastodon (4.x+) requires authentication for the public timeline while /timelines/tag/:hashtag remains open on most instances. Pagination follows the Link header with a max_id fallback. Sync via ureq — no async runtime added. New workspace deps: serde, serde_json, ureq (rustls TLS only), unicode-segmentation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- `niinku ingest mastodon` is the user-facing wrapper around mastodon_ingest::fetch_and_count. Flags: --instance (default mastodon.social), --language (default fi), --tags (comma-separated, default suomi,finland,helsinki), --count (default 1000 posts), --output (default data/cached/mastodon-fi.txt). Logs progress to stderr. - `niinku assemble` now picks up data/cached/mastodon-fi.txt automatically when present; --no-mastodon to skip. - `just ingest-mastodon` recipe with COUNT/INSTANCE env overrides. Local smoke test against mastodon.social: 200 posts across the three default hashtags produced 1585 unique tokens (raw post text discarded). End-to-end run including Mastodon: 50011 merged tokens → 7441 entries after Voikko filter → 45K puhekieli_fi.dict. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reflects that the OpenSubtitles, Urbaani, and Mastodon adapters are now live and that only Suomi24 remains stubbed. Adds the `just ingest-mastodon` recipe to the quick-start list. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
laurigates
pushed a commit
that referenced
this pull request
Jun 3, 2026
🤖 I have created a release *beep* *boop* --- ## [0.1.0](v0.0.1...v0.1.0) (2026-06-03) ### Features * **cli:** wire `niinku assemble` end-to-end with opensubtitles + urbaani ([#1](#1)) ([72ab50d](72ab50d)) * **curated:** expand puhekieli seed list (+138) and add dedup test ([#8](#8)) ([d53c5ab](d53c5ab)) * emit .combined header and compile to .dict via dicttool ([#3](#3)) ([f2c283d](f2c283d)) * **pipeline:** Voikko-based kirjakieli filter ([#2](#2)) ([ae29987](ae29987)) * **sources:** Mastodon hashtag-timeline ingestion ([#4](#4)) ([f773873](f773873)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: laurigates-release-please[bot] <272124289+laurigates-release-please[bot]@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds the third corpus source: live Finnish posts from Mastodon hashtag timelines. With this PR, only Suomi24 remains stubbed.
niinku-sources::tokenize: purestrip_html+tokenize_for_freqhelpers (HTML stripping, mention/hashtag/URL pre-strip, lowercased Unicode word segmentation, drop single-chars + pure-numeric tokens including decimals).niinku-sources::mastodon: cache-file source adapter (mirrors OpenSubtitles).niinku-sources::mastodon_ingest: network code. Pulls posts from one or more hashtag timelines (/api/v1/timelines/tag/:hashtag), filters byStatus.language, accumulates counts, discards raw text. Uses hashtag streams in preference to/timelines/publicbecause modern Mastodon (4.x+) requires auth for the public timeline while hashtag streams remain open. Pagination viaLink: rel="next"header withmax_idfallback. Sync viaureq— no async runtime added.niinku assemblepicks updata/cached/mastodon-fi.txtautomatically (--no-mastodonto skip).niinku ingest mastodonis the user-facing wrapper. Defaults:--instance mastodon.social --language fi --tags suomi,finland,helsinki --count 1000.just ingest-mastodonwithCOUNT=/INSTANCE=env overrides.Local smoke test
End-to-end (
just generateincluding Mastodon):puhekieli_fi.dictproducedTest plan
just lint(fmt + clippy -D warnings) — greenjust test(12 pipeline + 18 sources tests; new tests cover tokenize edge cases, Link header parsing, FreqTable serialisation, mastodon cache adapter) — greenjust ingest-mastodonagainstmastodon.social— produces the cache filejust generateend-to-end —.dictproducedKnown noise (filed as followup)
The smoke-test cache file shows three classes of noise that survive tokenisation:
yle.fi,dawn.fi— my URL filter only catcheshttp:///https://prefixed tokens.https— appears when a URL splits across whitespace.digitaalinenitsenäisyys— Mastodon's hashtag HTML sometimes renders without#in the visible text (e.g.<a class="hashtag">DigitaalinenItsenäisyys</a>), so my#-prefix filter misses them.Fixable by a follow-up PR — either parsing HTML structurally (skip text inside
<a class="hashtag">/<a rel="tag">), or a domain-pattern post-filter. Not in scope here.Remaining deferred
🤖 Generated with Claude Code