feat(sources): Mastodon hashtag-timeline ingestion by laurigates · Pull Request #4 · laurigates/niinku

laurigates · 2026-05-27T13:19:19Z

Summary

Adds the third corpus source: live Finnish posts from Mastodon hashtag timelines. With this PR, only Suomi24 remains stubbed.

niinku-sources::tokenize: pure strip_html + tokenize_for_freq helpers (HTML stripping, mention/hashtag/URL pre-strip, lowercased Unicode word segmentation, drop single-chars + pure-numeric tokens including decimals).
niinku-sources::mastodon: cache-file source adapter (mirrors OpenSubtitles).
niinku-sources::mastodon_ingest: network code. Pulls posts from one or more hashtag timelines (/api/v1/timelines/tag/:hashtag), filters by Status.language, accumulates counts, discards raw text. Uses hashtag streams in preference to /timelines/public because modern Mastodon (4.x+) requires auth for the public timeline while hashtag streams remain open. Pagination via Link: rel="next" header with max_id fallback. Sync via ureq — no async runtime added.
niinku assemble picks up data/cached/mastodon-fi.txt automatically (--no-mastodon to skip).
niinku ingest mastodon is the user-facing wrapper. Defaults: --instance mastodon.social --language fi --tags suomi,finland,helsinki --count 1000.
just ingest-mastodon with COUNT= / INSTANCE= env overrides.

Local smoke test

$ COUNT=200 just ingest-mastodon
ingesting up to 200 posts from mastodon.social hashtag streams [suomi, finland, helsinki] (language=fi)
  fetched 40/200
  fetched 80/200
  fetched 120/200
  fetched 160/200
  fetched 200/200
tokenised 1585 unique tokens (raw post text discarded)
wrote data/cached/mastodon-fi.txt

End-to-end (just generate including Mastodon):

50011 merged tokens (OpenSubtitles + Mastodon)
7441 entries after Voikko (+6 net new vs OpenSubtitles alone — small contribution because the smoke-test sample was only 200 posts)
45K puhekieli_fi.dict produced

Test plan

just lint (fmt + clippy -D warnings) — green
just test (12 pipeline + 18 sources tests; new tests cover tokenize edge cases, Link header parsing, FreqTable serialisation, mastodon cache adapter) — green
Live just ingest-mastodon against mastodon.social — produces the cache file
Full just generate end-to-end — .dict produced

Known noise (filed as followup)

The smoke-test cache file shows three classes of noise that survive tokenisation:

Bare domain names like yle.fi, dawn.fi — my URL filter only catches http:///https:// prefixed tokens.
The literal token https — appears when a URL splits across whitespace.
Hashtag-concatenated words like digitaalinenitsenäisyys — Mastodon's hashtag HTML sometimes renders without # in the visible text (e.g. <a class="hashtag">DigitaalinenItsenäisyys</a>), so my #-prefix filter misses them.

Fixable by a follow-up PR — either parsing HTML structurally (skip text inside <a class="hashtag"> / <a rel="tag">), or a domain-pattern post-filter. Not in scope here.

Remaining deferred

Suomi24 adapter (Kielipankki, requires academic access)
The noise-cleanup followup above
English proper-noun cleanup in OpenSubtitles output (from PR feat(pipeline): Voikko-based kirjakieli filter #2 caveats)

🤖 Generated with Claude Code

Three new modules in niinku-sources: - tokenize: pure functions used by live-source ingestion. strip_html(s) handles Mastodon's server-sanitised HTML (tag stripping + the entity set Mastodon actually emits). tokenize_for_freq(s) pre-strips @mentions / #hashtags / URLs (which unicode_words would otherwise turn into bare words after dropping the prefix), then lowercases and word-segments with unicode-segmentation. Drops single-char and pure-numeric tokens (including decimals like 3.14 that unicode_words preserves intact). Does not try to detect language — Voikko handles that downstream. - mastodon (cache adapter): Mirrors the OpenSubtitles adapter — reads `word count` lines from data/cached/mastodon-fi.txt. Lets Stage B (assemble) stay offline + testable. - mastodon_ingest (Stage A network code): pulls posts from hashtag timelines on a target instance (one or more tags in a single run), filters by Status.language, tokenises, accumulates counts, discards raw text per README's privacy default. Uses hashtag streams in preference to /timelines/public because modern Mastodon (4.x+) requires authentication for the public timeline while /timelines/tag/:hashtag remains open on most instances. Pagination follows the Link header with a max_id fallback. Sync via ureq — no async runtime added. New workspace deps: serde, serde_json, ureq (rustls TLS only), unicode-segmentation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- `niinku ingest mastodon` is the user-facing wrapper around mastodon_ingest::fetch_and_count. Flags: --instance (default mastodon.social), --language (default fi), --tags (comma-separated, default suomi,finland,helsinki), --count (default 1000 posts), --output (default data/cached/mastodon-fi.txt). Logs progress to stderr. - `niinku assemble` now picks up data/cached/mastodon-fi.txt automatically when present; --no-mastodon to skip. - `just ingest-mastodon` recipe with COUNT/INSTANCE env overrides. Local smoke test against mastodon.social: 200 posts across the three default hashtags produced 1585 unique tokens (raw post text discarded). End-to-end run including Mastodon: 50011 merged tokens → 7441 entries after Voikko filter → 45K puhekieli_fi.dict. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Reflects that the OpenSubtitles, Urbaani, and Mastodon adapters are now live and that only Suomi24 remains stubbed. Adds the `just ingest-mastodon` recipe to the quick-start list. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

🤖 I have created a release *beep* *boop* --- ## [0.1.0](v0.0.1...v0.1.0) (2026-06-03) ### Features * **cli:** wire `niinku assemble` end-to-end with opensubtitles + urbaani ([#1](#1)) ([72ab50d](72ab50d)) * **curated:** expand puhekieli seed list (+138) and add dedup test ([#8](#8)) ([d53c5ab](d53c5ab)) * emit .combined header and compile to .dict via dicttool ([#3](#3)) ([f2c283d](f2c283d)) * **pipeline:** Voikko-based kirjakieli filter ([#2](#2)) ([ae29987](ae29987)) * **sources:** Mastodon hashtag-timeline ingestion ([#4](#4)) ([f773873](f773873)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: laurigates-release-please[bot] <272124289+laurigates-release-please[bot]@users.noreply.github.com>

laurigates and others added 3 commits May 27, 2026 16:18

laurigates added the enhancement New feature or improvement label May 27, 2026

laurigates self-assigned this May 27, 2026

laurigates merged commit f773873 into main May 27, 2026
1 check passed

laurigates deleted the feat/mastodon-ingest branch May 27, 2026 13:20

laurigates-release-please Bot mentioned this pull request Jun 3, 2026

chore(main): release 0.1.0 #10

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(sources): Mastodon hashtag-timeline ingestion#4

feat(sources): Mastodon hashtag-timeline ingestion#4
laurigates merged 3 commits into
mainfrom
feat/mastodon-ingest

laurigates commented May 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

laurigates commented May 27, 2026

Summary

Local smoke test

Test plan

Known noise (filed as followup)

Remaining deferred

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant