Skip to content

feat(sources): Mastodon hashtag-timeline ingestion#4

Merged
laurigates merged 3 commits into
mainfrom
feat/mastodon-ingest
May 27, 2026
Merged

feat(sources): Mastodon hashtag-timeline ingestion#4
laurigates merged 3 commits into
mainfrom
feat/mastodon-ingest

Conversation

@laurigates

Copy link
Copy Markdown
Owner

Summary

Adds the third corpus source: live Finnish posts from Mastodon hashtag timelines. With this PR, only Suomi24 remains stubbed.

  • niinku-sources::tokenize: pure strip_html + tokenize_for_freq helpers (HTML stripping, mention/hashtag/URL pre-strip, lowercased Unicode word segmentation, drop single-chars + pure-numeric tokens including decimals).
  • niinku-sources::mastodon: cache-file source adapter (mirrors OpenSubtitles).
  • niinku-sources::mastodon_ingest: network code. Pulls posts from one or more hashtag timelines (/api/v1/timelines/tag/:hashtag), filters by Status.language, accumulates counts, discards raw text. Uses hashtag streams in preference to /timelines/public because modern Mastodon (4.x+) requires auth for the public timeline while hashtag streams remain open. Pagination via Link: rel="next" header with max_id fallback. Sync via ureq — no async runtime added.
  • niinku assemble picks up data/cached/mastodon-fi.txt automatically (--no-mastodon to skip).
  • niinku ingest mastodon is the user-facing wrapper. Defaults: --instance mastodon.social --language fi --tags suomi,finland,helsinki --count 1000.
  • just ingest-mastodon with COUNT= / INSTANCE= env overrides.

Local smoke test

$ COUNT=200 just ingest-mastodon
ingesting up to 200 posts from mastodon.social hashtag streams [suomi, finland, helsinki] (language=fi)
  fetched 40/200
  fetched 80/200
  fetched 120/200
  fetched 160/200
  fetched 200/200
tokenised 1585 unique tokens (raw post text discarded)
wrote data/cached/mastodon-fi.txt

End-to-end (just generate including Mastodon):

  • 50011 merged tokens (OpenSubtitles + Mastodon)
  • 7441 entries after Voikko (+6 net new vs OpenSubtitles alone — small contribution because the smoke-test sample was only 200 posts)
  • 45K puhekieli_fi.dict produced

Test plan

  • just lint (fmt + clippy -D warnings) — green
  • just test (12 pipeline + 18 sources tests; new tests cover tokenize edge cases, Link header parsing, FreqTable serialisation, mastodon cache adapter) — green
  • Live just ingest-mastodon against mastodon.social — produces the cache file
  • Full just generate end-to-end — .dict produced

Known noise (filed as followup)

The smoke-test cache file shows three classes of noise that survive tokenisation:

  1. Bare domain names like yle.fi, dawn.fi — my URL filter only catches http:///https:// prefixed tokens.
  2. The literal token https — appears when a URL splits across whitespace.
  3. Hashtag-concatenated words like digitaalinenitsenäisyys — Mastodon's hashtag HTML sometimes renders without # in the visible text (e.g. <a class="hashtag">DigitaalinenItsenäisyys</a>), so my #-prefix filter misses them.

Fixable by a follow-up PR — either parsing HTML structurally (skip text inside <a class="hashtag"> / <a rel="tag">), or a domain-pattern post-filter. Not in scope here.

Remaining deferred

🤖 Generated with Claude Code

laurigates and others added 3 commits May 27, 2026 16:18
Three new modules in niinku-sources:

- tokenize: pure functions used by live-source ingestion.
  strip_html(s) handles Mastodon's server-sanitised HTML (tag stripping
  + the entity set Mastodon actually emits). tokenize_for_freq(s)
  pre-strips @mentions / #hashtags / URLs (which unicode_words would
  otherwise turn into bare words after dropping the prefix), then
  lowercases and word-segments with unicode-segmentation. Drops
  single-char and pure-numeric tokens (including decimals like 3.14
  that unicode_words preserves intact). Does not try to detect
  language — Voikko handles that downstream.

- mastodon (cache adapter): Mirrors the OpenSubtitles adapter — reads
  `word count` lines from data/cached/mastodon-fi.txt. Lets Stage B
  (assemble) stay offline + testable.

- mastodon_ingest (Stage A network code): pulls posts from hashtag
  timelines on a target instance (one or more tags in a single run),
  filters by Status.language, tokenises, accumulates counts, discards
  raw text per README's privacy default. Uses hashtag streams in
  preference to /timelines/public because modern Mastodon (4.x+)
  requires authentication for the public timeline while
  /timelines/tag/:hashtag remains open on most instances. Pagination
  follows the Link header with a max_id fallback. Sync via ureq —
  no async runtime added.

New workspace deps: serde, serde_json, ureq (rustls TLS only),
unicode-segmentation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- `niinku ingest mastodon` is the user-facing wrapper around
  mastodon_ingest::fetch_and_count. Flags: --instance (default
  mastodon.social), --language (default fi), --tags (comma-separated,
  default suomi,finland,helsinki), --count (default 1000 posts),
  --output (default data/cached/mastodon-fi.txt). Logs progress to
  stderr.
- `niinku assemble` now picks up data/cached/mastodon-fi.txt
  automatically when present; --no-mastodon to skip.
- `just ingest-mastodon` recipe with COUNT/INSTANCE env overrides.

Local smoke test against mastodon.social: 200 posts across the three
default hashtags produced 1585 unique tokens (raw post text discarded).
End-to-end run including Mastodon: 50011 merged tokens → 7441 entries
after Voikko filter → 45K puhekieli_fi.dict.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reflects that the OpenSubtitles, Urbaani, and Mastodon adapters are
now live and that only Suomi24 remains stubbed. Adds the
`just ingest-mastodon` recipe to the quick-start list.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@laurigates laurigates added the enhancement New feature or improvement label May 27, 2026
@laurigates laurigates self-assigned this May 27, 2026
@laurigates laurigates merged commit f773873 into main May 27, 2026
1 check passed
@laurigates laurigates deleted the feat/mastodon-ingest branch May 27, 2026 13:20
laurigates pushed a commit that referenced this pull request Jun 3, 2026
🤖 I have created a release *beep* *boop*
---


## [0.1.0](v0.0.1...v0.1.0)
(2026-06-03)


### Features

* **cli:** wire `niinku assemble` end-to-end with opensubtitles +
urbaani ([#1](#1))
([72ab50d](72ab50d))
* **curated:** expand puhekieli seed list (+138) and add dedup test
([#8](#8))
([d53c5ab](d53c5ab))
* emit .combined header and compile to .dict via dicttool
([#3](#3))
([f2c283d](f2c283d))
* **pipeline:** Voikko-based kirjakieli filter
([#2](#2))
([ae29987](ae29987))
* **sources:** Mastodon hashtag-timeline ingestion
([#4](#4))
([f773873](f773873))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: laurigates-release-please[bot] <272124289+laurigates-release-please[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or improvement

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant