A toy archiver and read-only web UI for public-inbox v2 mailing list archives. Out of the box it indexes the Linux Kernel Mailing List and linux-fsdevel, but any list published by public-inbox works. The displayed site name is configurable; "mimir" appears only as the page generator.
mimir targets a personal or small-team archive deployment. The defaults assume:
- Single host, single SQLite file. The web side scales fine behind a CDN / reverse proxy; writes (ingest) need to be serialized to one process at a time. No Postgres path; SQLite handles the lkml-scale corpus comfortably.
- Multi-million-message scale. Tested on the full lkml corpus (~6 M articles, ~3.6 GB DB on disk). Comfortable on a laptop; growing past ~50 M would warrant revisiting SQLite.
- Single-user ingest at a time.
mimir update/ingestare not safe to run concurrently against the same DB. Multiple readers (web server + warm-cache cron) are fine, WAL mode handles that. - Append-only upstreams. public-inbox v2 commits are append- only by design; mimir's "no updates ever" rule for existing rows assumes that. If an upstream rewrites history, you'll need to wipe and re-ingest.
- Mirrors stay on disk. The git mirror is the source of truth; re-ingesting is cheap, re-cloning isn't (~20 GB and hours for the full lkml archive).
- Walks one or more public-inbox v2 epoch repositories (
0.git,1.git, …), where each commit's tree contains a singlemblob holding the raw RFC 5322 bytes of one message. - Parses each message with the stdlib email API under
policy.default, proper handling of MIME multiparts, RFC 2231 filenames, RFC 2047 encoded headers, and the like. - Treats the public-inbox mirror as the source of truth. SQLite
is a lean index that records, per message, only what's needed to
find and display it:
message_id+ threading hints + a few display fields (subject,author,date). Body, full headers, and attachment bytes are not duplicated into SQLite, they're re-parsed from the git blob on demand. - Cross-posted messages dedupe. A message that appears in both
lkml and linux-fsdevel produces one
articlesrow plus onearticle_listsrow per inbox. - Re-runs are incremental: only new commits since the last recorded HEAD SHA per (inbox, epoch) are visited.
The read path costs roughly 2 ms per message (SQL lookup + dulwich blob fetch + parse). The mirror must be present on disk at read time, not just at ingest time.
- Python 3.14 (declared in
.python-version) - uv for dependency management
uv sync
uv run alembic upgrade headThen create a .env in the project root. Minimum:
SECRET_KEY=<run: python -c "import secrets; print(secrets.token_hex(32))">
Defaults baked into mimir/config.py:
DATABASE_URL,sqlite:///<project_root>/mimir.db. Override per-deployment, e.g.DATABASE_URL=sqlite:////data/mimir.dbfor a container with a persistent volume.SITE_NAME,mimir. The displayed brand in titles, the nav, and the/heading; set this to whatever you want the public site to be called.INBOXES, a JSON map of{name: {mirror_path, upstream_url}}. Defaults cover lkml and linux-fsdevel underInboxes/<name>/git. Seemimir/config.pyfor the exact shape; override via env, e.g.INBOXES='{"lkml": {"mirror_path": "...", "upstream_url": "..."}}'.EMAIL_ALLOWLIST, substrings whose email addresses display in full; everyone else gets<hidden>.- Per-inbox author trackers (each renders as a tile on the inbox
dashboard) are mutated via
mimir admin inbox trackers {set,add, remove,clear} <inbox> [<label>=<email-substring>]. State lives onInbox.tracked_authors(JSON column), not in env, so each inbox can carry its own list and admin edits survive restarts. SECURITY_CONTACT, enables/security.txtand/.well-known/security.txt(RFC 9116). Typical value:mailto:security@example.com. Without it, both routes 404, better than serving a contact-less file. Optional companions:SECURITY_POLICY_URL,SECURITY_ENCRYPTION_URL,SECURITY_PREFERRED_LANGUAGES(defaulten). TheExpires:field is computed at request time asnow + 1 year, so there's no annual rotation chore.
Settings.inboxes (env) is the bootstrap source: each entry
guarantees an inboxes row exists in the DB on first start, but env
never overwrites existing rows on subsequent boots, admin edits to
mirror_path / upstream_url survive restarts.
Each list on lore.kernel.org is published as one git repo per epoch. The easiest way to set things up is to let mimir do it for you:
uv run mimir update # all configured inboxes
uv run mimir update --inbox lkml # one specific inbox
uv run mimir update --skip-clone # only fetch updates on existing epochs
uv run mimir update --skip-fetch # only discover/clone new epochs
uv run mimir update --skip-ingest # download but don't index(All mimir <cmd> invocations are also reachable as
flask --app mimir <cmd>, both share the same Click commands.
Pick whichever reads better in your scripts.)
For each inbox update fetches the upstream manifest.js.gz, runs
git clone --mirror -- <url> on any epoch missing locally, runs
git fetch --prune on the ones already present, and then ingests
new commits, all in one shot.
Each epoch is roughly 1 GB and holds several hundred thousand messages, so a fresh clone of all of lkml (currently 19 epochs, ~6M messages) takes a while and needs ~20 GB of disk. linux-fsdevel is about an order of magnitude smaller.
If you'd rather drive it manually:
mkdir -p Inboxes/lkml/git && cd Inboxes/lkml/git
git clone --mirror -- https://lore.kernel.org/lkml/git/0.git 0.git
git clone --mirror -- https://lore.kernel.org/lkml/git/1.git 1.git
# … and so onuv run mimir ingest # walk every configured inbox (parallel by default)
uv run mimir ingest --inbox lkml # only one inbox
uv run mimir ingest --limit 500 # cap for testing
uv run mimir ingest --workers 1 # force sequential (debug)
uv run mimir ingest -v # progress every 100 msgs
uv run mimir ingest -vv # one log line per messageParsing runs in a ProcessPoolExecutor (defaults to
os.cpu_count()), with the main process collecting results in
commit order and doing the SQL writes. The walker, dedup, batched
commits, and per-(inbox, epoch) IngestState checkpoints are
unaffected, parallelism is confined to the CPU-bound
parse_message stage.
To inspect a single message (smoke test for the git-backed read path):
uv run mimir show '<message-id-without-angle-brackets>'
uv run mimir show '...' --inbox lkml # read the blob from this inbox's mirror
uv run mimir show '...' --body-chars -1 # full body, no truncationBy default the ingest is quiet apart from the per-epoch summary line. Parse failures are surfaced as warnings at any verbosity level.
To re-walk a single epoch, e.g. to backfill messages that failed under an older parser version:
uv run mimir reindex lkml 0.git # rewind state, re-walk; dedup skips existing
uv run mimir reindex lkml 0.git --from-scratch # also DELETE this inbox's links to that epoch firstOutput is one line per epoch, e.g.:
lkml/0.git: new=500 linked=0 dup_batch=0 dup_db=0 failed=0 head=8f282234b668f51b884f3140adf1947d95e32ce7
Every commit lands in exactly one bucket: new (Article inserted),
linked (Article already existed in another inbox, added a new
article_lists row, i.e. a cross-post), dup_batch (same Message-ID
seen earlier in the current uncommitted batch), dup_db (Article
already in DB and already linked to this inbox, re-walks land
here), or failed (parse_message raised).
The default form is non-destructive: existing rows are left alone
and only previously-failed (or genuinely new) messages get inserted.
--from-scratch deletes the per-inbox article_lists rows
pointing at this epoch first; the articles themselves stay (a
cross-post may still be linked from another inbox).
| Situation | What happens |
|---|---|
| Same Message-ID seen across epochs | Counted in dup_db (DB-level check) |
| Same Message-ID twice within one walk | Counted in dup_batch (in-batch set) |
| Cross-post: Message-ID seen in another inbox | Article reused; one new article_lists row added (counts as linked) |
| Existing article with the same Message-ID | Left untouched, no updates, ever |
parse_message raises |
Counted in failed; row recorded in parse_failures; SHA still advances |
The "no updates, ever" stance assumes the underlying archive is
immutable (public-inbox commits are append-only). If you want to
retry a previously failed parse, for example after fixing a parser
bug, see "Replaying parse failures" below, or wipe / rewind
ingest_state.last_commit_sha for that (inbox, epoch).
Every commit whose m blob raises in parse_message is recorded in
parse_failures keyed by (inbox, epoch, commit_sha) with the
exception class, message, attempt count, and timestamps. A re-walk
that parses the same commit cleanly clears the row automatically.
To enumerate or replay without re-walking the whole epoch:
mimir admin failures list # all
mimir admin failures list --inbox lkml --error-class ValueError
mimir admin failures replay lkml # re-parse all of lkml's failures
mimir admin failures replay lkml --epoch 0.git --limit 100replay re-fetches each failure's blob from the mirror, re-runs the
parser, and either inserts the article (success → row deleted) or
bumps attempts + last_attempt (still failing → row kept). Skipped
rows mean the commit or m blob is no longer in the mirror.
Settings.inboxes (env) seeds the inboxes table on first
startup, but you can also create / modify / delete inboxes
directly. The CLI is the front-end to a service layer in
mimir.inboxes, the future Flask admin UI will call the same
functions.
mimir admin inbox list
mimir admin inbox show <name>
mimir admin inbox add <name> [--mirror-path PATH] [--upstream-url URL]
mimir admin inbox update <name> [--mirror-path P] [--upstream-url U] [--rename NEW]
mimir admin inbox remove <name> [--keep-orphan-articles] [--remove-inbox-data] [--yes]Validation is enforced at the service layer:
<name>must match^[a-z0-9](?:[a-z0-9-]{0,62}[a-z0-9])?$. The name flows into URL paths and cache-key fragments, so it has to be lowercase alphanumeric with hyphens, no leading/trailing hyphens, ≤64 chars.<upstream_url>must behttps://with a non-empty host.<mirror_path>must be a non-empty string. The directory is allowed to not exist yet,mimir update --inbox <name>will create it on first clone.
add only inserts the row. With just a name, it defaults to
Inboxes/<name>/git on disk and https://lore.kernel.org/<name>
upstream, matching the conventional lore.kernel.org public-inbox
layout. Pass --mirror-path and/or --upstream-url to override
either independently. To actually populate the inbox:
mimir admin inbox add linux-arm-kernel # defaults to lore.kernel.org
mimir update --inbox linux-arm-kernel # clone the mirror + ingestremove cascade-deletes via FK ON DELETE CASCADE: the inbox's
article_lists and ingest_state rows go with it. By default it
also drops articles left without remaining links, cross-posts to
other inboxes survive untouched. --keep-orphan-articles opts out.
--remove-inbox-data additionally rm -rf's the on-disk
public-inbox mirror at <mirror_path>. Permanent, re-cloning all
of lkml takes hours and ~20 GB. The command prompts for explicit
confirmation; use --yes to skip both the DB-removal prompt and
the on-disk-removal prompt in a script.
update --rename and remove invalidate the cache rows that
reference the affected name (mimir.cache.delete_for_inbox) so
subsequent reads don't return stale entries pointing at a now-
defunct inbox.
/robots.txt is rendered from the robots_rules table on every
request. The migration seeds a * stanza with the previous
hardcoded values plus Cloudflare-style Content-Signal defaults
(Crawl-delay: 5, Disallow: /*/attachment/,
Content-Signal: search=yes, ai-input=no, ai-train=no), so a
fresh deploy serves the structurally-same body it always did
with an additional Content-Signal line. Operator mutates the
table via:
mimir admin robots list
mimir admin robots show <ua>
mimir admin robots add <ua> [--disallow PATH ...] [--crawl-delay SECS] \
[--content-signal KEY=VALUE ...]
mimir admin robots update <ua> [--add-disallow PATH ...] [--remove-disallow PATH ...] \
[--crawl-delay SECS | --clear-crawl-delay] \
[--set-content-signal KEY=VALUE ...] \
[--clear-content-signal KEY ...] \
[--clear-all-content-signals]
mimir admin robots remove <ua>
mimir admin robots reset --yesCommon shapes:
- Block a bot entirely:
mimir admin robots add GPTBot --disallow /. - Add an extra Disallow path to the default stanza:
mimir admin robots update '*' --add-disallow '/private/'. - Tune the global crawl delay:
mimir admin robots update '*' --crawl-delay 10. - Flip the AI-training consent on the default stanza:
mimir admin robots update '*' --set-content-signal ai-train=yes.
Each stanza can carry the
Cloudflare-proposed Content-Signal directive
expressing search / AI-input / AI-training consent. Valid keys
are search, ai-input, ai-train; values are yes or no;
omitting a key expresses no preference. The migration seeds the
* stanza with search=yes, ai-train=no, ai-input=no, matching
the "redaction is a friction layer" posture documented in
CONTEXT.md.
When any rule carries Content-Signal directives, the rendered
/robots.txt is prefixed with an explanatory preamble: a glossary
of what each signal means and a reservation of the operator's
rights in the compilation (index, deduplication, threading,
rendering) under EU Directive 96/9/EC on the legal protection of
databases. The preamble is suppressed when no rule has signals.
Copyright in individual messages is unaffected and belongs to
their authors.
The * stanza is structural; remove '*' is refused. Use
reset to restore the seeded defaults if a * mutation has
wandered.
inboxes
id, name (UNIQUE), -- name is the URL slug
mirror_path, upstream_url
articles
id, message_id (UNIQUE),
subject, author, date, -- for listings; date indexed
thread_parent, -- best-guess parent (in_reply_to OR refs[-1]); indexed
subject_normalized -- prefixes stripped, lowercased; indexed for JWZ subject grouping
article_lists -- per-inbox presence; cross-posts get N rows
article_id (FK → articles.id, ON DELETE CASCADE),
inbox_id (FK → inboxes.id, ON DELETE CASCADE),
epoch, commit_sha, -- pointer back to the public-inbox blob in *this* inbox's mirror
PRIMARY KEY (article_id, inbox_id)
ingest_state
inbox_id (FK → inboxes.id, ON DELETE CASCADE),
epoch,
last_commit_sha,
PRIMARY KEY (inbox_id, epoch)
cache -- DB-backed cache for slow dashboard queries
key (PK), value (JSON), expires_at (indexed)
parse_failures -- one row per (inbox, epoch, commit_sha) whose blob couldn't be parsed
inbox_id (FK → inboxes.id, ON DELETE CASCADE),
epoch, commit_sha,
error_class (indexed), error_message,
first_seen, last_attempt, attempts,
PRIMARY KEY (inbox_id, epoch, commit_sha)
mimir.store.read_message(session, inbox, message_id) is the
canonical read path: looks up (epoch, commit_sha) for the message
in the given inbox, opens the dulwich repo, fetches the blob, runs
parse_message to return a ParsedArticle with body, full headers,
and attachment bytes.
SQLite runs in WAL mode with synchronous=NORMAL and
foreign_keys=ON, set on every connection from
mimir/extensions.py.
Models are defined as SQLAlchemy 2.0 typed Mapped[] classes in
mimir/models.py. Migrations live under alembic/versions/.
mimir/
__init__.py Flask app factory; read-only at the SQLite layer
(the broker bootstraps inboxes on its own startup
since 2.0.0).
cli/ Click commands, one submodule per concern group:
initdb, ingest, mainline, backfill, show, cache,
maintenance, devseed, bootstrap, broker,
admin/{inbox,failures,canonicals}.
register_cli(app) wires them onto Flask's cli.
config.py pydantic-settings Settings class + PROJECT_ROOT
extensions.py SQLAlchemy engine + WAL pragmas, sessionmaker, Base
inboxes.py Inbox lifecycle: bootstrap from env, mutate via admin,
expose via nav-name cache; shared validators.
ingest/ Ingest pipeline split by flow: epoch (hot per-epoch
walk + shared helpers), replay (parse-failures
re-walk), backfill (canonical-inbox resolution),
orchestrate (per-inbox + cross-inbox drivers).
models.py All ORM tables (Inbox, Article, ArticleList,
IngestState, Subsystem, MainlineCommit, etc.).
parser.py pydantic DTOs + BytesParser-based MIME extraction
rendering/ body→HTML pipeline split by concern: blocks
(segmentation), diff (per-hunk anchors +
per-language Pygments overlay), linkify
(URL / Message-ID + DCO trailer redaction),
body (orchestrator + render_body entry).
store.py read_message(): SQL lookup + dulwich fetch + parse
sync.py public-inbox manifest discovery + git clone/fetch
threading.py recursive CTEs for thread reconstruction + active threads
dashboard.py landing-page aggregations (trackers, pulls, stats, sparkline)
cache.py DB-backed cache with JSON encode/decode + a type registry
canonical.py canonical-inbox resolution from To/Cc headers
patches.py, patch path / trailer / patch-series extractors;
trailers.py, shared backfill walker shell in _backfill.py.
patch_series.py,
_backfill.py
subsystems.py article-level path-matching primitives over MAINTAINERS
subsystems_dashboard/ per-subsystem dashboard surfaces split by concern:
reads (per-subsystem fan-outs), reviewers
(attestation surfaces), activity (cross-inbox
"most active subsystems"), triage (needs-
attention + quiet-for-N+-days queues).
maintainers.py MAINTAINERS file parser (no DB)
maintainer_allowlist.py dynamic email allowlist sourced from MAINTAINERS
mainline.py mainline-tree end-to-end: MAINTAINERS reload +
Link-trailer commit walker + update_mainline
orchestrator
maintenance.py SQLite hygiene operations: run_analyze, run_vacuum
patch_revisions.py v1/v2/v3 series grouping logic for patch pages
patch_state.py lifecycle classifier (applied / under review / ...)
datetime_utils.py tz-aware UTC normalisation for Date headers
broker/ write-broker daemon + JSONL protocol + client:
server (queue + worker pool), handlers
(cache / longops / maintenance / warm),
protocol (pydantic request types), client
(process-singleton, thread-safe RPC)
indexnow.py IndexNow push-notification client
seo/ SEO output split by format: sitemaps (XML),
json_ld (schema.org payloads), atom (Atom 1.0 feeds).
web/ Flask blueprint package: routes (one submodule per
URL family), filters (template filters), hooks
(request/response hooks + context processor),
urls (URL composition + site-base memo).
templates/ Jinja2 (base, index, inbox, daily, since, year,
month, search, author, reviewer, subsystem,
message, attachment_preview, _recent_items)
alembic/ migrations
tests/ pytest
Inboxes/ default mirror root (per-inbox subdirs; gitignored)
A read-only browser for the archive. Lightweight stack: Flask + Jinja2, Pico CSS and HTMX from CDN with SRI pins (no build step), and Pygments for server-side syntax highlighting.
uv run mimir run # http://127.0.0.1:5000/Routes:
GET /, meta-index: list of configured inboxes with per-inbox row counts, epoch counts, and date spans.GET /<inbox>/, per-inbox dashboard: most active threads (last 7 days, top 10 by decay-weighted score); side-by-side latest[GIT PULL]requests andLinux N.N.Nrelease announcements; side-by-side per-author trackers driven byInbox.tracked_authors(manage viamimir admin inbox trackers …; the section is hidden when the inbox has no trackers configured); a "this day, 5 years ago" sample; the last 10 messages in the inbox; a 30-day daily-volume sparkline + archive stats footer.GET /<inbox>/todayandGET /<inbox>/yesterday, daily views showing every thread with at least one message on that calendar day (UTC), plus the day's total message count.GET /<inbox>/<YYYY>/, year archive: 12-month grid with per-month message counts; cells link to the month view, missing months dimmed. Prev/next year nav bounded by the plausible-archive range (1995..now+1).GET /<inbox>/<YYYY>/<MM>/, month archive: every thread with at least one message that month, ordered by last activity desc, capped at 100 with a count notice when truncated. Prev/next month nav with year wraparound; breadcrumb up to the year view.GET /<inbox>/search?q=<query>, substring search oversubjectandauthor(case-insensitive, OR-combined). 100-result cap, cached per (inbox, query). Form lives on the inbox dashboard. Caveats: queries with no matches can take seconds to scan on a cold cache; the date-index short-circuit only helps when some rows match. Seemimir.dashboard.search_articles.GET /m/<message-id>, Message-ID lookup. 301-redirects to the article's canonical/<inbox>/<YYYY>/<MM>/<article-id>URL. For cross-posts, the destination is the article's pinnedcanonical_inbox_id(or the alphabetically-first link with firehose inboxes demoted, when canonical isn't pinned yet); the message page's "Also in:" line surfaces the rest. Useful for linking from outside the archive (commit trailers, IRC, lore.kernel.org refs). An inbox-scoped variantGET /<inbox>/m/<message-id>404s when the message exists but isn't in the named inbox.GET /<inbox>/<YYYY>/<MM>/<article-id>, single message: headers, full thread tree with the current message highlighted, body, attachment list. Patch articles also render a per-patch state card above the body summarising trailers (with maintainer attestation chips), mainline-landing record, cross-revision series timeline (with[diff vs current]links), and thread activity. On patch threads whose root has an off-list parent, also shows a "Possibly related" surface of other archived messages with the same normalized subject (JWZ subject-based grouping). Non-patch threads (excluding those rooted by automated senders like syzbot or the kernel test robot) also show a Related discussions panel: up to 5 prior threads from the same inbox ranked by subject-token overlap, shared participants, and recency. The year/month must match the article's archived date; mismatches return 404. ETag-based conditional revalidation: responses carryCache-Control: public, no-cacheand a strong ETag, repeat requests resolve as 304 with no body. The HTMX intra-thread swap (click a tree link) returns just the_message_body.htmlpartial, leaving the tree + chrome on the client.GET /<inbox>/series/<patch_series_key>/diff?from=<vN>&to=<vM>&pos=<pos>
Inter-revision diff for a single patch-series position.pos=cover(orpos=0) diffs the cover-letter bodies between two revisions;pos=Ndiffs the N-th in-series patch's body across revisions. The body diff covers both commit message and patch hunks. 24h cached, source emails are immutable in the mirror. Linked from each non-current entry in the patch page's Revisions fold (the[diff vs current]chip).GET /<inbox>/<YYYY>/<MM>/<article-id>/attachment/<n>, binary download of the n-th attachment, served from the dulwich-fetched blob.GET /<inbox>/<YYYY>/<MM>/<article-id>/attachment/<n>/preview
Pygments-highlighted inline preview for text-like attachments (patches, .c, .py, etc.); falls back to a "binary, can't preview" page otherwise.GET /api/<inbox>/recent?offset=N, HTMX partial: next page of "Recent messages" entries plus a fresh "Load more" trigger. Hard-capped at offset 1000 (100 pages back); past it the route 404s to bound worst-case server work on adversarial pagination.GET /healthzandGET /readyz, cheap probes for orchestrators./healthzdoes no DB work;/readyzruns aSELECT 1. Both bypass the route cache viaCache-Control: no-store.GET /robots.txt, disallows/*/attachment/*and points at the sitemap.GET /sitemap.xml, sitemap index pointing at/meta-sitemap.xmlplus one/<inbox>/sitemap.xmlper configured inbox. Each response carriesLast-Modifiedderived from the most-recent content date and honoursIf-Modified-Since(304 Not Modified on a conditional GET that already covers the latest content), so crawlers like Google can re-fetch on a real change rather than on full-body diffs. Cached for 1 h.GET /security.txtandGET /.well-known/security.txt
RFC 9116 contact info. 404 unlessSECURITY_CONTACTis set.GET /privacy, GDPR Art. 13 transparency notice: controller identity, browser storage (Cloudflare cookies,mimir.fold.*localStorage), server-side log retention, third parties in the request path, the From-line / DCO-trailer redaction posture, data-subject rights, and the Dutch supervisory-authority complaint route. Linked from the footer on every page.
The body rendering pipeline (mimir/rendering/) walks the body
line by line, segments it into runs of text, quote, and diff,
and emits HTML accordingly:
- Quoted blocks (
>-prefixed lines) become<blockquote>and recurse for nested levels,>>>>ends up four<blockquote>deep. Levels at or beyond depth 2 collapse into<details>so the reader can expand on demand. - Inline unified diffs (recognized by
diff --git,---,+++,@@starts) get the standard green/red/cyan add/remove/header rendering, with two extras: each hunk wraps in<div id="h-N" class="hunk">and each line within carriesid="h-N-LM"so URL fragments like#h-2-L15jump to a specific line of a patch; context, add, and remove lines also receive a per-language Pygments overlay (the lexer is detected from each+++ b/<path>), so a C patch's hunk content reads as C, a Python patch's reads as Python, and unknown / binary targets fall through to plain monospace. - Plain text runs are escaped, preserve newlines via
<pre>, and have URLs and<Message-ID>s linkified, clicking a referenced Message-ID inside one message takes you to that message's per-inbox URL when it's in the archive (and renders as a neutral[ref]placeholder so the address part isn't re-leaked); refs not in the archive render as[off-list ref].
The dashboard helpers run through a DB-backed cache (the cache
table; values JSON-encoded with a small dataclass registry in
mimir/cache.py). TTLs are sized to the cost of recomputation:
| Helper | TTL |
|---|---|
archive_stats |
24 h |
daily_volume |
1 h |
active_threads |
5 min |
threads_for_day |
5 min |
author_recent (each) |
5 min |
latest_pull_requests |
5 min |
latest_stable_releases |
5 min |
this_day_in_history |
5 min |
To eliminate user-facing cold-start latency, run:
uv run mimir warm-cache # all tiers (operator one-off)
uv run mimir warm-cache --tier fast # sitemaps + cheap helpers
uv run mimir warm-cache --tier slow # subsystem dashboards + restfrom cron or a systemd timer. The work splits into a fast tier
(sitemaps, archive_stats, latest pulls, latest stable releases,
recent articles) on a per-minute cadence and a slow tier
(subsystem dashboards, per-tracker queries, the rest) on a per-hour
cadence. The container scheduler fires them on the
WARM_CACHE_EVERY / WARM_CACHE_SLOW_EVERY cadences (see
deploy/README.md); broker-side, the warm-worker queue is a
priority queue so a fast-tier RPC queued behind a slow-tier RPC
jumps ahead (in-flight slow ops are never preempted). Sample
crontab for a non-broker deploy:
* * * * * cd ~/Projects/mimir && uv run mimir warm-cache --tier fast >/dev/null
0 * * * * cd ~/Projects/mimir && uv run mimir warm-cache --tier slow >/dev/nullA warm-cache run refreshes every targeted helper for every configured inbox. With this in place, dashboard loads come back in single-digit-millisecond range regardless of how big the archive gets.
Per-key work fans out across min(cpu_count, 8) worker threads by
default; pass --workers N to override (e.g. --workers 1 when
debugging a slow target). Keys are skipped on a warm tick if their
remaining TTL is comfortably above the helper's refresh window
(deterministic skip); inside the window the warm tick refreshes
with a probability that ramps from 0 to 1 as the row approaches
expiry, so siblings sharing a TTL don't all refresh on the same
tick. Below the window the refresh fires every tick (the insurance
zone, so the row is always rewritten before it expires).
SQLite never reclaims freed pages on its own; the .db file grows
past its actual content over time, and the WAL grows during long
ingests until something checkpoints it. To compact both:
uv run mimir vacuumReports before/after sizes for mimir.db, mimir.db-wal, and
mimir.db-shm. VACUUM holds an exclusive lock for the duration and
needs ~2× the on-disk size of free space (the rebuild lives in the
WAL until checkpoint). Other processes with the DB open (web
server, warm-cache cron) prevent the post-VACUUM WAL truncate, so
run it during a quiet window.
Sample crontab (daily at 04:00, only if no ingest is running):
0 4 * * * cd ~/Projects/mimir && uv run mimir vacuum >/dev/nullOn lkml-scale (~6 M articles, ~3.6 GB DB) a full VACUUM takes 80 to 120 s.
SQLite's planner reads sqlite_stat1 to pick query plans. The
migration runs ANALYZE once on an empty schema, which doesn't
help; as ingest fills the tables the stats stay zero and the
planner can flip to bad plans (e.g. scanning all of article_lists
instead of walking the date index).
The broker container owns this: a bounded ANALYZE runs on first
start (gated by /data/.broker_initial_analyze), and the in-loop
daily + weekly-full ticks RPC to the broker. Operator-transparent;
just keep the scheduler tasks container's cadence env vars (see
deploy/README.md) at their defaults.
For ad-hoc operator runs:
uv run mimir analyze
uv run mimir analyze --fullThe bounded form runs in 1 to 3 s on the lkml-scale corpus and is
accurate enough for the common join shapes. The --full pass
re-samples every row of every index, holds the writer lock 25 to
30 s, and is the safety net for distribution drift in long-tail
indexes the bounded sample might miss.
Example crontab pair:
30 4 * * * cd ~/Projects/mimir && uv run mimir analyze
0 5 * * 0 cd ~/Projects/mimir && uv run mimir analyze --fullFor real-host deployment (everything above runs as flask run for
dev), see deploy/README.md. Three shapes are covered:
- Container,
Dockerfileandcompose.yamlat the repo root. Multi-stage build, non-root runtime, gunicorn behind a${WORKERS}knob, broker container self-bootstrapsalembic upgrade headon first start,/healthzcontainer healthcheck,/data(with/data/db/and/data/Inboxes/subpaths) as the single bind mount.docker compose up --buildonceSECRET_KEYis set. - systemd,
deploy/systemd/carries the web-server unit plus three timer/oneshot pairs replacing the cron lines for warm-cache (every minute), analyze (daily), and vacuum (weekly). For the weekly vacuum, the WAL truncate only lands fully when no other process holds the DB; dosystemctl stop mimir.servicebefore triggering it manually if that matters, or let the timer fire and accept best-effort. - Reverse proxy,
deploy/caddy/Caddyfile.example(5 lines, automatic HTTPS) anddeploy/nginx/mimir.conf.example(full TLS site block with theX-Forwarded-ProtoandX-Request-Idheaders mimir reads).
mimir runs three containers (since 2.0.0; see compose.yaml):
mimir-brokerowns the sole SQLite writer connection. Serves cache + admin RPCs over a UNIX socket at/data/.broker.sock. Self-bootstraps on startup (alembic upgrade head→bootstrap_inboxes→ bounded post-migrateANALYZE), each gated by a sentinel file so subsequent restarts skip. Internal periodic purge thread drops expired cache rows.mimiris the web tier. Opens every SQLite connection withPRAGMA query_only=1; cache writes route through the broker socket. Depends on the broker's healthcheck so cold requests after deploy don't hit an un-migrated schema.mimir-tasksis the scheduler. Alsoquery_only=1. Fires RPCs at the broker on a timer (warm-cache, update, mainline refresh, ANALYZE, VACUUM). No direct SQLite writes.
The invariant: broker is the sole SQLite writer; everything else
is query_only=1. MIMIR_IS_BROKER=true flags the broker
container; MIMIR_DEPLOY=true flags the web + tasks containers
(triggers query_only=1 and refuses FLASK_DEBUG=true).
See deploy/README.md for the step-by-step, healthcheck shape,
and bootstrap-sentinel layout.
mimir mirrors Linus's linux.git locally so it can read
MAINTAINERS and surface subsystem ownership across the UI:
per-subsystem dashboards (/<inbox>/subsystem/<name>/), reviewer
pages, attestation chips on patch views, the most-active-subsystems
aggregation on /, and the MAINTAINERS-derived half of the email
allowlist that drives From-line / DCO-trailer redaction.
mimir walks one or more git trees for Link: trailers feeding
the patch-lifecycle surfaces. Defaults to Linus + linux-next +
five *-next subsystem trees (net-next, tip, pci, mm,
bpf-next). Operators extend via:
export TREES__bcachefs__URL=https://example.com/bcachefs.git
export TREES__bcachefs__PATH=Mainline/bcachefs.git
export TREES__bcachefs__WALK_EVERY_SECONDS=3600Or replace defaults entirely by setting one or more TREES__* env
keys (any operator-curated set wins outright; defaults are not merged
in).
Deprecated: MAINLINE_TREE_URL / MAINLINE_TREE_PATH. Continue
to work (seeds only the linus entry); will be removed in the
next major release.
# Clone (first run) or fetch + load:
uv run mimir update-mainline
# Re-parse the local HEAD without fetching:
uv run mimir update-mainline --skip-fetch
# Force re-parse even when HEAD hasn't moved (after a parser fix):
uv run mimir update-mainline --forceSteady-state ticks (HEAD unchanged) are cheap: fetch, compare, no-op. Operator can run this on a cron / systemd timer; the schema is replaced transactionally on every change so consumers never see a half-loaded subsystems table.
update-mainline runs two passes against the tree:
- MAINTAINERS load, replaces the
subsystemsschema as above. Skipped when HEAD is unchanged.--skip-maintainersdisables this pass for the tick. Link:-trailer walk, scans every new commit forLink: https://lore.kernel.org/.../<msgid>trailers and insertsmainline_commitsrows. Resumable; the first run walks the full history (~1.5M commits on Linus's tree, a few minutes). The patch page's state card surfaces these as a "Landed:<sha>in<tree>on YYYY-MM-DD" row whenever an article's Message-ID matches one or more recorded commits.--skip-commitsdisables this pass.
New articles get their diff-touched paths extracted automatically
at ingest time (parsing diff --git a/<old> b/<new> headers out
of patch bodies). For articles ingested before that landed, run:
uv run mimir backfill-article-files [-v]Idempotent, articles that already have rows are skipped. Pass
--limit N to bound a session for huge archives; the walker is
newest-first, so a --limit run covers the most-visible articles
first. --reprocess re-extracts for articles whose rows already
exist (use after an extractor change).
Cover-letter subjects ([PATCH ... 0/N] <title>) and in-series
patches ([PATCH ... M/T] <title>) are tagged with
patch_series_key + patch_series_version + patch_series_position
at ingest (in-series patches inherit key + version from the cover
via their thread parent). For articles ingested before that
landed:
uv run mimir backfill-patch-series [-v]Cheaper than the article-files backfill, only reads
subject + author + thread_parent, no body re-parse via git mirror.
Idempotent; --limit N and --reprocess work the same way.
Output buckets: indexed (covers), in_series_indexed
(in-series patches linked to a cover), in_series_orphan
(in-series patch with position set but cover not yet known,
re-attempted on the next run), not_cover, skipped.
Off by default. Set INDEXNOW_KEY to enable: the update
scheduler tick will push the canonical URL of every newly-ingested
article to https://api.indexnow.org/indexnow, which fans out to
Bing, Yandex, Naver, Seznam, and Yep. Google does not consume
IndexNow, so this won't help Google discovery; it only accelerates
Bing-family crawlers.
# Generate a key once (32 hex chars; the spec is loose on length).
python -c "import secrets; print(secrets.token_hex(16))"
# Set in the deployment env. SITE_BASE_URL must also be set so the
# protocol can build the keyLocation URL and the host field.
export INDEXNOW_KEY=8c3aef... # the generated key
export SITE_BASE_URL=https://example.testOnce enabled, mimir serves the key file at
https://<host>/<key>.txt (registered only when INDEXNOW_KEY is
set, an unconfigured deploy doesn't expose the endpoint). Pre-
existing articles in the DB are not backfilled to IndexNow;
only articles newly created on each update tick are pushed.
INDEXNOW_MAX_PER_TICK (default 1000) is a backfill guard:
when a single update produces more new URLs than this (fresh
deploy, post-outage catch-up, etc.), the push is skipped entirely
and a warning logged, the sitemap remains the discovery path for
the backlog. Raise the cap if your steady-state new-articles-per-
tick legitimately exceeds it.
All notification calls are best-effort: network errors and non-2xx responses log a warning and don't break the ingest tick.
The broker carries a latent tracemalloc-based diagnostic for
investigating Python-side memory retention (RSS staying high
between warm cycles, VmData drifting up over a multi-day
window, that sort of thing). Off by default, zero cost when
disabled.
Enable on the broker container only, by setting
TRACEMALLOC_INTERVAL_SECONDS to a positive integer:
# compose.yaml
services:
mimir-broker:
environment:
TRACEMALLOC_INTERVAL_SECONDS: "1800" # snapshot every 30 minRestart the broker. From that point on, a daemon thread takes a
tracemalloc.Snapshot on the interval, writes it as a pickle to
/data/diagnostics/tracemalloc-<UTC-ISO>.pkl (atomic rename),
and logs a top-25-by-current-bytes summary to stderr per
snapshot, so a podman logs -f mimir-broker tail shows the leak
shape forming without pulling files.
To rank growers between two snapshots offline:
podman exec mimir-broker mimir tracemalloc-diff \
/data/diagnostics/tracemalloc-<early>.pkl \
/data/diagnostics/tracemalloc-<late>.pkl \
--top 25 \
--filter-prefix /app/mimir--filter-prefix narrows the output to allocations whose source
file starts with the given path (typical pick: /app/mimir for
the mimir code itself, ignoring stdlib / third-party noise).
When done, drop the env var, restart, and rm -rf /data/diagnostics/*.pkl to reclaim the disk. Snapshot files run
~10 to 50 MB each depending on heap size and frames (default
25); a 4 to 8 hour observation window at 30-minute interval lands
around 200 to 400 MB.
uv run ruff check mimir/ tests/
uv run pytestBoth mimir/ and tests/ are linted because CI runs ruff over
the same set; a pre-push sweep that omits tests/ can pass
locally and fail in CI. The suite focuses on the cache
encoder/decoder round-trip, the parser contract, threading CTEs,
canonical-inbox resolution, the rendering XSS contract, and the
broker dispatch / handler / client roundtrips; that's where silent
corruption would be most expensive.
MIT, see LICENSE.