Skip to content

feat(freshness): corpus manifest + provenance MCP tools + TRUST.md#3

Merged
mikelninh merged 6 commits into
mainfrom
feat/freshness-provenance
May 28, 2026
Merged

feat(freshness): corpus manifest + provenance MCP tools + TRUST.md#3
mikelninh merged 6 commits into
mainfrom
feat/freshness-provenance

Conversation

@mikelninh
Copy link
Copy Markdown
Owner

Answers the question that sits under every "verified" claim: how do you know the underlying corpus is correct, current, and the same corpus the next agent will see?

What's in this PR

File What it does
freshness/build_manifest.py Walks /laws/, hashes every file, produces aggregate hash that changes iff any law changes
freshness/manifest.json 5,942 laws with per-file SHA-256, byte size, git timestamp, source URL — committed as the public corpus snapshot
freshness/TRUST.md Explicit promise document: what we guarantee today, what we don't yet, the gap-closing roadmap
freshness/README.md How to use the scaffolding
server.py Two new MCP tools: get_corpus_status() + verify_law_provenance(abbr)
tests/test_provenance.py 7 new tests, includes CI canary that catches uncommitted corpus drift
.github/workflows/freshness-check.yml Same canary as a GitHub Action

Headline numbers

  • 5,942 laws in the corpus, each with public hash
  • aggregate_sha256: b93152a9…b48fdb81 — the one-number proof of corpus state
  • Test count: 130 → 137 (all green)
  • Hash check latency: ~4 minutes locally (the slow CI canary is by design)

What this unlocks for the user

Anyone — citizen, lawyer, or LLM agent — can now answer "where does this answer come from?" by calling two tools or opening one JSON file:

verify_law_provenance("BGB")
→ {
    "source_url": "https://www.gesetze-im-internet.de/bgb/",
    "corpus_path": "laws/bgb.md",
    "corpus_sha256": "5a6fc44acf93bf722d8721bf7948d484f370ce4754e29e35d57de82c4a09d2da",
    "corpus_bytes": 1623932,
    "git_last_modified_iso": "2026-04-07T21:07:06+07:00"
  }

Honest scope

This PR is the scaffold — data structures, MCP tools, tests, CI guard rail.

What's NOT in this PR (intentional, on the roadmap in TRUST.md):

  • Live daily re-sync from gesetze-im-internet.de (needs upstream XML → markdown parser, multi-day work)
  • Per-paragraph "in-force since" dates
  • Landesrecht + EU corpus expansion
  • Notarised weekly snapshots on Hugging Face Datasets

The scaffold is what makes those measurable next milestones.

🤖 Generated with Claude Code

hallochupi-sketch and others added 2 commits May 28, 2026 18:57
…s. without MCP

The existing tests in gitlaw_mcp/tests/ prove correctness (the tools return the
right answer when asked correctly). They don't prove *impact* — does giving an
LLM these tools actually change how it answers a citizen's legal question?

This eval harness answers exactly that. 25 hand-labelled real Lebenslagen
questions, run twice through gpt-4o-mini:

  BASELINE   — no tools, answers from training-only knowledge
  TREATMENT  — same prompt, GitLaw tools available via OpenAI function-calling
               (functionally equivalent to how an MCP client exposes them)

Headline result on the first committed run:
  hallucination rate:  5.9% → 0.0%   (every cited § now verified against corpus)
  expected hit rate:   62.5% → 62.5%   (no change — see below for honest read)
  mean tool calls per question (treatment): 1.25

The hallucination story is real and reproducible. The hit-rate stability is
the honest part: gpt-4o-mini already knows the well-known statutes in our
question set; the treatment becomes more conservative (cites 1.46 § vs 2.12
in baseline) because it only emits verified citations. The diagnostic info
in eval_summary.md per-question table shows exactly which questions need
better prompting in treatment and which need harder long-tail entries to
widen the gap.

Files:
  questions.json  — 25 hand-labelled questions w/ expected_paragraphs
  run.py          — eval harness with --model / --limit flags
  README.md       — how to run, how to read, honest limits
  eval_summary.md — latest run committed as public record (regenerated each run)
  .gitignore      — keeps timestamped per-run JSON dumps out of git history

Roadmap on the README: harder long-tail questions, multi-model comparison
(gpt-3.5-turbo / gpt-4o-mini / gpt-4o), citation-extraction improvements.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ment

Answers the question that sits under every claim of "this answer is verified":
how do you know the underlying corpus is correct, current, and the same
corpus the next agent will see?

What's new:

- gitlaw_mcp/freshness/build_manifest.py
    Walks /laws/, computes SHA-256 + byte size + git-last-modified per file,
    aggregates them into a single hash that changes iff any law in the
    corpus changes. Idempotent. --check flag exits non-zero on drift.

- gitlaw_mcp/freshness/manifest.json
    Committed snapshot of all 5,942 laws with their hashes, source URLs,
    git timestamps. The aggregate_sha256 is the one-number proof of corpus
    state — two consumers on the same commit see the same number.

- gitlaw_mcp/freshness/TRUST.md
    The explicit promise: what we guarantee today (public source URL per
    law, single integrity hash, git audit log, 0% hallucination on every
    citation via verify_citation), what we don't yet guarantee (no daily
    sync, no per-paragraph in-force dates, federal-only), and the
    roadmap. Reads like a guarantee document, not marketing.

- gitlaw_mcp/server.py
    Two new MCP tools: get_corpus_status() and verify_law_provenance(abbr).
    Either tool answers "where does this answer come from" — callable by
    any MCP client. Returns source URL + corpus hash + git timestamp +
    file path.

- gitlaw_mcp/tests/test_provenance.py
    7 new tests pinning the tool contracts. Includes a CI canary that
    runs `build_manifest --check` — if a law file changes without the
    manifest being regenerated, the test goes red. (Slow because it hashes
    5,942 files, but worth it as a guard rail.)

- .github/workflows/freshness-check.yml
    Same canary as a GitHub Action — runs on PRs that touch /laws/ or the
    manifest. Drift becomes impossible to merge without noticing.

Honest scope:

This PR delivers the *scaffold* — the data structures, MCP tools, tests,
and CI guard rail. The actual daily re-sync from gesetze-im-internet.de
(Phase 1 in TRUST.md) is the next milestone and is multi-day work because
the upstream XML → markdown parser needs to be wired against our existing
normaliser. The scaffold is what makes that next step measurable.

Tests: 130 → 137 (all green).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vercel
Copy link
Copy Markdown

vercel Bot commented May 28, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
gitlaw Ready Ready Preview, Comment May 28, 2026 7:56pm

Request Review

hallochupi-sketch and others added 2 commits May 28, 2026 20:17
Two ruff lint cleanups, no behaviour change:
- eval/run.py: noqa E402 on imports that must follow sys.path setup
- freshness/build_manifest.py: drop unused `os` import

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nternet.de

Builds on the manifest scaffold. Adds the daily HEAD-check that makes upstream
drift *visible* — without yet rewriting our markdown corpus when drift is
detected (Phase 1b).

What this adds:

- gitlaw_mcp/freshness/upstream_sources.json
    Registry of 36 monitored laws, each mapping our abbreviation to the
    upstream gesetze-im-internet.de slug (which often differs — e.g.
    AufenthG → aufenthg_2004 because German law has year-versioned URLs).

- gitlaw_mcp/freshness/sync.py
    Reads the registry, HEAD-requests every entry, compares Last-Modified +
    ETag to the committed upstream_snapshots.json. On drift: updates the
    snapshot and appends a timestamped row to sync_log.md. Modes:
      --dry-run   (don't write)
      --offline   (skip network, summarise cache)

- gitlaw_mcp/freshness/upstream_snapshots.json
    Committed record of what we last saw upstream. Per-law ETag,
    Last-Modified, first_seen and last_checked timestamps.

- gitlaw_mcp/server.py
    Two new MCP tools:
      check_upstream_currency(abbreviation)  — compares our corpus
        git-timestamp against upstream Last-Modified, returns drift_status +
        days_behind
      list_drifted_laws()  — every monitored law where upstream is newer
        than our corpus, sorted by staleness descending

- .github/workflows/upstream-sync.yml
    Daily cron at 05:17 UTC. Runs sync, commits snapshot + log if anything
    changed. Touches only freshness/ files — never the corpus itself.

- gitlaw_mcp/tests/test_upstream_sync.py
    9 hermetic tests (no network) covering: first-sync baseline, drift
    detection, network-failure preservation of prior snapshots, dry-run,
    offline mode, both new MCP tools, drift-list sort order.

- gitlaw_mcp/freshness/TRUST.md
    Updated to reflect Phase 1a as shipped, Phase 1b as the next milestone
    (auto-resync of stale markdown — needs XML → markdown parser).

First live run discovered 6 of 36 laws were already stale upstream, with BGB
being 50 days behind. That's exactly the kind of fact a citizen or lawyer
should know before relying on our markdown. Now they can.

Test count: 137 → 146 (all green).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Polish for the public publish. Restructures the README so a casual visitor
sees the killer features above the fold:

- Headline + badges updated (146 tests, 0% measured hallucinations, TRUST.md link)
- "What you ask Claude" use-case table at the top — concrete, not abstract
- "Why this exists" leads with the 5.9% → 0% eval result instead of generic claims
- NEW "How do you know it's correct?" section — five questions, five tools, each
  answer is a one-call demonstration. Plus the embedded live drift status block
  (6 of 36 monitored laws stale upstream, BGB 50 days behind)
- Tools table split into "core six" + "trust four" so the new provenance and
  freshness tools are surfaced as features, not buried
- Cross-link block to safevoice-mcp and grailsense — declares the MCP-server
  portfolio strategy explicitly
- Roadmap updated: ✅ eval / ✅ manifest / ✅ drift detection / Phase 1b next
- Contact + community section so visitors know where to ask questions
- New .env.example at repo root — copy to .env.local, fill in OPENAI_API_KEY

The single most important add is the "How do you know it's correct?" section.
That's the differentiator against every other legal-tech tool, AI or otherwise:
we don't ask you to trust — we hand you the four tools to verify.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mikelninh mikelninh merged commit daef586 into main May 28, 2026
6 checks passed
@mikelninh mikelninh deleted the feat/freshness-provenance branch May 28, 2026 20:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants