Skip to content

feat(mycelium): add distill subcommand + harden pr-retro capture quality#173

Merged
howie merged 3 commits into
mainfrom
feat/mycelium-distill
Jun 29, 2026
Merged

feat(mycelium): add distill subcommand + harden pr-retro capture quality#173
howie merged 3 commits into
mainfrom
feat/mycelium-distill

Conversation

@howie

@howie howie commented Jun 28, 2026

Copy link
Copy Markdown
Collaborator

What & Why

Adds a Hermes-style knowledge-distillation loop on top of the existing mycelium
lessons table, modeled on NousResearch/hermes-agent's Skills System (procedural memory +
periodic nudges + autonomous skill-creation prep). /pr-retro already captures per-PR
lessons, but nothing periodically reviews accumulated lessons to surface recurring patterns —
this fills that gap (the synthesis/skill-drafting half lands in ainization-skill as the
knowledge-distill skill).

Part 0 — pr-retro capture hardening (upstream signal quality)

The distill scoring relies on confidence and recurrence signals that /pr-retro was
flattening:

  • --confidence no longer hardcoded to 7 — differentiate by source (user-stated 8-9,
    cross-model 8, inferred 5-6; +1 on recurrence)
  • Q4 lessons unlock from a fixed "3 points" to 0-5 (quality over padding)
  • lessons add --skill records the subject skill, not the producer (pr-retrospective)
  • --key slug gains a domain-prefix convention so recurrence/dedup converge across PRs

Part A — mycelium distill subcommand

  • distill_service.py: harvest (read-only time window) → cluster (deterministic
    token-Jaccard + type + key-prefix; pluggable _similarity for when embeddings land) →
    score (≥3 members, ≥2 distinct retro_pr, avg_conf ≥7, procedural type, new-evidence
    since watermark
    ) → DigestReport JSON
  • models.py: DistilledCluster / SkillCandidate / DigestReport
  • cli.py: distill run (read-only + watermark) and distill promote-tiers (wires the
    previously unscheduled tier promotion)
  • config.py: DISTILL_DIR / DISTILL_STATE_PATH
  • watermark state.json makes runs idempotent — periodic nudge only fires on new accrual

Tests / verification

  • 20 new tests (clustering, score thresholds DT, watermark idempotency); full mycelium suite
    green (389)
  • ruff / mypy / pylint / bandit / lint-skill-bash / markdownlint all green
  • Real-data dry-run over 233 lessons surfaced 4 coherent cross-PR candidates

🤖 Generated with Claude Code

Hermes-style knowledge distillation loop (periodic nudge + skill-candidate
synthesis prep) on top of the existing lessons table.

Part 0 — pr-retrospective capture hardening (upstream signal quality):
- confidence no longer hardcoded to 7; differentiate by source (user-stated
  8-9, cross-model 8, inferred 5-6; +1 on recurrence)
- Q4 lessons unlock from fixed "3 points" to 0-5 (quality over padding)
- lessons add --skill now records the subject skill, not the producer
- key slug gains a domain prefix convention to converge recurrence/dedup

Part A — mycelium distill subcommand:
- distill_service.py: harvest (read-only window) -> cluster (deterministic
  token-Jaccard + type + key-prefix; pluggable for future embeddings) ->
  score (>=3 members, >=2 distinct retro_pr, avg_conf>=7, procedural type,
  new-evidence-since-watermark) -> DigestReport JSON
- models.py: DistilledCluster / SkillCandidate / DigestReport
- cli.py: `distill run` (read-only + watermark) and `distill promote-tiers`
  (wires the previously unscheduled tier promotion)
- config.py: DISTILL_DIR / DISTILL_STATE_PATH
- watermark state.json makes it idempotent (periodic nudge only on new accrual)

Tests: 20 new (clustering, thresholds, watermark idempotency). Full mycelium
suite green (389). Real-data dry-run over 233 lessons surfaced 4 coherent
cross-PR candidates.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01XprgLv8LiF2zM3kPhNtyRU
@howie

howie commented Jun 28, 2026

Copy link
Copy Markdown
Collaborator Author

Final Aggregated Review — PR #173

Mode

group-review (3/3 voices active: Claude 4-subagent / agy / Codex)

Voice verdicts

  • Claude (code-reviewer + silent-failure-hunter + pr-test-analyzer + comment-analyzer): NEEDS_CHANGES
  • agy / Gemini: NEEDS_CHANGES (2 Critical refuted on verification, 1 consensus Important)
  • Codex: LGTM (diff inspected, no substantive findings; test exec blocked by sandbox temp dir)

Consensus Important (must fix)

  1. Prefix union-find over-clusters (Claude code-reviewer + agy) — distill_service.py cluster():
    same_prefix unconditionally unions any two same-type lessons sharing a key prefix, regardless
    of token overlap. With union-find transitivity, all bash-* pattern lessons collapse into one
    grab-bag mega-cluster, diluting candidates (contradicts the PR's own "湊數稀釋 cluster" goal).
    Fix: make prefix lower the similarity bar, not override it — union iff
    _similarity >= threshold OR (same_prefix AND _similarity >= PREFIX_FLOOR) (PREFIX_FLOOR=0.15).
    Update DT-002 to assert corrected behavior + add a zero-overlap-same-prefix → no-merge test.

Single-voice, verified real (must fix)

  1. [test gap, raised Critical] Positive watermark resurface path untested (Claude pr-test-analyzer)
    — verified by inspection: every "candidate present" test uses wm=None; the periodic-nudge
    positive branch (wm != None AND a member newer than wm → candidate surfaces) has zero coverage.
    An inverted > would pass the whole suite. Fix: add a resurface test (watermark older than
    members → 1 candidate; advance past → 0; insert fresh post-watermark member → resurfaces).

  2. Silent data loss on parse failures (Claude silent-failure-hunter; agy raised the Z-suffix
    variant) — _parse_ts returns None and harvest folds unparseable-ts rows into the same
    "skip" branch as out-of-window rows; load_watermark swallows corruption → silent re-flood +
    overwrite. Fix: count + surface dropped-unparseable-ts (CLI warn + DigestReport field);
    distinguish corrupt watermark from missing (warn before treating as first run).

  3. Silent harvest truncation at 2000 (Claude code-reviewer + silent-failure-hunter) — warn when
    len(rows) == _HARVEST_SCAN_LIMIT so window truncation is observable.

  4. SKILL.md confidence rubric contradicts hardcoded --source inferred (Claude comment-analyzer)
    — the new by-source confidence guidance is impossible to honor while --source stays hardcoded.
    Fix: parametrize --source in the lessons-add template alongside --confidence.

  5. /knowledge-distill forward reference (Claude comment-analyzer) — referenced as the downstream
    consumer but lives in the separate ainization-skill repo. Fix: clarify cross-repo / planned.

Disputed → refuted with evidence (drop)

  • agy [Critical] from datetime import UTC breaks Python < 3.11 → REFUTED: pyproject.toml
    requires-python = ">=3.11"; existing modules already import UTC; CI runs 3.12. The Z-suffix
    parse concern is likewise moot on 3.11+ (fromisoformat handles Z) — but the underlying
    silent-drop hardening (refactor: reorganize skills into 7 plugin packs for selective marketplace install #3) is kept regardless.
  • agy [Critical] TypeError on null confidence → REFUTED: schema is confidence INTEGER NOT NULL CHECK(confidence BETWEEN 1 AND 10); Pydantic confidence: int = Field(ge=1, le=10). Never None.
    (Claude code-reviewer independently confirmed.)

Actionable NIT (must fix — convention cleans all NITs)

  • Dead now param in score() — remove (implies time logic that isn't there).
  • _slug_title double-prefixes (bash-bash-cd-git) — strip redundant prefix.
  • SKILL.md LESSON_CONFIDENCE placeholder {{5-9}}{{5-10}} (recurrence +1 ceilings at 10).
  • "唯讀 lessons table" overstates — tighten to "只讀 lessons 資料列;schema init 由 init_db 寫入".
  • Add high-value test assertions: union-find transitivity (AB,BC ⇒ one cluster), _jaccard
    partial-overlap ratio, boundary avg_confidence == 7.0 passes.

Voices unavailable

  • None (all 3 active; Codex test-exec blocked by sandbox but review completed).

howie and others added 2 commits June 28, 2026 22:05
… tests)

Resolves the consensus + verified findings from the 3-voice mob review on PR #173.

Clustering (consensus Claude + agy):
- prefix no longer unconditionally unions same-prefix lessons; it lowers the
  similarity bar to PREFIX_SIMILARITY_FLOOR (0.15) instead of overriding it, so
  zero-overlap same-prefix lessons no longer collapse into a grab-bag mega-cluster

Observability / silent-failure hardening (Claude silent-failure-hunter):
- harvest returns HarvestResult with dropped_unparseable_ts + truncated; CLI warns
  on both so "0 candidates" is diagnosable
- load_watermark distinguishes corrupt (warn) from missing state.json
- DigestReport gains dropped_unparseable_ts / truncated fields

Docs (Claude comment-analyzer):
- pr-retro SKILL.md: parametrize --source (was hardcoded inferred while the rubric
  scores confidence by source — source also drives decay); LESSON_CONFIDENCE 5->10
- tighten "read-only" claims to "lesson rows only" (init_db does schema writes)
- clarify /knowledge-distill consumer lives in the ainization-skill repo

NITs:
- remove dead `now` param from score(); fix _slug_title double-prefix

Tests (Claude pr-test-analyzer — the [Critical] blind spot):
- add the positive watermark resurface path (DT-017/018) — previously only the
  suppression direction was tested
- add union-find transitivity (DT-005), zero-overlap-same-prefix no-merge (DT-002b),
  jaccard ratios (EG-001), boundary avg_conf==7.0 (BVA-001), unparseable-ts counted
  (EG-003), corrupt watermark (EG-004)

Refuted (agy single-voice criticals, evidence in PR comment): UTC import needs 3.11
(repo requires-python>=3.11) and null-confidence TypeError (schema is NOT NULL).

27 distill tests + full mycelium suite (396) green; ruff/mypy/pylint/bandit clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01XprgLv8LiF2zM3kPhNtyRU
…ert)

Round-2 re-review: Claude (4 subagents) and Codex both LGTM with all round-1
findings verified resolved (incl. mutation-checked watermark resurface tests).
Remaining items addressed:

- load_watermark: guard non-dict-but-valid JSON (e.g. `[1,2]`) so .get() can't
  raise AttributeError; move `import sys` to function top (also clears the
  local-import nit)
- run_distill: document the >scan-limit watermark-skip limitation (surfaced via
  HarvestResult.truncated + CLI warn; no SQL-paging redesign at current scale)
- test DT-018: also assert has_new_evidence is True (symmetry with DT-017)

agy round-2 criticals refuted again with evidence (same as round-1): null
confidence / null insight are impossible (schema NOT NULL + Pydantic constraints);
the agy voice reviews the diff without full-repo schema context.

27 distill tests + ruff/mypy/pylint/bandit green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01XprgLv8LiF2zM3kPhNtyRU
@howie

howie commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator Author

Round-2 Mob Review — PR #173 (convergence)

Voice verdicts (round 2, on the fix commit)

  • Claude (code-reviewer + silent-failure-hunter + pr-test-analyzer + comment-analyzer): LGTM
    — all round-1 findings verified resolved; pr-test-analyzer mutation-checked the new watermark
    resurface tests (an inverted > in score() now fails DT-017/DT-015). One new NIT (non-dict
    watermark JSON → AttributeError) — fixed in cc83f55.
  • Codex: LGTM — no substantive findings (both rounds).
  • agy / Gemini: NEEDS_CHANGES — all items refuted or documented (see below).

agy round-2 items → disposition

  • [Critical] null confidenceint(None) crash → REFUTED (3rd time): schema
    confidence INTEGER NOT NULL CHECK(BETWEEN 1 AND 10) + Pydantic Field(ge=1, le=10). Never None.
  • [Important] str(None) → "None" in clustering → REFUTED: insight/key are NOT NULL;
    nullable skill is already guarded by if m.get("skill").
  • [Critical] scan-limit truncation + watermark skip → documented limitation: only triggers at

    2000 in-window lessons (current total ~233); surfaced via HarvestResult.truncated + CLI WARN;
    docstring on run_distill now records it. No SQL-paging redesign at current scale.

  • [NIT] local import sys → addressed by moving it to load_watermark top.

agy reviews the diff without full-repo schema context, so it re-raises the same nullability
false-positives each round. Per this skill's "single-voice Critical → empirically refute", these
are dropped with evidence rather than cargo-culting a wrong fix.

Convergence

Functional convergence: Claude LGTM + Codex LGTM; agy's remaining objections are
evidence-refuted false-positives or a documented (warned) limitation. All real findings from both
rounds are resolved. 27 distill tests + 396 full mycelium suite + ruff/mypy/pylint/bandit green.

@howie howie merged commit 9151a6e into main Jun 29, 2026
1 check passed
@howie howie deleted the feat/mycelium-distill branch June 29, 2026 08:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant