Skip to content

Recall quality audit (2026-05-22): observer-session crowding, stale-draft hallucination, broken citations #77

@ashu17706

Description

@ashu17706

Audit date: 2026-05-22
Corpus size at audit: thousands of sessions accumulated over months of multi-agent use
Trigger: First end-to-end "use Smriti as a user would" eval since the daemon work began. Goal was to find out what's actually shippable, not just what passes unit tests.

This is a single tracking issue covering five queries' worth of findings, the methodology used to evaluate them, and the proposed fixes ranked by complexity. Split into sub-issues if/when work starts on individual findings.

Why this matters now

The v0.8.0 daemon (#71) will make Smriti capture much more data automatically. If retrieval quality is shaky today on a manually-curated corpus, it gets worse, not better, when the daemon is silently filling the DB. We need to know what's wrong before we 5x the data volume.

Methodology

For each query, we tracked five dimensions:

Dimension Question
Precision Are the retrieved hits actually about what was asked?
Recall Did obvious-relevant sessions get missed?
Synthesis fidelity Does the LLM output reflect what's in the sources, or does it hallucinate?
Latency How long does the call take end-to-end?
Ground-truth check Pick a specific claim per output, verify it against data we can independently confirm

Ground truths used (so hallucinations could be detected):

  • The 42-process pile-up: 13,449 CPU-minutes total, oldest from Wednesday, fix was lockf -t 0 /tmp/smriti-ingest.lock
  • Daemon design rejected three options during pre-impl smoke tests: chokidar (fires 0 events under Bun), socket-bind single-instance (silently steals connections), long-lived DB connection (Bun segfault, 6.8 GB RSS peak)
  • QMD upstream: fork is 49 commits behind, hits include 004714a / 3b7e065 / d045a8b / e36ab96
  • Current Claude session ID: 4a283f66-575d-47db-864e-9c77f9e0f07b

Test suite

# Query Command
1 "42 stuck processes that consumed 9 CPU days" with synthesis smriti recall "..." --limit 5 --synthesize
2 Temporal drift on QMD smriti drift "qmd"
3 RAG ask on a specific decision smriti ask "how does the daemon enforce single-instance, and why"
4 Project-scoped list smriti list --project smriti --limit 10
5 BM25 exact match smriti search "lockf" --limit 8

Picked to cover: BM25-only retrieval (test 5), semantic retrieval (test 1's recall layer), LLM synthesis (tests 1 + 3), temporal aggregation (test 2), metadata-only listing (test 4). Each test stresses a different layer.

Findings

🔴 P0 — Observer-session crowding in retrieval

Severity: Affects every search/recall query. The single biggest product issue.

Evidence: In test 5 (BM25 for "lockf"), 6 of 8 hits were "Hello memory agent, you are continuing to observe..." sessions, all scoring 0.137–0.152. The primary session containing the actual lockf-and-daemon-design story scored 0.136 at hit #6below the observer noise. Same pattern in tests 1, 2, 3.

These observer sessions are created by claude-mem's plugin: they record summaries of other sessions' work. They're dense, well-formatted, and contain heavy term overlap with the primary sessions they observe. BM25 and vector retrieval both reward this density. The user almost always wants the primary source, not the observer's summary of it.

Impact: Users searching for their own work get someone else's notes about their own work. Confusing in good cases, misleading in bad cases (observer summaries can lag the primary or contain interpretation drift).

Possible fixes (ranked by complexity):

  1. Quick: Filter observer sessions by default; expose --include-observers for explicit opt-in. Detect via agent name (claude-mem) or session title prefix. ~20 LOC in src/search/index.ts.
  2. Medium: Add a session_type column (primary | observer | derived) on smriti_session_meta, populate at ingest time. Ranker applies a configurable boost/penalty based on type. ~100 LOC + a migration.
  3. Architectural: Treat observer sessions as annotations on primary sessions, not as first-class searchables. They'd surface only when expanding the result for a primary session. ~weeks of work.

Recommended: ship quick fix in v0.8.1 (filter-by-default), revisit medium fix in v0.9.x.

🔴 P0 — Synthesis hallucinates from stale-draft data

Severity: Confidence-of-wrong-answer is the worst failure mode. Has shipped already (synthesis is a daily-use feature).

Evidence: Test 1's synthesis output included this in <next_steps>:

"Address cross-platform file system monitoring with chokidar abstraction."

We explicitly rejected chokidar in the daemon PRD's "Three pre-impl smoke-test findings" section — it fires zero events under Bun 1.3.6 in our test. But earlier drafts of the PRD (now superseded) recommended chokidar. Synthesis pulled from the older draft and confidently presented it as a next step.

For a tool whose explicit pitch is "team learning from each other's coding sessions," this is the exact opposite of the intended UX: rather than surfacing the lesson ("we tried chokidar, it didn't work, here's why"), it surfaces the rejected suggestion as if it were current.

Impact: Decisions can be silently reverted via synthesis. Users acting on a synthesized "next step" can re-introduce a bug we already learned to avoid.

Possible fixes:

  1. Recency weighting in the ranker. When two sources discuss the same topic, prefer the newer one. Today's RRF fusion is content-blind to date. ~50 LOC in searchVec / searchFTS to add a recency boost.
  2. Contradiction detection during synthesis. smriti recall --check-conflicts exists (Contradiction detection in recall results #67) but isn't run by default for --synthesize. Wire conflict detection into the synthesis prompt: if conflicts exist among sources, the prompt should surface "decisions evolved" rather than averaging them.
  3. Explicit decisions ledger. When a decision is marked as superseding an earlier one (via category decision/superseded or explicit linking), synthesis treats the newer as canonical. Big design change; out of scope for v0.8.x.
  4. Synthesis source-citation discipline. Make the synthesis prompt require each claim to cite a specific source ID, and have the post-processor verify the cited source actually contains the claim. Catches the worst form of hallucination at output time. ~moderate work.

Recommended: ship #1 (recency weighting) in v0.8.1, ship #2 (auto-run check-conflicts under synthesize) in v0.9.0.

🟡 P1 — Citation UX is broken in two specific ways

Severity: Affects every smriti ask output. Hurts trust in answers that are otherwise accurate.

Evidence (test 3):

Sources:
  [1] 362db08e... — Hello memory agent... (Invalid Date)
  [2] 536689d3... — Hello memory agent... (Invalid Date)
  [3] aba53827... — Hello memory agent... (Invalid Date)
  [4] e491a144... — Hello memory agent... (Invalid Date)
  [5] 7930d328... — Hello memory agent... (Invalid Date)

Two distinct bugs:

  • Date parsing failure: every citation shows Invalid Date. Likely a new Date(undefined).toISOString() somewhere in the format layer, returning a string Node renders as "Invalid Date".
  • Indistinguishable titles: all five citations share the same title prefix from the observer-session prompt. Without dereferencing each session_id, the user can't tell what they're looking at.

Possible fixes:

  1. Date bug: locate the date-formatting call (likely in src/format.ts or wherever smriti ask builds its citations), guard against undefined/null, fall back to "unknown date" or the message's actual timestamp from created_at. ~5 LOC.
  2. Title bug: when a session title would collide with N other sessions, append a distinguishing suffix (date + first-line snippet, or session-id-short). Or: stop using the prompt as the title for observer sessions; derive a title from the observed work instead.

The title bug is partially the same problem as P0 (observer crowding) — fixing P0 may fix this implicitly.

🟡 P1 — smriti drift doesn't show evolution of thinking

Severity: A whole command provides little value beyond what smriti list --project ... --limit N already does.

Evidence (test 2): Asked smriti drift "qmd". Got back a narrative that says "they started doing X, then turned to Y, now Z" — a generic temporal arc. The narrative talks about files (server.ts, queue.ts) but not decisions or turning points. A useful drift for "qmd" would surface: "Mar 12: QMD treated as black-box dependency. Apr 4: discovered fork was 49 commits behind. May 19: decided to track upstream rather than diverge, started Smriti daemon design." Today's output returns dates and filenames where it should return narrative beats.

Topic-vs-keyword confusion also hit hard: drift on "qmd" returned 10 sessions that mention qmd but where qmd wasn't the topic.

Possible fixes:

  1. Better synthesis prompt for drift. Today's prompt presumably says "summarize the evolution"; should say "identify 3-5 turning points and what changed at each." Free improvement.
  2. Topic vs keyword filter: only include sessions where the keyword appears in the title or in a session-level category tag, not anywhere in the content. Cuts the corpus for drift dramatically.
  3. Add a decision-marker hint: surface sessions categorized decision/* more heavily in drift output. We already have the category system; drift just doesn't use it.

🟡 P1 — BM25 dynamic range is squashed

Severity: Ranking is barely meaningful for short specific queries.

Evidence (test 5): Searching for "lockf" returned 8 hits scoring 0.128–0.152 — a 0.024 spread. For a term that appears 1 time in some sessions and 15+ times in others, the BM25 score differences should be 10x larger.

Likely the score normalization (the 1 / (1 + |bm25|) step in QMD's searchFTS) is mapping the natural range too aggressively. Could also be a chunking effect: if BM25 runs per-chunk and chunks have uniform size, term-frequency normalization within chunks washes out per-document signal.

Possible fixes:

  1. Investigate first: log raw BM25 scores from FTS5 before normalization to confirm the spread is bigger than the visible one. ~10 minutes of instrumentation.
  2. Adjust normalization: a less-aggressive transform like 1 - 1/(1 + 0.1*|bm25|) widens the range. Tune empirically.
  3. Expose raw scores via --raw-scores flag for diagnostics — never user-facing but useful when debugging.

Probably an upstream QMD concern more than a Smriti one. File against tobi/qmd if confirmed.

🟡 P2 — "Ingested but not searchable" UX gap

Severity: Confusing edge case. Affects sessions during the gap between ingest and embed.

Evidence: The current Claude session (4a283f66-...) appears in smriti list --project smriti --limit 10 (test 4) but doesn't appear in any of the semantic recall tests (1, 3). It's been ingested as messages but hasn't been chunked + embedded yet. The user has no way to know which state a session is in.

Possible fixes:

  1. Status column in smriti list: add a vector_state column (embedded | pending | failed). User can see at a glance which sessions are full-search-ready.
  2. Auto-embed after ingest in the daemon: when a flush completes, kick off a small qmd embed --batch <project> for the newly-written chunks. Adds latency to the post-ingest path but closes the UX gap.

🟡 P2 — Silent failure when Ollama is unreachable

Severity: User runs --synthesize, gets raw search output, doesn't realize synthesis silently no-op'd.

Evidence: Before starting Ollama, smriti recall "..." --synthesize returned only the raw recall hits, no synthesis section, no error message. The synthesis call presumably timed out or refused-connection'd, and the catch handler just swallowed the failure.

Possible fixes:

  1. Health-check Ollama before synthesis: probe http://127.0.0.1:11434/api/tags (or whatever) and if it fails, print a clear "Ollama unreachable at , returning raw recall hits. Run ollama serve to enable synthesis." Continue without throwing. ~10 LOC.
  2. Cache the last-known-good model: if synthesis works once, remember the model+host. On next failure, surface "synthesis failed (was working at ) — Ollama may have crashed."

Repro recipe (regression check for future releases)

These five tests should be runnable on the corpus before each release. Recommend they be added to docs/internal/eval-suite.md along with this issue's ground truths.

# Before each release, on a real corpus:
smriti recall "the 42 stuck smriti ingest processes that consumed 9 CPU days" --limit 5 --synthesize
smriti drift "qmd"
smriti ask "what did we decide about how the daemon enforces single-instance, and why"
smriti list --project smriti --limit 10
smriti search "lockf" --limit 8

Score each on the five-dimension rubric above. If precision drops or hallucinations appear, block the release.

What I'd actually do, in order

  1. P0 fixes first. Observer-session filter (Sync should restore all secondary category tags from frontmatter #1.1) + recency weighting in synthesis (Add .smriti/config.json as team-shared config with custom categories #2.1). These two together would meaningfully improve daily use and don't require schema changes.
  2. P1 citation fixes. Date bug is 5 LOC; title bug is partially solved by P0.
  3. P2 silent-Ollama fix. 10 LOC, prevents user confusion forever.
  4. Drift prompt improvement. Free, doesn't change architecture.
  5. BM25 range investigation. Time-boxed: spend an hour on it; if it's an upstream-QMD issue, file there.
  6. Drift + "ingested-but-not-searchable" + auto-embed-after-ingest. Larger features; queue for v0.9.x.

Do not block v0.8.0 on any of these. The daemon is a write-side feature; nothing here is a regression vs. the current state. But every line of this issue gets more important as v0.8.0 starts capturing more data without user effort.

Acceptance: how we'd know quality is fixed

Run the same five queries on the same corpus three releases later. The bar:

  • Test 5 (BM25 "lockf"): primary sessions occupy ≥3 of the top 5 hits
  • Test 1 (synthesize): no claims contradict known ground truths
  • Test 3 (ask): citations have real dates and distinguishable titles
  • Test 2 (drift): output surfaces ≥3 named decisions with dates
  • All tests: synthesis latency under 30s (today's worst was 87s)

When all five pass on a fresh corpus, this issue closes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions