Recall quality audit (2026-05-22): observer-session crowding, stale-draft hallucination, broken citations

**Audit date:** 2026-05-22
**Corpus size at audit:** thousands of sessions accumulated over months of multi-agent use
**Trigger:** First end-to-end "use Smriti as a user would" eval since the daemon work began. Goal was to find out what's actually shippable, not just what passes unit tests.

This is a single tracking issue covering five queries' worth of findings, the methodology used to evaluate them, and the proposed fixes ranked by complexity. Split into sub-issues if/when work starts on individual findings.

## Why this matters now

The v0.8.0 daemon (#71) will make Smriti capture *much more* data automatically. If retrieval quality is shaky today on a manually-curated corpus, it gets worse, not better, when the daemon is silently filling the DB. We need to know what's wrong before we 5x the data volume.

## Methodology

For each query, we tracked five dimensions:

| Dimension | Question |
|---|---|
| **Precision** | Are the retrieved hits actually about what was asked? |
| **Recall** | Did obvious-relevant sessions get missed? |
| **Synthesis fidelity** | Does the LLM output reflect what's in the sources, or does it hallucinate? |
| **Latency** | How long does the call take end-to-end? |
| **Ground-truth check** | Pick a specific claim per output, verify it against data we can independently confirm |

Ground truths used (so hallucinations could be detected):

- The 42-process pile-up: 13,449 CPU-minutes total, oldest from Wednesday, fix was `lockf -t 0 /tmp/smriti-ingest.lock`
- Daemon design rejected three options during pre-impl smoke tests: chokidar (fires 0 events under Bun), socket-bind single-instance (silently steals connections), long-lived DB connection (Bun segfault, 6.8 GB RSS peak)
- QMD upstream: fork is 49 commits behind, hits include 004714a / 3b7e065 / d045a8b / e36ab96
- Current Claude session ID: `4a283f66-575d-47db-864e-9c77f9e0f07b`

## Test suite

| # | Query | Command |
|---|---|---|
| 1 | "42 stuck processes that consumed 9 CPU days" with synthesis | `smriti recall "..." --limit 5 --synthesize` |
| 2 | Temporal drift on QMD | `smriti drift "qmd"` |
| 3 | RAG ask on a specific decision | `smriti ask "how does the daemon enforce single-instance, and why"` |
| 4 | Project-scoped list | `smriti list --project smriti --limit 10` |
| 5 | BM25 exact match | `smriti search "lockf" --limit 8` |

Picked to cover: BM25-only retrieval (test 5), semantic retrieval (test 1's recall layer), LLM synthesis (tests 1 + 3), temporal aggregation (test 2), metadata-only listing (test 4). Each test stresses a different layer.

## Findings

### 🔴 P0 — Observer-session crowding in retrieval

**Severity:** Affects every search/recall query. The single biggest product issue.

**Evidence:** In test 5 (BM25 for "lockf"), 6 of 8 hits were "Hello memory agent, you are continuing to observe..." sessions, all scoring 0.137–0.152. The primary session containing the actual lockf-and-daemon-design story scored 0.136 at hit #6 — *below* the observer noise. Same pattern in tests 1, 2, 3.

These observer sessions are created by claude-mem's plugin: they record summaries of other sessions' work. They're dense, well-formatted, and contain heavy term overlap with the primary sessions they observe. BM25 and vector retrieval both reward this density. The user almost always wants the primary source, not the observer's summary of it.

**Impact:** Users searching for their own work get someone else's notes about their own work. Confusing in good cases, misleading in bad cases (observer summaries can lag the primary or contain interpretation drift).

**Possible fixes (ranked by complexity):**

1. **Quick:** Filter observer sessions by default; expose `--include-observers` for explicit opt-in. Detect via agent name (`claude-mem`) or session title prefix. ~20 LOC in `src/search/index.ts`.
2. **Medium:** Add a `session_type` column (`primary | observer | derived`) on `smriti_session_meta`, populate at ingest time. Ranker applies a configurable boost/penalty based on type. ~100 LOC + a migration.
3. **Architectural:** Treat observer sessions as *annotations* on primary sessions, not as first-class searchables. They'd surface only when expanding the result for a primary session. ~weeks of work.

Recommended: ship quick fix in v0.8.1 (filter-by-default), revisit medium fix in v0.9.x.

### 🔴 P0 — Synthesis hallucinates from stale-draft data

**Severity:** Confidence-of-wrong-answer is the worst failure mode. Has shipped already (synthesis is a daily-use feature).

**Evidence:** Test 1's synthesis output included this in `<next_steps>`:

> "Address cross-platform file system monitoring with chokidar abstraction."

We *explicitly rejected* chokidar in the daemon PRD's "Three pre-impl smoke-test findings" section — it fires zero events under Bun 1.3.6 in our test. But earlier drafts of the PRD (now superseded) recommended chokidar. Synthesis pulled from the older draft and confidently presented it as a next step.

For a tool whose explicit pitch is "team learning from each other's coding sessions," this is the exact opposite of the intended UX: rather than surfacing the lesson ("we tried chokidar, it didn't work, here's why"), it surfaces the rejected suggestion as if it were current.

**Impact:** Decisions can be silently reverted via synthesis. Users acting on a synthesized "next step" can re-introduce a bug we already learned to avoid.

**Possible fixes:**

1. **Recency weighting in the ranker.** When two sources discuss the same topic, prefer the newer one. Today's RRF fusion is content-blind to date. ~50 LOC in `searchVec` / `searchFTS` to add a recency boost.
2. **Contradiction detection during synthesis.** `smriti recall --check-conflicts` exists (#67) but isn't run by default for `--synthesize`. Wire conflict detection into the synthesis prompt: if conflicts exist among sources, the prompt should surface "decisions evolved" rather than averaging them.
3. **Explicit decisions ledger.** When a decision is marked as superseding an earlier one (via category `decision/superseded` or explicit linking), synthesis treats the newer as canonical. Big design change; out of scope for v0.8.x.
4. **Synthesis source-citation discipline.** Make the synthesis prompt require each claim to cite a specific source ID, and have the post-processor verify the cited source actually contains the claim. Catches the worst form of hallucination at output time. ~moderate work.

Recommended: ship #1 (recency weighting) in v0.8.1, ship #2 (auto-run check-conflicts under synthesize) in v0.9.0.

### 🟡 P1 — Citation UX is broken in two specific ways

**Severity:** Affects every `smriti ask` output. Hurts trust in answers that are otherwise accurate.

**Evidence (test 3):**

```
Sources:
  [1] 362db08e... — Hello memory agent... (Invalid Date)
  [2] 536689d3... — Hello memory agent... (Invalid Date)
  [3] aba53827... — Hello memory agent... (Invalid Date)
  [4] e491a144... — Hello memory agent... (Invalid Date)
  [5] 7930d328... — Hello memory agent... (Invalid Date)
```

Two distinct bugs:

- **Date parsing failure**: every citation shows `Invalid Date`. Likely a `new Date(undefined).toISOString()` somewhere in the format layer, returning a string Node renders as "Invalid Date".
- **Indistinguishable titles**: all five citations share the same title prefix from the observer-session prompt. Without dereferencing each session_id, the user can't tell what they're looking at.

**Possible fixes:**

1. **Date bug**: locate the date-formatting call (likely in `src/format.ts` or wherever `smriti ask` builds its citations), guard against undefined/null, fall back to "unknown date" or the message's actual timestamp from `created_at`. ~5 LOC.
2. **Title bug**: when a session title would collide with N other sessions, append a distinguishing suffix (date + first-line snippet, or session-id-short). Or: stop using the prompt as the title for observer sessions; derive a title from the *observed* work instead.

The title bug is partially the same problem as P0 (observer crowding) — fixing P0 may fix this implicitly.

### 🟡 P1 — `smriti drift` doesn't show evolution of thinking

**Severity:** A whole command provides little value beyond what `smriti list --project ... --limit N` already does.

**Evidence (test 2):** Asked `smriti drift "qmd"`. Got back a narrative that says "they started doing X, then turned to Y, now Z" — a generic temporal arc. The narrative talks about *files* (server.ts, queue.ts) but not *decisions* or *turning points*. A useful drift for "qmd" would surface: "Mar 12: QMD treated as black-box dependency. Apr 4: discovered fork was 49 commits behind. May 19: decided to track upstream rather than diverge, started Smriti daemon design." Today's output returns dates and filenames where it should return narrative beats.

Topic-vs-keyword confusion also hit hard: drift on "qmd" returned 10 sessions that *mention* qmd but where qmd wasn't the topic.

**Possible fixes:**

1. **Better synthesis prompt** for drift. Today's prompt presumably says "summarize the evolution"; should say "identify 3-5 turning points and what changed at each." Free improvement.
2. **Topic vs keyword filter**: only include sessions where the keyword appears in the *title* or in a session-level category tag, not anywhere in the content. Cuts the corpus for drift dramatically.
3. **Add a decision-marker hint**: surface sessions categorized `decision/*` more heavily in drift output. We already have the category system; drift just doesn't use it.

### 🟡 P1 — BM25 dynamic range is squashed

**Severity:** Ranking is barely meaningful for short specific queries.

**Evidence (test 5):** Searching for "lockf" returned 8 hits scoring 0.128–0.152 — a 0.024 spread. For a term that appears 1 time in some sessions and 15+ times in others, the BM25 score differences should be 10x larger.

Likely the score normalization (the `1 / (1 + |bm25|)` step in QMD's `searchFTS`) is mapping the natural range too aggressively. Could also be a chunking effect: if BM25 runs per-chunk and chunks have uniform size, term-frequency normalization within chunks washes out per-document signal.

**Possible fixes:**

1. **Investigate first**: log raw BM25 scores from FTS5 before normalization to confirm the spread is bigger than the visible one. ~10 minutes of instrumentation.
2. **Adjust normalization**: a less-aggressive transform like `1 - 1/(1 + 0.1*|bm25|)` widens the range. Tune empirically.
3. **Expose raw scores via `--raw-scores` flag** for diagnostics — never user-facing but useful when debugging.

Probably an upstream QMD concern more than a Smriti one. File against `tobi/qmd` if confirmed.

### 🟡 P2 — "Ingested but not searchable" UX gap

**Severity:** Confusing edge case. Affects sessions during the gap between ingest and embed.

**Evidence:** The current Claude session (`4a283f66-...`) appears in `smriti list --project smriti --limit 10` (test 4) but doesn't appear in any of the semantic recall tests (1, 3). It's been ingested as messages but hasn't been chunked + embedded yet. The user has no way to know which state a session is in.

**Possible fixes:**

1. **Status column in `smriti list`**: add a `vector_state` column (`embedded | pending | failed`). User can see at a glance which sessions are full-search-ready.
2. **Auto-embed after ingest** in the daemon: when a flush completes, kick off a small `qmd embed --batch <project>` for the newly-written chunks. Adds latency to the post-ingest path but closes the UX gap.

### 🟡 P2 — Silent failure when Ollama is unreachable

**Severity:** User runs `--synthesize`, gets raw search output, doesn't realize synthesis silently no-op'd.

**Evidence:** Before starting Ollama, `smriti recall "..." --synthesize` returned only the raw recall hits, no synthesis section, no error message. The synthesis call presumably timed out or refused-connection'd, and the catch handler just swallowed the failure.

**Possible fixes:**

1. **Health-check Ollama before synthesis**: probe `http://127.0.0.1:11434/api/tags` (or whatever) and if it fails, print a clear "Ollama unreachable at <host>, returning raw recall hits. Run `ollama serve` to enable synthesis." Continue without throwing. ~10 LOC.
2. **Cache the last-known-good model**: if synthesis works once, remember the model+host. On next failure, surface "synthesis failed (was working at <last-success-time>) — Ollama may have crashed."

## Repro recipe (regression check for future releases)

These five tests should be runnable on the corpus before each release. Recommend they be added to `docs/internal/eval-suite.md` along with this issue's ground truths.

```bash
# Before each release, on a real corpus:
smriti recall "the 42 stuck smriti ingest processes that consumed 9 CPU days" --limit 5 --synthesize
smriti drift "qmd"
smriti ask "what did we decide about how the daemon enforces single-instance, and why"
smriti list --project smriti --limit 10
smriti search "lockf" --limit 8
```

Score each on the five-dimension rubric above. If precision drops or hallucinations appear, block the release.

## What I'd actually do, in order

1. **P0 fixes first.** Observer-session filter (#1.1) + recency weighting in synthesis (#2.1). These two together would meaningfully improve daily use and don't require schema changes.
2. **P1 citation fixes.** Date bug is 5 LOC; title bug is partially solved by P0.
3. **P2 silent-Ollama fix.** 10 LOC, prevents user confusion forever.
4. **Drift prompt improvement.** Free, doesn't change architecture.
5. **BM25 range investigation.** Time-boxed: spend an hour on it; if it's an upstream-QMD issue, file there.
6. **Drift + "ingested-but-not-searchable" + auto-embed-after-ingest.** Larger features; queue for v0.9.x.

Do not block v0.8.0 on any of these. The daemon is a write-side feature; nothing here is a regression vs. the current state. But every line of this issue gets more important as v0.8.0 starts capturing more data without user effort.

## Acceptance: how we'd know quality is fixed

Run the same five queries on the same corpus three releases later. The bar:

- Test 5 (BM25 "lockf"): primary sessions occupy ≥3 of the top 5 hits
- Test 1 (synthesize): no claims contradict known ground truths
- Test 3 (ask): citations have real dates and distinguishable titles
- Test 2 (drift): output surfaces ≥3 named decisions with dates
- All tests: synthesis latency under 30s (today's worst was 87s)

When all five pass on a fresh corpus, this issue closes.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recall quality audit (2026-05-22): observer-session crowding, stale-draft hallucination, broken citations #77

Why this matters now

Methodology

Test suite

Findings

🔴 P0 — Observer-session crowding in retrieval

🔴 P0 — Synthesis hallucinates from stale-draft data

🟡 P1 — Citation UX is broken in two specific ways

🟡 P1 — `smriti drift` doesn't show evolution of thinking

🟡 P1 — BM25 dynamic range is squashed

🟡 P2 — "Ingested but not searchable" UX gap

🟡 P2 — Silent failure when Ollama is unreachable

Repro recipe (regression check for future releases)

What I'd actually do, in order

Acceptance: how we'd know quality is fixed

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Dimension	Question
Precision	Are the retrieved hits actually about what was asked?
Recall	Did obvious-relevant sessions get missed?
Synthesis fidelity	Does the LLM output reflect what's in the sources, or does it hallucinate?
Latency	How long does the call take end-to-end?
Ground-truth check	Pick a specific claim per output, verify it against data we can independently confirm

#	Query	Command
1	"42 stuck processes that consumed 9 CPU days" with synthesis	`smriti recall "..." --limit 5 --synthesize`
2	Temporal drift on QMD	`smriti drift "qmd"`
3	RAG ask on a specific decision	`smriti ask "how does the daemon enforce single-instance, and why"`
4	Project-scoped list	`smriti list --project smriti --limit 10`
5	BM25 exact match	`smriti search "lockf" --limit 8`

Recall quality audit (2026-05-22): observer-session crowding, stale-draft hallucination, broken citations #77

Description

Why this matters now

Methodology

Test suite

Findings

🔴 P0 — Observer-session crowding in retrieval

🔴 P0 — Synthesis hallucinates from stale-draft data

🟡 P1 — Citation UX is broken in two specific ways

🟡 P1 — smriti drift doesn't show evolution of thinking

🟡 P1 — BM25 dynamic range is squashed

🟡 P2 — "Ingested but not searchable" UX gap

🟡 P2 — Silent failure when Ollama is unreachable

Repro recipe (regression check for future releases)

What I'd actually do, in order

Acceptance: how we'd know quality is fixed

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

🟡 P1 — `smriti drift` doesn't show evolution of thinking