diff --git a/README.md b/README.md index ad5d337..82e2a99 100644 --- a/README.md +++ b/README.md @@ -6,37 +6,34 @@ [![Release](https://img.shields.io/github/v/release/lukefwalton/answer-engine)](https://github.com/lukefwalton/answer-engine/releases) [![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/lukefwalton/answer-engine) -A small answer engine that keeps the authorial frame outside the model: -sources are bounded, private text cannot leak into the prompt, citations are -grounded, and refusals are tested. - -This is site-level search that uses an LLM **without being a chatbot**. You -point it at a body of work — essays, lyrics, letters, philosophy, -documentation — and it answers one question at a time, with no conversation -state, no memory, no persona improvising on your behalf. Each answer is a -one-shot transaction: question in, cited answer or honest refusal out. A -chatbot that's right most of the time speaks *for* you; an answer engine that -cites or declines speaks *from* you. +A small answer engine for a body of work you own. It answers only from your +published sources, keeps your private text out of the prompt, cites what it +uses, and says "I don't know" when it should — and each of those promises is +tested. + +It uses an LLM **without being a chatbot**. Point it at essays, lyrics, +letters, documentation, and it answers one question at a time: no +conversation state, no memory, no persona improvising on your behalf. +Question in, cited answer or honest refusal out. A chatbot that's right most +of the time speaks *for* you; an answer engine that cites or declines speaks +*from* you. This repo is the teaching-sized version of the engine behind "Ask the Archive" on [lukefwalton.com](https://lukefwalton.com). It runs out of the box on a bundled example corpus (by "Person A" — a placeholder, not a person), it's small enough to read in one sitting, and the whole design is -five ideas. Here they are, in the order the data flows. +five ideas, laid out below in the order the data flows. -**What this is:** a **GitHub example repo** you clone and run locally (`npm -install`, `npm run …`). It is not published to npm — there is no `bin`, -`main`, or `exports`; you read the source and invoke the CLI scripts, not -`npm install answer-engine` as a dependency. +**What this is:** an example repo you clone and run locally (`npm install`, +`npm run …`). It is not published to npm, and it is deliberately not a +framework, hosted app, chatbot UI, or vector-database starter. It is the +smallest useful version of the answer contract: what evidence may enter the +prompt, what must stay out, how citations are grounded, and when the system +must decline. -It is deliberately not a framework, hosted app, chatbot UI, or vector-database -starter. It is the smallest useful version of the answer contract: what -evidence may enter the prompt, what must stay out, how citations are grounded, -and when the system must decline. - -**Example content:** everything under `example-content/` is synthetic fiction -for the demo, including the first-person notebook entries — written to show -the private-layer boundary, not real notes. +**Example content:** everything under `example-content/` is synthetic +fiction, including the first-person notebook entries — written to show the +private-layer boundary, not real notes. ## 1. Public records are quotable; private text is not @@ -52,25 +49,23 @@ The corpus has two layers, and the distinction drives everything downstream (`locator`). Its text gets embedded, so retrieval can find it. It is never shown to the model. -> In production ([Ask the Archive](https://lukefwalton.com/ask/)), podcast -> transcripts are part of the public archive: published passages are **records** -> (retrieved and cited). Unpublished transcript text may be embedded for search -> but reaches the model only as **routing hints** — where to listen, never what -> was said. The system must not turn transcripts into uncited private knowledge -> or persona-voice. -> This repo uses hand-written notebook entries to show the same boundary -> without the transcription pipeline. +> In production ([Ask the Archive](https://lukefwalton.com/ask/)), published +> podcast passages are **records** — retrieved and cited — while unpublished +> transcript text is embedded for search but reaches the model only as a +> **routing hint**: where to listen, never what was said. This repo shows the +> same boundary with hand-written notebook entries instead of a transcription +> pipeline. ## 2. Retrieval returns both; assembly strips prose Both layers share one embedding space in one versioned index file (`artifacts/index.json` — gitignored, because vectors derived from private text are private). Retrieval (`src/retrieve.ts`) scores everything with -brute-force cosine plus two conservative boosts — naming a work's title -(0.30) and using a curated theme verbatim (0.15), because metadata you -maintain should outrank raw similarity — and drops anything under a score -floor. Weak matches don't get to masquerade as evidence; an empty result is -where "I don't know" begins, before any model is involved. +brute-force cosine plus two conservative boosts: naming a work's title +(0.30) and using a curated theme verbatim (0.15) — metadata you maintain +should outrank raw similarity. Anything under a score floor is dropped. Weak +matches don't get to masquerade as evidence; an empty result is where "I +don't know" begins, before any model is involved. The result keeps records and notes in **two separate lists**, because what happens next is different for each: @@ -90,25 +85,25 @@ corpus ─► index ─┤ (body travels AnswerEvidence = { records, hints } ──► the model ``` -`src/no-leak.ts` is small enough to audit by eye — the only thing -`toRoutingHint` does is drop the note's text — and it is the whole point: -`RoutingHint` has **no field for the note's text**, so there is nothing through -which private prose could reach the model: the boundary is the type's *shape*, -not a guard somebody remembers to write. +`src/no-leak.ts` is small enough to audit by eye: the only thing +`toRoutingHint` does is drop the note's text. `RoutingHint` has **no field +for that text**, so there is no path by which private prose can reach the +model. The boundary is the type's *shape*, not a guard somebody has to +remember to write. ## 3. The model only sees AnswerEvidence One Responses API call (`src/answer.ts`), with the policy versioned in code -(`src/prompt.ts`): records render with their bodies; hints render as label, -locator, and URL — `buildUserPrompt` couldn't leak a hint's text if it wanted -to, because the field doesn't exist. **What does travel is the label and the -locator: any frontmatter field that becomes a hint's label or locator reaches -the model, so keep titles and locators public-safe.** The body is stripped; -those two are not. (Making that boundary structural rather than advisory is -[`NEXT-STEPS.md`](./NEXT-STEPS.md) A1.) The model is told what a hint *is*: the -location of a relevant private moment, to be routed to, never restated. If no -evidence cleared the floor at all, the engine returns `not-found` without -making the call — refusal costs nothing. +(`src/prompt.ts`). Records render with their full bodies. Hints render as +label, locator, and URL — `buildUserPrompt` couldn't leak a hint's text if it +wanted to, because the field doesn't exist. **What does travel is the label +and the locator: any frontmatter field that becomes either one reaches the +model, so keep titles and locators public-safe.** (Making that boundary +structural rather than advisory is [`NEXT-STEPS.md`](./NEXT-STEPS.md) A1.) +The model is told what a hint *is*: the location of a relevant private +moment, to be routed to, never restated. And if nothing cleared the score +floor, the engine returns `not-found` without making the call at all — +refusal costs nothing. ## 4. Modes are enforced in schema + validator, not vibes @@ -123,22 +118,22 @@ citation mix — which makes honesty checkable: | `not-found` | none, empty answer | "I don't know," plainly | Three layers enforce this, because the first two are requests and only the -third is a guarantee: the JSON schema constrains the shape; `validateAnswer` -rejects contract violations (a `not-found` with prose, a sourced mode without -it); then `repairCitationsToEvidence` snaps almost-right citations onto the -exact retrieved pairs (models mangle URLs more often than they invent -sources), dedupes, and **re-derives the mode from the final mix** — the model -can't claim `supported` while citing nothing but hints. Finally -`assertCitationsGroundedInEvidence` verifies every citation is the exact -(id, url) pair of something actually retrieved. An invented source is an -error, not a footnote. - -One UI lesson: **retrieved is not cited**. -Retrieved neighbors are candidates; final citations are evidence. If you build -a web UI around this, render source cards from the final citation list, not -from raw retrieval hits — and render none for `not-found`, even if retrieval -found nearby material. Otherwise a refusal can look like it is backed by the -very sources the engine declined to use. +third is a guarantee. The JSON schema constrains the shape. `validateAnswer` +rejects contract violations — a `not-found` with prose, a sourced mode +without sources. Then `repairCitationsToEvidence` snaps almost-right +citations onto the exact retrieved pairs (models mangle URLs more often than +they invent sources), dedupes, and **re-derives the mode from the final +mix** — the model can't claim `supported` while citing nothing but hints. +Finally, `assertCitationsGroundedInEvidence` verifies every citation is the +exact (id, url) pair of something actually retrieved. An invented source is +an error, not a footnote. + +One UI lesson: **retrieved is not cited**. Retrieved neighbors are +candidates; final citations are evidence. If you build a web UI around this, +render source cards from the final citation list, not from raw retrieval +hits — and render none for `not-found`, even if retrieval found nearby +material. Otherwise a refusal can look like it's backed by the very sources +the engine declined to use. ## 5. Gold queries are regression tests for answerability @@ -155,46 +150,38 @@ story, including a real failing-then-passing walkthrough. ## What this shows, and where it stops -The strongest objection to this approach is that it works only because the -frame is easy to own: one archive, one named author, a delimited corpus. -The mechanisms here do not depend on that smallness — none of them refers to -corpus size. What a bounded demo cannot do, on its own, is prove that holding -these surfaces at public, plural, or contested scale is affordable, or that -systems where it is genuinely unsettled *whose* frame holds can be made -answerable the same way. That is a real limit, and this repo is the bounded -case on purpose, not a proof about the unbounded one. The public-scale cost -question is real, but it belongs to the builders of public-scale systems, not -to a teaching repo note. - -It is worth being exact about *which* limit, because it is narrower than it -looks. The gate owns **soundness**: nothing enters an answer that isn't -grounded in retrieved evidence or honestly refused. It does not own -**completeness** — it cannot certify that what was retrieved is what *should* -have been. A source that falls below the score floor is simply absent, and a -gate sees only what reaches it. But absence isn't therefore unowned: the -scoring, the floor, and the corpus boundary that decide what becomes a -candidate are authored constants someone maintains (`src/retrieve.ts`, -`archive.config.ts`), and the gold set tests recall for the cases it names -(`eval/gold.yaml`). What stays irreducible is the relevant source no one -thought to test for — and that is irreducible for any system, since -anticipating it in full would mean knowing the answer in advance. - -What the repo does try to show is concrete: that whether a frame is *held* or -just *inherited* can be settled at control surfaces in running code, not -promissory labels. The privacy boundary is structural — a type with no field for private prose, not a guard -someone has to remember (`src/no-leak.ts`); modes are re-derived from the -evidence, not taken from the model's word for it (`src/answer.ts`); refusals -are regression-tested like any other behavior (`eval/gold.yaml`). - -The [Answerability papers](#related-writing) take up the harder cases — plural -authorship, contested frames, systems where *whose* gate applies is itself -unsettled. This repo is the bounded reference implementation; discussion, issues, -and PRs that extend, test, or push against those limits are welcome. The bar -for new code is the bar the repo sets for itself: least lines that keep the -promises, boundaries enforced by types or runtime checks, loud failures, and -no change that makes the eval pass by special-casing a question. Before opening -one, see [`CONTRIBUTING.md`](./CONTRIBUTING.md): it names what is in scope — a -failing gold case is the best PR — and what isn't. +The fair objection: this works because the frame is easy to own — one +archive, one named author, a bounded corpus. The mechanisms don't depend on +that smallness (none of them refers to corpus size), but a small demo can't +prove that holding these boundaries stays affordable at public, plural, or +contested scale. This repo is the bounded case on purpose, not a proof about +the unbounded one. + +The limit is narrower than it looks, though. What the engine guarantees is +**soundness**: nothing enters an answer that isn't grounded in retrieved +evidence or honestly refused. What it can't guarantee is **completeness**: a +source that falls below the score floor is simply absent, and a gate only +sees what reaches it. That absence still has owners — the scoring, the +floor, and the corpus boundary are constants someone maintains +(`src/retrieve.ts`, `archive.config.ts`), and the gold set tests recall for +the cases it names (`eval/gold.yaml`). What remains out of reach, for any +system, is the relevant source no one thought to test for. + +What the repo does show is concrete: whether a frame is *held* or merely +*inherited* can be settled in running code, not in promissory labels. The +privacy boundary is structural (`src/no-leak.ts`); modes are re-derived from +the evidence, not taken on the model's word (`src/answer.ts`); refusals are +regression-tested like any other behavior (`eval/gold.yaml`). + +The [Answerability papers](#related-writing) take up the harder cases — +plural authorship, contested frames, systems where *whose* gate applies is +itself unsettled. This repo is the bounded reference implementation, and +issues and PRs that extend, test, or push against those limits are welcome: +see [`CONTRIBUTING.md`](./CONTRIBUTING.md) for what's in scope (a failing +gold case is the best PR). The bar for new code is the bar the repo sets for +itself: the fewest lines that keep the promises, boundaries enforced by types +or runtime checks, loud failures, and no eval pass by special-casing a +question. --- @@ -250,53 +237,57 @@ npm run typecheck # tsc --noEmit ## Where to take it -In the order we'd add them: chunk long documents into overlapping windows so -retrieval points at passages; more retrieval signals (recency for "what do -you think *now*", author aliases, per-collection weights); a -document-frequency cap on the theme boost — at four records a verbatim theme -match is signal, but on a large corpus a theme that appears on half the -records boosts nothing and should be discounted; an evidence-selection prune -before synthesis (keep one record per cluster, then the clear winner plus a -single corroborator when it leads the rest by a margin) for when a large -corpus makes wide top-k surface correlated neighbors instead of distinct -sources, which shapes what synthesis *sees*, not what the gate certifies -(retrieved is still not cited); an HTTP handler -around `retrieve` + `answerQuestion` with a rate limit, query cap, and cache; -SQLite or pgvector when the archive outgrows in-memory cosine — the shapes -don't change. In production we also keep the wire contract's `not-found` -empty and let the UI roll plain decline copy at display time, so refusals -stay honest *and* human. +In the order we'd add them: + +- **Chunking** — split long documents into overlapping windows so retrieval + points at passages, not whole files. +- **More retrieval signals** — recency (for "what do you think *now*"), + author aliases, per-collection weights. +- **A document-frequency cap on the theme boost** — at four records a + verbatim theme match is signal; on a large corpus, a theme that appears on + half the records boosts nothing and should be discounted. +- **Evidence pruning before synthesis** — on a large corpus, wide top-k + surfaces correlated neighbors instead of distinct sources; keep one record + per cluster, plus a single corroborator when the winner leads by a margin. + This shapes what synthesis *sees*, not what the gate certifies — retrieved + is still not cited. +- **An HTTP handler** around `retrieve` + `answerQuestion`, with a rate + limit, query cap, and cache. +- **SQLite or pgvector** when the archive outgrows in-memory cosine — the + shapes don't change. + +In production we also keep the wire contract's `not-found` empty and let the +UI supply plain decline copy at display time, so refusals stay honest *and* +human. Code the invariant. Document the scaling pattern. Comment the footgun. -The empirical companion to this list — the two levers it doesn't name (vector -dimension and wire format), which only appear once the index crosses a network -boundary, each gated by the eval rather than by vibes — is in +The empirical companion to this list — plus two levers it doesn't name +(vector dimension and wire format), which only matter once the index crosses +a network boundary — is in [`docs/production-scaling.md`](./docs/production-scaling.md). ## Next steps / open problems -[`NEXT-STEPS.md`](./NEXT-STEPS.md) is the standing record of the **seams we can -see** — where the design leaves something to be *owned* rather than structurally -guaranteed — and the **levers an adopter might pull** that trade quality for -cost. It is not a roadmap: nothing in it has to be fixed for the engine to keep -its promises. Each entry is written to be pulled as a ticket, and the -performance section is a starter for anyone adapting this to their own system. -Naming these edges is the program doing what it claims, in the open. +[`NEXT-STEPS.md`](./NEXT-STEPS.md) is the standing record of the seams we +can see — places where the design leaves something to be *owned* rather than +structurally guaranteed — and the levers an adopter might pull to trade +quality for cost. It is not a roadmap: nothing in it has to be fixed for the +engine to keep its promises. Each entry is written to be pulled as a ticket. ## What stays out -A running deployment grows layers this engine deliberately omits: deterministic -product routes (help, usage, or corpus-count answers that never call a model), -a domain-specific eval guard taxonomy, an ingestion or transcription pipeline, -and the site's own config. Those are consumer-adapter concerns. They live in -the site layer (for "Ask the Archive," the `ask-the-archive/` adapter), not the -engine, because the value this repo carries is the boundary and the answer -contract, not feature parity (`.github/STANDARDS.md` §3, "What Matters Less"). -One line worth holding if you add a deterministic route downstream: it may -shortcut *delivery*, but it must never be how a gold query passes. A route that -flips an eval outcome is special-casing the question wearing a hat: the same -thing §5 forbids, one layer up. +A running deployment grows layers this engine deliberately omits: +deterministic product routes (help, usage, or corpus-count answers that never +call a model), a domain-specific eval guard taxonomy, an ingestion or +transcription pipeline, and the site's own config. Those belong to the site +layer (for "Ask the Archive," the `ask-the-archive/` adapter), not the +engine — what this repo carries is the boundary and the answer contract, not +feature parity (`.github/STANDARDS.md` §3, "What Matters Less"). One line +worth holding if you add a deterministic route downstream: it may shortcut +*delivery*, but it must never be how a gold query passes. A route that flips +an eval outcome is special-casing the question wearing a hat — the same thing +§5 forbids, one layer up. ## Citing this software