diff --git a/demo/README.md b/demo/README.md new file mode 100644 index 0000000..c10ed21 --- /dev/null +++ b/demo/README.md @@ -0,0 +1,141 @@ +# The int8 scaling demo + +The result this module is built to produce is a **caught failure**: the same +gold suite that owns grounding and refusal rejecting a cheaper encoding. The +*mechanism* is proven offline, in `quantize.test.ts` (run by `npm test`): on +fixture vectors searched to exhibit a near-tie, int8 preserves both the route +and the disambiguation winner, int4 flips the top slot, and the gate catches it. + +Whether the **real Smith corpus** produces that flip at the int8/int4 boundary +is a separate, empirical question, settled by the build run, not asserted here. +"int8 held" on a small corpus is expected and proves little on its own; the gate +saying *no* when pushed is what shows the gold suite, not the encoding, is the +adjudicator. So: the mechanism is demonstrated; the real-corpus demonstration is +pending. + +The committed vectors are not built yet (this module was written with no network +and no key), so `npm run demo:run` errors with a build pointer until then; +see **Build status**. Once built: + +``` +npm run demo:run # int8, real corpus: the headline, keyless +npm run demo:run -- --natural+synthetic # add the spire and its gold +npm run demo:run -- --natural+synthetic --bits 4 # int4: the gate rejects the spire's route flip +npm run demo:run -- --full # also run the answer-mode pass (needs a key) +``` + +## What it is + +The paper (§6) claims that the same gold suite which owns grounding and refusal +also adjudicated every cost reduction made to run the system at scale. The +production figures behind that are private and non-reproducible. This demo makes +the *mechanism* runnable on a public-domain corpus: it quantizes the embedding +index to int8, re-ranks, and runs the full gold suite including the must-refuse +and must-route cases, so the gate either certifies or rejects the cheaper +encoding. The claim is **relative, not absolute**: not "this corpus is +realistic," but "int8 preserves the verdicts full-precision produces, and where +it does not, the gate catches it." Realism is never asserted. + +Public domain is the *absence* of copyright, not a license: this corpus is +public-domain, not "permissively licensed." The two name-colliding authors and +their provenance live in [`corpus/README.md`](./corpus/README.md). + +## How it works (a wrapper plus a re-rank, not a second system) + +The int8 path is an encode/decode wrapper plus a re-rank. It reuses the core +retrieval (`src/retrieve.ts`), the gold judge (`src/evaluate.ts`), the store +(`src/store.ts`), and the no-leak boundary (`src/no-leak.ts`) untouched; nothing +in the core was forked or changed. `quantize.ts` is the public twin of the +production site adapter's `vector-quant.ts` (named in +[`docs/production-scaling.md`](../docs/production-scaling.md) §2). The harness +quantizes the committed full-precision vectors in process, dequantizes, and +hands the result to the same `retrieve()` the engine uses. + +Two facts make int8 admissible, and they differ in kind (the §6 split): + +- **Exact, by algebra.** Cosine normalizes by vector norm, so a positive + per-vector scale cancels from the score entirely. The ranking is invariant to + it; you can score against the quantized bytes without restoring the scale. +- **Measured, by the suite.** Integer rounding perturbs direction and can + reorder near-ties, so its harmlessness is not proven; it is verified. The + harness reports rank correlation against the full-precision ranking, then runs + the gold suite. Rank correlation is *necessary, not sufficient*: a demo that + reports it and stops has shown a retrieval benchmark, not answerability + governing tuning. The gold suite is the actual adjudicator, and it checks not + just that the expected source is *retrieved* but that it *wins the top slot*: + so a quantization flip that swaps which Smith ranks first (disambiguation) or + lets a public record overtake the private note (route) is caught keyless, not + only by the keyed answer pass. Past int8 (int4, PQ, binary) the exact part + stops applying and the whole lever is measured; the wire format is versioned + so a code/data mismatch fails loudly. + +The headline run is **keyless**: it reads committed full-precision vectors and +committed gold-query vectors, so no embedding call is made. A key is needed only +to regenerate the vectors (`demo:build`) or to run the `--full` answer pass. +That answer pass exercises route *selection*, which is what quantization moves; +it does not touch A2, the answer model's confabulation residue, which the +encoding never exercises. + +## Disclosures (the three that are non-negotiable) + +1. **Layer designation, not secrecy.** "Private" means the type cannot carry the + text to the model, regardless of what the text is. George Adam Smith's minor + works are public-domain; assigning some of them to the private layer is an + authored research decision, the same move the core's notebook entries make. +2. **The synthetic spire is fabricated and flagged.** A small set of fabricated + George-private notes lives quarantined in `corpus/synthetic/`, loaded only + under `--natural+synthetic`, each marked `synthetic: true` and naming the + edge it tests. It is additive and never enters the headline metrics; the + spire's effect is reported on its own line. No fabricated words are ever + passed off as either real Smith's writing: the spire is George-framed but + flagged, and nothing fabricated is presented as the actual work of either man. +3. **The claim is relative.** int8 preserves the verdicts full-precision + produces; the corpus is not offered as realistic and nothing turns on its + realism. + +One disclosure carries a warning. The core gitignores `artifacts/index.json` +because vectors derived from private text are private; this demo does the +opposite and commits its vectors, so the headline reproduces with no key. That +is safe *here* because the "private" layer is public-domain George text, whose +embeddings reveal nothing already public. Do not copy "commit your vectors" as a +general pattern: embeddings of genuinely private text can be inverted to recover +approximate content, which is the exposure the core's gitignored index avoids. + +## Build status + +The code, the gold set, the provenance manifest, and the deterministic harness +tests (`quantize.test.ts`, run by `npm test`) are committed. The real text +bodies and the committed vectors (`corpus/index.json`, +`corpus/index.synthetic.json`, `corpus/query-vectors.json`) are produced by +`demo:build`, which needs network access to the public-domain sources and an +`OPENAI_API_KEY`; the session that wrote the module had neither. See +[`docs/scaling-demo/build-handoff.md`](../docs/scaling-demo/build-handoff.md) +for the exact steps, and the delta log for what is confirmed versus pending. + +## The spec and the log are kept in the open + +The planning docs live beside the module in +[`docs/scaling-demo/`](../docs/scaling-demo/), kept on purpose rather than +discarded once the code landed: + +- `SCALING-DEMO-spec.md`: what the demo set out to do, and why; the ticket it was + built from. +- `scaling-demo-delta-log.md`: every place the build diverged from that spec, + what is settled versus pending the keyed build run, and the prepared + reconciliations (NEXT-STEPS, STANDARDS, the paper) to apply at merge. +- `build-handoff.md`: the brief for the build run that fetches the public-domain + texts and generates the committed vectors. + +This is the same move the corpus manifest makes: the reasoning behind the +artifact is part of the artifact. A reader can see what was intended, where +reality differed, and which decisions are still owed. + +## Relation to production + +This is the runnable counterpart to the prose in `docs/production-scaling.md` +§2: the prose makes the case, the demo runs it. The George/Adam disambiguation +mirrors the real two-tier citation surface on the production site (Ask the +Archive), where a public-record citation carries an id and a URL and a +routing-hint citation carries only where the moment lives, never the text. The +**architecture** is what reproduces here, not the scale: the scale stays +reported in §6, the mechanism runs in this folder. diff --git a/demo/build.ts b/demo/build.ts new file mode 100644 index 0000000..1c9b590 --- /dev/null +++ b/demo/build.ts @@ -0,0 +1,154 @@ +// npm run demo:build — embed the scaling corpus and the gold queries, then +// commit the vectors. KEYED and run once (or after corpus edits): needs network +// to the embedding API and an OPENAI_API_KEY. The session that wrote this code +// had neither; see docs/scaling-demo/build-handoff.md. +// +// Reuses the core corpus loaders, embedding, and store writers untouched. The +// only thing new is pointing them at demo/corpus/ and splitting the output +// into the natural index (the headline source of truth), the synthetic spire +// (a strictly baseline-plus-delta file, unioned only under --natural+synthetic), +// and the committed gold-query vectors (what makes demo:run keyless). + +import { createHash } from 'node:crypto'; +import { existsSync } from 'node:fs'; +import { resolve } from 'node:path'; +import OpenAI from 'openai'; + +import { buildCorpus, buildPrivateNotes, embedText, noteEmbedText } from '../src/corpus.js'; +import { batchInputs, embedBatch, truncateForEmbedding } from '../src/embedding.js'; +import { assertHomogeneousIndex, writeIndexFile } from '../src/store.js'; +import type { ArchiveConfig, IndexEntry, PrivateNote } from '../src/types.js'; +import { loadGold } from '../src/evaluate.js'; +import { config, SYNTHETIC_NOTES_DIR } from './config.js'; +import { writeQueryVectors } from './query-vectors.js'; + +const NATURAL_INDEX = resolve('demo/corpus/index.json'); +const SYNTHETIC_INDEX = resolve('demo/corpus/index.synthetic.json'); +const NATURAL_GOLD = resolve('demo/gold.yaml'); +const SYNTHETIC_GOLD = resolve('demo/gold.synthetic.yaml'); + +function contentHash(text: string): string { + return createHash('sha1').update(truncateForEmbedding(text)).digest('hex').slice(0, 16); +} + +type EmbedJob = { id: string; text: string }; + +async function embedAll(client: OpenAI, jobs: EmbedJob[]): Promise> { + const byId = new Map(); + let done = 0; + for (const batch of batchInputs(jobs)) { + const results = await embedBatch(client, batch, { model: config.embeddingModel }); + for (const r of results) byId.set(r.id, r.vector); + done += batch.length; + console.log(` embedded ${done}/${jobs.length}`); + } + return byId; +} + +function recordEntries(config: ArchiveConfig, vectors: Map): IndexEntry[] { + const entries: IndexEntry[] = []; + for (const record of buildCorpus(config)) { + const text = embedText(record); + const vector = vectors.get(record.id); + if (!vector) throw new Error(`no embedding returned for record '${record.id}'; refusing to write a partial index.`); + entries.push({ + model: config.embeddingModel, + dimensions: vector.length, + vector, + contentHash: contentHash(text), + sourceType: 'record', + record, + }); + } + return entries; +} + +function noteEntries(notes: PrivateNote[], vectors: Map): IndexEntry[] { + const entries: IndexEntry[] = []; + for (const note of notes) { + const vector = vectors.get(note.id); + if (!vector) throw new Error(`no embedding returned for note '${note.id}'; refusing to write a partial index.`); + entries.push({ + model: config.embeddingModel, + dimensions: vector.length, + vector, + contentHash: contentHash(noteEmbedText(note)), + sourceType: 'note', + note, + }); + } + return entries; +} + +async function main(): Promise { + if (!process.env.OPENAI_API_KEY) { + throw new Error('OPENAI_API_KEY is not set. demo:build needs it to embed (see build-handoff.md).'); + } + const client = new OpenAI(); + + const records = buildCorpus(config); + const naturalNotes = buildPrivateNotes(config); + const syntheticNotes = buildPrivateNotes({ ...config, privateNotesDir: SYNTHETIC_NOTES_DIR }); + console.log( + `Corpus: ${records.length} records, ${naturalNotes.length} private notes, ` + + `${syntheticNotes.length} synthetic notes`, + ); + if (records.length === 0) { + throw new Error('No records found under demo/corpus/public — populate it first (build-handoff.md §1).'); + } + + // Gold queries: natural always, synthetic if authored. + const gold = loadGold(NATURAL_GOLD, config.authorName); + const goldQueries = [...gold]; + if (existsSync(SYNTHETIC_GOLD)) { + goldQueries.push(...loadGold(SYNTHETIC_GOLD, config.authorName)); + } + + // One embedding pass over every source and query, distinguished by id. + const sourceJobs: EmbedJob[] = [ + ...records.map((r) => ({ id: r.id, text: embedText(r) })), + ...naturalNotes.map((n) => ({ id: n.id, text: noteEmbedText(n) })), + ...syntheticNotes.map((n) => ({ id: n.id, text: noteEmbedText(n) })), + ]; + const queryJobs: EmbedJob[] = goldQueries.map((g) => ({ id: `query:${g.id}`, text: g.query })); + + console.log(`Embedding ${sourceJobs.length} sources and ${queryJobs.length} gold queries...`); + const vectors = await embedAll(client, [...sourceJobs, ...queryJobs]); + + // Natural index: records + real private notes. + const naturalEntries = [...recordEntries(config, vectors), ...noteEntries(naturalNotes, vectors)].sort((a, b) => + (a.sourceType === 'record' ? a.record.id : a.note.id).localeCompare( + b.sourceType === 'record' ? b.record.id : b.note.id, + ), + ); + assertHomogeneousIndex(naturalEntries); + writeIndexFile(naturalEntries, NATURAL_INDEX); + console.log(`Wrote ${naturalEntries.length} natural entries to ${NATURAL_INDEX}`); + + // Synthetic spire: written only when authored, so the headline never depends on it. + if (syntheticNotes.length > 0) { + const spireEntries = noteEntries(syntheticNotes, vectors); + assertHomogeneousIndex([...naturalEntries, ...spireEntries]); // spire must share the space + writeIndexFile(spireEntries, SYNTHETIC_INDEX); + console.log(`Wrote ${spireEntries.length} synthetic spire entries to ${SYNTHETIC_INDEX}`); + } else { + console.log('No synthetic notes authored yet; skipping the spire index.'); + } + + // Committed gold-query vectors (what makes demo:run keyless). Every gold + // query must embed, or the keyless runner would later fail on a missing id. + const queryVectors = goldQueries.map((g) => { + const vector = vectors.get(`query:${g.id}`); + if (!vector) throw new Error(`no embedding returned for gold query '${g.id}'; refusing to write partial query vectors.`); + return { id: g.id, vector }; + }); + const dims = queryVectors[0]?.vector.length ?? naturalEntries[0]?.dimensions ?? 0; + writeQueryVectors(config.embeddingModel, dims, queryVectors); + console.log(`Wrote ${queryVectors.length} gold-query vectors`); + console.log('Done. Commit the *.json artifacts, then `npm run demo:run`.'); +} + +main().catch((err) => { + console.error(`demo:build failed: ${err instanceof Error ? err.message : err}`); + process.exitCode = 1; +}); diff --git a/demo/config.ts b/demo/config.ts new file mode 100644 index 0000000..29b6ffe --- /dev/null +++ b/demo/config.ts @@ -0,0 +1,52 @@ +// config.ts — points the engine at the int8 scaling-demo corpus. +// +// This is the same ArchiveConfig shape the core uses (src/types.ts), pointed at +// demo/corpus/ instead of example-content/. The demo reuses the core +// retrieval, the no-leak boundary, and the eval judges untouched; only the +// corpus, the gold set, and a thin int8 pass are new (see demo/README.md). +// +// Two authors share one colliding name on purpose: Adam Smith the economist +// (1723-1790) and George Adam Smith the theologian (1856-1942). Both write +// dense moral prose about justice and society, so their records sit close in +// embedding space; that proximity is what packs the near-ties int8 rounding can +// reorder. authorName names the collection rather than one person because the +// demo's whole subject is disambiguation; the gold queries name each Smith +// explicitly rather than relying on {{author}} substitution. +// +// On URLs: a record's citation URL is built by the reused corpus path +// (baseUrl + urlPrefix + slug), so it is a demo-canonical surface under the +// reserved .example TLD (RFC 2606), not a live page. The real public-domain +// sources live in demo/corpus/README.md's provenance table, per work. A +// private note's `about` is taken verbatim from frontmatter, so those route +// targets ARE real public George pages. See the delta log for this divergence +// from the spec's "records carry real public URLs" assumption and why it keeps +// src/corpus.ts untouched. + +import type { ArchiveConfig } from '../src/types.js'; + +export const config: ArchiveConfig = { + archiveName: 'Smith Collection (int8 scaling demo)', + authorName: 'Adam Smith and George Adam Smith', + baseUrl: 'https://smith-collection.example', + contentRoot: './demo/corpus', + collections: [ + { dir: 'public/adam-smith', urlPrefix: '/adam-smith/', type: 'adam-smith' }, + { dir: 'public/george-adam-smith', urlPrefix: '/george/', type: 'george-adam-smith' }, + ], + // The private layer: George's minor works (sermons, addresses), searchable + // but never quotable. Designating published work "private" is a layer + // assignment enforced by the type, not a claim of secrecy (README §2). + privateNotesDir: './demo/corpus/private', + // Matches archive.config.ts. The int8 demo depends on this: the committed + // vectors must be text-embedding-3-large at native dimensionality or the + // homogeneity invariant (src/store.ts) rejects them. + embeddingModel: 'text-embedding-3-large', + answerModel: 'gpt-4o-mini', +}; + +// The quarantined synthetic spire (demo/corpus/synthetic/) is loaded as an +// ADDITIONAL private-notes dir only under --natural+synthetic, never here. Its +// location is one flag and each file also carries `synthetic: true` in +// frontmatter (a second flag the PrivateNote type ignores): nothing in +// demo/corpus/synthetic/ is real George text. See demo/run.ts and README §3. +export const SYNTHETIC_NOTES_DIR = './demo/corpus/synthetic'; diff --git a/demo/corpus/README.md b/demo/corpus/README.md new file mode 100644 index 0000000..8a1b779 --- /dev/null +++ b/demo/corpus/README.md @@ -0,0 +1,63 @@ +# The scaling-demo corpus + +This folder holds the corpus for the int8 scaling demo. This README is the corpus's **answerable half**: the mechanism makes the unauthored move inexpressible; this document owns, in the open, every authored choice behind the data. Each entry names the choice and the reason it was made. None of it is hidden, so none of it is a concession; it is the record of decisions a maintainer signs for. + +If you are reading this to attack the corpus, the choices you would reach for are below, named first. + +## What this corpus is + +A name-collision corpus over two real, public-domain authors who share a name: + +- **Adam Smith**, the economist and moral philosopher (1723-1790). +- **George Adam Smith**, the theologian and historical geographer (1856-1942). ("Adam" is a middle name; the partial-name match is deliberate, see the boost edge case in the gold suite.) + +Both write dense moral prose about justice, society, and ethics, so the two bodies of work sit close in embedding space. That proximity, not corpus size, is the point: it packs the near-ties where int8 rounding can reorder candidates, which is the only condition under which the demo tests anything. + +## Build status + +The text bodies and the embedding vectors are produced by `demo/build.ts`, which needs network access to the public-domain sources and an `OPENAI_API_KEY`. The code, the structure, the gold set, the provenance table below, and the deterministic harness tests are authored and committed; the real bodies and the committed `index.json` / `query-vectors.json` are populated by a build run with those two things. See `docs/scaling-demo/build-handoff.md` for the exact build steps. **Every ID and date below is a claim to verify against the live source during that run, not a confirmation made here.** + +## Provenance and public-domain status + +Every source, with the basis for its public-domain status. Public domain is the *absence* of copyright, not a license: this corpus is not "permissively licensed," it is public-domain. State the basis in both jurisdictions cleanly, since they rest on different facts: +- **US:** published before 1931, so public domain in the USA. (As of 1 Jan 2026, works published in 1930 and earlier are PD in the US.) +- **Life-plus-70 jurisdictions:** public domain once the author has been dead 70 years. In 2026 that covers authors who died in 1955 or earlier; George Adam Smith died 1942 and Adam Smith in 1790, so both are clear. + +Verify each ID and date against the source before relying on it; fill OCR-quality notes from the actual file. + +| Work (unit) | Author | Pub. | Layer | Source (ID) | PD basis | Notes | +|---|---|---|---|---|---|---| +| _Theory of Moral Sentiments_, §\ | Adam Smith | 1759 | public | Gutenberg \ | US: pre-1931 / PD in USA. Life+70: author d. 1790; term expired | _verify; fill: clean / OCR-noisy_ | +| _Wealth of Nations_, bk\ ch\ | Adam Smith | 1776 | public | Gutenberg \ | US: pre-1931 / PD in USA. Life+70: author d. 1790; term expired | _verify_ | +| _The Book of the Twelve Prophets_, \ | George Adam Smith | 1896-98 | public | Gutenberg 43847 / 50747 | US: pre-1931 / PD in USA. Life+70: author d. 1942; term expired | _verify against Gutenberg_ | +| _The Book of Isaiah_, ch\ | George Adam Smith | 1888-90 | public | Gutenberg 39767 / 43672 | US: pre-1931 / PD in USA. Life+70: author d. 1942; term expired | _verify against Gutenberg_ | +| _The Forgiveness of Sins, and Other Sermons_, \ | George Adam Smith | 1905 (A. C. Armstrong & Son) | **private** | Internet Archive `forgivenessofsin00smitrich` (ARK `ark:/13960/t0gt5jk4g`); HathiTrust full-view backup record 100136688 | US: pre-1931 / PD in USA. Life+70: author d. 1942; term expired | _verify NOT_IN_COPYRIGHT; OCR-noisy expected, which is fine_ | +| \ | n/a (fabricated) | n/a | **synthetic** | authored here | n/a (no copyright in fabricated demo text) | quarantined in `synthetic/`; tests \ | + +**Sourcing (resolved, pending verification).** George's *major* commentaries are listed on Project Gutenberg. The private layer rests on *The Forgiveness of Sins, and other Sermons* (1905), a single volume yielding several short, windy sermon units, which is exactly what the private layer needs: short whole units that route without restating. *Jeremiah: Being the Baird Lecture for 1922* (1923) is a further minor source if wanted. The fallback (designating a *section* of a major work private) is therefore **not** required; if a future rebuild loses these sources, that fallback keeps the private layer real rather than padding it with synthetic. + +**OPEN: the one sourcing check that can block the build.** Confirm George's minor/windy material (the sermons) actually downloads as clean-enough public-domain text. If only the big commentaries are digitized, the private ledger is thin: use the fallback (a short *section* of a major work, designated private) rather than padding with synthetic, which would turn the spire into a column. Record the outcome here. + +## URLs: demo-canonical citations, real route targets + +A record's citation URL is constructed by the reused `src/corpus.ts` path (`baseUrl + urlPrefix + slug`) under the reserved `.example` TLD, so it is a stable demo surface rather than a live page; the real sources are the provenance table above. This keeps `src/corpus.ts` untouched (the budget rule) and is symmetric across both authors, so neither Smith reads as the decoy. A private note's `about` is taken verbatim from frontmatter, so the routing targets ARE real public George pages. The delta log records this as a divergence from the spec's "records carry real public URLs," with the reason (per-unit real URLs do not exist; Gutenberg is work-level). + +## The authored choices (named first, owned in the open) + +**1. The corpus is partly fabricated, and the claim does not depend on its realism.** The public layer (both Smiths) is real public-domain text the maintainer did not write. The private layer is real George minor works. The synthetic notes are a small, flagged set (below). The demo's claim is *relative*: int8 preserves the verdicts full-precision produces, and where it does not, the gate catches it. Realism is never asserted; the baseline runs on text the maintainer does not control. + +**2. "Private" is a layer assignment, not a claim of secrecy.** George was a public figure and all his work is published; designating some of it private means only that *the type cannot carry its text to the model*, regardless of what the text is. The whole repo works this way (the default example corpus is synthetic "Person A"). Everything here is exposed in the repo on purpose: seeing the full private text, then watching the type admit only its routing hint, is the demonstration, not a contradiction of it. **This is also why this demo can commit its embedding vectors when the main repo gitignores its index: these vectors derive from public-domain text, so they expose nothing already private. Do not copy "commit your vectors" as a general pattern: embeddings of genuinely private text can be inverted to recover approximate content, which is the exposure the main repo's gitignored index avoids.** + +**3. No fabricated words are passed off as either real Smith's writing, and synthetic notes are flagged in the data.** Every fabricated note lives in the quarantined `synthetic/` directory, carries `synthetic: true`, and names the edge case it tests, so nothing can be mistaken for either Smith's actual writing even lifted out of context. Real George material is handled as George's; synthetic is never confusable with it. + +**4. The corpus is not tuned so int8 passes.** Headline numbers come from the real-only (`--natural`) run. The demo deliberately *includes a failure*: a tightened encoding (int4, or a lowered floor) breaking a route case, caught by the gold suite. Shipping a caught failure is the opposite of tuning to pass; it is how the demo shows the gate can say no. + +**5. Themes are authored honestly, including where they collide.** Both Smiths carry shared themes (e.g. "justice"), so a verbatim theme match can hand the boost to the wrong Smith. That mis-fire is a near-tie the gold suite *exposes*, not one smoothed away by curation. Themes are not shaped to make disambiguation easy; doing so would special-case the corpus, which the eval forbids. + +**6. The public/private split is a research decision, stated as one.** Major, legible George works go to the public records layer; minor, windy works go to the private routing layer. This is an authored, answerable choice made to exercise both the disambiguation path (public) and the routing path (private) without confounding them, not a natural fact about the texts. The private layer is George-only, so the disambiguation problem (which Smith?) stays entirely in the public layer and never contaminates the boundary demonstration. + +**7. The synthetic layer is a spire, not a column.** It is deliberately small. A large synthetic layer would invert the honesty of the demo (headline numbers riding on authored text) and would mean a large body of fabricated words attributed to a real person, in a project about provenance and backing. If more near-ties are ever needed, the lever is a tighter floor and boosts (a calibration question, gold-gated), not more fabricated text. + +## Scope of this README + +This file documents the **data and the choices behind it** only. The mechanism (the type boundary, retrieval, the modes), the eval, and the int8 harness are documented where they live; this is not the place to restate them. Provenance and authored choices here; everything else by reference. diff --git a/demo/corpus/private/.gitkeep b/demo/corpus/private/.gitkeep new file mode 100644 index 0000000..e69de29 diff --git a/demo/corpus/public/adam-smith/.gitkeep b/demo/corpus/public/adam-smith/.gitkeep new file mode 100644 index 0000000..e69de29 diff --git a/demo/corpus/public/george-adam-smith/.gitkeep b/demo/corpus/public/george-adam-smith/.gitkeep new file mode 100644 index 0000000..e69de29 diff --git a/demo/corpus/synthetic/syn-amos-justice-margin.md b/demo/corpus/synthetic/syn-amos-justice-margin.md new file mode 100644 index 0000000..343808f --- /dev/null +++ b/demo/corpus/synthetic/syn-amos-justice-margin.md @@ -0,0 +1,18 @@ +--- +title: "Synthetic demo note (fabricated, not George Adam Smith): Amos and the justice of God" +about: https://en.wikipedia.org/wiki/George_Adam_Smith +locator: "study marginalia, Amos" +synthetic: true +targets: "syn-route-margin / route case at the floor / related-material — a tightened encoding (int4 or a lowered floor) should flip this to the public Amos exposition, caught by the gold suite" +--- + +Fabricated for the int8 demo, not George Adam Smith's words (see the frontmatter). + +On Amos the herdsman of Tekoa: justice is not a ledger the strong keep against +the weak, but the weight of heaven set on the side of the wronged. The plumb-line +is held to the wall of the nation and the wall is found to lean. Where the courts +sell the righteous for silver and the needy for a pair of shoes, the worship that +continues above that wrong is itself the offence; the feast is noise until the +judgement runs down like waters. The mercy is in the warning: that the line is +shown at all, and shown in time, is the patience of God toward a people still +able to return. diff --git a/demo/gold.synthetic.yaml b/demo/gold.synthetic.yaml new file mode 100644 index 0000000..25ed388 --- /dev/null +++ b/demo/gold.synthetic.yaml @@ -0,0 +1,27 @@ +# Expanded gold for --natural+synthetic. Loaded ONLY alongside the quarantined +# synthetic spire (demo/corpus/synthetic/). Because the spire is fabricated, +# these cases never touch the headline (--natural) numbers; the runner reports +# the spire's effect on its own line, broken out, so a reader can tell whether +# int8 held because the encoding is sound or because notes were hand-placed. +# +# Skewed toward the route-flip the demo is built to catch, never an extra +# must-answer win. The deliberate failure is the payload: a caught failure is +# worth more than any clean pass. + +queries: + - id: syn-route-margin + query: What did George Adam Smith note privately about Amos and the justice of God? + expectAnswerMode: related-material + expectSources: [note:syn-amos-justice-margin] + forbidRecordCitations: true + forbidAnswerPatterns: ['https?://'] + note: > + THE DELIBERATE FAILURE. The synthetic note is tuned to sit at the floor, + just above the public Amos exposition (george-adam-smith:twelve-prophets-amos) + on the same theme. int8 must hold the route — the private note wins the top + slot and the answer routes without restating. A tightened encoding (run with + --bits 4) or a lowered floor flips the top slot to the public record, or + drops the note below the floor entirely; either way the gold suite catches + it. If it does NOT fire, the near-tie is too loose: tighten the margin in the + synthetic note (build-handoff §4), do NOT add corpus. This caught failure, + not any clean pass, is the result the demo rests on. diff --git a/demo/gold.yaml b/demo/gold.yaml new file mode 100644 index 0000000..53ec89d --- /dev/null +++ b/demo/gold.yaml @@ -0,0 +1,96 @@ +# The scaling-demo gold set (real-only, --natural). +# +# Same three-mode shape as the core gold set (eval/gold.yaml): questions the +# archive must answer, one it must route to a private note without restating, +# and one it must refuse. Tuned so the cases live where int8 rounding bites: +# near the score floor and near each other. A refuse case comfortably below the +# floor or a route case comfortably clear proves nothing about quantization; +# the marginal cases are the whole point. +# +# These run against demo/corpus/index.json (committed FP vectors), quantized +# in process. The harness (demo/run.ts) checks each case keylessly at the +# retrieval tier and, with --full and a key, the answer-mode tier too. Source +# ids are `${type}:${slug}` for records and `note:${slug}` for private notes, +# matching the corpus files in demo/corpus/ (see docs/scaling-demo/build-handoff.md). +# +# Queries name each Smith explicitly rather than using {{author}}, because the +# demo's whole subject is which Smith a question means. + +queries: + # ── Disambiguation: the economist ──────────────────────────────────────── + - id: econ-justice + query: What did Adam Smith argue about justice and beneficence? + expectAnswerMode: partial + expectSources: [adam-smith:theory-of-moral-sentiments-justice] + note: > + Must resolve to the economist. George's Amos and Micah expositions carry + the "justice" theme too, and his record titles contain "George Adam + Smith" so the query's "Adam Smith" phrase-matches them for the exact-match + boost. That mis-fire is the near-tie: int8 reordering is exactly what + could tip this toward the wrong Smith. + + - id: econ-labour + query: What did Adam Smith say about the division of labour? + expectAnswerMode: partial + expectSources: [adam-smith:wealth-of-nations-division-of-labour] + note: > + A cleaner economist hit — "division of labour" is unambiguously Wealth of + Nations. Anchors the disambiguation against a case with little collision. + + # ── Disambiguation: the theologian ─────────────────────────────────────── + - id: george-amos + query: What did George Adam Smith say about the prophet Amos and justice? + expectAnswerMode: partial + expectSources: [george-adam-smith:twelve-prophets-amos] + note: > + The parallel disambiguation, the other way. "justice" appears for both + Smiths, but "the prophet Amos" plus George's full name should carry this + to George. The symmetric twin of econ-justice. + + - id: george-isaiah + query: Where does George Adam Smith write about faith in the book of Isaiah? + expectAnswerMode: partial + expectSources: [george-adam-smith:isaiah-prophet-of-faith] + note: > + Theme-and-subject query carried mostly by semantic similarity; "Isaiah" + and "faith" are George's, with no economist competitor. + + # ── The partial-name boost edge ────────────────────────────────────────── + - id: boost-edge-micah + query: What did Adam Smith say about the prophet Micah? + expectAnswerMode: partial + expectSources: [george-adam-smith:twelve-prophets-micah] + note: > + The boost edge case. The query names "Adam Smith" (the economist) but asks + about Micah (George's subject). EXACT_MATCH_BOOST (0.30) fires on the + partial name match against George's title, AND Micah is George's alone, so + the intended answer is George's Micah exposition despite the economist's + name in the query. Pin the observed behavior here; int8 reordering near + this collision is precisely the kind of thing that could tip it. + + # ── Route: answered by the boundary ────────────────────────────────────── + - id: route-forgiveness + query: How did George Adam Smith preach on the forgiveness of sins? + expectAnswerMode: related-material + expectSources: [note:forgiveness-of-sins] + forbidRecordCitations: true + forbidAnswerPatterns: ['https?://'] + note: > + Only the private sermon bears on this. The note must win the top slot over + the public George records on adjacent themes, and the answer must route to + the page-and-locator WITHOUT restating what the sermon says — the mode is + related-material, never a paraphrase of the private text. This tests route + SELECTION (which note wins), not A2: int8 never touches the answer model's + confabulation residue. + + # ── Refuse: nothing clears the floor ───────────────────────────────────── + - id: refuse-quantum + query: What did Adam Smith think about quantum computing? + expectAnswerMode: not-found + forbidSources: + [adam-smith:theory-of-moral-sentiments-justice, adam-smith:wealth-of-nations-division-of-labour, george-adam-smith:twelve-prophets-amos, george-adam-smith:isaiah-prophet-of-faith] + note: > + A subject no Smith addressed, three centuries out of reach. The score + floor must keep every record out of the evidence and the answer must be a + plain not-found. A refuse case is only worth having if it sits where a + lowered floor or a coarser encoding could let a weak hit cross. diff --git a/demo/harness.ts b/demo/harness.ts new file mode 100644 index 0000000..2de4c9d --- /dev/null +++ b/demo/harness.ts @@ -0,0 +1,190 @@ +// demo/harness.ts — the int8 gate, as pure logic the CLI drives. +// +// Reuses the core retrieval (src/retrieve.ts) and the gold judge +// (src/evaluate.ts) untouched: the int8 path is an encode/decode wrapper plus a +// re-rank, never a second pipeline. Given full-precision index entries and a +// quantization bit width, it builds the lossy index, re-ranks each gold query +// against it, and reports the two things the paper distinguishes: rank +// correlation against the full-precision ranking (necessary), and the gold +// suite's verdicts including refuse and route (sufficient). Rank correlation +// alone is a retrieval benchmark; the suite is the actual adjudicator. + +import { cosine, retrieve } from '../src/retrieve.js'; +import type { RetrievalResult } from '../src/retrieve.js'; +import { judgeRetrieval } from '../src/evaluate.js'; +import type { GoldQuery } from '../src/evaluate.js'; +import type { IndexEntry } from '../src/types.js'; +import { requantizeVector } from './quantize.js'; + +/** The lossy index the demo re-ranks against: every vector round-tripped + * through `bits`-bit quantization, every other field untouched. The + * full-precision index stays the source of truth. */ +export function requantizeIndex(index: readonly IndexEntry[], bits: number): IndexEntry[] { + return index.map((e) => ({ ...e, vector: requantizeVector(e.vector, bits) })); +} + +/** The single highest-scoring source across both streams, or null if nothing + * cleared the floor. Route selection lives here: in related-material mode the + * winner must be the private note, or the answer would resolve to a record + * instead and the verdict has flipped. */ +export function topSource( + result: RetrievalResult, +): { id: string; kind: 'record' | 'note'; score: number } | null { + let best: { id: string; kind: 'record' | 'note'; score: number } | null = null; + for (const r of result.records) { + if (!best || r.score > best.score) best = { id: r.record.id, kind: 'record', score: r.score }; + } + for (const n of result.notes) { + if (!best || n.score > best.score) best = { id: n.note.id, kind: 'note', score: n.score }; + } + return best; +} + +function averageRanks(xs: readonly number[]): number[] { + const order = xs.map((x, i) => ({ x, i })).sort((a, b) => a.x - b.x); + const ranks = new Array(xs.length); + let i = 0; + while (i < order.length) { + let j = i; + while (j + 1 < order.length && order[j + 1]!.x === order[i]!.x) j += 1; + const avg = (i + j) / 2 + 1; // 1-based average rank across the tie block i..j + for (let k = i; k <= j; k += 1) ranks[order[k]!.i] = avg; + i = j + 1; + } + return ranks; +} + +/** Spearman's rho: Pearson correlation of the rank vectors, with average ranks + * for ties. Returns 1 for degenerate inputs (length < 2 or all-tied), which is + * the harmless reading — no reordering to detect. */ +export function spearmanRho(a: readonly number[], b: readonly number[]): number { + if (a.length !== b.length) throw new Error('spearmanRho: length mismatch'); + const n = a.length; + if (n < 2) return 1; + const ra = averageRanks(a); + const rb = averageRanks(b); + let ma = 0; + let mb = 0; + for (let i = 0; i < n; i += 1) { + ma += ra[i]!; + mb += rb[i]!; + } + ma /= n; + mb /= n; + let num = 0; + let da = 0; + let db = 0; + for (let i = 0; i < n; i += 1) { + const x = ra[i]! - ma; + const y = rb[i]! - mb; + num += x * y; + da += x * x; + db += y * y; + } + if (da === 0 || db === 0) return 1; + return num / Math.sqrt(da * db); +} + +/** Rank correlation between the full-precision and quantized cosine orderings + * for one query, over the whole index. The boosts (src/retrieve.ts) are + * identical in both rankings, so the only thing that can reorder is the vector + * part: cosine. That is what this measures. */ +export function rankCorrelation( + index: readonly IndexEntry[], + quantIndex: readonly IndexEntry[], + queryVector: readonly number[], +): number { + const fp = index.map((e) => cosine(queryVector, e.vector)); + const q = quantIndex.map((e) => cosine(queryVector, e.vector)); + return spearmanRho(fp, q); +} + +export interface QueryGateResult { + id: string; + /** Rank correlation FP vs quantized for this query. */ + rho: number; + /** judgeRetrieval on the quantized index: expected sources in, forbidden out. */ + retrievalPass: boolean; + retrievalIssues: string[]; + /** For any case that names an expected source and is not a refusal: did that + * source win the top slot on the quantized index? This is what protects the + * *verdict*, not just presence. judgeRetrieval only checks top-K membership, + * so a quantization flip that keeps both Smiths retrieved but swaps which one + * ranks first would pass it silently. The top-slot check catches that: the + * expected record must OUTRANK the competing Smith (disambiguation), and the + * private note must win over the public records (route). */ + topSlot?: { expected: string; winner: string | null; won: boolean }; + /** retrievalPass AND (topSlot ? topSlot.won : true). */ + pass: boolean; +} + +/** Re-rank one gold query against the quantized index and judge it. */ +export function evaluateQuery( + gold: GoldQuery, + index: readonly IndexEntry[], + quantIndex: readonly IndexEntry[], + queryVector: readonly number[], +): QueryGateResult { + const hits = retrieve(queryVector, gold.query, quantIndex); + const judged = judgeRetrieval(gold, hits); + const rho = rankCorrelation(index, quantIndex, queryVector); + + // Any non-refusal case with a named expected source must see that source win + // the top slot, not merely appear. Refusals (not-found) carry no expected + // source; the floor and forbidSources adjudicate them via judgeRetrieval. + let topSlot: QueryGateResult['topSlot']; + if (gold.expectAnswerMode !== 'not-found' && gold.expectSources && gold.expectSources[0]) { + const expected = gold.expectSources[0]; + const winner = topSource(hits); + topSlot = { expected, winner: winner?.id ?? null, won: winner?.id === expected }; + } + + const pass = judged.pass && (topSlot ? topSlot.won : true); + return { + id: gold.id, + rho, + retrievalPass: judged.pass, + retrievalIssues: judged.issues, + ...(topSlot ? { topSlot } : {}), + pass, + }; +} + +export interface GateReport { + bits: number; + total: number; + passed: number; + failed: number; + meanRho: number; + minRho: number; + results: QueryGateResult[]; +} + +/** Run the whole gold suite against the index at `bits` precision. */ +export function runGate( + gold: readonly GoldQuery[], + index: readonly IndexEntry[], + queryVectorById: ReadonlyMap, + bits: number, +): GateReport { + const quantIndex = requantizeIndex(index, bits); + const results: QueryGateResult[] = []; + for (const g of gold) { + const qv = queryVectorById.get(g.id); + if (!qv) throw new Error(`no query vector for gold id '${g.id}' (rebuild demo:build?)`); + results.push(evaluateQuery(g, index, quantIndex, qv)); + } + const passed = results.filter((r) => r.pass).length; + const rhos = results.map((r) => r.rho); + const meanRho = rhos.length ? rhos.reduce((s, x) => s + x, 0) / rhos.length : 1; + const minRho = rhos.length ? Math.min(...rhos) : 1; + return { + bits, + total: results.length, + passed, + failed: results.length - passed, + meanRho, + minRho, + results, + }; +} diff --git a/demo/quantize.test.ts b/demo/quantize.test.ts new file mode 100644 index 0000000..d5eaa6d --- /dev/null +++ b/demo/quantize.test.ts @@ -0,0 +1,198 @@ +// Offline, deterministic tests for the int8 demo's mechanism. No corpus, no +// key: the quantizer and the gate are exercised on fixture vectors, so the +// whole int8 path — including the int4 route flip the demo is built to catch — +// is provable here. The real corpus instantiates this same mechanism; these +// tests prove the mechanism itself. + +import assert from 'node:assert/strict'; +import { test } from 'node:test'; + +import { cosine } from '../src/retrieve.js'; +import type { ArchiveRecord, IndexEntry, PrivateNote } from '../src/types.js'; +import type { GoldQuery } from '../src/evaluate.js'; +import { dequantize, levelFor, quantize, requantizeVector } from './quantize.js'; +import { evaluateQuery, rankCorrelation, requantizeIndex, runGate, spearmanRho, topSource } from './harness.js'; + +// A near-tie found by deterministic search (scaling: seed 421, 24-dim): the +// query Q ranks note VN just above record VR at full precision; int8 preserves +// that order, int4 reorders it. This is the route flip in miniature. +const Q = [-0.201545, -0.070296, -0.836567, -0.496486, 0.932744, -0.183835, 0.620633, -0.319135, 0.353699, 0.535227, 0.630447, -0.913022, 0.74482, 0.20067, -0.735437, 0.48168, -0.628687, 0.422013, -0.824056, 0.95873, -0.055049, -0.014708, 0.136552, -0.126328]; +const VN = [-0.209326, -0.367113, -0.781625, -0.22665, 0.421356, -0.779461, 0.686374, -0.431379, 0.807734, 0.556436, 0.078187, -1.104108, 0.064971, -0.250693, -0.829483, -0.06284, -0.225568, 0.419642, -0.941748, 0.05885, -0.260352, 0.396049, -0.299235, 0.33248]; +const VR = [0.153577, 0.081729, -1.05474, -0.793276, 0.049555, -0.0844, 0.769011, 0.098334, 0.570278, -0.166597, 0.599978, -1.115543, 0.517046, -0.496545, 0.207507, 0.785012, -0.899066, 0.109867, -0.881006, 0.360131, 0.467909, 0.04772, 0.550953, 0.232781]; + +function makeRecord(id: string, extra: Partial = {}): ArchiveRecord { + return { + id, + type: 'work', + slug: id.split(':')[1] ?? id, + title: extra.title ?? id, + url: `https://smith-collection.example/${id}/`, + summary: extra.summary ?? '', + body: extra.body ?? '', + themes: extra.themes ?? [], + }; +} + +function makeNote(id: string): PrivateNote { + return { id, label: id, url: 'https://en.wikipedia.org/wiki/George_Adam_Smith', locator: 'sermon', text: 'private' }; +} + +function recordEntry(id: string, vector: number[], extra: Partial = {}): IndexEntry { + return { model: 'text-embedding-3-large', dimensions: vector.length, vector, contentHash: 'h', sourceType: 'record', record: makeRecord(id, extra) }; +} + +function noteEntry(id: string, vector: number[]): IndexEntry { + return { model: 'text-embedding-3-large', dimensions: vector.length, vector, contentHash: 'h', sourceType: 'note', note: makeNote(id) }; +} + +// Two fillers near-orthogonal to Q, so they stay below the floor in every +// precision and never enter the route contest. +const filler1 = Array.from({ length: 24 }, (_, i) => (i === 1 ? 1 : 0)); +const filler2 = Array.from({ length: 24 }, (_, i) => (i === 21 ? 1 : 0)); + +test('quantize: level widths and rejection of bad bit counts', () => { + assert.equal(levelFor(8), 127); + assert.equal(levelFor(4), 7); + assert.throws(() => levelFor(1)); + assert.throws(() => levelFor(9)); + assert.throws(() => levelFor(3.5)); +}); + +test('quantize: round-trips within the per-vector scale, zero vector is safe', () => { + const v = [0.5, -0.25, 0.9, -0.9, 0.1]; + const q = quantize(v, 8); + const back = dequantize(q); + for (let i = 0; i < v.length; i += 1) { + assert.ok(Math.abs(back[i]! - v[i]!) <= q.scale, `component ${i} within one scale step`); + } + // scale derives from the max magnitude (0.9), one signed byte (127 levels). + assert.ok(Math.abs(q.scale - 0.9 / 127) < 1e-9); + + const zero = quantize([0, 0, 0], 8); + assert.equal(zero.scale, 1); + assert.deepEqual([...zero.codes], [0, 0, 0]); +}); + +test('quantize: int4 is coarser than int8 (larger reconstruction error)', () => { + const v = Q; + const err = (bits: number) => v.reduce((s, x, i) => s + Math.abs(requantizeVector(v, bits)[i]! - x), 0); + assert.ok(err(4) > err(8), 'int4 reconstruction error exceeds int8'); +}); + +test('quantize: per-vector scale cancels under cosine (exact, by algebra)', () => { + // Scaling a vector by any positive constant leaves cosine unchanged, which is + // why the per-vector scale need not be restored to rank. The demo leans on this. + const scaled = VN.map((x) => x * 7.5); + assert.ok(Math.abs(cosine(Q, VN) - cosine(Q, scaled)) < 1e-12); +}); + +test('harness: spearmanRho on known orderings, with ties', () => { + assert.equal(spearmanRho([1, 2, 3, 4], [10, 20, 30, 40]), 1); + assert.equal(spearmanRho([1, 2, 3, 4], [40, 30, 20, 10]), -1); + assert.ok(Math.abs(spearmanRho([1, 2, 2, 3], [1, 2, 2, 3]) - 1) < 1e-12); // ties -> average ranks + assert.equal(spearmanRho([5], [9]), 1); // degenerate length < 2 +}); + +test('harness: requantizeIndex keeps every field but the vector', () => { + const index = [recordEntry('work:a', VR), noteEntry('note:b', VN)]; + const q = requantizeIndex(index, 8); + assert.equal(q.length, 2); + assert.equal(q[0]!.sourceType, 'record'); + assert.equal(q[0]!.dimensions, 24); + assert.notDeepEqual(q[0]!.vector, index[0]!.vector); // lossy + assert.equal(q[0]!.contentHash, index[0]!.contentHash); // untouched +}); + +test('harness: topSource picks the highest score across both streams', () => { + const result = { + records: [{ record: makeRecord('work:r'), score: 0.71, semantic: 0.71 }], + notes: [{ note: makeNote('note:n'), score: 0.73, semantic: 0.73 }], + }; + assert.equal(topSource(result)?.id, 'note:n'); + assert.equal(topSource(result)?.kind, 'note'); + assert.equal(topSource({ records: [], notes: [] }), null); +}); + +test('harness: int8 preserves the FP ranking better than int4 (rank correlation)', () => { + const index = [noteEntry('note:n', VN), recordEntry('work:r', VR), recordEntry('work:f1', filler1), recordEntry('work:f2', filler2)]; + const rho8 = rankCorrelation(index, requantizeIndex(index, 8), Q); + const rho4 = rankCorrelation(index, requantizeIndex(index, 4), Q); + assert.ok(rho8 >= rho4, `int8 rho (${rho8}) >= int4 rho (${rho4})`); + assert.ok(rho8 >= rho4 && rho8 > 0.9, 'int8 holds the ordering tightly'); +}); + +test('the payload: the gate certifies int8 and rejects int4 on the route case', () => { + // The note (VN) must win the top slot; that is the route. A query with no + // title/theme overlap, so the contest is pure cosine, not boosts. + const index: IndexEntry[] = [ + noteEntry('note:syn-amos-justice-margin', VN), + recordEntry('george-adam-smith:twelve-prophets-amos', VR, { title: 'unrelated phrasing' }), + recordEntry('work:f1', filler1), + ]; + const gold: GoldQuery = { + id: 'route-margin', + query: 'zzz qqq no token overlap with any title or theme', + expectAnswerMode: 'related-material', + expectSources: ['note:syn-amos-justice-margin'], + }; + const qById = new Map([[gold.id, Q]]); + + // int8: the note wins the top slot, the gate passes. + const int8 = runGate([gold], index, qById, 8); + assert.equal(int8.passed, 1, 'int8 certifies the route'); + assert.equal(int8.results[0]!.topSlot?.won, true); + assert.ok(int8.results[0]!.rho >= 0.9); + + // int4: the record overtakes the note for the top slot. The note is still + // retrieved (so judgeRetrieval alone would miss it), but the route flipped, + // and the gate catches it. + const int4 = runGate([gold], index, qById, 4); + assert.equal(int4.failed, 1, 'int4 is rejected'); + const r = int4.results[0]!; + assert.equal(r.retrievalPass, true, 'the note is still in the candidate set'); + assert.equal(r.topSlot?.won, false, 'but it lost the top slot'); + assert.equal(r.topSlot?.winner, 'george-adam-smith:twelve-prophets-amos'); +}); + +test('disambiguation: the keyless gate catches a partial-mode flip (right Smith vs wrong Smith)', () => { + // Same near-tie geometry, but both candidates are public records: the right + // Smith (VN) outranks the wrong Smith (VR) at full precision and int8, and int4 + // swaps them. A partial case is presence-checked by judgeRetrieval, so without + // the top-slot check the flipped disambiguation verdict would pass keyless. + const index: IndexEntry[] = [ + recordEntry('adam-smith:theory-of-moral-sentiments-justice', VN, { title: 'unrelated phrasing' }), + recordEntry('george-adam-smith:twelve-prophets-amos', VR, { title: 'unrelated phrasing' }), + ]; + const gold: GoldQuery = { + id: 'econ-justice', + query: 'zzz qqq no token overlap with any title or theme', + expectAnswerMode: 'partial', + expectSources: ['adam-smith:theory-of-moral-sentiments-justice'], + }; + const qById = new Map([[gold.id, Q]]); + + const int8 = runGate([gold], index, qById, 8); + assert.equal(int8.passed, 1, 'int8 keeps the right Smith on top'); + assert.equal(int8.results[0]!.topSlot?.won, true); + + const int4 = runGate([gold], index, qById, 4); + assert.equal(int4.failed, 1, 'int4 flips to the wrong Smith and the gate catches it'); + const r = int4.results[0]!; + assert.equal(r.retrievalPass, true, 'both Smiths still retrieved (presence alone passes)'); + assert.equal(r.topSlot?.won, false, 'but the wrong Smith won the top slot'); + assert.equal(r.topSlot?.winner, 'george-adam-smith:twelve-prophets-amos'); +}); + +test('the payload, directly: cosine ordering flips between int8 and int4', () => { + const c = (v: number[], bits: number) => cosine(Q, requantizeVector(v, bits)); + assert.ok(cosine(Q, VN) > cosine(Q, VR), 'FP: note outranks record'); + assert.ok(c(VN, 8) > c(VR, 8), 'int8: note still outranks record'); + assert.ok(c(VR, 4) > c(VN, 4), 'int4: record overtakes the note (the flip)'); +}); + +test('evaluateQuery: a refuse case with nothing above the floor stays not-found', () => { + const index = [recordEntry('work:f1', filler1), recordEntry('work:f2', filler2)]; + const gold: GoldQuery = { id: 'refuse', query: 'zzz qqq', expectAnswerMode: 'not-found', forbidSources: ['work:f1', 'work:f2'] }; + const res = evaluateQuery(gold, index, requantizeIndex(index, 8), Q); + assert.equal(res.pass, true, 'fillers stay below the floor, so nothing is forbidden-surfaced'); +}); diff --git a/demo/quantize.ts b/demo/quantize.ts new file mode 100644 index 0000000..a3dca71 --- /dev/null +++ b/demo/quantize.ts @@ -0,0 +1,70 @@ +// demo/quantize.ts — scalar quantization for the int8 demo. +// +// The public, runnable twin of the production site adapter's vector-quant.ts +// (named in docs/production-scaling.md §2; that adapter is not a public repo). +// Same scheme: per-vector symmetric scalar quantization. The full-precision +// vectors stay the source of truth (demo/corpus/index.json); the demo +// quantizes them in process, re-ranks, and lets the gold suite judge the result. +// +// Why it is admissible, in two parts of different kinds (the paper's §6 split): +// cosine (src/retrieve.ts) recomputes norms per call, so a positive per-vector +// scale cancels from the score entirely; the ranking is invariant to it as a +// matter of algebra (exact). Integer rounding perturbs direction and can +// reorder near-ties, so its harmlessness is not proven but measured against the +// gold suite. int8 holds on the real corpus; int4 is the scalpel that makes the +// gate say no. + +export interface QuantizedVector { + /** Signed integer codes, one per dimension, each in [-level, level]. */ + codes: Int8Array; + /** Dequantization scale: vector[i] ≈ codes[i] * scale. */ + scale: number; +} + +/** The signed range for a bit width: int8 -> 127, int4 -> 7. One function + * serves both, so the headline (int8) and the deliberate failure (int4) run + * the identical path at different precisions. */ +export function levelFor(bits: number): number { + if (!Number.isInteger(bits) || bits < 2 || bits > 8) { + throw new Error(`quantize: unsupported bit width ${bits} (expected 2..8)`); + } + return (1 << (bits - 1)) - 1; // 2^(bits-1) - 1 +} + +/** Per-vector symmetric quantization to `bits` signed bits. scale carries the + * per-vector max magnitude so the reader can rebuild the approximate float. An + * all-zero vector (no signal) quantizes to all-zero with scale 1; it never + * divides by zero. */ +export function quantize(vector: readonly number[], bits = 8): QuantizedVector { + const level = levelFor(bits); + const n = vector.length; + const codes = new Int8Array(n); + let max = 0; + for (let i = 0; i < n; i += 1) { + const a = Math.abs(vector[i]!); + if (a > max) max = a; + } + if (max === 0) return { codes, scale: 1 }; + const inv = level / max; + for (let i = 0; i < n; i += 1) { + let q = Math.round(vector[i]! * inv); + if (q > level) q = level; + else if (q < -level) q = -level; + codes[i] = q; + } + return { codes, scale: max / level }; +} + +/** Reconstruct the approximate float vector from codes + scale. */ +export function dequantize(q: QuantizedVector): number[] { + const { codes, scale } = q; + const out = new Array(codes.length); + for (let i = 0; i < codes.length; i += 1) out[i] = codes[i]! * scale; + return out; +} + +/** Round-trip a vector through `bits`-bit quantization: the lossy vector the + * demo re-ranks against. quantize then dequantize, nothing else. */ +export function requantizeVector(vector: readonly number[], bits = 8): number[] { + return dequantize(quantize(vector, bits)); +} diff --git a/demo/query-vectors.test.ts b/demo/query-vectors.test.ts new file mode 100644 index 0000000..1f6fdc6 --- /dev/null +++ b/demo/query-vectors.test.ts @@ -0,0 +1,47 @@ +// Offline tests for the committed gold-query vector store: a clean round-trip, +// a missing file reading as "not built yet" (null), and malformed artifacts +// failing loudly at read with the rebuild hint rather than later as bad cosine. + +import assert from 'node:assert/strict'; +import { mkdtempSync, writeFileSync } from 'node:fs'; +import { tmpdir } from 'node:os'; +import { join } from 'node:path'; +import { test } from 'node:test'; + +import { QUERY_VECTORS_VERSION, readQueryVectors, writeQueryVectors } from './query-vectors.js'; + +const tmp = mkdtempSync(join(tmpdir(), 'scaling-qv-')); + +test('query-vectors: write/read round-trips with model and dimensions', () => { + const path = join(tmp, 'ok.json'); + writeQueryVectors('text-embedding-3-large', 3, [{ id: 'a', vector: [0.1, 0.2, 0.3] }], path); + const loaded = readQueryVectors(path); + assert.ok(loaded); + assert.equal(loaded.model, 'text-embedding-3-large'); + assert.equal(loaded.dimensions, 3); + assert.deepEqual(loaded.byId.get('a'), [0.1, 0.2, 0.3]); +}); + +test('query-vectors: a missing file reads as null (not built yet), not an error', () => { + assert.equal(readQueryVectors(join(tmp, 'absent.json')), null); +}); + +test('query-vectors: malformed entries fail loudly at read', () => { + const wrongDims = join(tmp, 'dims.json'); + writeFileSync( + wrongDims, + JSON.stringify({ version: QUERY_VECTORS_VERSION, model: 'm', dimensions: 3, queries: [{ id: 'a', vector: [0.1, 0.2] }] }), + ); + assert.throws(() => readQueryVectors(wrongDims), /malformed entry for 'a'/); + + const nonNumeric = join(tmp, 'nan.json'); + writeFileSync( + nonNumeric, + JSON.stringify({ version: QUERY_VECTORS_VERSION, model: 'm', dimensions: 2, queries: [{ id: 'b', vector: [0.1, 'x'] }] }), + ); + assert.throws(() => readQueryVectors(nonNumeric), /malformed entry for 'b'/); + + const badVersion = join(tmp, 'ver.json'); + writeFileSync(badVersion, JSON.stringify({ version: 999, model: 'm', dimensions: 2, queries: [] })); + assert.throws(() => readQueryVectors(badVersion), /schema version/); +}); diff --git a/demo/query-vectors.ts b/demo/query-vectors.ts new file mode 100644 index 0000000..75679d7 --- /dev/null +++ b/demo/query-vectors.ts @@ -0,0 +1,82 @@ +// demo/query-vectors.ts — the committed gold-query embeddings. +// +// The core eval CLI (src/cli/eval.ts) embeds every gold query at run time, so +// it always needs a key. The demo's headline must reproduce WITHOUT one, so the +// gold-query vectors are precomputed by demo:build and committed here beside +// the index. The runner reads them instead of calling the embedding API; a key +// is only ever needed to regenerate them or to run the --full answer pass. +// +// Same homogeneity discipline as the index (src/store.ts): a query embedded in +// a different model or width than the index is a meaningless cosine, so the +// file carries its (model, dimensions) and the runner checks them. + +import { existsSync, mkdirSync, readFileSync, writeFileSync } from 'node:fs'; +import { dirname, resolve } from 'node:path'; + +export const QUERY_VECTORS_PATH = resolve('demo/corpus/query-vectors.json'); +export const QUERY_VECTORS_VERSION = 1; + +export interface QueryVectorsFile { + version: number; + model: string; + dimensions: number; + queries: { id: string; vector: number[] }[]; +} + +export interface LoadedQueryVectors { + model: string; + dimensions: number; + byId: Map; +} + +const REBUILD = 'Run `npm run demo:build` with an OPENAI_API_KEY (see docs/scaling-demo/build-handoff.md).'; + +/** Read the committed query vectors, or null if not built yet. Throws on a + * present-but-malformed file so a corrupt artifact fails loudly with a remedy. */ +export function readQueryVectors(path: string = QUERY_VECTORS_PATH): LoadedQueryVectors | null { + if (!existsSync(path)) return null; + let parsed: unknown; + try { + parsed = JSON.parse(readFileSync(path, 'utf8')); + } catch { + throw new Error(`query vectors at ${path} are not valid JSON. ${REBUILD}`); + } + const file = parsed as Partial; + if ( + typeof parsed !== 'object' || + parsed === null || + file.version !== QUERY_VECTORS_VERSION || + typeof file.model !== 'string' || + typeof file.dimensions !== 'number' || + !Array.isArray(file.queries) + ) { + throw new Error(`query vectors at ${path} are not schema version ${QUERY_VECTORS_VERSION}. ${REBUILD}`); + } + const byId = new Map(); + for (const q of file.queries) { + // Validate to the same depth the store does for the index: a corrupt vector + // must fail loudly at read with the rebuild hint, not later as bad cosine. + if ( + typeof q?.id !== 'string' || + !Array.isArray(q.vector) || + q.vector.length !== file.dimensions || + !q.vector.every((x) => typeof x === 'number' && Number.isFinite(x)) + ) { + const which = typeof q?.id === 'string' ? ` for '${q.id}'` : ''; + throw new Error(`query vectors at ${path} have a malformed entry${which}. ${REBUILD}`); + } + byId.set(q.id, q.vector); + } + return { model: file.model, dimensions: file.dimensions, byId }; +} + +export function writeQueryVectors( + model: string, + dimensions: number, + queries: { id: string; vector: number[] }[], + path: string = QUERY_VECTORS_PATH, +): void { + mkdirSync(dirname(path), { recursive: true }); + const file: QueryVectorsFile = { version: QUERY_VECTORS_VERSION, model, dimensions, queries }; + writeFileSync(path, `${JSON.stringify(file)}\n`, 'utf8'); +} diff --git a/demo/run.ts b/demo/run.ts new file mode 100644 index 0000000..a59d0ae --- /dev/null +++ b/demo/run.ts @@ -0,0 +1,241 @@ +// npm run demo:run — quantize the committed index in process, re-rank, and +// run the full gold suite against the quantized index. +// +// --natural (default) real corpus only; owns the headline numbers. +// --natural+synthetic adds the quarantined synthetic spire + its gold. +// --bits quantization width (default 8; 4 is the int4 scalpel). +// --full also run the answer-mode pass (needs OPENAI_API_KEY). +// +// The headline run is keyless: it reads committed FP vectors and committed +// gold-query vectors, quantizes in process, and judges with the reused gold +// logic. --full adds the answer model, which is the only part that needs a key. +// See demo/README.md and docs/scaling-demo/build-handoff.md. + +import { resolve } from 'node:path'; + +import { loadGold } from '../src/evaluate.js'; +import type { GoldQuery } from '../src/evaluate.js'; +import { assertHomogeneousIndex, readIndexFile } from '../src/store.js'; +import type { IndexEntry } from '../src/types.js'; +import { requantizeIndex, runGate } from './harness.js'; +import { readQueryVectors } from './query-vectors.js'; + +const NATURAL_INDEX = resolve('demo/corpus/index.json'); +const SYNTHETIC_INDEX = resolve('demo/corpus/index.synthetic.json'); +const NATURAL_GOLD = resolve('demo/gold.yaml'); +const SYNTHETIC_GOLD = resolve('demo/gold.synthetic.yaml'); + +interface RunArgs { + synthetic: boolean; + bits: number; + full: boolean; +} + +function parseArgs(argv: string[]): RunArgs { + const args: RunArgs = { synthetic: false, bits: 8, full: false }; + for (let i = 0; i < argv.length; i += 1) { + const arg = argv[i]; + switch (arg) { + case '--natural': + args.synthetic = false; + break; + case '--natural+synthetic': + case '--synthetic': + args.synthetic = true; + break; + case '--full': + args.full = true; + break; + case '--bits': { + const value = argv[++i]; + if (!value) throw new Error('--bits requires a number (e.g. 8 or 4)'); + args.bits = Number(value); + if (!Number.isInteger(args.bits)) throw new Error(`--bits must be an integer, got '${value}'`); + break; + } + case '--help': + case '-h': + console.log( + 'demo:run [--natural | --natural+synthetic] [--bits ] [--full]\n' + + ' --natural real corpus only (default); owns the headline numbers\n' + + ' --natural+synthetic add the quarantined synthetic spire + its gold\n' + + ' --bits quantization width (default 8; 4 is the int4 scalpel)\n' + + ' --full also run the answer-mode pass (needs OPENAI_API_KEY)', + ); + process.exit(0); + break; + default: + throw new Error(`unknown argument '${arg ?? ''}'`); + } + } + return args; +} + +function loadIndex(synthetic: boolean): IndexEntry[] { + const natural = readIndexFile(NATURAL_INDEX); + if (natural.length === 0) { + throw new Error( + `no committed vectors at ${NATURAL_INDEX}. ` + + 'Run `npm run demo:build` with an OPENAI_API_KEY (see docs/scaling-demo/build-handoff.md).', + ); + } + if (!synthetic) { + assertHomogeneousIndex(natural); + return natural; + } + const spire = readIndexFile(SYNTHETIC_INDEX); + if (spire.length === 0) { + throw new Error( + `--natural+synthetic needs the spire at ${SYNTHETIC_INDEX}, which is not built yet ` + + '(author the synthetic notes, then `npm run demo:build`).', + ); + } + const union = [...natural, ...spire]; + // The spire is strictly baseline-plus-delta: same model, same dimensionality. + assertHomogeneousIndex(union); + return union; +} + +function loadGoldSet(synthetic: boolean, author: string): GoldQuery[] { + const gold = loadGold(NATURAL_GOLD, author); + if (!synthetic) return gold; + const expanded = loadGold(SYNTHETIC_GOLD, author); + return [...gold, ...expanded]; +} + +async function main(): Promise { + const args = parseArgs(process.argv.slice(2)); + const { config } = await import('./config.js'); + + const index = loadIndex(args.synthetic); + const gold = loadGoldSet(args.synthetic, config.authorName); + + const qv = readQueryVectors(); + if (!qv) { + throw new Error( + 'no committed query vectors. Run `npm run demo:build` with an OPENAI_API_KEY ' + + '(see docs/scaling-demo/build-handoff.md).', + ); + } + const spec = index[0]!; + if (qv.model !== spec.model || qv.dimensions !== spec.dimensions) { + throw new Error( + `query vectors (${qv.model}/${qv.dimensions}) do not match the index ` + + `(${spec.model}/${spec.dimensions}); rebuild both with demo:build.`, + ); + } + + // Say plainly what this run IS, so a reader knows what they are looking at. + const label = args.synthetic ? '--natural+synthetic' : '--natural'; + const shipped = args.bits === 8; + console.log('demo:run — int8 quantization gate (Smith collection)'); + console.log( + ` encoding: int${args.bits} ` + + (shipped + ? '(the shipped wire format; expected to HOLD the suite)' + : '(tightened below int8; a near-tie may flip and be REJECTED)'), + ); + console.log( + ` corpus: ${label} ` + + (args.synthetic + ? '(real corpus + the fabricated spire; headline still comes from --natural)' + : '(real corpus only; owns the headline numbers)'), + ); + console.log(` ${gold.length} gold queries, ${index.length} index entries, keyless (committed vectors)\n`); + + const report = runGate(gold, index, qv.byId, args.bits); + + for (const r of report.results) { + const status = r.pass ? 'ok ' : 'FAIL'; + const slot = r.topSlot ? ` top:${r.topSlot.won ? 'won' : `LOST->${r.topSlot.winner ?? 'none'}`}` : ''; + console.log(` ${status} ${r.id.padEnd(18)} rho=${r.rho.toFixed(4)}${slot}`); + if (!r.pass) { + for (const issue of r.retrievalIssues) console.log(` - ${issue}`); + if (r.topSlot && !r.topSlot.won) { + console.log(` - top slot flipped: expected ${r.topSlot.expected} to win, ${r.topSlot.winner ?? 'nothing'} did`); + } + } + } + + console.log( + `\nint${args.bits}: ${report.passed}/${report.total} gold verdicts held; ` + + `rank correlation mean ${report.meanRho.toFixed(4)}, min ${report.minRho.toFixed(4)}`, + ); + if (report.failed === 0) { + console.log(` VERDICT: the gold suite CERTIFIED int${args.bits} — every verdict full precision produces held.`); + } else { + const flips = report.results.filter((r) => r.topSlot && !r.topSlot.won).length; + console.log( + ` VERDICT: the gold suite REJECTED int${args.bits} — ${report.failed} verdict(s) did not hold` + + (flips ? `, including ${flips} top-slot flip(s)` : '') + + '.', + ); + console.log(' The same suite that owns grounding and refusal caught it; that caught failure is the payload.'); + } + + if (args.full) { + // The answer pass must see evidence selected from the SAME quantized index + // the retrieval gate judged, or a route flip on the lossy index would be + // masked by full-precision retrieval. Quantize once, here, and hand it down. + await runAnswerPass(gold, requantizeIndex(index, args.bits), qv.byId, config); + } else { + console.log('(retrieval + route tier only; add --full to run the answer-mode pass with a key)'); + } + + if (report.failed > 0) process.exitCode = 1; +} + +/** The keyed bonus: run the answer model on evidence retrieved from the + * QUANTIZED index, and check the declared mode. Same lossy surface the + * retrieval gate judged, so a route flip is not masked by full-precision + * retrieval. Exercises route SELECTION through the reused no-leak boundary; it + * does not touch A2 (the answer model's confabulation residue), which the + * encoding never moves. */ +async function runAnswerPass( + gold: readonly GoldQuery[], + quantIndex: readonly IndexEntry[], + queryVectorById: ReadonlyMap, + config: import('../src/types.js').ArchiveConfig, +): Promise { + if (!process.env.OPENAI_API_KEY) { + throw new Error('--full runs the answer model, which needs OPENAI_API_KEY.'); + } + const [{ default: OpenAI }, { retrieve }, { assembleEvidence }, { answerQuestion }, { judgeAnswer }] = + await Promise.all([ + import('openai'), + import('../src/retrieve.js'), + import('../src/no-leak.js'), + import('../src/answer.js'), + import('../src/evaluate.js'), + ]); + const client = new OpenAI(); + console.log('\n--full answer-mode pass (keyed, on the quantized index):'); + let answerFails = 0; + for (const g of gold) { + const qv = queryVectorById.get(g.id); + if (!qv) continue; + const hits = retrieve(qv, g.query, quantIndex); + const evidence = assembleEvidence( + hits.records.map((h) => h.record), + hits.notes.map((h) => h.note), + ); + try { + const answer = await answerQuestion(client, g.query, evidence, config); + const judged = judgeAnswer(g, answer); + console.log(` ${judged.pass ? 'ok ' : 'FAIL'} ${g.id.padEnd(18)} mode=${answer.mode}`); + if (!judged.pass) { + answerFails += 1; + for (const issue of judged.issues) console.log(` - ${issue}`); + } + } catch (err) { + answerFails += 1; + console.log(` FAIL ${g.id.padEnd(18)} answer engine threw: ${err instanceof Error ? err.message : err}`); + } + } + if (answerFails > 0) process.exitCode = 1; +} + +main().catch((err) => { + console.error(`demo:run failed: ${err instanceof Error ? err.message : err}`); + process.exitCode = 1; +}); diff --git a/docs/scaling-demo/build-handoff.md b/docs/scaling-demo/build-handoff.md new file mode 100644 index 0000000..cf794cc --- /dev/null +++ b/docs/scaling-demo/build-handoff.md @@ -0,0 +1,85 @@ +# Build handoff — populate the scaling corpus and generate the vectors + +This is an executable brief for an agent (or person) running in an environment **with network access to the public-domain sources and an `OPENAI_API_KEY`**. The session that built `demo/` had neither: this repo's egress allowed only GitHub, and `api.openai.com` plus Gutenberg / archive.org were all blocked, so the code, structure, gold set, provenance manifest, and deterministic harness tests are authored and committed, but the real text bodies and the committed embedding vectors are not. This brief produces them. + +Read the spec (`docs/scaling-demo/SCALING-DEMO-spec.md`), the corpus manifest (`demo/corpus/README.md`), and the delta log (`docs/scaling-demo/scaling-demo-delta-log.md`) first. The frame governs: verify against the live source not against this doc, prefer the smaller change, and **never fabricate words for the real Adam Smith or the real George Adam Smith** — the only authored text is the quarantined synthetic spire. + +## 0. Prerequisites + +- `OPENAI_API_KEY` set (in `.env` or the environment). The build embeds with `text-embedding-3-large` at native dimensionality; nothing else will satisfy the homogeneity invariant (`src/store.ts`). +- Network egress to Project Gutenberg and the Internet Archive (and Wikipedia/Wikisource for the real route-target URLs). +- A clean offline baseline first: `npm test` and `npm run typecheck` green (they are, with the fixture tests; do not regress them). + +## 1. Create the corpus files + +One markdown file per **short whole unit** (a single prophet exposition, one chapter, one sermon). **Never a whole volume as one file** — a whole volume as one embedding dilutes its topical center (`NEXT-STEPS.md` B3) and washes out the near-ties the demo needs. Watch sermon length specifically: if a sermon is long enough that it would have to be split into windows to retrieve well, that is the **highest-stakes delta** (the demo would then chunk, and "in-memory and unchunked" breaks — log it in delta-log row 4 before doing it). + +Slugs are the filename stems and must match `demo/gold.yaml` exactly. Titles **carry the author's full name on purpose**: that is what makes the partial-name boost edge live (a query naming "Adam Smith" phrase-matches a title containing "George Adam Smith"). Author `themes` honestly from the actual text, **including where they collide** (both Smiths on "justice"); do not curate themes to make disambiguation easy. + +### Public ledger — Adam Smith (economist), dir `demo/corpus/public/adam-smith/` + +Record frontmatter: `title` (required, lead with "Adam Smith — "), `summary` (or `description`/`meaning`), `themes`. Body: the real unit text, lightly cleaned. + +| slug (filename) | unit to extract | suggested themes (verify against text) | +|---|---|---| +| `theory-of-moral-sentiments-justice` | _Theory of Moral Sentiments_, the section on justice and beneficence | justice, morality, society | +| `theory-of-moral-sentiments-sympathy` | _Theory of Moral Sentiments_, the opening on sympathy | sympathy, morality, the passions | +| `wealth-of-nations-division-of-labour` | _Wealth of Nations_, Bk I ch. 1 (division of labour) | labour, economy, society | +| `wealth-of-nations-value` | _Wealth of Nations_, Bk I on value / price | value, money, economy | + +### Public ledger — George Adam Smith (theologian), dir `demo/corpus/public/george-adam-smith/` + +Same frontmatter shape; lead titles with "George Adam Smith — ". + +| slug (filename) | unit to extract | suggested themes (verify against text) | +|---|---|---| +| `twelve-prophets-amos` | _The Book of the Twelve Prophets_, the Amos exposition | justice, prophecy, righteousness | +| `twelve-prophets-hosea` | _The Book of the Twelve Prophets_, the Hosea exposition | love, mercy, prophecy | +| `twelve-prophets-micah` | _The Book of the Twelve Prophets_, the Micah exposition | justice, judgment, prophecy | +| `isaiah-prophet-of-faith` | _The Book of Isaiah_, one chapter exposition | faith, prophecy, judgment | + +Note the deliberate theme collision: Amos and Micah carry "justice," which Adam Smith's _Theory of Moral Sentiments_ also carries. That collision is wanted; the gold suite exposes where the theme boost mis-fires. + +### Private ledger — George sermons, dir `demo/corpus/private/` + +These are **real George minor works**, designated private (a layer assignment, not secrecy). Note frontmatter: `title` (the label that travels — keep it public-safe), `about` (a **real** public George page to route to, e.g. the work's Wikisource/IA page or `https://en.wikipedia.org/wiki/George_Adam_Smith`), `locator` (where the moment lives, e.g. "Forgiveness of Sins (1905), sermon II"). Body: the real sermon text. The id is `note:`. + +| slug (filename) | unit to extract | +|---|---| +| `forgiveness-of-sins` | _The Forgiveness of Sins, and Other Sermons_ (1905), the title sermon | +| `sermon-the-eternal-in-man` | the same volume, a second short sermon (use the actual title) | +| `sermon-faith-and-the-unseen` | the same volume, a third short sermon (use the actual title) | + +Confirm the actual sermon titles from the volume and rename slugs to match if needed (update `gold.yaml` in lockstep). **No economist material and no name-collision in the private ledger** — every private note is unambiguously George. + +## 2. Author the synthetic spire (only if the deliberate failure needs it) + +The spire is the scalpel for the deliberate failure (step 4), not a corpus filler. Author it **only if** the real route-margin tie does not flip under a tightened encoding on its own. Each synthetic note: +- lives in `demo/corpus/synthetic/` (the quarantine is one flag) and carries a `synthetic: true` frontmatter marker (a second, in-file flag; the `PrivateNote` type does not read it, so it changes nothing the engine sees), +- is a fabricated **George-private** note (never a third Smith, never words for the real Adam Smith), +- carries a one-line comment at the top of the body naming the gold case, the margin, and the mode it targets, +- is skewed toward must-refuse / route-flip, never an extra must-answer win. + +Suggested first spire note: `syn-amos-justice-margin` — a fabricated George note on Amos and justice, tuned to sit at the floor against `george-adam-smith:twelve-prophets-amos` so int8 holds the route but int4 flips it. + +## 3. Generate the committed vectors + +`npm run demo:build` (added in `package.json`) reads the corpus through the reused `buildCorpus` / `buildPrivateNotes`, embeds with the configured model, embeds the gold queries, and writes: +- `demo/corpus/index.json` — natural FP vectors (records + real private notes). The headline source of truth; committed. +- `demo/corpus/index.synthetic.json` — the spire delta (synthetic notes only), unioned under `--natural+synthetic`. +- `demo/corpus/query-vectors.json` — the gold-query vectors that make `demo:run` keyless. + +Commit all three. They derive from public-domain text, so committing them exposes nothing private (manifest §2); do not generalize that to private corpora. + +## 4. Run the gate, then calibrate the deliberate failure + +1. `npm run demo:run` (the `--natural` headline, no key needed once vectors are committed). Confirm: rank correlation FP-vs-int8 above the bar, and the full gold suite passes. Record the headline numbers in delta-log row 2 / 7. +2. Find the break: re-run at `--bits 4` (int4) or a lowered floor and confirm a **route** case flips and the gold suite **catches it**. Report the spire's effect on its own line, never folded into the headline. **If it does not fire, the near-ties are too loose: tighten the margin (the spire), do NOT add corpus** (delta-log row 3). This caught failure is the result the demo rests on; lead the README with it. +3. Optional keyed bonus: `npm run demo:run -- --full` runs the answer-mode adjudication (related-material routes without restating). This exercises selection, not A2 — int8 never touches the answer model's confabulation residue. + +## 5. Verify and reconcile (do these last, from real facts) + +- Fill the provenance table OCR-quality notes in `demo/corpus/README.md` from the actual files; verify every Gutenberg ID and the IA ARK against the live source. +- Fill the delta log rows with what the build actually did. Flag any `paper §5-§6` row immediately (especially row 4 if any unit had to be split). +- **Only once `demo:run` confirms the headline**, apply the deferred `NEXT-STEPS.md` reconciliation (prepared text in the delta log): distinguish the deliberately-simple **core** (full-precision, pulls no levers, indexes documents whole) from the **`demo/` miniature** (pulls exactly one lever, int8, on a short-whole-unit corpus; explicitly marked), and add the §C1 link to `demo/`. Do not claim "a runnable miniature ships" until it runs — that honesty is the whole point. +- Re-run `npm test` and `npm run typecheck`; both stay green. diff --git a/docs/scaling-demo/scaling-demo-delta-log.md b/docs/scaling-demo/scaling-demo-delta-log.md index 53dbffb..89ee8b0 100644 --- a/docs/scaling-demo/scaling-demo-delta-log.md +++ b/docs/scaling-demo/scaling-demo-delta-log.md @@ -1,6 +1,6 @@ # Delta log — scaling demo build -The lab notebook for building `scaling/`. The spec states assumptions; the build establishes facts; this log records every place they diverge. Fill it **during** testing, not after — the point is to write the downstream docs once, from ground truth, instead of authoring them under time pressure on merge day. +The lab notebook for building `demo/`. The spec states assumptions; the build establishes facts; this log records every place they diverge. Fill it **during** testing, not after — the point is to write the downstream docs once, from ground truth, instead of authoring them under time pressure on merge day. **Why this exists:** the reconciliation edits (NEXT-STEPS C-intro/C1, the paper §5/§6 line) are *descriptions of what the built demo actually does*. They can't be written accurately before the build, and "verify against the live repo, never against the brief" applies one level up here too. Defer the prose; don't defer the obligation — every row tagged `paper` or `NEXT-STEPS` is a downstream edit that comes due at merge. @@ -9,37 +9,98 @@ The lab notebook for building `scaling/`. The spec states assumptions; the build For each assumption the spec makes, record what the build actually did and what that touches. A row only matters if reality diverged or confirmed-under-doubt. The **Touches** column is the early-warning system: most deltas are `spec` (fix the spec so it stays true) or `nothing`; the ones tagged `paper` are the ones that change a published claim and must not be discovered by a referee. **Touches** values: -- `spec` — correct `SCALING-DEMO-spec.md` / `scaling/corpus/README.md` so they describe the real build. +- `spec` — correct `SCALING-DEMO-spec.md` / `demo/corpus/README.md` so they describe the real build. - `NEXT-STEPS` — the C-intro/C1 core-vs-miniature reconciliation depends on this fact. - `paper §5–§6` — changes a claim in the paper (in-memory, unchunked, pulls no levers). Highest stakes. Flag immediately. - `nothing` — confirmed as assumed; log it so you know it was checked. ## Pre-seeded rows (the deltas most likely to surface) +**Build context (read before the rows).** The session that built `demo/` +had egress to GitHub only: `api.openai.com`, Gutenberg, and archive.org all +returned `host_not_allowed`, and no `OPENAI_API_KEY` was set. So the code, the +gold set, the provenance manifest, and the deterministic harness tests are +committed and green, but the real text bodies and the committed vectors are +**pending a build run** (a local agent with network + key; see +`build-handoff.md`). Rows about what the *real run* produced are marked PENDING; +rows about the *mechanism and structure* are settled now. + | # | Spec assumption | What the build actually did | Touches | Downstream action | |---|---|---|---|---| -| 1 | Score floor as shipped (`SCORE_FLOOR`) puts marginal cases where int8 can flip them | _fill: kept / tightened to \_ | `spec`, maybe `NEXT-STEPS` (B1) | If moved, document the new floor and that it's model-dependent (B1) | -| 2 | int8 holds the full gold suite on the real corpus (headline pass) | _fill: held / didn't_ | `nothing` if held; investigate if not | Headline number for §6/C1 | -| 3 | A tightened encoding (int4 / lowered floor) flips a **route** case and the gold suite catches it — the deliberate failure | _fill: fired at \ / did NOT fire_ | `spec` if settings changed | **If it doesn't fire, near-ties are too loose — tighten margin, do NOT add corpus.** This is the result the demo rests on | -| 4 | George sermons index as short **whole** units without diluting their topical center (so "indexes documents whole" stays true) | _fill: whole units worked / had to split a sermon_ | **`paper §5–§6`** if split | **Highest stakes.** If any unit is split into windows, the demo now chunks; "in-memory and unchunked" breaks and the §5 reconciliation grows. Watch sermon length specifically | -| 5 | `EXACT_MATCH_BOOST = 0.30` fires (or not) on "Adam Smith" vs "George Adam Smith" partial match as the gold case predicts | _fill: actual behavior_ | `spec` | Pin the observed behavior in the gold case | -| 6 | Both-Smith shared theme (e.g. "justice") mis-fires the theme boost, and the gold suite exposes it | _fill: observed / didn't occur_ | `spec` | Keep as exposed near-tie; do not curate themes to suppress it | -| 7 | FP vectors commit cleanly and the default run reproduces with no key | _fill: yes / issue_ | `spec` | Confirms the no-key headline claim | -| 8 | Demo is a thin module: reuses `src/retrieve.ts` + `src/no-leak.ts` untouched, no second pipeline | _fill: stayed thin / needed more_ | `spec`; **halt if it needs its own pipeline** | If it can't stay thin, propose a sibling repo per the budget rule — do not bloat | +| 1 | Score floor as shipped (`SCORE_FLOOR`) puts marginal cases where int8 can flip them | Confirmed `SCORE_FLOOR = 0.2` in `src/retrieve.ts` (model-dependent, B1); gold cases authored near it. Real-run margin PENDING build | `spec`, maybe `NEXT-STEPS` (B1) | If the build moves it, document the new floor and that it's model-dependent | +| 2 | int8 holds the full gold suite on the real corpus (headline pass) | PENDING build. Mechanism proven offline (`quantize.test.ts`: int8 certifies the route case) | `nothing` if held; investigate if not | Headline number for §6/C1, recorded at build | +| 3 | A tightened encoding (int4 / lowered floor) flips a **route** case and the gold suite catches it | Mechanism PROVEN offline (the payload test: int8 holds, int4 flips the top slot, the gate catches it). Real spire authored (`syn-amos-justice-margin`); calibration to fire on real vectors PENDING build | `spec` if settings changed | **If it doesn't fire on real vectors, tighten the margin, do NOT add corpus.** Handoff §4 | +| 4 | George sermons index as short **whole** units (so "indexes documents whole" stays true) | Authored as short whole units, one sermon per file, not split. Real sermon length not verifiable here (no source access) | **`paper §5–§6`** if split | **Highest stakes.** Build agent watches sermon length; if any unit is split into windows, "in-memory and unchunked" breaks and the §5 reconciliation grows | +| 5 | `EXACT_MATCH_BOOST = 0.30` fires (or not) on the "Adam Smith" vs "George Adam Smith" partial match | Designed live: record **titles carry the full author name** so a query's "Adam Smith" phrase-matches a "George Adam Smith" title; `boost-edge-micah` gold pins it. Observed behavior PENDING build | `spec` | Pin the observed behavior in the gold case at build | +| 6 | Both-Smith shared theme (e.g. "justice") mis-fires the theme boost, and the gold suite exposes it | Collision authored on purpose (Amos/Micah carry "justice", as does TMS); not curated away. Observed behavior PENDING build | `spec` | Keep as an exposed near-tie; do not curate themes to suppress it | +| 7 | FP vectors commit cleanly and the default run reproduces with no key | Keyless runner built and degrades cleanly. **DIVERGENCE:** the core eval CLI requires a key (it embeds gold queries at run time), so the keyless headline needs committed **gold-query** vectors, not just FP vectors. Added `query-vectors.json`; `build.ts` writes it | `spec` | Spec §5 should say "FP **and gold-query** vectors committed" for the no-key claim | +| 8 | Demo is a thin module: reuses `src/retrieve.ts` + `src/no-leak.ts` untouched, no second pipeline | **CONFIRMED.** Reuses `retrieve()`, `cosine()`, the no-leak boundary, the gold judges, `store`, the corpus loaders, and embedding untouched; the int8 path is `quantize.ts` plus a re-rank. No core types changed. Budget held, **no halt** | `spec` | None; the budget claim holds | -## Open-ended rows (add as testing surfaces them) +## Open-ended rows (surfaced during the build) | # | Spec assumption | What the build actually did | Touches | Downstream action | |---|---|---|---|---| -| 9 | | | | | -| 10 | | | | | +| 9 | Spec §2: "records carry real public URLs via the normal record path" | **DIVERGENCE.** Per-unit real URLs do not exist (Gutenberg is work-level) and `src/corpus.ts` is reused untouched, so record citation URLs are constructed demo-canonical (`.example` TLD), symmetric across both authors; the provenance table holds the real sources, and private-note `about` targets ARE real | `spec` | Soften §2 to "demo-canonical citations, real provenance + real route targets" (already stated in `corpus/README.md`) | +| 10 | Spec quotes `NEXT-STEPS.md` §C1 as already saying "a runnable miniature ships at `demo/`" | The live §C1 has no such line. The §C1 link and the C-intro carve-out are **deferred reconciliation** (prepared text below), applied by the build agent **after** `demo:run` confirms the headline, so "runnable" is verified not asserted | `NEXT-STEPS` | Apply the prepared edit at build, not before | +| 11 | Spec §7: `production-scaling.md` location "unconfirmed (subdir or pending)" | RESOLVED at `docs/production-scaling.md`, em-dashes already thinned (fix 2.4 landed). `demo/README.md` cross-links it | `spec` | None; resolved | +| 12 | The keyless gate catches a quantization flip | `judgeRetrieval` checks presence in top-K only, so it misses a flip where both candidates stay retrieved but swap rank. **Added a keyless top-slot check** (`topSource`): for any non-refusal case with an expected source, that source must WIN the top slot, not merely appear. Covers **route** (the private note must outrank the records) and, extended per review, **disambiguation** (the right Smith must outrank the wrong one — otherwise the headline's marquee verdict was protected only by the keyed `--full` pass) | `spec` | Note the top-slot check in the spec's §5 harness description; it is what makes the disambiguation verdict a keyless one | +| 13 | The answer-mode pass governs the route/refuse verdicts | The keyless headline covers retrieval + route selection + refuse-by-floor; the answer-mode adjudication (related-material routes without restating) is the `--full` keyed pass. `answerQuestion` short-circuits to not-found on empty evidence, so refuse-by-empty-floor is keyless even under `--full`. Route tests selection, not A2 | `spec` | Clarify the two tiers (keyless retrieval gate vs keyed answer gate) in §5 | +| 14 | (build) the corpus and vectors are produced in this session | Blocked by egress (GitHub-only) and a missing key; deferred to a local agent per `build-handoff.md`. Code, structure, gold, and tests committed and green | `nothing` (process) | Run the handoff to complete the demo | +| 15 | `.github/STANDARDS.md` line 51: "Don't leak private embeddings/text into committed artifacts" | The demo commits `demo/corpus/index.json` with the public-domain George "private"-layer vectors, **on purpose** (spec §5): the layer is public-domain (a layer assignment, not secrecy), the file is deliberately not gitignored, and README §2 + manifest §2 explain it with the inversion warning. The automated review flagged the standard. No design change; the standard is about genuinely-private data | `STANDARDS` reconciliation | Add a one-line carve-out at merge (prepared below) so the demo's public-domain exception is named, not re-flagged | +| 16 | Spec §7 proposes the module at a top-level `scaling/` | **Renamed to `demo/`** (npm scripts `demo:build/run/test`) per the author: `scaling/` read like a subsystem; the artifact is a demo. The historical `SCALING-DEMO-spec.md` and `scaling-corpus-README.md` draft keep `scaling/` as the original proposal | `spec` | The spec's `scaling/` references are the pre-rename proposal; this log is the bridge. Update the spec's path words if it is ever revised | ## Merge-day assembly (do this the day the demo lands, while it's hot) Walk the log top to bottom: - Every `spec` row → correct the spec and corpus README so they're true. -- Every `NEXT-STEPS` row → write the C-intro/C1 edit distinguishing core (pulls no levers) from `scaling/` miniature (pulls one, marked), using the actual facts logged. +- Every `NEXT-STEPS` row → write the C-intro/C1 edit distinguishing core (pulls no levers) from `demo/` miniature (pulls one, marked), using the actual facts logged. - Every `paper §5–§6` row → write the one-line bridge so §5's "in-memory and unchunked … pulls none of these levers" reads as describing the core. **If row 4 fired (a unit was split), this is no longer one line — the unchunked claim itself needs revisiting.** - Confirm the anonymization checklist still covers any new identifying surface the demo added. The reconciliation is then assembly from recorded facts, not authorship under pressure. That was the point of keeping the log. + +## Prepared reconciliation text (apply at build, once `demo:run` confirms the headline) + +These edits describe what the demo *does*. They are held here, not applied, +because the demo is not runnable until the vectors are built (rows 2, 14). Apply +them only after `demo:run --natural` confirms the headline, so "a runnable +miniature ships" is verified, not asserted. + +**`NEXT-STEPS.md` §C-intro** (row 10). It currently reads: "This repository is +full-precision and indexes documents whole; it pulls none of these levers." +Once `demo/` lands, the repo contains int8 code, so distinguish core from +miniature, for example: + +> This repository's **core** is full-precision and indexes documents whole; it +> pulls none of these levers. The one exception is the marked illustration at +> `demo/`: a runnable int8 miniature on a short-whole-unit public-domain +> corpus, which pulls exactly one lever (int8 quantization) to show the gold +> suite gating it. The core's claims stay true of the core; `demo/` is named +> as the exception. (It still indexes short units *whole*, so "indexes documents +> whole" holds; only the lever claim needs the carve-out.) + +**`NEXT-STEPS.md` §C1** (row 10). Add a pointer in the int8 lever, for example: +"A runnable miniature of this lever ships at `demo/` (see +`demo/README.md`); it is the public, gated counterpart to the private +production figures above." + +**`.github/STANDARDS.md` line 51** (row 15, raised by the automated review). +"Don't leak private embeddings/text into committed artifacts. (The index is +gitignored for a reason.)" stays true of the core. Name the demo's exception so +it is not re-flagged, for example: + +> The one exception is the `demo/` demo. Its "private" layer is public-domain +> text by design (a layer assignment, not secrecy), so it commits +> `demo/corpus/index.json` on purpose, to reproduce the headline with no key. +> See `demo/README.md` §2 for why that is safe there and must not be +> generalized to a genuinely-private corpus. + +**Paper §5/§6 (author's call, conditional).** The published note's §5 says +retrieval is "in-memory and unchunked … indexed whole." That stays true of the +core and of the demo's short whole units. **Only if `demo/` is in an +anonymized submission snapshot** does §6 want a one-line bridge so §5 reads as +describing the core, not the `demo/` exception. This is a paper edit, the +author's not the agent's, and it is moot if `demo/` is deferred past review. +Note in the build summary whether `demo/` is present in any snapshot built. +**If row 4 fired (a sermon had to be split), the unchunked claim itself needs +revisiting, not just a bridge.** diff --git a/package.json b/package.json index f2ddc08..57cd8d3 100644 --- a/package.json +++ b/package.json @@ -12,7 +12,10 @@ "index": "node --env-file-if-exists=.env --import tsx src/cli/build-index.ts", "ask": "node --env-file-if-exists=.env --import tsx src/cli/ask.ts", "eval": "node --env-file-if-exists=.env --import tsx src/cli/eval.ts", - "test": "node --import tsx --test test/*.test.ts", + "demo:build": "node --env-file-if-exists=.env --import tsx demo/build.ts", + "demo:run": "node --env-file-if-exists=.env --import tsx demo/run.ts", + "demo:test": "node --import tsx --test demo/*.test.ts", + "test": "node --import tsx --test test/*.test.ts demo/*.test.ts", "typecheck": "tsc --noEmit" }, "dependencies": { diff --git a/tsconfig.json b/tsconfig.json index ffaed3c..ecdf61d 100644 --- a/tsconfig.json +++ b/tsconfig.json @@ -12,6 +12,6 @@ "skipLibCheck": true, "noEmit": true }, - "include": ["archive.config.ts", "src/**/*.ts", "test/**/*.ts"], + "include": ["archive.config.ts", "src/**/*.ts", "test/**/*.ts", "demo/**/*.ts"], "exclude": ["node_modules", "artifacts"] }