From f372ec1d7614fa07321aa4b5ffb002e3b2d7abe4 Mon Sep 17 00:00:00 2001 From: Claude Date: Tue, 16 Jun 2026 04:21:57 +0000 Subject: [PATCH 01/10] =?UTF-8?q?feat(scaling):=20scaffold=20int8=20demo?= =?UTF-8?q?=20=E2=80=94=20config,=20corpus=20manifest,=20build=20handoff?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Thin module beside the core (the budget rule): scaling.config.ts points the reused engine at scaling/corpus/ with two name-colliding public-domain authors (Adam Smith the economist, George Adam Smith the theologian). Adds the corpus provenance + authored-choices manifest (scaling/corpus/README.md) and a build handoff for an environment with network + an OpenAI key, since this session's egress allowed only GitHub and api.openai.com was blocked. Brings scaling/ under typecheck. No core behavior changed. https://claude.ai/code/session_01EhtDe3ZQnv6vx2qSLy8qQ1 --- docs/scaling-demo/build-handoff.md | 85 +++++++++++++++++++ scaling/corpus/README.md | 63 ++++++++++++++ scaling/corpus/private/.gitkeep | 0 scaling/corpus/public/adam-smith/.gitkeep | 0 .../corpus/public/george-adam-smith/.gitkeep | 0 scaling/corpus/synthetic/.gitkeep | 0 scaling/scaling.config.ts | 51 +++++++++++ tsconfig.json | 2 +- 8 files changed, 200 insertions(+), 1 deletion(-) create mode 100644 docs/scaling-demo/build-handoff.md create mode 100644 scaling/corpus/README.md create mode 100644 scaling/corpus/private/.gitkeep create mode 100644 scaling/corpus/public/adam-smith/.gitkeep create mode 100644 scaling/corpus/public/george-adam-smith/.gitkeep create mode 100644 scaling/corpus/synthetic/.gitkeep create mode 100644 scaling/scaling.config.ts diff --git a/docs/scaling-demo/build-handoff.md b/docs/scaling-demo/build-handoff.md new file mode 100644 index 0000000..bbfd3d2 --- /dev/null +++ b/docs/scaling-demo/build-handoff.md @@ -0,0 +1,85 @@ +# Build handoff — populate the scaling corpus and generate the vectors + +This is an executable brief for an agent (or person) running in an environment **with network access to the public-domain sources and an `OPENAI_API_KEY`**. The session that built `scaling/` had neither: this repo's egress allowed only GitHub, and `api.openai.com` plus Gutenberg / archive.org were all blocked, so the code, structure, gold set, provenance manifest, and deterministic harness tests are authored and committed, but the real text bodies and the committed embedding vectors are not. This brief produces them. + +Read the spec (`docs/scaling-demo/SCALING-DEMO-spec.md`), the corpus manifest (`scaling/corpus/README.md`), and the delta log (`docs/scaling-demo/scaling-demo-delta-log.md`) first. The frame governs: verify against the live source not against this doc, prefer the smaller change, and **never fabricate words for the real Adam Smith or the real George Adam Smith** — the only authored text is the quarantined synthetic spire. + +## 0. Prerequisites + +- `OPENAI_API_KEY` set (in `.env` or the environment). The build embeds with `text-embedding-3-large` at native dimensionality; nothing else will satisfy the homogeneity invariant (`src/store.ts`). +- Network egress to Project Gutenberg and the Internet Archive (and Wikipedia/Wikisource for the real route-target URLs). +- A clean offline baseline first: `npm test` and `npm run typecheck` green (they are, with the fixture tests; do not regress them). + +## 1. Create the corpus files + +One markdown file per **short whole unit** (a single prophet exposition, one chapter, one sermon). **Never a whole volume as one file** — a whole volume as one embedding dilutes its topical center (`NEXT-STEPS.md` B3) and washes out the near-ties the demo needs. Watch sermon length specifically: if a sermon is long enough that it would have to be split into windows to retrieve well, that is the **highest-stakes delta** (the demo would then chunk, and "in-memory and unchunked" breaks — log it in delta-log row 4 before doing it). + +Slugs are the filename stems and must match `scaling/gold.yaml` exactly. Titles **carry the author's full name on purpose**: that is what makes the partial-name boost edge live (a query naming "Adam Smith" phrase-matches a title containing "George Adam Smith"). Author `themes` honestly from the actual text, **including where they collide** (both Smiths on "justice"); do not curate themes to make disambiguation easy. + +### Public ledger — Adam Smith (economist), dir `scaling/corpus/public/adam-smith/` + +Record frontmatter: `title` (required, lead with "Adam Smith — "), `summary` (or `description`/`meaning`), `themes`. Body: the real unit text, lightly cleaned. + +| slug (filename) | unit to extract | suggested themes (verify against text) | +|---|---|---| +| `theory-of-moral-sentiments-justice` | _Theory of Moral Sentiments_, the section on justice and beneficence | justice, morality, society | +| `theory-of-moral-sentiments-sympathy` | _Theory of Moral Sentiments_, the opening on sympathy | sympathy, morality, the passions | +| `wealth-of-nations-division-of-labour` | _Wealth of Nations_, Bk I ch. 1 (division of labour) | labour, economy, society | +| `wealth-of-nations-value` | _Wealth of Nations_, Bk I on value / price | value, money, economy | + +### Public ledger — George Adam Smith (theologian), dir `scaling/corpus/public/george-adam-smith/` + +Same frontmatter shape; lead titles with "George Adam Smith — ". + +| slug (filename) | unit to extract | suggested themes (verify against text) | +|---|---|---| +| `twelve-prophets-amos` | _The Book of the Twelve Prophets_, the Amos exposition | justice, prophecy, righteousness | +| `twelve-prophets-hosea` | _The Book of the Twelve Prophets_, the Hosea exposition | love, mercy, prophecy | +| `twelve-prophets-micah` | _The Book of the Twelve Prophets_, the Micah exposition | justice, judgment, prophecy | +| `isaiah-prophet-of-faith` | _The Book of Isaiah_, one chapter exposition | faith, prophecy, judgment | + +Note the deliberate theme collision: Amos and Micah carry "justice," which Adam Smith's _Theory of Moral Sentiments_ also carries. That collision is wanted; the gold suite exposes where the theme boost mis-fires. + +### Private ledger — George sermons, dir `scaling/corpus/private/` + +These are **real George minor works**, designated private (a layer assignment, not secrecy). Note frontmatter: `title` (the label that travels — keep it public-safe), `about` (a **real** public George page to route to, e.g. the work's Wikisource/IA page or `https://en.wikipedia.org/wiki/George_Adam_Smith`), `locator` (where the moment lives, e.g. "Forgiveness of Sins (1905), sermon II"). Body: the real sermon text. The id is `note:`. + +| slug (filename) | unit to extract | +|---|---| +| `forgiveness-of-sins` | _The Forgiveness of Sins, and Other Sermons_ (1905), the title sermon | +| `sermon-the-eternal-in-man` | the same volume, a second short sermon (use the actual title) | +| `sermon-faith-and-the-unseen` | the same volume, a third short sermon (use the actual title) | + +Confirm the actual sermon titles from the volume and rename slugs to match if needed (update `gold.yaml` in lockstep). **No economist material and no name-collision in the private ledger** — every private note is unambiguously George. + +## 2. Author the synthetic spire (only if the deliberate failure needs it) + +The spire is the scalpel for the deliberate failure (step 4), not a corpus filler. Author it **only if** the real route-margin tie does not flip under a tightened encoding on its own. Each synthetic note: +- lives in `scaling/corpus/synthetic/` (the quarantine **is** the flag; there is no `synthetic` type field), +- is a fabricated **George-private** note (never a third Smith, never words for the real Adam Smith), +- carries a one-line comment at the top of the body naming the gold case, the margin, and the mode it targets, +- is skewed toward must-refuse / route-flip, never an extra must-answer win. + +Suggested first spire note: `syn-amos-justice-margin` — a fabricated George note on Amos and justice, tuned to sit at the floor against `george-adam-smith:twelve-prophets-amos` so int8 holds the route but int4 flips it. + +## 3. Generate the committed vectors + +`npm run scaling:build` (added in `package.json`) reads the corpus through the reused `buildCorpus` / `buildPrivateNotes`, embeds with the configured model, embeds the gold queries, and writes: +- `scaling/corpus/index.json` — natural FP vectors (records + real private notes). The headline source of truth; committed. +- `scaling/corpus/index.synthetic.json` — the spire delta (synthetic notes only), unioned under `--natural+synthetic`. +- `scaling/corpus/query-vectors.json` — the gold-query vectors that make `scaling:run` keyless. + +Commit all three. They derive from public-domain text, so committing them exposes nothing private (manifest §2); do not generalize that to private corpora. + +## 4. Run the gate, then calibrate the deliberate failure + +1. `npm run scaling:run` (the `--natural` headline, no key needed once vectors are committed). Confirm: rank correlation FP-vs-int8 above the bar, and the full gold suite passes. Record the headline numbers in delta-log row 2 / 7. +2. Find the break: re-run at `--bits 4` (int4) or a lowered floor and confirm a **route** case flips and the gold suite **catches it**. Report the spire's effect on its own line, never folded into the headline. **If it does not fire, the near-ties are too loose: tighten the margin (the spire), do NOT add corpus** (delta-log row 3). This caught failure is the result the demo rests on; lead the README with it. +3. Optional keyed bonus: `npm run scaling:run -- --full` runs the answer-mode adjudication (related-material routes without restating). This exercises selection, not A2 — int8 never touches the answer model's confabulation residue. + +## 5. Verify and reconcile (do these last, from real facts) + +- Fill the provenance table OCR-quality notes in `scaling/corpus/README.md` from the actual files; verify every Gutenberg ID and the IA ARK against the live source. +- Fill the delta log rows with what the build actually did. Flag any `paper §5-§6` row immediately (especially row 4 if any unit had to be split). +- **Only once `scaling:run` confirms the headline**, apply the deferred `NEXT-STEPS.md` reconciliation (prepared text in the delta log): distinguish the deliberately-simple **core** (full-precision, pulls no levers, indexes documents whole) from the **`scaling/` miniature** (pulls exactly one lever, int8, on a short-whole-unit corpus; explicitly marked), and add the §C1 link to `scaling/`. Do not claim "a runnable miniature ships" until it runs — that honesty is the whole point. +- Re-run `npm test` and `npm run typecheck`; both stay green. diff --git a/scaling/corpus/README.md b/scaling/corpus/README.md new file mode 100644 index 0000000..f4e1f0a --- /dev/null +++ b/scaling/corpus/README.md @@ -0,0 +1,63 @@ +# The scaling-demo corpus + +This folder holds the corpus for the int8 scaling demo. This README is the corpus's **answerable half**: the mechanism makes the unauthored move inexpressible; this document owns, in the open, every authored choice behind the data. Each entry names the choice and the reason it was made. None of it is hidden, so none of it is a concession; it is the record of decisions a maintainer signs for. + +If you are reading this to attack the corpus, the choices you would reach for are below, named first. + +## What this corpus is + +A name-collision corpus over two real, public-domain authors who share a name: + +- **Adam Smith**, the economist and moral philosopher (1723-1790). +- **George Adam Smith**, the theologian and historical geographer (1856-1942). ("Adam" is a middle name; the partial-name match is deliberate, see the boost edge case in the gold suite.) + +Both write dense moral prose about justice, society, and ethics, so the two bodies of work sit close in embedding space. That proximity, not corpus size, is the point: it packs the near-ties where int8 rounding can reorder candidates, which is the only condition under which the demo tests anything. + +## Build status + +The text bodies and the embedding vectors are produced by `scaling/build.ts`, which needs network access to the public-domain sources and an `OPENAI_API_KEY`. The code, the structure, the gold set, the provenance table below, and the deterministic harness tests are authored and committed; the real bodies and the committed `index.json` / `query-vectors.json` are populated by a build run with those two things. See `docs/scaling-demo/build-handoff.md` for the exact build steps. **Every ID and date below is a claim to verify against the live source during that run, not a confirmation made here.** + +## Provenance and public-domain status + +Every source, with the basis for its public-domain status. Public domain is the *absence* of copyright, not a license: this corpus is not "permissively licensed," it is public-domain. State the basis in both jurisdictions cleanly, since they rest on different facts: +- **US:** published before 1931, so public domain in the USA. (As of 1 Jan 2026, works published in 1930 and earlier are PD in the US.) +- **Life-plus-70 jurisdictions:** public domain once the author has been dead 70 years. In 2026 that covers authors who died in 1955 or earlier; George Adam Smith died 1942 and Adam Smith in 1790, so both are clear. + +Verify each ID and date against the source before relying on it; fill OCR-quality notes from the actual file. + +| Work (unit) | Author | Pub. | Layer | Source (ID) | PD basis | Notes | +|---|---|---|---|---|---|---| +| _Theory of Moral Sentiments_, §\ | Adam Smith | 1759 | public | Gutenberg \ | US: pre-1931 / PD in USA. Life+70: author d. 1790; term expired | _verify; fill: clean / OCR-noisy_ | +| _Wealth of Nations_, bk\ ch\ | Adam Smith | 1776 | public | Gutenberg \ | US: pre-1931 / PD in USA. Life+70: author d. 1790; term expired | _verify_ | +| _The Book of the Twelve Prophets_, \ | George Adam Smith | 1896-98 | public | Gutenberg 43847 / 50747 | US: pre-1931 / PD in USA. Life+70: author d. 1942; term expired | _verify against Gutenberg_ | +| _The Book of Isaiah_, ch\ | George Adam Smith | 1888-90 | public | Gutenberg 39767 / 43672 | US: pre-1931 / PD in USA. Life+70: author d. 1942; term expired | _verify against Gutenberg_ | +| _The Forgiveness of Sins, and Other Sermons_, \ | George Adam Smith | 1905 (A. C. Armstrong & Son) | **private** | Internet Archive `forgivenessofsin00smitrich` (ARK `ark:/13960/t0gt5jk4g`); HathiTrust full-view backup record 100136688 | US: pre-1931 / PD in USA. Life+70: author d. 1942; term expired | _verify NOT_IN_COPYRIGHT; OCR-noisy expected, which is fine_ | +| \ | — (fabricated) | — | **synthetic** | authored here | n/a (no copyright in fabricated demo text) | quarantined in `synthetic/`; tests \ | + +**Sourcing (resolved, pending verification).** George's *major* commentaries are listed on Project Gutenberg. The private layer rests on *The Forgiveness of Sins, and other Sermons* (1905), a single volume yielding several short, windy sermon units, which is exactly what the private layer needs: short whole units that route without restating. *Jeremiah: Being the Baird Lecture for 1922* (1923) is a further minor source if wanted. The fallback (designating a *section* of a major work private) is therefore **not** required; if a future rebuild loses these sources, that fallback keeps the private layer real rather than padding it with synthetic. + +**OPEN — the one sourcing check that can block the build.** Confirm George's minor/windy material (the sermons) actually downloads as clean-enough public-domain text. If only the big commentaries are digitized, the private ledger is thin: use the fallback (a short *section* of a major work, designated private) rather than padding with synthetic, which would turn the spire into a column. Record the outcome here. + +## URLs: demo-canonical citations, real route targets + +A record's citation URL is constructed by the reused `src/corpus.ts` path (`baseUrl + urlPrefix + slug`) under the reserved `.example` TLD, so it is a stable demo surface rather than a live page; the real sources are the provenance table above. This keeps `src/corpus.ts` untouched (the budget rule) and is symmetric across both authors, so neither Smith reads as the decoy. A private note's `about` is taken verbatim from frontmatter, so the routing targets ARE real public George pages. The delta log records this as a divergence from the spec's "records carry real public URLs," with the reason (per-unit real URLs do not exist; Gutenberg is work-level). + +## The authored choices (named first, owned in the open) + +**1. The corpus is partly fabricated, and the claim does not depend on its realism.** The public layer (both Smiths) is real public-domain text the maintainer did not write. The private layer is real George minor works. The synthetic notes are a small, flagged set (below). The demo's claim is *relative*: int8 preserves the verdicts full-precision produces, and where it does not, the gate catches it. Realism is never asserted; the baseline runs on text the maintainer does not control. + +**2. "Private" is a layer assignment, not a claim of secrecy.** George was a public figure and all his work is published; designating some of it private means only that *the type cannot carry its text to the model*, regardless of what the text is. The whole repo works this way (the default example corpus is synthetic "Person A"). Everything here is exposed in the repo on purpose: seeing the full private text, then watching the type admit only its routing hint, is the demonstration, not a contradiction of it. **This is also why this demo can commit its embedding vectors when the main repo gitignores its index: these vectors derive from public-domain text, so they expose nothing already private. Do not copy "commit your vectors" as a general pattern: embeddings of genuinely private text can be inverted to recover approximate content, which is the exposure the main repo's gitignored index avoids.** + +**3. No fabricated words are attributed to the real Adam Smith, and synthetic notes are flagged in the data.** Every fabricated note lives in the quarantined `synthetic/` directory and names the edge case it tests, so nothing can be mistaken for either Smith's actual writing even lifted out of context. Real George material is handled as George's; synthetic is never confusable with it. + +**4. The corpus is not tuned so int8 passes.** Headline numbers come from the real-only (`--natural`) run. The demo deliberately *includes a failure*: a tightened encoding (int4, or a lowered floor) breaking a route case, caught by the gold suite. Shipping a caught failure is the opposite of tuning to pass; it is how the demo shows the gate can say no. + +**5. Themes are authored honestly, including where they collide.** Both Smiths carry shared themes (e.g. "justice"), so a verbatim theme match can hand the boost to the wrong Smith. That mis-fire is a near-tie the gold suite *exposes*, not one smoothed away by curation. Themes are not shaped to make disambiguation easy; doing so would special-case the corpus, which the eval forbids. + +**6. The public/private split is a research decision, stated as one.** Major, legible George works go to the public records layer; minor, windy works go to the private routing layer. This is an authored, answerable choice made to exercise both the disambiguation path (public) and the routing path (private) without confounding them, not a natural fact about the texts. The private layer is George-only, so the disambiguation problem (which Smith?) stays entirely in the public layer and never contaminates the boundary demonstration. + +**7. The synthetic layer is a spire, not a column.** It is deliberately small. A large synthetic layer would invert the honesty of the demo (headline numbers riding on authored text) and would mean a large body of fabricated words attributed to a real person, in a project about provenance and backing. If more near-ties are ever needed, the lever is a tighter floor and boosts (a calibration question, gold-gated), not more fabricated text. + +## Scope of this README + +This file documents the **data and the choices behind it** only. The mechanism (the type boundary, retrieval, the modes), the eval, and the int8 harness are documented where they live; this is not the place to restate them. Provenance and authored choices here; everything else by reference. diff --git a/scaling/corpus/private/.gitkeep b/scaling/corpus/private/.gitkeep new file mode 100644 index 0000000..e69de29 diff --git a/scaling/corpus/public/adam-smith/.gitkeep b/scaling/corpus/public/adam-smith/.gitkeep new file mode 100644 index 0000000..e69de29 diff --git a/scaling/corpus/public/george-adam-smith/.gitkeep b/scaling/corpus/public/george-adam-smith/.gitkeep new file mode 100644 index 0000000..e69de29 diff --git a/scaling/corpus/synthetic/.gitkeep b/scaling/corpus/synthetic/.gitkeep new file mode 100644 index 0000000..e69de29 diff --git a/scaling/scaling.config.ts b/scaling/scaling.config.ts new file mode 100644 index 0000000..df08739 --- /dev/null +++ b/scaling/scaling.config.ts @@ -0,0 +1,51 @@ +// scaling.config.ts — points the engine at the int8 scaling-demo corpus. +// +// This is the same ArchiveConfig shape the core uses (src/types.ts), pointed at +// scaling/corpus/ instead of example-content/. The demo reuses the core +// retrieval, the no-leak boundary, and the eval judges untouched; only the +// corpus, the gold set, and a thin int8 pass are new (see scaling/README.md). +// +// Two authors share one colliding name on purpose: Adam Smith the economist +// (1723-1790) and George Adam Smith the theologian (1856-1942). Both write +// dense moral prose about justice and society, so their records sit close in +// embedding space; that proximity is what packs the near-ties int8 rounding can +// reorder. authorName names the collection rather than one person because the +// demo's whole subject is disambiguation; the gold queries name each Smith +// explicitly rather than relying on {{author}} substitution. +// +// On URLs: a record's citation URL is built by the reused corpus path +// (baseUrl + urlPrefix + slug), so it is a demo-canonical surface under the +// reserved .example TLD (RFC 2606), not a live page. The real public-domain +// sources live in scaling/corpus/README.md's provenance table, per work. A +// private note's `about` is taken verbatim from frontmatter, so those route +// targets ARE real public George pages. See the delta log for this divergence +// from the spec's "records carry real public URLs" assumption and why it keeps +// src/corpus.ts untouched. + +import type { ArchiveConfig } from '../src/types.js'; + +export const config: ArchiveConfig = { + archiveName: 'Smith Collection (int8 scaling demo)', + authorName: 'Adam Smith and George Adam Smith', + baseUrl: 'https://smith-collection.example', + contentRoot: './scaling/corpus', + collections: [ + { dir: 'public/adam-smith', urlPrefix: '/adam-smith/', type: 'adam-smith' }, + { dir: 'public/george-adam-smith', urlPrefix: '/george/', type: 'george-adam-smith' }, + ], + // The private layer: George's minor works (sermons, addresses), searchable + // but never quotable. Designating published work "private" is a layer + // assignment enforced by the type, not a claim of secrecy (README §2). + privateNotesDir: './scaling/corpus/private', + // Matches archive.config.ts. The int8 demo depends on this: the committed + // vectors must be text-embedding-3-large at native dimensionality or the + // homogeneity invariant (src/store.ts) rejects them. + embeddingModel: 'text-embedding-3-large', + answerModel: 'gpt-4o-mini', +}; + +// The quarantined synthetic spire (scaling/corpus/synthetic/) is loaded as an +// ADDITIONAL private-notes dir only under --natural+synthetic, never here. Its +// location is the flag: nothing in scaling/corpus/synthetic/ is real George +// text. See scaling/run.ts and README §3. +export const SYNTHETIC_NOTES_DIR = './scaling/corpus/synthetic'; diff --git a/tsconfig.json b/tsconfig.json index ffaed3c..1e30c72 100644 --- a/tsconfig.json +++ b/tsconfig.json @@ -12,6 +12,6 @@ "skipLibCheck": true, "noEmit": true }, - "include": ["archive.config.ts", "src/**/*.ts", "test/**/*.ts"], + "include": ["archive.config.ts", "src/**/*.ts", "test/**/*.ts", "scaling/**/*.ts"], "exclude": ["node_modules", "artifacts"] } From bba5da23653dc7e91c3514cf8bcb36ee0a78079b Mon Sep 17 00:00:00 2001 From: Claude Date: Tue, 16 Jun 2026 04:24:26 +0000 Subject: [PATCH 02/10] feat(scaling): gold suite tuned to the floor and the name collision scaling/gold.yaml, real-only (--natural). Same three-mode shape as the core set: disambiguation both ways (economist vs theologian over shared "justice" themes and the name-boost mis-fire), the partial-name boost edge ("Adam Smith" phrase-matching a "George Adam Smith" title), a route case the private sermon must win without restating, and a refuse case. Cases sit near the floor and near each other, where int8 rounding can reorder them. Source ids match the corpus slugs the build handoff defines. https://claude.ai/code/session_01EhtDe3ZQnv6vx2qSLy8qQ1 --- scaling/gold.yaml | 96 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 96 insertions(+) create mode 100644 scaling/gold.yaml diff --git a/scaling/gold.yaml b/scaling/gold.yaml new file mode 100644 index 0000000..d7e415a --- /dev/null +++ b/scaling/gold.yaml @@ -0,0 +1,96 @@ +# The scaling-demo gold set (real-only, --natural). +# +# Same three-mode shape as the core gold set (eval/gold.yaml): questions the +# archive must answer, one it must route to a private note without restating, +# and one it must refuse. Tuned so the cases live where int8 rounding bites: +# near the score floor and near each other. A refuse case comfortably below the +# floor or a route case comfortably clear proves nothing about quantization; +# the marginal cases are the whole point. +# +# These run against scaling/corpus/index.json (committed FP vectors), quantized +# in process. The harness (scaling/run.ts) checks each case keylessly at the +# retrieval tier and, with --full and a key, the answer-mode tier too. Source +# ids are `${type}:${slug}` for records and `note:${slug}` for private notes, +# matching the corpus files in scaling/corpus/ (see docs/scaling-demo/build-handoff.md). +# +# Queries name each Smith explicitly rather than using {{author}}, because the +# demo's whole subject is which Smith a question means. + +queries: + # ── Disambiguation: the economist ──────────────────────────────────────── + - id: econ-justice + query: What did Adam Smith argue about justice and beneficence? + expectAnswerMode: partial + expectSources: [adam-smith:theory-of-moral-sentiments-justice] + note: > + Must resolve to the economist. George's Amos and Micah expositions carry + the "justice" theme too, and his record titles contain "George Adam + Smith" so the query's "Adam Smith" phrase-matches them for the exact-match + boost. That mis-fire is the near-tie: int8 reordering is exactly what + could tip this toward the wrong Smith. + + - id: econ-labour + query: What did Adam Smith say about the division of labour? + expectAnswerMode: partial + expectSources: [adam-smith:wealth-of-nations-division-of-labour] + note: > + A cleaner economist hit — "division of labour" is unambiguously Wealth of + Nations. Anchors the disambiguation against a case with little collision. + + # ── Disambiguation: the theologian ─────────────────────────────────────── + - id: george-amos + query: What did George Adam Smith say about the prophet Amos and justice? + expectAnswerMode: partial + expectSources: [george-adam-smith:twelve-prophets-amos] + note: > + The parallel disambiguation, the other way. "justice" appears for both + Smiths, but "the prophet Amos" plus George's full name should carry this + to George. The symmetric twin of econ-justice. + + - id: george-isaiah + query: Where does George Adam Smith write about faith in the book of Isaiah? + expectAnswerMode: partial + expectSources: [george-adam-smith:isaiah-prophet-of-faith] + note: > + Theme-and-subject query carried mostly by semantic similarity; "Isaiah" + and "faith" are George's, with no economist competitor. + + # ── The partial-name boost edge ────────────────────────────────────────── + - id: boost-edge-micah + query: What did Adam Smith say about the prophet Micah? + expectAnswerMode: partial + expectSources: [george-adam-smith:twelve-prophets-micah] + note: > + The boost edge case. The query names "Adam Smith" (the economist) but asks + about Micah (George's subject). EXACT_MATCH_BOOST (0.30) fires on the + partial name match against George's title, AND Micah is George's alone, so + the intended answer is George's Micah exposition despite the economist's + name in the query. Pin the observed behavior here; int8 reordering near + this collision is precisely the kind of thing that could tip it. + + # ── Route: answered by the boundary ────────────────────────────────────── + - id: route-forgiveness + query: How did George Adam Smith preach on the forgiveness of sins? + expectAnswerMode: related-material + expectSources: [note:forgiveness-of-sins] + forbidRecordCitations: true + forbidAnswerPatterns: ['https?://'] + note: > + Only the private sermon bears on this. The note must win the top slot over + the public George records on adjacent themes, and the answer must route to + the page-and-locator WITHOUT restating what the sermon says — the mode is + related-material, never a paraphrase of the private text. This tests route + SELECTION (which note wins), not A2: int8 never touches the answer model's + confabulation residue. + + # ── Refuse: nothing clears the floor ───────────────────────────────────── + - id: refuse-quantum + query: What did Adam Smith think about quantum computing? + expectAnswerMode: not-found + forbidSources: + [adam-smith:theory-of-moral-sentiments-justice, adam-smith:wealth-of-nations-division-of-labour, george-adam-smith:twelve-prophets-amos, george-adam-smith:isaiah-prophet-of-faith] + note: > + A subject no Smith addressed, three centuries out of reach. The score + floor must keep every record out of the evidence and the answer must be a + plain not-found. A refuse case is only worth having if it sits where a + lowered floor or a coarser encoding could let a weak hit cross. From d32fa50c24dd10d8cfb5032cb94b855dd2357edb Mon Sep 17 00:00:00 2001 From: Claude Date: Tue, 16 Jun 2026 04:33:06 +0000 Subject: [PATCH 03/10] =?UTF-8?q?feat(scaling):=20int8=20harness=20?= =?UTF-8?q?=E2=80=94=20quantizer,=20gate,=20keyless=20runner,=20build?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The int8 path as a thin wrapper plus a re-rank, reusing src/retrieve.ts cosine, the gold judge, and the no-leak boundary untouched. quantize.ts is the public twin of the production vector-quant.ts (per-vector symmetric, int8 and int4 from one path). harness.ts re-ranks the quantized index, reports rank correlation (necessary) and the gold verdicts including the route top-slot check (sufficient). run.ts is keyless: it reads committed FP + gold-query vectors and quantizes in process; --full adds the keyed answer pass. build.ts (keyed, for the local agent) embeds the corpus and gold queries. quantize.test.ts proves the mechanism offline on fixture geometry, including the payload: the gate certifies int8 and rejects an int4 route flip the note stays retrieved through but loses the top slot to. npm test now covers scaling/ too: 36 pass (25 core + 11 scaling), typecheck clean. https://claude.ai/code/session_01EhtDe3ZQnv6vx2qSLy8qQ1 --- package.json | 5 +- scaling/build.ts | 151 ++++++++++++++++++++++++++++ scaling/harness.ts | 182 +++++++++++++++++++++++++++++++++ scaling/quantize.test.ts | 169 +++++++++++++++++++++++++++++++ scaling/quantize.ts | 70 +++++++++++++ scaling/query-vectors.ts | 74 ++++++++++++++ scaling/run.ts | 210 +++++++++++++++++++++++++++++++++++++++ 7 files changed, 860 insertions(+), 1 deletion(-) create mode 100644 scaling/build.ts create mode 100644 scaling/harness.ts create mode 100644 scaling/quantize.test.ts create mode 100644 scaling/quantize.ts create mode 100644 scaling/query-vectors.ts create mode 100644 scaling/run.ts diff --git a/package.json b/package.json index f2ddc08..621f279 100644 --- a/package.json +++ b/package.json @@ -12,7 +12,10 @@ "index": "node --env-file-if-exists=.env --import tsx src/cli/build-index.ts", "ask": "node --env-file-if-exists=.env --import tsx src/cli/ask.ts", "eval": "node --env-file-if-exists=.env --import tsx src/cli/eval.ts", - "test": "node --import tsx --test test/*.test.ts", + "scaling:build": "node --env-file-if-exists=.env --import tsx scaling/build.ts", + "scaling:run": "node --env-file-if-exists=.env --import tsx scaling/run.ts", + "scaling:test": "node --import tsx --test scaling/*.test.ts", + "test": "node --import tsx --test test/*.test.ts scaling/*.test.ts", "typecheck": "tsc --noEmit" }, "dependencies": { diff --git a/scaling/build.ts b/scaling/build.ts new file mode 100644 index 0000000..ba41c0b --- /dev/null +++ b/scaling/build.ts @@ -0,0 +1,151 @@ +// npm run scaling:build — embed the scaling corpus and the gold queries, then +// commit the vectors. KEYED and run once (or after corpus edits): needs network +// to the embedding API and an OPENAI_API_KEY. The session that wrote this code +// had neither; see docs/scaling-demo/build-handoff.md. +// +// Reuses the core corpus loaders, embedding, and store writers untouched. The +// only thing new is pointing them at scaling/corpus/ and splitting the output +// into the natural index (the headline source of truth), the synthetic spire +// (a strictly baseline-plus-delta file, unioned only under --natural+synthetic), +// and the committed gold-query vectors (what makes scaling:run keyless). + +import { createHash } from 'node:crypto'; +import { existsSync } from 'node:fs'; +import { resolve } from 'node:path'; +import OpenAI from 'openai'; + +import { buildCorpus, buildPrivateNotes, embedText, noteEmbedText } from '../src/corpus.js'; +import { batchInputs, embedBatch, truncateForEmbedding } from '../src/embedding.js'; +import { assertHomogeneousIndex, writeIndexFile } from '../src/store.js'; +import type { ArchiveConfig, IndexEntry, PrivateNote } from '../src/types.js'; +import { loadGold } from '../src/evaluate.js'; +import { config, SYNTHETIC_NOTES_DIR } from './scaling.config.js'; +import { writeQueryVectors } from './query-vectors.js'; + +const NATURAL_INDEX = resolve('scaling/corpus/index.json'); +const SYNTHETIC_INDEX = resolve('scaling/corpus/index.synthetic.json'); +const NATURAL_GOLD = resolve('scaling/gold.yaml'); +const SYNTHETIC_GOLD = resolve('scaling/gold.synthetic.yaml'); + +function contentHash(text: string): string { + return createHash('sha1').update(truncateForEmbedding(text)).digest('hex').slice(0, 16); +} + +type EmbedJob = { id: string; text: string }; + +async function embedAll(client: OpenAI, jobs: EmbedJob[]): Promise> { + const byId = new Map(); + let done = 0; + for (const batch of batchInputs(jobs)) { + const results = await embedBatch(client, batch, { model: config.embeddingModel }); + for (const r of results) byId.set(r.id, r.vector); + done += batch.length; + console.log(` embedded ${done}/${jobs.length}`); + } + return byId; +} + +function recordEntries(config: ArchiveConfig, vectors: Map): IndexEntry[] { + const entries: IndexEntry[] = []; + for (const record of buildCorpus(config)) { + const text = embedText(record); + const vector = vectors.get(record.id); + if (!vector) continue; + entries.push({ + model: config.embeddingModel, + dimensions: vector.length, + vector, + contentHash: contentHash(text), + sourceType: 'record', + record, + }); + } + return entries; +} + +function noteEntries(notes: PrivateNote[], vectors: Map): IndexEntry[] { + const entries: IndexEntry[] = []; + for (const note of notes) { + const vector = vectors.get(note.id); + if (!vector) continue; + entries.push({ + model: config.embeddingModel, + dimensions: vector.length, + vector, + contentHash: contentHash(noteEmbedText(note)), + sourceType: 'note', + note, + }); + } + return entries; +} + +async function main(): Promise { + if (!process.env.OPENAI_API_KEY) { + throw new Error('OPENAI_API_KEY is not set. scaling:build needs it to embed (see build-handoff.md).'); + } + const client = new OpenAI(); + + const records = buildCorpus(config); + const naturalNotes = buildPrivateNotes(config); + const syntheticNotes = buildPrivateNotes({ ...config, privateNotesDir: SYNTHETIC_NOTES_DIR }); + console.log( + `Corpus: ${records.length} records, ${naturalNotes.length} private notes, ` + + `${syntheticNotes.length} synthetic notes`, + ); + if (records.length === 0) { + throw new Error('No records found under scaling/corpus/public — populate it first (build-handoff.md §1).'); + } + + // Gold queries: natural always, synthetic if authored. + const gold = loadGold(NATURAL_GOLD, config.authorName); + const goldQueries = [...gold]; + if (existsSync(SYNTHETIC_GOLD)) { + goldQueries.push(...loadGold(SYNTHETIC_GOLD, config.authorName)); + } + + // One embedding pass over every source and query, distinguished by id. + const sourceJobs: EmbedJob[] = [ + ...records.map((r) => ({ id: r.id, text: embedText(r) })), + ...naturalNotes.map((n) => ({ id: n.id, text: noteEmbedText(n) })), + ...syntheticNotes.map((n) => ({ id: n.id, text: noteEmbedText(n) })), + ]; + const queryJobs: EmbedJob[] = goldQueries.map((g) => ({ id: `query:${g.id}`, text: g.query })); + + console.log(`Embedding ${sourceJobs.length} sources and ${queryJobs.length} gold queries...`); + const vectors = await embedAll(client, [...sourceJobs, ...queryJobs]); + + // Natural index: records + real private notes. + const naturalEntries = [...recordEntries(config, vectors), ...noteEntries(naturalNotes, vectors)].sort((a, b) => + (a.sourceType === 'record' ? a.record.id : a.note.id).localeCompare( + b.sourceType === 'record' ? b.record.id : b.note.id, + ), + ); + assertHomogeneousIndex(naturalEntries); + writeIndexFile(naturalEntries, NATURAL_INDEX); + console.log(`Wrote ${naturalEntries.length} natural entries to ${NATURAL_INDEX}`); + + // Synthetic spire: written only when authored, so the headline never depends on it. + if (syntheticNotes.length > 0) { + const spireEntries = noteEntries(syntheticNotes, vectors); + assertHomogeneousIndex([...naturalEntries, ...spireEntries]); // spire must share the space + writeIndexFile(spireEntries, SYNTHETIC_INDEX); + console.log(`Wrote ${spireEntries.length} synthetic spire entries to ${SYNTHETIC_INDEX}`); + } else { + console.log('No synthetic notes authored yet; skipping the spire index.'); + } + + // Committed gold-query vectors (what makes scaling:run keyless). + const queryVectors = goldQueries + .map((g) => ({ id: g.id, vector: vectors.get(`query:${g.id}`) })) + .filter((q): q is { id: string; vector: number[] } => Array.isArray(q.vector)); + const dims = queryVectors[0]?.vector.length ?? naturalEntries[0]?.dimensions ?? 0; + writeQueryVectors(config.embeddingModel, dims, queryVectors); + console.log(`Wrote ${queryVectors.length} gold-query vectors`); + console.log('Done. Commit the *.json artifacts, then `npm run scaling:run`.'); +} + +main().catch((err) => { + console.error(`scaling:build failed: ${err instanceof Error ? err.message : err}`); + process.exitCode = 1; +}); diff --git a/scaling/harness.ts b/scaling/harness.ts new file mode 100644 index 0000000..842fcae --- /dev/null +++ b/scaling/harness.ts @@ -0,0 +1,182 @@ +// scaling/harness.ts — the int8 gate, as pure logic the CLI drives. +// +// Reuses the core retrieval (src/retrieve.ts) and the gold judge +// (src/evaluate.ts) untouched: the int8 path is an encode/decode wrapper plus a +// re-rank, never a second pipeline. Given full-precision index entries and a +// quantization bit width, it builds the lossy index, re-ranks each gold query +// against it, and reports the two things the paper distinguishes: rank +// correlation against the full-precision ranking (necessary), and the gold +// suite's verdicts including refuse and route (sufficient). Rank correlation +// alone is a retrieval benchmark; the suite is the actual adjudicator. + +import { cosine, retrieve } from '../src/retrieve.js'; +import type { RetrievalResult } from '../src/retrieve.js'; +import { judgeRetrieval } from '../src/evaluate.js'; +import type { GoldQuery } from '../src/evaluate.js'; +import type { IndexEntry } from '../src/types.js'; +import { requantizeVector } from './quantize.js'; + +/** The lossy index the demo re-ranks against: every vector round-tripped + * through `bits`-bit quantization, every other field untouched. The + * full-precision index stays the source of truth. */ +export function requantizeIndex(index: readonly IndexEntry[], bits: number): IndexEntry[] { + return index.map((e) => ({ ...e, vector: requantizeVector(e.vector, bits) })); +} + +/** The single highest-scoring source across both streams, or null if nothing + * cleared the floor. Route selection lives here: in related-material mode the + * winner must be the private note, or the answer would resolve to a record + * instead and the verdict has flipped. */ +export function topSource( + result: RetrievalResult, +): { id: string; kind: 'record' | 'note'; score: number } | null { + let best: { id: string; kind: 'record' | 'note'; score: number } | null = null; + for (const r of result.records) { + if (!best || r.score > best.score) best = { id: r.record.id, kind: 'record', score: r.score }; + } + for (const n of result.notes) { + if (!best || n.score > best.score) best = { id: n.note.id, kind: 'note', score: n.score }; + } + return best; +} + +function averageRanks(xs: readonly number[]): number[] { + const order = xs.map((x, i) => ({ x, i })).sort((a, b) => a.x - b.x); + const ranks = new Array(xs.length); + let i = 0; + while (i < order.length) { + let j = i; + while (j + 1 < order.length && order[j + 1]!.x === order[i]!.x) j += 1; + const avg = (i + j) / 2 + 1; // 1-based average rank across the tie block i..j + for (let k = i; k <= j; k += 1) ranks[order[k]!.i] = avg; + i = j + 1; + } + return ranks; +} + +/** Spearman's rho: Pearson correlation of the rank vectors, with average ranks + * for ties. Returns 1 for degenerate inputs (length < 2 or all-tied), which is + * the harmless reading — no reordering to detect. */ +export function spearmanRho(a: readonly number[], b: readonly number[]): number { + if (a.length !== b.length) throw new Error('spearmanRho: length mismatch'); + const n = a.length; + if (n < 2) return 1; + const ra = averageRanks(a); + const rb = averageRanks(b); + let ma = 0; + let mb = 0; + for (let i = 0; i < n; i += 1) { + ma += ra[i]!; + mb += rb[i]!; + } + ma /= n; + mb /= n; + let num = 0; + let da = 0; + let db = 0; + for (let i = 0; i < n; i += 1) { + const x = ra[i]! - ma; + const y = rb[i]! - mb; + num += x * y; + da += x * x; + db += y * y; + } + if (da === 0 || db === 0) return 1; + return num / Math.sqrt(da * db); +} + +/** Rank correlation between the full-precision and quantized cosine orderings + * for one query, over the whole index. The boosts (src/retrieve.ts) are + * identical in both rankings, so the only thing that can reorder is the vector + * part: cosine. That is what this measures. */ +export function rankCorrelation( + index: readonly IndexEntry[], + quantIndex: readonly IndexEntry[], + queryVector: readonly number[], +): number { + const fp = index.map((e) => cosine(queryVector, e.vector)); + const q = quantIndex.map((e) => cosine(queryVector, e.vector)); + return spearmanRho(fp, q); +} + +export interface QueryGateResult { + id: string; + /** Rank correlation FP vs quantized for this query. */ + rho: number; + /** judgeRetrieval on the quantized index: expected sources in, forbidden out. */ + retrievalPass: boolean; + retrievalIssues: string[]; + /** Present only for route (related-material) cases: did the expected note win + * the top slot on the quantized index? */ + route?: { expectedNote: string; winner: string | null; won: boolean }; + /** retrievalPass AND (route ? route.won : true). */ + pass: boolean; +} + +/** Re-rank one gold query against the quantized index and judge it. */ +export function evaluateQuery( + gold: GoldQuery, + index: readonly IndexEntry[], + quantIndex: readonly IndexEntry[], + queryVector: readonly number[], +): QueryGateResult { + const hits = retrieve(queryVector, gold.query, quantIndex); + const judged = judgeRetrieval(gold, hits); + const rho = rankCorrelation(index, quantIndex, queryVector); + + let route: QueryGateResult['route']; + if (gold.expectAnswerMode === 'related-material' && gold.expectSources && gold.expectSources[0]) { + const expectedNote = gold.expectSources[0]; + const winner = topSource(hits); + route = { expectedNote, winner: winner?.id ?? null, won: winner?.id === expectedNote }; + } + + const pass = judged.pass && (route ? route.won : true); + return { + id: gold.id, + rho, + retrievalPass: judged.pass, + retrievalIssues: judged.issues, + ...(route ? { route } : {}), + pass, + }; +} + +export interface GateReport { + bits: number; + total: number; + passed: number; + failed: number; + meanRho: number; + minRho: number; + results: QueryGateResult[]; +} + +/** Run the whole gold suite against the index at `bits` precision. */ +export function runGate( + gold: readonly GoldQuery[], + index: readonly IndexEntry[], + queryVectorById: ReadonlyMap, + bits: number, +): GateReport { + const quantIndex = requantizeIndex(index, bits); + const results: QueryGateResult[] = []; + for (const g of gold) { + const qv = queryVectorById.get(g.id); + if (!qv) throw new Error(`no query vector for gold id '${g.id}' (rebuild scaling:build?)`); + results.push(evaluateQuery(g, index, quantIndex, qv)); + } + const passed = results.filter((r) => r.pass).length; + const rhos = results.map((r) => r.rho); + const meanRho = rhos.length ? rhos.reduce((s, x) => s + x, 0) / rhos.length : 1; + const minRho = rhos.length ? Math.min(...rhos) : 1; + return { + bits, + total: results.length, + passed, + failed: results.length - passed, + meanRho, + minRho, + results, + }; +} diff --git a/scaling/quantize.test.ts b/scaling/quantize.test.ts new file mode 100644 index 0000000..fa37821 --- /dev/null +++ b/scaling/quantize.test.ts @@ -0,0 +1,169 @@ +// Offline, deterministic tests for the int8 demo's mechanism. No corpus, no +// key: the quantizer and the gate are exercised on fixture vectors, so the +// whole int8 path — including the int4 route flip the demo is built to catch — +// is provable here. The real corpus instantiates this same mechanism; these +// tests prove the mechanism itself. + +import assert from 'node:assert/strict'; +import test from 'node:test'; + +import { cosine } from '../src/retrieve.js'; +import type { ArchiveRecord, IndexEntry, PrivateNote } from '../src/types.js'; +import type { GoldQuery } from '../src/evaluate.js'; +import { dequantize, levelFor, quantize, requantizeVector } from './quantize.js'; +import { evaluateQuery, rankCorrelation, requantizeIndex, runGate, spearmanRho, topSource } from './harness.js'; + +// A near-tie found by deterministic search (scaling: seed 421, 24-dim): the +// query Q ranks note VN just above record VR at full precision; int8 preserves +// that order, int4 reorders it. This is the route flip in miniature. +const Q = [-0.201545, -0.070296, -0.836567, -0.496486, 0.932744, -0.183835, 0.620633, -0.319135, 0.353699, 0.535227, 0.630447, -0.913022, 0.74482, 0.20067, -0.735437, 0.48168, -0.628687, 0.422013, -0.824056, 0.95873, -0.055049, -0.014708, 0.136552, -0.126328]; +const VN = [-0.209326, -0.367113, -0.781625, -0.22665, 0.421356, -0.779461, 0.686374, -0.431379, 0.807734, 0.556436, 0.078187, -1.104108, 0.064971, -0.250693, -0.829483, -0.06284, -0.225568, 0.419642, -0.941748, 0.05885, -0.260352, 0.396049, -0.299235, 0.33248]; +const VR = [0.153577, 0.081729, -1.05474, -0.793276, 0.049555, -0.0844, 0.769011, 0.098334, 0.570278, -0.166597, 0.599978, -1.115543, 0.517046, -0.496545, 0.207507, 0.785012, -0.899066, 0.109867, -0.881006, 0.360131, 0.467909, 0.04772, 0.550953, 0.232781]; + +function makeRecord(id: string, extra: Partial = {}): ArchiveRecord { + return { + id, + type: 'work', + slug: id.split(':')[1] ?? id, + title: extra.title ?? id, + url: `https://smith-collection.example/${id}/`, + summary: extra.summary ?? '', + body: extra.body ?? '', + themes: extra.themes ?? [], + }; +} + +function makeNote(id: string): PrivateNote { + return { id, label: id, url: 'https://en.wikipedia.org/wiki/George_Adam_Smith', locator: 'sermon', text: 'private' }; +} + +function recordEntry(id: string, vector: number[], extra: Partial = {}): IndexEntry { + return { model: 'text-embedding-3-large', dimensions: vector.length, vector, contentHash: 'h', sourceType: 'record', record: makeRecord(id, extra) }; +} + +function noteEntry(id: string, vector: number[]): IndexEntry { + return { model: 'text-embedding-3-large', dimensions: vector.length, vector, contentHash: 'h', sourceType: 'note', note: makeNote(id) }; +} + +// Two fillers near-orthogonal to Q, so they stay below the floor in every +// precision and never enter the route contest. +const filler1 = Array.from({ length: 24 }, (_, i) => (i === 1 ? 1 : 0)); +const filler2 = Array.from({ length: 24 }, (_, i) => (i === 21 ? 1 : 0)); + +test('quantize: level widths and rejection of bad bit counts', () => { + assert.equal(levelFor(8), 127); + assert.equal(levelFor(4), 7); + assert.throws(() => levelFor(1)); + assert.throws(() => levelFor(9)); + assert.throws(() => levelFor(3.5)); +}); + +test('quantize: round-trips within the per-vector scale, zero vector is safe', () => { + const v = [0.5, -0.25, 0.9, -0.9, 0.1]; + const q = quantize(v, 8); + const back = dequantize(q); + for (let i = 0; i < v.length; i += 1) { + assert.ok(Math.abs(back[i]! - v[i]!) <= q.scale, `component ${i} within one scale step`); + } + // scale derives from the max magnitude (0.9), one signed byte (127 levels). + assert.ok(Math.abs(q.scale - 0.9 / 127) < 1e-9); + + const zero = quantize([0, 0, 0], 8); + assert.equal(zero.scale, 1); + assert.deepEqual([...zero.codes], [0, 0, 0]); +}); + +test('quantize: int4 is coarser than int8 (larger reconstruction error)', () => { + const v = Q; + const err = (bits: number) => v.reduce((s, x, i) => s + Math.abs(requantizeVector(v, bits)[i]! - x), 0); + assert.ok(err(4) > err(8), 'int4 reconstruction error exceeds int8'); +}); + +test('quantize: per-vector scale cancels under cosine (exact, by algebra)', () => { + // Scaling a vector by any positive constant leaves cosine unchanged, which is + // why the per-vector scale need not be restored to rank. The demo leans on this. + const scaled = VN.map((x) => x * 7.5); + assert.ok(Math.abs(cosine(Q, VN) - cosine(Q, scaled)) < 1e-12); +}); + +test('harness: spearmanRho on known orderings, with ties', () => { + assert.equal(spearmanRho([1, 2, 3, 4], [10, 20, 30, 40]), 1); + assert.equal(spearmanRho([1, 2, 3, 4], [40, 30, 20, 10]), -1); + assert.ok(Math.abs(spearmanRho([1, 2, 2, 3], [1, 2, 2, 3]) - 1) < 1e-12); // ties -> average ranks + assert.equal(spearmanRho([5], [9]), 1); // degenerate length < 2 +}); + +test('harness: requantizeIndex keeps every field but the vector', () => { + const index = [recordEntry('work:a', VR), noteEntry('note:b', VN)]; + const q = requantizeIndex(index, 8); + assert.equal(q.length, 2); + assert.equal(q[0]!.sourceType, 'record'); + assert.equal(q[0]!.dimensions, 24); + assert.notDeepEqual(q[0]!.vector, index[0]!.vector); // lossy + assert.equal(q[0]!.contentHash, index[0]!.contentHash); // untouched +}); + +test('harness: topSource picks the highest score across both streams', () => { + const result = { + records: [{ record: makeRecord('work:r'), score: 0.71, semantic: 0.71 }], + notes: [{ note: makeNote('note:n'), score: 0.73, semantic: 0.73 }], + }; + assert.equal(topSource(result)?.id, 'note:n'); + assert.equal(topSource(result)?.kind, 'note'); + assert.equal(topSource({ records: [], notes: [] }), null); +}); + +test('harness: int8 preserves the FP ranking better than int4 (rank correlation)', () => { + const index = [noteEntry('note:n', VN), recordEntry('work:r', VR), recordEntry('work:f1', filler1), recordEntry('work:f2', filler2)]; + const rho8 = rankCorrelation(index, requantizeIndex(index, 8), Q); + const rho4 = rankCorrelation(index, requantizeIndex(index, 4), Q); + assert.ok(rho8 >= rho4, `int8 rho (${rho8}) >= int4 rho (${rho4})`); + assert.ok(rho8 >= rho4 && rho8 > 0.9, 'int8 holds the ordering tightly'); +}); + +test('the payload: the gate certifies int8 and rejects int4 on the route case', () => { + // The note (VN) must win the top slot; that is the route. A query with no + // title/theme overlap, so the contest is pure cosine, not boosts. + const index: IndexEntry[] = [ + noteEntry('note:syn-amos-justice-margin', VN), + recordEntry('george-adam-smith:twelve-prophets-amos', VR, { title: 'unrelated phrasing' }), + recordEntry('work:f1', filler1), + ]; + const gold: GoldQuery = { + id: 'route-margin', + query: 'zzz qqq no token overlap with any title or theme', + expectAnswerMode: 'related-material', + expectSources: ['note:syn-amos-justice-margin'], + }; + const qById = new Map([[gold.id, Q]]); + + // int8: the note wins the top slot, the gate passes. + const int8 = runGate([gold], index, qById, 8); + assert.equal(int8.passed, 1, 'int8 certifies the route'); + assert.equal(int8.results[0]!.route?.won, true); + assert.ok(int8.results[0]!.rho >= 0.9); + + // int4: the record overtakes the note for the top slot. The note is still + // retrieved (so judgeRetrieval alone would miss it), but the route flipped, + // and the gate catches it. + const int4 = runGate([gold], index, qById, 4); + assert.equal(int4.failed, 1, 'int4 is rejected'); + const r = int4.results[0]!; + assert.equal(r.retrievalPass, true, 'the note is still in the candidate set'); + assert.equal(r.route?.won, false, 'but it lost the top slot'); + assert.equal(r.route?.winner, 'george-adam-smith:twelve-prophets-amos'); +}); + +test('the payload, directly: cosine ordering flips between int8 and int4', () => { + const c = (v: number[], bits: number) => cosine(Q, requantizeVector(v, bits)); + assert.ok(cosine(Q, VN) > cosine(Q, VR), 'FP: note outranks record'); + assert.ok(c(VN, 8) > c(VR, 8), 'int8: note still outranks record'); + assert.ok(c(VR, 4) > c(VN, 4), 'int4: record overtakes the note (the flip)'); +}); + +test('evaluateQuery: a refuse case with nothing above the floor stays not-found', () => { + const index = [recordEntry('work:f1', filler1), recordEntry('work:f2', filler2)]; + const gold: GoldQuery = { id: 'refuse', query: 'zzz qqq', expectAnswerMode: 'not-found', forbidSources: ['work:f1', 'work:f2'] }; + const res = evaluateQuery(gold, index, requantizeIndex(index, 8), Q); + assert.equal(res.pass, true, 'fillers stay below the floor, so nothing is forbidden-surfaced'); +}); diff --git a/scaling/quantize.ts b/scaling/quantize.ts new file mode 100644 index 0000000..cc7a04c --- /dev/null +++ b/scaling/quantize.ts @@ -0,0 +1,70 @@ +// scaling/quantize.ts — scalar quantization for the int8 demo. +// +// The public, runnable twin of the production site adapter's vector-quant.ts +// (named in docs/production-scaling.md §2; that adapter is not a public repo). +// Same scheme: per-vector symmetric scalar quantization. The full-precision +// vectors stay the source of truth (scaling/corpus/index.json); the demo +// quantizes them in process, re-ranks, and lets the gold suite judge the result. +// +// Why it is admissible, in two parts of different kinds (the paper's §6 split): +// cosine (src/retrieve.ts) recomputes norms per call, so a positive per-vector +// scale cancels from the score entirely; the ranking is invariant to it as a +// matter of algebra (exact). Integer rounding perturbs direction and can +// reorder near-ties, so its harmlessness is not proven but measured against the +// gold suite. int8 holds on the real corpus; int4 is the scalpel that makes the +// gate say no. + +export interface QuantizedVector { + /** Signed integer codes, one per dimension, each in [-level, level]. */ + codes: Int8Array; + /** Dequantization scale: vector[i] ≈ codes[i] * scale. */ + scale: number; +} + +/** The signed range for a bit width: int8 -> 127, int4 -> 7. One function + * serves both, so the headline (int8) and the deliberate failure (int4) run + * the identical path at different precisions. */ +export function levelFor(bits: number): number { + if (!Number.isInteger(bits) || bits < 2 || bits > 8) { + throw new Error(`quantize: unsupported bit width ${bits} (expected 2..8)`); + } + return (1 << (bits - 1)) - 1; // 2^(bits-1) - 1 +} + +/** Per-vector symmetric quantization to `bits` signed bits. scale carries the + * per-vector max magnitude so the reader can rebuild the approximate float. An + * all-zero vector (no signal) quantizes to all-zero with scale 1; it never + * divides by zero. */ +export function quantize(vector: readonly number[], bits = 8): QuantizedVector { + const level = levelFor(bits); + const n = vector.length; + const codes = new Int8Array(n); + let max = 0; + for (let i = 0; i < n; i += 1) { + const a = Math.abs(vector[i]!); + if (a > max) max = a; + } + if (max === 0) return { codes, scale: 1 }; + const inv = level / max; + for (let i = 0; i < n; i += 1) { + let q = Math.round(vector[i]! * inv); + if (q > level) q = level; + else if (q < -level) q = -level; + codes[i] = q; + } + return { codes, scale: max / level }; +} + +/** Reconstruct the approximate float vector from codes + scale. */ +export function dequantize(q: QuantizedVector): number[] { + const { codes, scale } = q; + const out = new Array(codes.length); + for (let i = 0; i < codes.length; i += 1) out[i] = codes[i]! * scale; + return out; +} + +/** Round-trip a vector through `bits`-bit quantization: the lossy vector the + * demo re-ranks against. quantize then dequantize, nothing else. */ +export function requantizeVector(vector: readonly number[], bits = 8): number[] { + return dequantize(quantize(vector, bits)); +} diff --git a/scaling/query-vectors.ts b/scaling/query-vectors.ts new file mode 100644 index 0000000..f1ead57 --- /dev/null +++ b/scaling/query-vectors.ts @@ -0,0 +1,74 @@ +// scaling/query-vectors.ts — the committed gold-query embeddings. +// +// The core eval CLI (src/cli/eval.ts) embeds every gold query at run time, so +// it always needs a key. The demo's headline must reproduce WITHOUT one, so the +// gold-query vectors are precomputed by scaling:build and committed here beside +// the index. The runner reads them instead of calling the embedding API; a key +// is only ever needed to regenerate them or to run the --full answer pass. +// +// Same homogeneity discipline as the index (src/store.ts): a query embedded in +// a different model or width than the index is a meaningless cosine, so the +// file carries its (model, dimensions) and the runner checks them. + +import { existsSync, mkdirSync, readFileSync, writeFileSync } from 'node:fs'; +import { dirname, resolve } from 'node:path'; + +export const QUERY_VECTORS_PATH = resolve('scaling/corpus/query-vectors.json'); +export const QUERY_VECTORS_VERSION = 1; + +export interface QueryVectorsFile { + version: number; + model: string; + dimensions: number; + queries: { id: string; vector: number[] }[]; +} + +export interface LoadedQueryVectors { + model: string; + dimensions: number; + byId: Map; +} + +const REBUILD = 'Run `npm run scaling:build` with an OPENAI_API_KEY (see docs/scaling-demo/build-handoff.md).'; + +/** Read the committed query vectors, or null if not built yet. Throws on a + * present-but-malformed file so a corrupt artifact fails loudly with a remedy. */ +export function readQueryVectors(path: string = QUERY_VECTORS_PATH): LoadedQueryVectors | null { + if (!existsSync(path)) return null; + let parsed: unknown; + try { + parsed = JSON.parse(readFileSync(path, 'utf8')); + } catch { + throw new Error(`query vectors at ${path} are not valid JSON. ${REBUILD}`); + } + const file = parsed as Partial; + if ( + typeof parsed !== 'object' || + parsed === null || + file.version !== QUERY_VECTORS_VERSION || + typeof file.model !== 'string' || + typeof file.dimensions !== 'number' || + !Array.isArray(file.queries) + ) { + throw new Error(`query vectors at ${path} are not schema version ${QUERY_VECTORS_VERSION}. ${REBUILD}`); + } + const byId = new Map(); + for (const q of file.queries) { + if (typeof q?.id !== 'string' || !Array.isArray(q.vector)) { + throw new Error(`query vectors at ${path} have a malformed entry. ${REBUILD}`); + } + byId.set(q.id, q.vector); + } + return { model: file.model, dimensions: file.dimensions, byId }; +} + +export function writeQueryVectors( + model: string, + dimensions: number, + queries: { id: string; vector: number[] }[], + path: string = QUERY_VECTORS_PATH, +): void { + mkdirSync(dirname(path), { recursive: true }); + const file: QueryVectorsFile = { version: QUERY_VECTORS_VERSION, model, dimensions, queries }; + writeFileSync(path, `${JSON.stringify(file)}\n`, 'utf8'); +} diff --git a/scaling/run.ts b/scaling/run.ts new file mode 100644 index 0000000..08c8030 --- /dev/null +++ b/scaling/run.ts @@ -0,0 +1,210 @@ +// npm run scaling:run — quantize the committed index in process, re-rank, and +// run the full gold suite against the quantized index. +// +// --natural (default) real corpus only; owns the headline numbers. +// --natural+synthetic adds the quarantined synthetic spire + its gold. +// --bits quantization width (default 8; 4 is the int4 scalpel). +// --full also run the answer-mode pass (needs OPENAI_API_KEY). +// +// The headline run is keyless: it reads committed FP vectors and committed +// gold-query vectors, quantizes in process, and judges with the reused gold +// logic. --full adds the answer model, which is the only part that needs a key. +// See scaling/README.md and docs/scaling-demo/build-handoff.md. + +import { resolve } from 'node:path'; + +import { loadGold } from '../src/evaluate.js'; +import type { GoldQuery } from '../src/evaluate.js'; +import { assertHomogeneousIndex, readIndexFile } from '../src/store.js'; +import type { IndexEntry } from '../src/types.js'; +import { runGate } from './harness.js'; +import { readQueryVectors } from './query-vectors.js'; + +const NATURAL_INDEX = resolve('scaling/corpus/index.json'); +const SYNTHETIC_INDEX = resolve('scaling/corpus/index.synthetic.json'); +const NATURAL_GOLD = resolve('scaling/gold.yaml'); +const SYNTHETIC_GOLD = resolve('scaling/gold.synthetic.yaml'); + +interface RunArgs { + synthetic: boolean; + bits: number; + full: boolean; +} + +function parseArgs(argv: string[]): RunArgs { + const args: RunArgs = { synthetic: false, bits: 8, full: false }; + for (let i = 0; i < argv.length; i += 1) { + const arg = argv[i]; + switch (arg) { + case '--natural': + args.synthetic = false; + break; + case '--natural+synthetic': + case '--synthetic': + args.synthetic = true; + break; + case '--full': + args.full = true; + break; + case '--bits': { + const value = argv[++i]; + if (!value) throw new Error('--bits requires a number (e.g. 8 or 4)'); + args.bits = Number(value); + if (!Number.isInteger(args.bits)) throw new Error(`--bits must be an integer, got '${value}'`); + break; + } + case '--help': + case '-h': + console.log( + 'scaling:run [--natural | --natural+synthetic] [--bits ] [--full]\n' + + ' --natural real corpus only (default); owns the headline numbers\n' + + ' --natural+synthetic add the quarantined synthetic spire + its gold\n' + + ' --bits quantization width (default 8; 4 is the int4 scalpel)\n' + + ' --full also run the answer-mode pass (needs OPENAI_API_KEY)', + ); + process.exit(0); + break; + default: + throw new Error(`unknown argument '${arg ?? ''}'`); + } + } + return args; +} + +function loadIndex(synthetic: boolean): IndexEntry[] { + const natural = readIndexFile(NATURAL_INDEX); + if (natural.length === 0) { + throw new Error( + `no committed vectors at ${NATURAL_INDEX}. ` + + 'Run `npm run scaling:build` with an OPENAI_API_KEY (see docs/scaling-demo/build-handoff.md).', + ); + } + if (!synthetic) { + assertHomogeneousIndex(natural); + return natural; + } + const spire = readIndexFile(SYNTHETIC_INDEX); + if (spire.length === 0) { + throw new Error( + `--natural+synthetic needs the spire at ${SYNTHETIC_INDEX}, which is not built yet ` + + '(author the synthetic notes, then `npm run scaling:build`).', + ); + } + const union = [...natural, ...spire]; + // The spire is strictly baseline-plus-delta: same model, same dimensionality. + assertHomogeneousIndex(union); + return union; +} + +function loadGoldSet(synthetic: boolean, author: string): GoldQuery[] { + const gold = loadGold(NATURAL_GOLD, author); + if (!synthetic) return gold; + const expanded = loadGold(SYNTHETIC_GOLD, author); + return [...gold, ...expanded]; +} + +async function main(): Promise { + const args = parseArgs(process.argv.slice(2)); + const { config } = await import('./scaling.config.js'); + + const index = loadIndex(args.synthetic); + const gold = loadGoldSet(args.synthetic, config.authorName); + + const qv = readQueryVectors(); + if (!qv) { + throw new Error( + 'no committed query vectors. Run `npm run scaling:build` with an OPENAI_API_KEY ' + + '(see docs/scaling-demo/build-handoff.md).', + ); + } + const spec = index[0]!; + if (qv.model !== spec.model || qv.dimensions !== spec.dimensions) { + throw new Error( + `query vectors (${qv.model}/${qv.dimensions}) do not match the index ` + + `(${spec.model}/${spec.dimensions}); rebuild both with scaling:build.`, + ); + } + + const label = args.synthetic ? '--natural+synthetic' : '--natural'; + console.log(`scaling:run ${label} int${args.bits} ${gold.length} gold queries ${index.length} index entries`); + if (args.synthetic) { + console.log(' (headline numbers come from the --natural run; the spire is broken out below)'); + } + + const report = runGate(gold, index, qv.byId, args.bits); + + for (const r of report.results) { + const status = r.pass ? 'ok ' : 'FAIL'; + const routeBit = r.route ? ` route:${r.route.won ? 'won' : `LOST->${r.route.winner ?? 'none'}`}` : ''; + console.log(` ${status} ${r.id.padEnd(18)} rho=${r.rho.toFixed(4)}${routeBit}`); + if (!r.pass) for (const issue of r.retrievalIssues) console.log(` - ${issue}`); + if (r.route && !r.route.won) { + console.log(` - route flipped: expected ${r.route.expectedNote} to win the top slot`); + } + } + + console.log( + `\nint${args.bits}: ${report.passed}/${report.total} gold passed; ` + + `rank correlation mean ${report.meanRho.toFixed(4)}, min ${report.minRho.toFixed(4)}`, + ); + + if (args.full) { + await runAnswerPass(gold, index, qv.byId, config); + } else { + console.log('(retrieval + route tier only; add --full to run the answer-mode pass with a key)'); + } + + if (report.failed > 0) process.exitCode = 1; +} + +/** The keyed bonus: run the answer model and check the declared mode. Exercises + * route SELECTION through the reused no-leak boundary; it does not touch A2 + * (the answer model's confabulation residue), which the encoding never moves. */ +async function runAnswerPass( + gold: readonly GoldQuery[], + index: readonly IndexEntry[], + queryVectorById: ReadonlyMap, + config: import('../src/types.js').ArchiveConfig, +): Promise { + if (!process.env.OPENAI_API_KEY) { + throw new Error('--full runs the answer model, which needs OPENAI_API_KEY.'); + } + const [{ default: OpenAI }, { retrieve }, { assembleEvidence }, { answerQuestion }, { judgeAnswer }] = + await Promise.all([ + import('openai'), + import('../src/retrieve.js'), + import('../src/no-leak.js'), + import('../src/answer.js'), + import('../src/evaluate.js'), + ]); + const client = new OpenAI(); + console.log('\n--full answer-mode pass (keyed):'); + let answerFails = 0; + for (const g of gold) { + const qv = queryVectorById.get(g.id); + if (!qv) continue; + const hits = retrieve(qv, g.query, index); + const evidence = assembleEvidence( + hits.records.map((h) => h.record), + hits.notes.map((h) => h.note), + ); + try { + const answer = await answerQuestion(client, g.query, evidence, config); + const judged = judgeAnswer(g, answer); + console.log(` ${judged.pass ? 'ok ' : 'FAIL'} ${g.id.padEnd(18)} mode=${answer.mode}`); + if (!judged.pass) { + answerFails += 1; + for (const issue of judged.issues) console.log(` - ${issue}`); + } + } catch (err) { + answerFails += 1; + console.log(` FAIL ${g.id.padEnd(18)} answer engine threw: ${err instanceof Error ? err.message : err}`); + } + } + if (answerFails > 0) process.exitCode = 1; +} + +main().catch((err) => { + console.error(`scaling:run failed: ${err instanceof Error ? err.message : err}`); + process.exitCode = 1; +}); From 6ca88babf9b3dce244a94ac0d96793d45ee7a896 Mon Sep 17 00:00:00 2001 From: Claude Date: Tue, 16 Jun 2026 04:35:31 +0000 Subject: [PATCH 04/10] =?UTF-8?q?feat(scaling):=20the=20deliberate=20failu?= =?UTF-8?q?re=20=E2=80=94=20quarantined=20spire=20+=20route-flip=20gold?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The payload, the part the demo rests on. A fabricated George-private note, quarantined in scaling/corpus/synthetic/ (the location is the flag) and marked synthetic:true with the gold case, margin, and mode it targets, plus the expanded gold (gold.synthetic.yaml) loaded only under --natural+synthetic. The note is built to sit at the floor just above the public Amos exposition, so int8 holds the route while int4 flips the top slot to the public record and the gold suite catches it. The mechanism is already proven offline in quantize.test.ts; this is its real-corpus instance, calibrated against real vectors by the build (handoff §4). Headline numbers stay on --natural; the spire is broken out. https://claude.ai/code/session_01EhtDe3ZQnv6vx2qSLy8qQ1 --- scaling/corpus/synthetic/.gitkeep | 0 .../synthetic/syn-amos-justice-margin.md | 18 +++++++++++++ scaling/gold.synthetic.yaml | 27 +++++++++++++++++++ 3 files changed, 45 insertions(+) delete mode 100644 scaling/corpus/synthetic/.gitkeep create mode 100644 scaling/corpus/synthetic/syn-amos-justice-margin.md create mode 100644 scaling/gold.synthetic.yaml diff --git a/scaling/corpus/synthetic/.gitkeep b/scaling/corpus/synthetic/.gitkeep deleted file mode 100644 index e69de29..0000000 diff --git a/scaling/corpus/synthetic/syn-amos-justice-margin.md b/scaling/corpus/synthetic/syn-amos-justice-margin.md new file mode 100644 index 0000000..1c75914 --- /dev/null +++ b/scaling/corpus/synthetic/syn-amos-justice-margin.md @@ -0,0 +1,18 @@ +--- +title: "George Adam Smith — private note on Amos and the justice of God" +about: https://en.wikipedia.org/wiki/George_Adam_Smith +locator: "study marginalia, Amos" +synthetic: true +targets: "syn-route-margin / route case at the floor / related-material — a tightened encoding (int4 or a lowered floor) should flip this to the public Amos exposition, caught by the gold suite" +--- + +Fabricated for the int8 demo, not George Adam Smith's words (see the frontmatter). + +On Amos the herdsman of Tekoa: justice is not a ledger the strong keep against +the weak, but the weight of heaven set on the side of the wronged. The plumb-line +is held to the wall of the nation and the wall is found to lean. Where the courts +sell the righteous for silver and the needy for a pair of shoes, the worship that +continues above that wrong is itself the offence; the feast is noise until the +judgement runs down like waters. The mercy is in the warning: that the line is +shown at all, and shown in time, is the patience of God toward a people still +able to return. diff --git a/scaling/gold.synthetic.yaml b/scaling/gold.synthetic.yaml new file mode 100644 index 0000000..20f633e --- /dev/null +++ b/scaling/gold.synthetic.yaml @@ -0,0 +1,27 @@ +# Expanded gold for --natural+synthetic. Loaded ONLY alongside the quarantined +# synthetic spire (scaling/corpus/synthetic/). Because the spire is fabricated, +# these cases never touch the headline (--natural) numbers; the runner reports +# the spire's effect on its own line, broken out, so a reader can tell whether +# int8 held because the encoding is sound or because notes were hand-placed. +# +# Skewed toward the route-flip the demo is built to catch, never an extra +# must-answer win. The deliberate failure is the payload: a caught failure is +# worth more than any clean pass. + +queries: + - id: syn-route-margin + query: What did George Adam Smith note privately about Amos and the justice of God? + expectAnswerMode: related-material + expectSources: [note:syn-amos-justice-margin] + forbidRecordCitations: true + forbidAnswerPatterns: ['https?://'] + note: > + THE DELIBERATE FAILURE. The synthetic note is tuned to sit at the floor, + just above the public Amos exposition (george-adam-smith:twelve-prophets-amos) + on the same theme. int8 must hold the route — the private note wins the top + slot and the answer routes without restating. A tightened encoding (run with + --bits 4) or a lowered floor flips the top slot to the public record, or + drops the note below the floor entirely; either way the gold suite catches + it. If it does NOT fire, the near-tie is too loose: tighten the margin in the + synthetic note (build-handoff §4), do NOT add corpus. This caught failure, + not any clean pass, is the result the demo rests on. From 5c9e323b03ed3f51276c10ef1fcde7d83a990e2d Mon Sep 17 00:00:00 2001 From: Claude Date: Tue, 16 Jun 2026 04:39:02 +0000 Subject: [PATCH 05/10] docs(scaling): README leading with the caught failure + filled delta log MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit scaling/README.md in the papers' sparse register (no em-dashes), leading with the deliberate failure: the same gold suite that owns grounding and refusal rejecting a cheaper encoding. States the three non-negotiable disclosures, the exact-vs-measured admissibility split, rank correlation as necessary-not- sufficient, the commit-vectors-only-because-public-domain caveat with the inversion warning, and the reuse boundary (retrieval and no-leak untouched). Cross-links production-scaling.md §2 as the prose companion. Fills the delta log with what the build settled vs what is pending the keyed build run, the divergences found (keyless headline needs committed gold-query vectors; demo-canonical record URLs; the added route-selection gate; the GitHub-only egress that deferred the corpus + vectors), and the prepared NEXT-STEPS C-intro/C1 reconciliation to apply once scaling:run confirms the headline. https://claude.ai/code/session_01EhtDe3ZQnv6vx2qSLy8qQ1 --- docs/scaling-demo/scaling-demo-delta-log.md | 70 +++++++++++-- scaling/README.md | 109 ++++++++++++++++++++ scaling/corpus/README.md | 4 +- 3 files changed, 170 insertions(+), 13 deletions(-) create mode 100644 scaling/README.md diff --git a/docs/scaling-demo/scaling-demo-delta-log.md b/docs/scaling-demo/scaling-demo-delta-log.md index 53dbffb..144afef 100644 --- a/docs/scaling-demo/scaling-demo-delta-log.md +++ b/docs/scaling-demo/scaling-demo-delta-log.md @@ -16,23 +16,36 @@ For each assumption the spec makes, record what the build actually did and what ## Pre-seeded rows (the deltas most likely to surface) +**Build context (read before the rows).** The session that built `scaling/` +had egress to GitHub only: `api.openai.com`, Gutenberg, and archive.org all +returned `host_not_allowed`, and no `OPENAI_API_KEY` was set. So the code, the +gold set, the provenance manifest, and the deterministic harness tests are +committed and green, but the real text bodies and the committed vectors are +**pending a build run** (a local agent with network + key; see +`build-handoff.md`). Rows about what the *real run* produced are marked PENDING; +rows about the *mechanism and structure* are settled now. + | # | Spec assumption | What the build actually did | Touches | Downstream action | |---|---|---|---|---| -| 1 | Score floor as shipped (`SCORE_FLOOR`) puts marginal cases where int8 can flip them | _fill: kept / tightened to \_ | `spec`, maybe `NEXT-STEPS` (B1) | If moved, document the new floor and that it's model-dependent (B1) | -| 2 | int8 holds the full gold suite on the real corpus (headline pass) | _fill: held / didn't_ | `nothing` if held; investigate if not | Headline number for §6/C1 | -| 3 | A tightened encoding (int4 / lowered floor) flips a **route** case and the gold suite catches it — the deliberate failure | _fill: fired at \ / did NOT fire_ | `spec` if settings changed | **If it doesn't fire, near-ties are too loose — tighten margin, do NOT add corpus.** This is the result the demo rests on | -| 4 | George sermons index as short **whole** units without diluting their topical center (so "indexes documents whole" stays true) | _fill: whole units worked / had to split a sermon_ | **`paper §5–§6`** if split | **Highest stakes.** If any unit is split into windows, the demo now chunks; "in-memory and unchunked" breaks and the §5 reconciliation grows. Watch sermon length specifically | -| 5 | `EXACT_MATCH_BOOST = 0.30` fires (or not) on "Adam Smith" vs "George Adam Smith" partial match as the gold case predicts | _fill: actual behavior_ | `spec` | Pin the observed behavior in the gold case | -| 6 | Both-Smith shared theme (e.g. "justice") mis-fires the theme boost, and the gold suite exposes it | _fill: observed / didn't occur_ | `spec` | Keep as exposed near-tie; do not curate themes to suppress it | -| 7 | FP vectors commit cleanly and the default run reproduces with no key | _fill: yes / issue_ | `spec` | Confirms the no-key headline claim | -| 8 | Demo is a thin module: reuses `src/retrieve.ts` + `src/no-leak.ts` untouched, no second pipeline | _fill: stayed thin / needed more_ | `spec`; **halt if it needs its own pipeline** | If it can't stay thin, propose a sibling repo per the budget rule — do not bloat | +| 1 | Score floor as shipped (`SCORE_FLOOR`) puts marginal cases where int8 can flip them | Confirmed `SCORE_FLOOR = 0.2` in `src/retrieve.ts` (model-dependent, B1); gold cases authored near it. Real-run margin PENDING build | `spec`, maybe `NEXT-STEPS` (B1) | If the build moves it, document the new floor and that it's model-dependent | +| 2 | int8 holds the full gold suite on the real corpus (headline pass) | PENDING build. Mechanism proven offline (`quantize.test.ts`: int8 certifies the route case) | `nothing` if held; investigate if not | Headline number for §6/C1, recorded at build | +| 3 | A tightened encoding (int4 / lowered floor) flips a **route** case and the gold suite catches it | Mechanism PROVEN offline (the payload test: int8 holds, int4 flips the top slot, the gate catches it). Real spire authored (`syn-amos-justice-margin`); calibration to fire on real vectors PENDING build | `spec` if settings changed | **If it doesn't fire on real vectors, tighten the margin, do NOT add corpus.** Handoff §4 | +| 4 | George sermons index as short **whole** units (so "indexes documents whole" stays true) | Authored as short whole units, one sermon per file, not split. Real sermon length not verifiable here (no source access) | **`paper §5–§6`** if split | **Highest stakes.** Build agent watches sermon length; if any unit is split into windows, "in-memory and unchunked" breaks and the §5 reconciliation grows | +| 5 | `EXACT_MATCH_BOOST = 0.30` fires (or not) on the "Adam Smith" vs "George Adam Smith" partial match | Designed live: record **titles carry the full author name** so a query's "Adam Smith" phrase-matches a "George Adam Smith" title; `boost-edge-micah` gold pins it. Observed behavior PENDING build | `spec` | Pin the observed behavior in the gold case at build | +| 6 | Both-Smith shared theme (e.g. "justice") mis-fires the theme boost, and the gold suite exposes it | Collision authored on purpose (Amos/Micah carry "justice", as does TMS); not curated away. Observed behavior PENDING build | `spec` | Keep as an exposed near-tie; do not curate themes to suppress it | +| 7 | FP vectors commit cleanly and the default run reproduces with no key | Keyless runner built and degrades cleanly. **DIVERGENCE:** the core eval CLI requires a key (it embeds gold queries at run time), so the keyless headline needs committed **gold-query** vectors, not just FP vectors. Added `query-vectors.json`; `build.ts` writes it | `spec` | Spec §5 should say "FP **and gold-query** vectors committed" for the no-key claim | +| 8 | Demo is a thin module: reuses `src/retrieve.ts` + `src/no-leak.ts` untouched, no second pipeline | **CONFIRMED.** Reuses `retrieve()`, `cosine()`, the no-leak boundary, the gold judges, `store`, the corpus loaders, and embedding untouched; the int8 path is `quantize.ts` plus a re-rank. No core types changed. Budget held, **no halt** | `spec` | None; the budget claim holds | -## Open-ended rows (add as testing surfaces them) +## Open-ended rows (surfaced during the build) | # | Spec assumption | What the build actually did | Touches | Downstream action | |---|---|---|---|---| -| 9 | | | | | -| 10 | | | | | +| 9 | Spec §2: "records carry real public URLs via the normal record path" | **DIVERGENCE.** Per-unit real URLs do not exist (Gutenberg is work-level) and `src/corpus.ts` is reused untouched, so record citation URLs are constructed demo-canonical (`.example` TLD), symmetric across both authors; the provenance table holds the real sources, and private-note `about` targets ARE real | `spec` | Soften §2 to "demo-canonical citations, real provenance + real route targets" (already stated in `corpus/README.md`) | +| 10 | Spec quotes `NEXT-STEPS.md` §C1 as already saying "a runnable miniature ships at `scaling/`" | The live §C1 has no such line. The §C1 link and the C-intro carve-out are **deferred reconciliation** (prepared text below), applied by the build agent **after** `scaling:run` confirms the headline, so "runnable" is verified not asserted | `NEXT-STEPS` | Apply the prepared edit at build, not before | +| 11 | Spec §7: `production-scaling.md` location "unconfirmed (subdir or pending)" | RESOLVED at `docs/production-scaling.md`, em-dashes already thinned (fix 2.4 landed). `scaling/README.md` cross-links it | `spec` | None; resolved | +| 12 | The keyless gate catches route flips | `judgeRetrieval` checks presence in top-K only, so it misses a top-slot flip where the note stays retrieved. **Added a keyless route-selection check** (`topSource`) for related-material cases, so a flip that keeps the note in top-K is still caught | `spec` | Note the route check in the spec's §5 harness description | +| 13 | The answer-mode pass governs the route/refuse verdicts | The keyless headline covers retrieval + route selection + refuse-by-floor; the answer-mode adjudication (related-material routes without restating) is the `--full` keyed pass. `answerQuestion` short-circuits to not-found on empty evidence, so refuse-by-empty-floor is keyless even under `--full`. Route tests selection, not A2 | `spec` | Clarify the two tiers (keyless retrieval gate vs keyed answer gate) in §5 | +| 14 | (build) the corpus and vectors are produced in this session | Blocked by egress (GitHub-only) and a missing key; deferred to a local agent per `build-handoff.md`. Code, structure, gold, and tests committed and green | `nothing` (process) | Run the handoff to complete the demo | ## Merge-day assembly (do this the day the demo lands, while it's hot) @@ -43,3 +56,38 @@ Walk the log top to bottom: - Confirm the anonymization checklist still covers any new identifying surface the demo added. The reconciliation is then assembly from recorded facts, not authorship under pressure. That was the point of keeping the log. + +## Prepared reconciliation text (apply at build, once `scaling:run` confirms the headline) + +These edits describe what the demo *does*. They are held here, not applied, +because the demo is not runnable until the vectors are built (rows 2, 14). Apply +them only after `scaling:run --natural` confirms the headline, so "a runnable +miniature ships" is verified, not asserted. + +**`NEXT-STEPS.md` §C-intro** (row 10). It currently reads: "This repository is +full-precision and indexes documents whole; it pulls none of these levers." +Once `scaling/` lands, the repo contains int8 code, so distinguish core from +miniature, for example: + +> This repository's **core** is full-precision and indexes documents whole; it +> pulls none of these levers. The one exception is the marked illustration at +> `scaling/`: a runnable int8 miniature on a short-whole-unit public-domain +> corpus, which pulls exactly one lever (int8 quantization) to show the gold +> suite gating it. The core's claims stay true of the core; `scaling/` is named +> as the exception. (It still indexes short units *whole*, so "indexes documents +> whole" holds; only the lever claim needs the carve-out.) + +**`NEXT-STEPS.md` §C1** (row 10). Add a pointer in the int8 lever, for example: +"A runnable miniature of this lever ships at `scaling/` (see +`scaling/README.md`); it is the public, gated counterpart to the private +production figures above." + +**Paper §5/§6 (author's call, conditional).** The published note's §5 says +retrieval is "in-memory and unchunked … indexed whole." That stays true of the +core and of the demo's short whole units. **Only if `scaling/` is in an +anonymized submission snapshot** does §6 want a one-line bridge so §5 reads as +describing the core, not the `scaling/` exception. This is a paper edit, the +author's not the agent's, and it is moot if `scaling/` is deferred past review. +Note in the build summary whether `scaling/` is present in any snapshot built. +**If row 4 fired (a sermon had to be split), the unchunked claim itself needs +revisiting, not just a bridge.** diff --git a/scaling/README.md b/scaling/README.md new file mode 100644 index 0000000..c28d686 --- /dev/null +++ b/scaling/README.md @@ -0,0 +1,109 @@ +# The int8 scaling demo + +The result this module is built to produce is a **caught failure**: the same +gold suite that owns grounding and refusal rejecting a cheaper encoding. Run the +quantizer at int4 (or with a lowered floor) and a route case flips, the private +note loses the top slot to a public record, and the suite catches it. That is +the point. "int8 held" on a small corpus is expected and proves little on its +own; the gate saying *no* when pushed is what shows the gold suite, not the +encoding, is the adjudicator. + +``` +npm run scaling:run # int8 on the real corpus: the headline, keyless +npm run scaling:run -- --bits 4 # int4: the gate rejects the route flip +npm run scaling:run -- --natural+synthetic # add the quarantined spire + its gold +npm run scaling:run -- --full # also run the answer-mode pass (needs a key) +``` + +## What it is + +The paper (§6) claims that the same gold suite which owns grounding and refusal +also adjudicated every cost reduction made to run the system at scale. The +production figures behind that are private and non-reproducible. This demo makes +the *mechanism* runnable on a public-domain corpus: it quantizes the embedding +index to int8, re-ranks, and runs the full gold suite including the must-refuse +and must-route cases, so the gate either certifies or rejects the cheaper +encoding. The claim is **relative, not absolute**: not "this corpus is +realistic," but "int8 preserves the verdicts full-precision produces, and where +it does not, the gate catches it." Realism is never asserted. + +Public domain is the *absence* of copyright, not a license: this corpus is +public-domain, not "permissively licensed." The two name-colliding authors and +their provenance live in [`corpus/README.md`](./corpus/README.md). + +## How it works (a wrapper plus a re-rank, not a second system) + +The int8 path is an encode/decode wrapper plus a re-rank. It reuses the core +retrieval (`src/retrieve.ts`), the gold judge (`src/evaluate.ts`), the store +(`src/store.ts`), and the no-leak boundary (`src/no-leak.ts`) untouched; nothing +in the core was forked or changed. `quantize.ts` is the public twin of the +production site adapter's `vector-quant.ts` (named in +[`docs/production-scaling.md`](../docs/production-scaling.md) §2). The harness +quantizes the committed full-precision vectors in process, dequantizes, and +hands the result to the same `retrieve()` the engine uses. + +Two facts make int8 admissible, and they differ in kind (the §6 split): + +- **Exact, by algebra.** Cosine normalizes by vector norm, so a positive + per-vector scale cancels from the score entirely. The ranking is invariant to + it; you can score against the quantized bytes without restoring the scale. +- **Measured, by the suite.** Integer rounding perturbs direction and can + reorder near-ties, so its harmlessness is not proven; it is verified. The + harness reports rank correlation against the full-precision ranking, then runs + the gold suite. Rank correlation is *necessary, not sufficient*: a demo that + reports it and stops has shown a retrieval benchmark, not answerability + governing tuning. The refuse and route cases are the actual adjudicator. Past + int8 (int4, PQ, binary) the exact part stops applying and the whole lever is + measured; the wire format is versioned so a code/data mismatch fails loudly. + +The headline run is **keyless**: it reads committed full-precision vectors and +committed gold-query vectors, so no embedding call is made. A key is needed only +to regenerate the vectors (`scaling:build`) or to run the `--full` answer pass. +That answer pass exercises route *selection*, which is what quantization moves; +it does not touch A2, the answer model's confabulation residue, which the +encoding never exercises. + +## Disclosures (the three that are non-negotiable) + +1. **Layer designation, not secrecy.** "Private" means the type cannot carry the + text to the model, regardless of what the text is. George Adam Smith's minor + works are public-domain; assigning some of them to the private layer is an + authored research decision, the same move the core's notebook entries make. +2. **The synthetic spire is fabricated and flagged.** A small set of fabricated + George-private notes lives quarantined in `corpus/synthetic/`, loaded only + under `--natural+synthetic`, each marked `synthetic: true` and naming the + edge it tests. It is additive and never enters the headline metrics; the + spire's effect is reported on its own line. No fabricated words are ever + attributed to the real Adam Smith. +3. **The claim is relative.** int8 preserves the verdicts full-precision + produces; the corpus is not offered as realistic and nothing turns on its + realism. + +One disclosure carries a warning. The core gitignores `artifacts/index.json` +because vectors derived from private text are private; this demo does the +opposite and commits its vectors, so the headline reproduces with no key. That +is safe *here* because the "private" layer is public-domain George text, whose +embeddings reveal nothing already public. Do not copy "commit your vectors" as a +general pattern: embeddings of genuinely private text can be inverted to recover +approximate content, which is the exposure the core's gitignored index avoids. + +## Build status + +The code, the gold set, the provenance manifest, and the deterministic harness +tests (`quantize.test.ts`, run by `npm test`) are committed. The real text +bodies and the committed vectors (`corpus/index.json`, +`corpus/index.synthetic.json`, `corpus/query-vectors.json`) are produced by +`scaling:build`, which needs network access to the public-domain sources and an +`OPENAI_API_KEY`; the session that wrote the module had neither. See +[`docs/scaling-demo/build-handoff.md`](../docs/scaling-demo/build-handoff.md) +for the exact steps, and the delta log for what is confirmed versus pending. + +## Relation to production + +This is the runnable counterpart to the prose in `docs/production-scaling.md` +§2: the prose makes the case, the demo runs it. The George/Adam disambiguation +mirrors the real two-tier citation surface on the production site (Ask the +Archive), where a public-record citation carries an id and a URL and a +routing-hint citation carries only where the moment lives, never the text. The +**architecture** is what reproduces here, not the scale: the scale stays +reported in §6, the mechanism runs in this folder. diff --git a/scaling/corpus/README.md b/scaling/corpus/README.md index f4e1f0a..3eb9e08 100644 --- a/scaling/corpus/README.md +++ b/scaling/corpus/README.md @@ -32,11 +32,11 @@ Verify each ID and date against the source before relying on it; fill OCR-qualit | _The Book of the Twelve Prophets_, \ | George Adam Smith | 1896-98 | public | Gutenberg 43847 / 50747 | US: pre-1931 / PD in USA. Life+70: author d. 1942; term expired | _verify against Gutenberg_ | | _The Book of Isaiah_, ch\ | George Adam Smith | 1888-90 | public | Gutenberg 39767 / 43672 | US: pre-1931 / PD in USA. Life+70: author d. 1942; term expired | _verify against Gutenberg_ | | _The Forgiveness of Sins, and Other Sermons_, \ | George Adam Smith | 1905 (A. C. Armstrong & Son) | **private** | Internet Archive `forgivenessofsin00smitrich` (ARK `ark:/13960/t0gt5jk4g`); HathiTrust full-view backup record 100136688 | US: pre-1931 / PD in USA. Life+70: author d. 1942; term expired | _verify NOT_IN_COPYRIGHT; OCR-noisy expected, which is fine_ | -| \ | — (fabricated) | — | **synthetic** | authored here | n/a (no copyright in fabricated demo text) | quarantined in `synthetic/`; tests \ | +| \ | n/a (fabricated) | n/a | **synthetic** | authored here | n/a (no copyright in fabricated demo text) | quarantined in `synthetic/`; tests \ | **Sourcing (resolved, pending verification).** George's *major* commentaries are listed on Project Gutenberg. The private layer rests on *The Forgiveness of Sins, and other Sermons* (1905), a single volume yielding several short, windy sermon units, which is exactly what the private layer needs: short whole units that route without restating. *Jeremiah: Being the Baird Lecture for 1922* (1923) is a further minor source if wanted. The fallback (designating a *section* of a major work private) is therefore **not** required; if a future rebuild loses these sources, that fallback keeps the private layer real rather than padding it with synthetic. -**OPEN — the one sourcing check that can block the build.** Confirm George's minor/windy material (the sermons) actually downloads as clean-enough public-domain text. If only the big commentaries are digitized, the private ledger is thin: use the fallback (a short *section* of a major work, designated private) rather than padding with synthetic, which would turn the spire into a column. Record the outcome here. +**OPEN: the one sourcing check that can block the build.** Confirm George's minor/windy material (the sermons) actually downloads as clean-enough public-domain text. If only the big commentaries are digitized, the private ledger is thin: use the fallback (a short *section* of a major work, designated private) rather than padding with synthetic, which would turn the spire into a column. Record the outcome here. ## URLs: demo-canonical citations, real route targets From a8c5fe5ff2abbe3f4f0a47aabf6123fe6a98ae57 Mon Sep 17 00:00:00 2001 From: Claude Date: Tue, 16 Jun 2026 04:45:49 +0000 Subject: [PATCH 06/10] =?UTF-8?q?fix(scaling):=20address=20code=20review?= =?UTF-8?q?=20=E2=80=94=20quantized=20--full,=20loud=20build=20failures?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --full now retrieves the answer-mode evidence from the SAME quantized index the retrieval gate judged, so a route flip on the lossy index can no longer be masked by full-precision retrieval (the keyed pass exercises the surface it claims to). build.ts now throws with the id when any source or gold-query embedding is missing, instead of silently writing a partial artifact, matching the repo's loud-failure standard. Aligns the test import to the core's named node:test form. Not changed: the committed scaling/corpus/index.json. Committing those vectors is the spec's intentional divergence (§5) — the "private" layer is public-domain George text, the index is deliberately not gitignored, and README §2 + the corpus manifest §2 explain it with the embedding-inversion warning. The STANDARDS.md reconciliation for that is a separate call, raised with the author. https://claude.ai/code/session_01EhtDe3ZQnv6vx2qSLy8qQ1 --- scaling/build.ts | 15 +++++++++------ scaling/quantize.test.ts | 2 +- scaling/run.ts | 22 ++++++++++++++-------- 3 files changed, 24 insertions(+), 15 deletions(-) diff --git a/scaling/build.ts b/scaling/build.ts index ba41c0b..ffd8da6 100644 --- a/scaling/build.ts +++ b/scaling/build.ts @@ -50,7 +50,7 @@ function recordEntries(config: ArchiveConfig, vectors: Map): I for (const record of buildCorpus(config)) { const text = embedText(record); const vector = vectors.get(record.id); - if (!vector) continue; + if (!vector) throw new Error(`no embedding returned for record '${record.id}'; refusing to write a partial index.`); entries.push({ model: config.embeddingModel, dimensions: vector.length, @@ -67,7 +67,7 @@ function noteEntries(notes: PrivateNote[], vectors: Map): Inde const entries: IndexEntry[] = []; for (const note of notes) { const vector = vectors.get(note.id); - if (!vector) continue; + if (!vector) throw new Error(`no embedding returned for note '${note.id}'; refusing to write a partial index.`); entries.push({ model: config.embeddingModel, dimensions: vector.length, @@ -135,10 +135,13 @@ async function main(): Promise { console.log('No synthetic notes authored yet; skipping the spire index.'); } - // Committed gold-query vectors (what makes scaling:run keyless). - const queryVectors = goldQueries - .map((g) => ({ id: g.id, vector: vectors.get(`query:${g.id}`) })) - .filter((q): q is { id: string; vector: number[] } => Array.isArray(q.vector)); + // Committed gold-query vectors (what makes scaling:run keyless). Every gold + // query must embed, or the keyless runner would later fail on a missing id. + const queryVectors = goldQueries.map((g) => { + const vector = vectors.get(`query:${g.id}`); + if (!vector) throw new Error(`no embedding returned for gold query '${g.id}'; refusing to write partial query vectors.`); + return { id: g.id, vector }; + }); const dims = queryVectors[0]?.vector.length ?? naturalEntries[0]?.dimensions ?? 0; writeQueryVectors(config.embeddingModel, dims, queryVectors); console.log(`Wrote ${queryVectors.length} gold-query vectors`); diff --git a/scaling/quantize.test.ts b/scaling/quantize.test.ts index fa37821..0778c35 100644 --- a/scaling/quantize.test.ts +++ b/scaling/quantize.test.ts @@ -5,7 +5,7 @@ // tests prove the mechanism itself. import assert from 'node:assert/strict'; -import test from 'node:test'; +import { test } from 'node:test'; import { cosine } from '../src/retrieve.js'; import type { ArchiveRecord, IndexEntry, PrivateNote } from '../src/types.js'; diff --git a/scaling/run.ts b/scaling/run.ts index 08c8030..190f853 100644 --- a/scaling/run.ts +++ b/scaling/run.ts @@ -17,7 +17,7 @@ import { loadGold } from '../src/evaluate.js'; import type { GoldQuery } from '../src/evaluate.js'; import { assertHomogeneousIndex, readIndexFile } from '../src/store.js'; import type { IndexEntry } from '../src/types.js'; -import { runGate } from './harness.js'; +import { requantizeIndex, runGate } from './harness.js'; import { readQueryVectors } from './query-vectors.js'; const NATURAL_INDEX = resolve('scaling/corpus/index.json'); @@ -149,7 +149,10 @@ async function main(): Promise { ); if (args.full) { - await runAnswerPass(gold, index, qv.byId, config); + // The answer pass must see evidence selected from the SAME quantized index + // the retrieval gate judged, or a route flip on the lossy index would be + // masked by full-precision retrieval. Quantize once, here, and hand it down. + await runAnswerPass(gold, requantizeIndex(index, args.bits), qv.byId, config); } else { console.log('(retrieval + route tier only; add --full to run the answer-mode pass with a key)'); } @@ -157,12 +160,15 @@ async function main(): Promise { if (report.failed > 0) process.exitCode = 1; } -/** The keyed bonus: run the answer model and check the declared mode. Exercises - * route SELECTION through the reused no-leak boundary; it does not touch A2 - * (the answer model's confabulation residue), which the encoding never moves. */ +/** The keyed bonus: run the answer model on evidence retrieved from the + * QUANTIZED index, and check the declared mode. Same lossy surface the + * retrieval gate judged, so a route flip is not masked by full-precision + * retrieval. Exercises route SELECTION through the reused no-leak boundary; it + * does not touch A2 (the answer model's confabulation residue), which the + * encoding never moves. */ async function runAnswerPass( gold: readonly GoldQuery[], - index: readonly IndexEntry[], + quantIndex: readonly IndexEntry[], queryVectorById: ReadonlyMap, config: import('../src/types.js').ArchiveConfig, ): Promise { @@ -178,12 +184,12 @@ async function runAnswerPass( import('../src/evaluate.js'), ]); const client = new OpenAI(); - console.log('\n--full answer-mode pass (keyed):'); + console.log('\n--full answer-mode pass (keyed, on the quantized index):'); let answerFails = 0; for (const g of gold) { const qv = queryVectorById.get(g.id); if (!qv) continue; - const hits = retrieve(qv, g.query, index); + const hits = retrieve(qv, g.query, quantIndex); const evidence = assembleEvidence( hits.records.map((h) => h.record), hits.notes.map((h) => h.note), From 48203057915867f79b915124588becda3c4c6073 Mon Sep 17 00:00:00 2001 From: Claude Date: Tue, 16 Jun 2026 04:49:42 +0000 Subject: [PATCH 07/10] fix(scaling): validate committed query vectors at read; record STANDARDS carve-out MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit readQueryVectors now checks each committed vector is numeric and matches the file's dimensions, so a corrupt query-vectors.json fails loudly at read with the rebuild hint instead of surfacing later as bad cosine (the repo's loud-failure standard). Adds query-vectors.test.ts: round-trip, missing-file-is-null, and the malformed cases (wrong dims, non-numeric, bad version). 39 tests pass. Records in the delta log the one standing standards point the review keeps raising: the demo commits scaling/corpus/index.json with the public-domain George "private"-layer note objects on purpose (spec §5, manifest §2), so a STANDARDS.md line-51 carve-out is prepared as a deferred reconciliation rather than redesigning the spec-mandated keyless artifact story. https://claude.ai/code/session_01EhtDe3ZQnv6vx2qSLy8qQ1 --- docs/scaling-demo/scaling-demo-delta-log.md | 12 ++++++ scaling/query-vectors.test.ts | 47 +++++++++++++++++++++ scaling/query-vectors.ts | 12 +++++- 3 files changed, 69 insertions(+), 2 deletions(-) create mode 100644 scaling/query-vectors.test.ts diff --git a/docs/scaling-demo/scaling-demo-delta-log.md b/docs/scaling-demo/scaling-demo-delta-log.md index 144afef..a2f9add 100644 --- a/docs/scaling-demo/scaling-demo-delta-log.md +++ b/docs/scaling-demo/scaling-demo-delta-log.md @@ -46,6 +46,7 @@ rows about the *mechanism and structure* are settled now. | 12 | The keyless gate catches route flips | `judgeRetrieval` checks presence in top-K only, so it misses a top-slot flip where the note stays retrieved. **Added a keyless route-selection check** (`topSource`) for related-material cases, so a flip that keeps the note in top-K is still caught | `spec` | Note the route check in the spec's §5 harness description | | 13 | The answer-mode pass governs the route/refuse verdicts | The keyless headline covers retrieval + route selection + refuse-by-floor; the answer-mode adjudication (related-material routes without restating) is the `--full` keyed pass. `answerQuestion` short-circuits to not-found on empty evidence, so refuse-by-empty-floor is keyless even under `--full`. Route tests selection, not A2 | `spec` | Clarify the two tiers (keyless retrieval gate vs keyed answer gate) in §5 | | 14 | (build) the corpus and vectors are produced in this session | Blocked by egress (GitHub-only) and a missing key; deferred to a local agent per `build-handoff.md`. Code, structure, gold, and tests committed and green | `nothing` (process) | Run the handoff to complete the demo | +| 15 | `.github/STANDARDS.md` line 51: "Don't leak private embeddings/text into committed artifacts" | The demo commits `scaling/corpus/index.json` with the public-domain George "private"-layer vectors, **on purpose** (spec §5): the layer is public-domain (a layer assignment, not secrecy), the file is deliberately not gitignored, and README §2 + manifest §2 explain it with the inversion warning. The automated review flagged the standard. No design change; the standard is about genuinely-private data | `STANDARDS` reconciliation | Add a one-line carve-out at merge (prepared below) so the demo's public-domain exception is named, not re-flagged | ## Merge-day assembly (do this the day the demo lands, while it's hot) @@ -82,6 +83,17 @@ miniature, for example: `scaling/README.md`); it is the public, gated counterpart to the private production figures above." +**`.github/STANDARDS.md` line 51** (row 15, raised by the automated review). +"Don't leak private embeddings/text into committed artifacts. (The index is +gitignored for a reason.)" stays true of the core. Name the demo's exception so +it is not re-flagged, for example: + +> The one exception is the `scaling/` demo. Its "private" layer is public-domain +> text by design (a layer assignment, not secrecy), so it commits +> `scaling/corpus/index.json` on purpose, to reproduce the headline with no key. +> See `scaling/README.md` §2 for why that is safe there and must not be +> generalized to a genuinely-private corpus. + **Paper §5/§6 (author's call, conditional).** The published note's §5 says retrieval is "in-memory and unchunked … indexed whole." That stays true of the core and of the demo's short whole units. **Only if `scaling/` is in an diff --git a/scaling/query-vectors.test.ts b/scaling/query-vectors.test.ts new file mode 100644 index 0000000..1f6fdc6 --- /dev/null +++ b/scaling/query-vectors.test.ts @@ -0,0 +1,47 @@ +// Offline tests for the committed gold-query vector store: a clean round-trip, +// a missing file reading as "not built yet" (null), and malformed artifacts +// failing loudly at read with the rebuild hint rather than later as bad cosine. + +import assert from 'node:assert/strict'; +import { mkdtempSync, writeFileSync } from 'node:fs'; +import { tmpdir } from 'node:os'; +import { join } from 'node:path'; +import { test } from 'node:test'; + +import { QUERY_VECTORS_VERSION, readQueryVectors, writeQueryVectors } from './query-vectors.js'; + +const tmp = mkdtempSync(join(tmpdir(), 'scaling-qv-')); + +test('query-vectors: write/read round-trips with model and dimensions', () => { + const path = join(tmp, 'ok.json'); + writeQueryVectors('text-embedding-3-large', 3, [{ id: 'a', vector: [0.1, 0.2, 0.3] }], path); + const loaded = readQueryVectors(path); + assert.ok(loaded); + assert.equal(loaded.model, 'text-embedding-3-large'); + assert.equal(loaded.dimensions, 3); + assert.deepEqual(loaded.byId.get('a'), [0.1, 0.2, 0.3]); +}); + +test('query-vectors: a missing file reads as null (not built yet), not an error', () => { + assert.equal(readQueryVectors(join(tmp, 'absent.json')), null); +}); + +test('query-vectors: malformed entries fail loudly at read', () => { + const wrongDims = join(tmp, 'dims.json'); + writeFileSync( + wrongDims, + JSON.stringify({ version: QUERY_VECTORS_VERSION, model: 'm', dimensions: 3, queries: [{ id: 'a', vector: [0.1, 0.2] }] }), + ); + assert.throws(() => readQueryVectors(wrongDims), /malformed entry for 'a'/); + + const nonNumeric = join(tmp, 'nan.json'); + writeFileSync( + nonNumeric, + JSON.stringify({ version: QUERY_VECTORS_VERSION, model: 'm', dimensions: 2, queries: [{ id: 'b', vector: [0.1, 'x'] }] }), + ); + assert.throws(() => readQueryVectors(nonNumeric), /malformed entry for 'b'/); + + const badVersion = join(tmp, 'ver.json'); + writeFileSync(badVersion, JSON.stringify({ version: 999, model: 'm', dimensions: 2, queries: [] })); + assert.throws(() => readQueryVectors(badVersion), /schema version/); +}); diff --git a/scaling/query-vectors.ts b/scaling/query-vectors.ts index f1ead57..def941c 100644 --- a/scaling/query-vectors.ts +++ b/scaling/query-vectors.ts @@ -54,8 +54,16 @@ export function readQueryVectors(path: string = QUERY_VECTORS_PATH): LoadedQuery } const byId = new Map(); for (const q of file.queries) { - if (typeof q?.id !== 'string' || !Array.isArray(q.vector)) { - throw new Error(`query vectors at ${path} have a malformed entry. ${REBUILD}`); + // Validate to the same depth the store does for the index: a corrupt vector + // must fail loudly at read with the rebuild hint, not later as bad cosine. + if ( + typeof q?.id !== 'string' || + !Array.isArray(q.vector) || + q.vector.length !== file.dimensions || + !q.vector.every((x) => typeof x === 'number' && Number.isFinite(x)) + ) { + const which = typeof q?.id === 'string' ? ` for '${q.id}'` : ''; + throw new Error(`query vectors at ${path} have a malformed entry${which}. ${REBUILD}`); } byId.set(q.id, q.vector); } From cbbc3a88545e092273698e76e35353729040e10c Mon Sep 17 00:00:00 2001 From: Claude Date: Tue, 16 Jun 2026 05:09:15 +0000 Subject: [PATCH 08/10] feat(scaling): protect the disambiguation verdict keyless; clearer run output MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Generalizes the top-slot check from route cases to every non-refusal case with an expected source: judgeRetrieval only checks top-K membership, so a quantization flip that keeps both Smiths retrieved but swaps which ranks first would pass keyless and leave the corpus's marquee verdict (disambiguation) protected only by the keyed --full pass. Now the expected source must WIN the top slot, not merely appear — the right Smith must outrank the wrong one, the private note must outrank the records. New test proves a partial-mode Smith-vs- Smith int4 flip is caught keyless (40 tests). scaling:run now states plainly what it is running (encoding: int8 shipped vs int4 tightened; corpus: natural vs +spire; keyless) and prints a verdict line (CERTIFIED / REJECTED) so a reader knows what the result means. https://claude.ai/code/session_01EhtDe3ZQnv6vx2qSLy8qQ1 --- scaling/harness.ts | 28 ++++++++++++++++--------- scaling/quantize.test.ts | 35 ++++++++++++++++++++++++++++--- scaling/run.ts | 45 +++++++++++++++++++++++++++++++--------- 3 files changed, 85 insertions(+), 23 deletions(-) diff --git a/scaling/harness.ts b/scaling/harness.ts index 842fcae..95bbfb0 100644 --- a/scaling/harness.ts +++ b/scaling/harness.ts @@ -106,10 +106,15 @@ export interface QueryGateResult { /** judgeRetrieval on the quantized index: expected sources in, forbidden out. */ retrievalPass: boolean; retrievalIssues: string[]; - /** Present only for route (related-material) cases: did the expected note win - * the top slot on the quantized index? */ - route?: { expectedNote: string; winner: string | null; won: boolean }; - /** retrievalPass AND (route ? route.won : true). */ + /** For any case that names an expected source and is not a refusal: did that + * source win the top slot on the quantized index? This is what protects the + * *verdict*, not just presence. judgeRetrieval only checks top-K membership, + * so a quantization flip that keeps both Smiths retrieved but swaps which one + * ranks first would pass it silently. The top-slot check catches that: the + * expected record must OUTRANK the competing Smith (disambiguation), and the + * private note must win over the public records (route). */ + topSlot?: { expected: string; winner: string | null; won: boolean }; + /** retrievalPass AND (topSlot ? topSlot.won : true). */ pass: boolean; } @@ -124,20 +129,23 @@ export function evaluateQuery( const judged = judgeRetrieval(gold, hits); const rho = rankCorrelation(index, quantIndex, queryVector); - let route: QueryGateResult['route']; - if (gold.expectAnswerMode === 'related-material' && gold.expectSources && gold.expectSources[0]) { - const expectedNote = gold.expectSources[0]; + // Any non-refusal case with a named expected source must see that source win + // the top slot, not merely appear. Refusals (not-found) carry no expected + // source; the floor and forbidSources adjudicate them via judgeRetrieval. + let topSlot: QueryGateResult['topSlot']; + if (gold.expectAnswerMode !== 'not-found' && gold.expectSources && gold.expectSources[0]) { + const expected = gold.expectSources[0]; const winner = topSource(hits); - route = { expectedNote, winner: winner?.id ?? null, won: winner?.id === expectedNote }; + topSlot = { expected, winner: winner?.id ?? null, won: winner?.id === expected }; } - const pass = judged.pass && (route ? route.won : true); + const pass = judged.pass && (topSlot ? topSlot.won : true); return { id: gold.id, rho, retrievalPass: judged.pass, retrievalIssues: judged.issues, - ...(route ? { route } : {}), + ...(topSlot ? { topSlot } : {}), pass, }; } diff --git a/scaling/quantize.test.ts b/scaling/quantize.test.ts index 0778c35..d5eaa6d 100644 --- a/scaling/quantize.test.ts +++ b/scaling/quantize.test.ts @@ -140,7 +140,7 @@ test('the payload: the gate certifies int8 and rejects int4 on the route case', // int8: the note wins the top slot, the gate passes. const int8 = runGate([gold], index, qById, 8); assert.equal(int8.passed, 1, 'int8 certifies the route'); - assert.equal(int8.results[0]!.route?.won, true); + assert.equal(int8.results[0]!.topSlot?.won, true); assert.ok(int8.results[0]!.rho >= 0.9); // int4: the record overtakes the note for the top slot. The note is still @@ -150,8 +150,37 @@ test('the payload: the gate certifies int8 and rejects int4 on the route case', assert.equal(int4.failed, 1, 'int4 is rejected'); const r = int4.results[0]!; assert.equal(r.retrievalPass, true, 'the note is still in the candidate set'); - assert.equal(r.route?.won, false, 'but it lost the top slot'); - assert.equal(r.route?.winner, 'george-adam-smith:twelve-prophets-amos'); + assert.equal(r.topSlot?.won, false, 'but it lost the top slot'); + assert.equal(r.topSlot?.winner, 'george-adam-smith:twelve-prophets-amos'); +}); + +test('disambiguation: the keyless gate catches a partial-mode flip (right Smith vs wrong Smith)', () => { + // Same near-tie geometry, but both candidates are public records: the right + // Smith (VN) outranks the wrong Smith (VR) at full precision and int8, and int4 + // swaps them. A partial case is presence-checked by judgeRetrieval, so without + // the top-slot check the flipped disambiguation verdict would pass keyless. + const index: IndexEntry[] = [ + recordEntry('adam-smith:theory-of-moral-sentiments-justice', VN, { title: 'unrelated phrasing' }), + recordEntry('george-adam-smith:twelve-prophets-amos', VR, { title: 'unrelated phrasing' }), + ]; + const gold: GoldQuery = { + id: 'econ-justice', + query: 'zzz qqq no token overlap with any title or theme', + expectAnswerMode: 'partial', + expectSources: ['adam-smith:theory-of-moral-sentiments-justice'], + }; + const qById = new Map([[gold.id, Q]]); + + const int8 = runGate([gold], index, qById, 8); + assert.equal(int8.passed, 1, 'int8 keeps the right Smith on top'); + assert.equal(int8.results[0]!.topSlot?.won, true); + + const int4 = runGate([gold], index, qById, 4); + assert.equal(int4.failed, 1, 'int4 flips to the wrong Smith and the gate catches it'); + const r = int4.results[0]!; + assert.equal(r.retrievalPass, true, 'both Smiths still retrieved (presence alone passes)'); + assert.equal(r.topSlot?.won, false, 'but the wrong Smith won the top slot'); + assert.equal(r.topSlot?.winner, 'george-adam-smith:twelve-prophets-amos'); }); test('the payload, directly: cosine ordering flips between int8 and int4', () => { diff --git a/scaling/run.ts b/scaling/run.ts index 190f853..249f517 100644 --- a/scaling/run.ts +++ b/scaling/run.ts @@ -125,28 +125,53 @@ async function main(): Promise { ); } + // Say plainly what this run IS, so a reader knows what they are looking at. const label = args.synthetic ? '--natural+synthetic' : '--natural'; - console.log(`scaling:run ${label} int${args.bits} ${gold.length} gold queries ${index.length} index entries`); - if (args.synthetic) { - console.log(' (headline numbers come from the --natural run; the spire is broken out below)'); - } + const shipped = args.bits === 8; + console.log('scaling:run — int8 quantization gate (Smith collection)'); + console.log( + ` encoding: int${args.bits} ` + + (shipped + ? '(the shipped wire format; expected to HOLD the suite)' + : '(tightened below int8; a near-tie may flip and be REJECTED)'), + ); + console.log( + ` corpus: ${label} ` + + (args.synthetic + ? '(real corpus + the fabricated spire; headline still comes from --natural)' + : '(real corpus only; owns the headline numbers)'), + ); + console.log(` ${gold.length} gold queries, ${index.length} index entries, keyless (committed vectors)\n`); const report = runGate(gold, index, qv.byId, args.bits); for (const r of report.results) { const status = r.pass ? 'ok ' : 'FAIL'; - const routeBit = r.route ? ` route:${r.route.won ? 'won' : `LOST->${r.route.winner ?? 'none'}`}` : ''; - console.log(` ${status} ${r.id.padEnd(18)} rho=${r.rho.toFixed(4)}${routeBit}`); - if (!r.pass) for (const issue of r.retrievalIssues) console.log(` - ${issue}`); - if (r.route && !r.route.won) { - console.log(` - route flipped: expected ${r.route.expectedNote} to win the top slot`); + const slot = r.topSlot ? ` top:${r.topSlot.won ? 'won' : `LOST->${r.topSlot.winner ?? 'none'}`}` : ''; + console.log(` ${status} ${r.id.padEnd(18)} rho=${r.rho.toFixed(4)}${slot}`); + if (!r.pass) { + for (const issue of r.retrievalIssues) console.log(` - ${issue}`); + if (r.topSlot && !r.topSlot.won) { + console.log(` - top slot flipped: expected ${r.topSlot.expected} to win, ${r.topSlot.winner ?? 'nothing'} did`); + } } } console.log( - `\nint${args.bits}: ${report.passed}/${report.total} gold passed; ` + + `\nint${args.bits}: ${report.passed}/${report.total} gold verdicts held; ` + `rank correlation mean ${report.meanRho.toFixed(4)}, min ${report.minRho.toFixed(4)}`, ); + if (report.failed === 0) { + console.log(` VERDICT: the gold suite CERTIFIED int${args.bits} — every verdict full precision produces held.`); + } else { + const flips = report.results.filter((r) => r.topSlot && !r.topSlot.won).length; + console.log( + ` VERDICT: the gold suite REJECTED int${args.bits} — ${report.failed} verdict(s) did not hold` + + (flips ? `, including ${flips} top-slot flip(s)` : '') + + '.', + ); + console.log(' The same suite that owns grounding and refusal caught it; that caught failure is the payload.'); + } if (args.full) { // The answer pass must see evidence selected from the SAME quantized index From 6573ccd255afb44e2cf574801a499ed0fc6039e9 Mon Sep 17 00:00:00 2001 From: Claude Date: Tue, 16 Jun 2026 05:09:15 +0000 Subject: [PATCH 09/10] docs(scaling): README honesty pass + review nits MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Stops the README opening from asserting the real-corpus caught failure as established fact. It now leads with the mechanism proven offline in quantize.test.ts (fixture vectors searched to exhibit the near-tie) and marks the real-Smith-corpus demonstration as pending the build run — the same "don't claim runnable before it runs" rule applied to NEXT-STEPS, now applied to the README itself. Fixes the int4 command to --natural+synthetic --bits 4 (the spire only loads under +synthetic) and notes scaling:run errors until built. Notes the keyless top-slot check now protects disambiguation too. Review nits: synthetic note's traveling title is now unambiguously synthetic (the label is the A1 leak surface); "no synthetic type field" reworded to match the synthetic:true marker the file carries; "the real Adam Smith" -> "either real Smith"; the spec and delta log are noted in the README as kept-in-the-open on purpose; delta row 12 updated for the disambiguation extension. https://claude.ai/code/session_01EhtDe3ZQnv6vx2qSLy8qQ1 --- docs/scaling-demo/build-handoff.md | 2 +- docs/scaling-demo/scaling-demo-delta-log.md | 2 +- scaling/README.md | 60 ++++++++++++++----- scaling/corpus/README.md | 2 +- .../synthetic/syn-amos-justice-margin.md | 2 +- scaling/scaling.config.ts | 5 +- 6 files changed, 53 insertions(+), 20 deletions(-) diff --git a/docs/scaling-demo/build-handoff.md b/docs/scaling-demo/build-handoff.md index bbfd3d2..6d0066c 100644 --- a/docs/scaling-demo/build-handoff.md +++ b/docs/scaling-demo/build-handoff.md @@ -55,7 +55,7 @@ Confirm the actual sermon titles from the volume and rename slugs to match if ne ## 2. Author the synthetic spire (only if the deliberate failure needs it) The spire is the scalpel for the deliberate failure (step 4), not a corpus filler. Author it **only if** the real route-margin tie does not flip under a tightened encoding on its own. Each synthetic note: -- lives in `scaling/corpus/synthetic/` (the quarantine **is** the flag; there is no `synthetic` type field), +- lives in `scaling/corpus/synthetic/` (the quarantine is one flag) and carries a `synthetic: true` frontmatter marker (a second, in-file flag; the `PrivateNote` type does not read it, so it changes nothing the engine sees), - is a fabricated **George-private** note (never a third Smith, never words for the real Adam Smith), - carries a one-line comment at the top of the body naming the gold case, the margin, and the mode it targets, - is skewed toward must-refuse / route-flip, never an extra must-answer win. diff --git a/docs/scaling-demo/scaling-demo-delta-log.md b/docs/scaling-demo/scaling-demo-delta-log.md index a2f9add..d5580a7 100644 --- a/docs/scaling-demo/scaling-demo-delta-log.md +++ b/docs/scaling-demo/scaling-demo-delta-log.md @@ -43,7 +43,7 @@ rows about the *mechanism and structure* are settled now. | 9 | Spec §2: "records carry real public URLs via the normal record path" | **DIVERGENCE.** Per-unit real URLs do not exist (Gutenberg is work-level) and `src/corpus.ts` is reused untouched, so record citation URLs are constructed demo-canonical (`.example` TLD), symmetric across both authors; the provenance table holds the real sources, and private-note `about` targets ARE real | `spec` | Soften §2 to "demo-canonical citations, real provenance + real route targets" (already stated in `corpus/README.md`) | | 10 | Spec quotes `NEXT-STEPS.md` §C1 as already saying "a runnable miniature ships at `scaling/`" | The live §C1 has no such line. The §C1 link and the C-intro carve-out are **deferred reconciliation** (prepared text below), applied by the build agent **after** `scaling:run` confirms the headline, so "runnable" is verified not asserted | `NEXT-STEPS` | Apply the prepared edit at build, not before | | 11 | Spec §7: `production-scaling.md` location "unconfirmed (subdir or pending)" | RESOLVED at `docs/production-scaling.md`, em-dashes already thinned (fix 2.4 landed). `scaling/README.md` cross-links it | `spec` | None; resolved | -| 12 | The keyless gate catches route flips | `judgeRetrieval` checks presence in top-K only, so it misses a top-slot flip where the note stays retrieved. **Added a keyless route-selection check** (`topSource`) for related-material cases, so a flip that keeps the note in top-K is still caught | `spec` | Note the route check in the spec's §5 harness description | +| 12 | The keyless gate catches a quantization flip | `judgeRetrieval` checks presence in top-K only, so it misses a flip where both candidates stay retrieved but swap rank. **Added a keyless top-slot check** (`topSource`): for any non-refusal case with an expected source, that source must WIN the top slot, not merely appear. Covers **route** (the private note must outrank the records) and, extended per review, **disambiguation** (the right Smith must outrank the wrong one — otherwise the headline's marquee verdict was protected only by the keyed `--full` pass) | `spec` | Note the top-slot check in the spec's §5 harness description; it is what makes the disambiguation verdict a keyless one | | 13 | The answer-mode pass governs the route/refuse verdicts | The keyless headline covers retrieval + route selection + refuse-by-floor; the answer-mode adjudication (related-material routes without restating) is the `--full` keyed pass. `answerQuestion` short-circuits to not-found on empty evidence, so refuse-by-empty-floor is keyless even under `--full`. Route tests selection, not A2 | `spec` | Clarify the two tiers (keyless retrieval gate vs keyed answer gate) in §5 | | 14 | (build) the corpus and vectors are produced in this session | Blocked by egress (GitHub-only) and a missing key; deferred to a local agent per `build-handoff.md`. Code, structure, gold, and tests committed and green | `nothing` (process) | Run the handoff to complete the demo | | 15 | `.github/STANDARDS.md` line 51: "Don't leak private embeddings/text into committed artifacts" | The demo commits `scaling/corpus/index.json` with the public-domain George "private"-layer vectors, **on purpose** (spec §5): the layer is public-domain (a layer assignment, not secrecy), the file is deliberately not gitignored, and README §2 + manifest §2 explain it with the inversion warning. The automated review flagged the standard. No design change; the standard is about genuinely-private data | `STANDARDS` reconciliation | Add a one-line carve-out at merge (prepared below) so the demo's public-domain exception is named, not re-flagged | diff --git a/scaling/README.md b/scaling/README.md index c28d686..acd44e7 100644 --- a/scaling/README.md +++ b/scaling/README.md @@ -1,18 +1,27 @@ # The int8 scaling demo The result this module is built to produce is a **caught failure**: the same -gold suite that owns grounding and refusal rejecting a cheaper encoding. Run the -quantizer at int4 (or with a lowered floor) and a route case flips, the private -note loses the top slot to a public record, and the suite catches it. That is -the point. "int8 held" on a small corpus is expected and proves little on its -own; the gate saying *no* when pushed is what shows the gold suite, not the -encoding, is the adjudicator. +gold suite that owns grounding and refusal rejecting a cheaper encoding. The +*mechanism* is proven offline, in `quantize.test.ts` (run by `npm test`): on +fixture vectors searched to exhibit a near-tie, int8 preserves both the route +and the disambiguation winner, int4 flips the top slot, and the gate catches it. + +Whether the **real Smith corpus** produces that flip at the int8/int4 boundary +is a separate, empirical question, settled by the build run, not asserted here. +"int8 held" on a small corpus is expected and proves little on its own; the gate +saying *no* when pushed is what shows the gold suite, not the encoding, is the +adjudicator. So: the mechanism is demonstrated; the real-corpus demonstration is +pending. + +The committed vectors are not built yet (this module was written with no network +and no key), so `npm run scaling:run` errors with a build pointer until then; +see **Build status**. Once built: ``` -npm run scaling:run # int8 on the real corpus: the headline, keyless -npm run scaling:run -- --bits 4 # int4: the gate rejects the route flip -npm run scaling:run -- --natural+synthetic # add the quarantined spire + its gold -npm run scaling:run -- --full # also run the answer-mode pass (needs a key) +npm run scaling:run # int8, real corpus: the headline, keyless +npm run scaling:run -- --natural+synthetic # add the spire and its gold +npm run scaling:run -- --natural+synthetic --bits 4 # int4: the gate rejects the spire's route flip +npm run scaling:run -- --full # also run the answer-mode pass (needs a key) ``` ## What it is @@ -52,9 +61,13 @@ Two facts make int8 admissible, and they differ in kind (the §6 split): harness reports rank correlation against the full-precision ranking, then runs the gold suite. Rank correlation is *necessary, not sufficient*: a demo that reports it and stops has shown a retrieval benchmark, not answerability - governing tuning. The refuse and route cases are the actual adjudicator. Past - int8 (int4, PQ, binary) the exact part stops applying and the whole lever is - measured; the wire format is versioned so a code/data mismatch fails loudly. + governing tuning. The gold suite is the actual adjudicator, and it checks not + just that the expected source is *retrieved* but that it *wins the top slot*: + so a quantization flip that swaps which Smith ranks first (disambiguation) or + lets a public record overtake the private note (route) is caught keyless, not + only by the keyed answer pass. Past int8 (int4, PQ, binary) the exact part + stops applying and the whole lever is measured; the wire format is versioned + so a code/data mismatch fails loudly. The headline run is **keyless**: it reads committed full-precision vectors and committed gold-query vectors, so no embedding call is made. A key is needed only @@ -74,7 +87,8 @@ encoding never exercises. under `--natural+synthetic`, each marked `synthetic: true` and naming the edge it tests. It is additive and never enters the headline metrics; the spire's effect is reported on its own line. No fabricated words are ever - attributed to the real Adam Smith. + passed off as either real Smith's writing: the spire is George-framed but + flagged, and nothing fabricated is presented as the actual work of either man. 3. **The claim is relative.** int8 preserves the verdicts full-precision produces; the corpus is not offered as realistic and nothing turns on its realism. @@ -98,6 +112,24 @@ bodies and the committed vectors (`corpus/index.json`, [`docs/scaling-demo/build-handoff.md`](../docs/scaling-demo/build-handoff.md) for the exact steps, and the delta log for what is confirmed versus pending. +## The spec and the log are kept in the open + +The planning docs live beside the module in +[`docs/scaling-demo/`](../docs/scaling-demo/), kept on purpose rather than +discarded once the code landed: + +- `SCALING-DEMO-spec.md`: what the demo set out to do, and why; the ticket it was + built from. +- `scaling-demo-delta-log.md`: every place the build diverged from that spec, + what is settled versus pending the keyed build run, and the prepared + reconciliations (NEXT-STEPS, STANDARDS, the paper) to apply at merge. +- `build-handoff.md`: the brief for the build run that fetches the public-domain + texts and generates the committed vectors. + +This is the same move the corpus manifest makes: the reasoning behind the +artifact is part of the artifact. A reader can see what was intended, where +reality differed, and which decisions are still owed. + ## Relation to production This is the runnable counterpart to the prose in `docs/production-scaling.md` diff --git a/scaling/corpus/README.md b/scaling/corpus/README.md index 3eb9e08..74f0c3d 100644 --- a/scaling/corpus/README.md +++ b/scaling/corpus/README.md @@ -48,7 +48,7 @@ A record's citation URL is constructed by the reused `src/corpus.ts` path (`base **2. "Private" is a layer assignment, not a claim of secrecy.** George was a public figure and all his work is published; designating some of it private means only that *the type cannot carry its text to the model*, regardless of what the text is. The whole repo works this way (the default example corpus is synthetic "Person A"). Everything here is exposed in the repo on purpose: seeing the full private text, then watching the type admit only its routing hint, is the demonstration, not a contradiction of it. **This is also why this demo can commit its embedding vectors when the main repo gitignores its index: these vectors derive from public-domain text, so they expose nothing already private. Do not copy "commit your vectors" as a general pattern: embeddings of genuinely private text can be inverted to recover approximate content, which is the exposure the main repo's gitignored index avoids.** -**3. No fabricated words are attributed to the real Adam Smith, and synthetic notes are flagged in the data.** Every fabricated note lives in the quarantined `synthetic/` directory and names the edge case it tests, so nothing can be mistaken for either Smith's actual writing even lifted out of context. Real George material is handled as George's; synthetic is never confusable with it. +**3. No fabricated words are passed off as either real Smith's writing, and synthetic notes are flagged in the data.** Every fabricated note lives in the quarantined `synthetic/` directory, carries `synthetic: true`, and names the edge case it tests, so nothing can be mistaken for either Smith's actual writing even lifted out of context. Real George material is handled as George's; synthetic is never confusable with it. **4. The corpus is not tuned so int8 passes.** Headline numbers come from the real-only (`--natural`) run. The demo deliberately *includes a failure*: a tightened encoding (int4, or a lowered floor) breaking a route case, caught by the gold suite. Shipping a caught failure is the opposite of tuning to pass; it is how the demo shows the gate can say no. diff --git a/scaling/corpus/synthetic/syn-amos-justice-margin.md b/scaling/corpus/synthetic/syn-amos-justice-margin.md index 1c75914..343808f 100644 --- a/scaling/corpus/synthetic/syn-amos-justice-margin.md +++ b/scaling/corpus/synthetic/syn-amos-justice-margin.md @@ -1,5 +1,5 @@ --- -title: "George Adam Smith — private note on Amos and the justice of God" +title: "Synthetic demo note (fabricated, not George Adam Smith): Amos and the justice of God" about: https://en.wikipedia.org/wiki/George_Adam_Smith locator: "study marginalia, Amos" synthetic: true diff --git a/scaling/scaling.config.ts b/scaling/scaling.config.ts index df08739..8dc3508 100644 --- a/scaling/scaling.config.ts +++ b/scaling/scaling.config.ts @@ -46,6 +46,7 @@ export const config: ArchiveConfig = { // The quarantined synthetic spire (scaling/corpus/synthetic/) is loaded as an // ADDITIONAL private-notes dir only under --natural+synthetic, never here. Its -// location is the flag: nothing in scaling/corpus/synthetic/ is real George -// text. See scaling/run.ts and README §3. +// location is one flag and each file also carries `synthetic: true` in +// frontmatter (a second flag the PrivateNote type ignores): nothing in +// scaling/corpus/synthetic/ is real George text. See scaling/run.ts and README §3. export const SYNTHETIC_NOTES_DIR = './scaling/corpus/synthetic'; From 74371acdca848f72eb57028744f5a15a1f5cf3a3 Mon Sep 17 00:00:00 2001 From: Claude Date: Tue, 16 Jun 2026 05:14:19 +0000 Subject: [PATCH 10/10] refactor(demo): rename scaling/ -> demo/ (it is a demo, not a subsystem) The module read like a package/subsystem; it is a demo script. Renames the top-level dir scaling/ -> demo/, the npm scripts to demo:build / demo:run / demo:test, scaling.config.ts -> demo/config.ts, and every path/script reference in the module and the operational docs (build-handoff, delta log). Prose like "the int8 scaling demo" is left as description. The historical spec and corpus draft keep the original scaling/ name as the proposal; delta-log row 16 bridges it. tsconfig + the npm test glob updated; 40 tests pass, typecheck clean, demo:run degrades cleanly with the new paths. https://claude.ai/code/session_01EhtDe3ZQnv6vx2qSLy8qQ1 --- {scaling => demo}/README.md | 14 +++--- {scaling => demo}/build.ts | 26 +++++------ scaling/scaling.config.ts => demo/config.ts | 18 ++++---- {scaling => demo}/corpus/README.md | 2 +- {scaling => demo}/corpus/private/.gitkeep | 0 .../corpus/public/adam-smith/.gitkeep | 0 .../corpus/public/george-adam-smith/.gitkeep | 0 .../synthetic/syn-amos-justice-margin.md | 0 {scaling => demo}/gold.synthetic.yaml | 2 +- {scaling => demo}/gold.yaml | 6 +-- {scaling => demo}/harness.ts | 4 +- {scaling => demo}/quantize.test.ts | 0 {scaling => demo}/quantize.ts | 4 +- {scaling => demo}/query-vectors.test.ts | 0 {scaling => demo}/query-vectors.ts | 8 ++-- {scaling => demo}/run.ts | 28 ++++++------ docs/scaling-demo/build-handoff.md | 30 ++++++------- docs/scaling-demo/scaling-demo-delta-log.md | 43 ++++++++++--------- package.json | 8 ++-- tsconfig.json | 2 +- 20 files changed, 98 insertions(+), 97 deletions(-) rename {scaling => demo}/README.md (92%) rename {scaling => demo}/build.ts (85%) rename scaling/scaling.config.ts => demo/config.ts (79%) rename {scaling => demo}/corpus/README.md (93%) rename {scaling => demo}/corpus/private/.gitkeep (100%) rename {scaling => demo}/corpus/public/adam-smith/.gitkeep (100%) rename {scaling => demo}/corpus/public/george-adam-smith/.gitkeep (100%) rename {scaling => demo}/corpus/synthetic/syn-amos-justice-margin.md (100%) rename {scaling => demo}/gold.synthetic.yaml (94%) rename {scaling => demo}/gold.yaml (95%) rename {scaling => demo}/harness.ts (98%) rename {scaling => demo}/quantize.test.ts (100%) rename {scaling => demo}/quantize.ts (95%) rename {scaling => demo}/query-vectors.test.ts (100%) rename {scaling => demo}/query-vectors.ts (89%) rename {scaling => demo}/run.ts (89%) diff --git a/scaling/README.md b/demo/README.md similarity index 92% rename from scaling/README.md rename to demo/README.md index acd44e7..c10ed21 100644 --- a/scaling/README.md +++ b/demo/README.md @@ -14,14 +14,14 @@ adjudicator. So: the mechanism is demonstrated; the real-corpus demonstration is pending. The committed vectors are not built yet (this module was written with no network -and no key), so `npm run scaling:run` errors with a build pointer until then; +and no key), so `npm run demo:run` errors with a build pointer until then; see **Build status**. Once built: ``` -npm run scaling:run # int8, real corpus: the headline, keyless -npm run scaling:run -- --natural+synthetic # add the spire and its gold -npm run scaling:run -- --natural+synthetic --bits 4 # int4: the gate rejects the spire's route flip -npm run scaling:run -- --full # also run the answer-mode pass (needs a key) +npm run demo:run # int8, real corpus: the headline, keyless +npm run demo:run -- --natural+synthetic # add the spire and its gold +npm run demo:run -- --natural+synthetic --bits 4 # int4: the gate rejects the spire's route flip +npm run demo:run -- --full # also run the answer-mode pass (needs a key) ``` ## What it is @@ -71,7 +71,7 @@ Two facts make int8 admissible, and they differ in kind (the §6 split): The headline run is **keyless**: it reads committed full-precision vectors and committed gold-query vectors, so no embedding call is made. A key is needed only -to regenerate the vectors (`scaling:build`) or to run the `--full` answer pass. +to regenerate the vectors (`demo:build`) or to run the `--full` answer pass. That answer pass exercises route *selection*, which is what quantization moves; it does not touch A2, the answer model's confabulation residue, which the encoding never exercises. @@ -107,7 +107,7 @@ The code, the gold set, the provenance manifest, and the deterministic harness tests (`quantize.test.ts`, run by `npm test`) are committed. The real text bodies and the committed vectors (`corpus/index.json`, `corpus/index.synthetic.json`, `corpus/query-vectors.json`) are produced by -`scaling:build`, which needs network access to the public-domain sources and an +`demo:build`, which needs network access to the public-domain sources and an `OPENAI_API_KEY`; the session that wrote the module had neither. See [`docs/scaling-demo/build-handoff.md`](../docs/scaling-demo/build-handoff.md) for the exact steps, and the delta log for what is confirmed versus pending. diff --git a/scaling/build.ts b/demo/build.ts similarity index 85% rename from scaling/build.ts rename to demo/build.ts index ffd8da6..1c9b590 100644 --- a/scaling/build.ts +++ b/demo/build.ts @@ -1,13 +1,13 @@ -// npm run scaling:build — embed the scaling corpus and the gold queries, then +// npm run demo:build — embed the scaling corpus and the gold queries, then // commit the vectors. KEYED and run once (or after corpus edits): needs network // to the embedding API and an OPENAI_API_KEY. The session that wrote this code // had neither; see docs/scaling-demo/build-handoff.md. // // Reuses the core corpus loaders, embedding, and store writers untouched. The -// only thing new is pointing them at scaling/corpus/ and splitting the output +// only thing new is pointing them at demo/corpus/ and splitting the output // into the natural index (the headline source of truth), the synthetic spire // (a strictly baseline-plus-delta file, unioned only under --natural+synthetic), -// and the committed gold-query vectors (what makes scaling:run keyless). +// and the committed gold-query vectors (what makes demo:run keyless). import { createHash } from 'node:crypto'; import { existsSync } from 'node:fs'; @@ -19,13 +19,13 @@ import { batchInputs, embedBatch, truncateForEmbedding } from '../src/embedding. import { assertHomogeneousIndex, writeIndexFile } from '../src/store.js'; import type { ArchiveConfig, IndexEntry, PrivateNote } from '../src/types.js'; import { loadGold } from '../src/evaluate.js'; -import { config, SYNTHETIC_NOTES_DIR } from './scaling.config.js'; +import { config, SYNTHETIC_NOTES_DIR } from './config.js'; import { writeQueryVectors } from './query-vectors.js'; -const NATURAL_INDEX = resolve('scaling/corpus/index.json'); -const SYNTHETIC_INDEX = resolve('scaling/corpus/index.synthetic.json'); -const NATURAL_GOLD = resolve('scaling/gold.yaml'); -const SYNTHETIC_GOLD = resolve('scaling/gold.synthetic.yaml'); +const NATURAL_INDEX = resolve('demo/corpus/index.json'); +const SYNTHETIC_INDEX = resolve('demo/corpus/index.synthetic.json'); +const NATURAL_GOLD = resolve('demo/gold.yaml'); +const SYNTHETIC_GOLD = resolve('demo/gold.synthetic.yaml'); function contentHash(text: string): string { return createHash('sha1').update(truncateForEmbedding(text)).digest('hex').slice(0, 16); @@ -82,7 +82,7 @@ function noteEntries(notes: PrivateNote[], vectors: Map): Inde async function main(): Promise { if (!process.env.OPENAI_API_KEY) { - throw new Error('OPENAI_API_KEY is not set. scaling:build needs it to embed (see build-handoff.md).'); + throw new Error('OPENAI_API_KEY is not set. demo:build needs it to embed (see build-handoff.md).'); } const client = new OpenAI(); @@ -94,7 +94,7 @@ async function main(): Promise { `${syntheticNotes.length} synthetic notes`, ); if (records.length === 0) { - throw new Error('No records found under scaling/corpus/public — populate it first (build-handoff.md §1).'); + throw new Error('No records found under demo/corpus/public — populate it first (build-handoff.md §1).'); } // Gold queries: natural always, synthetic if authored. @@ -135,7 +135,7 @@ async function main(): Promise { console.log('No synthetic notes authored yet; skipping the spire index.'); } - // Committed gold-query vectors (what makes scaling:run keyless). Every gold + // Committed gold-query vectors (what makes demo:run keyless). Every gold // query must embed, or the keyless runner would later fail on a missing id. const queryVectors = goldQueries.map((g) => { const vector = vectors.get(`query:${g.id}`); @@ -145,10 +145,10 @@ async function main(): Promise { const dims = queryVectors[0]?.vector.length ?? naturalEntries[0]?.dimensions ?? 0; writeQueryVectors(config.embeddingModel, dims, queryVectors); console.log(`Wrote ${queryVectors.length} gold-query vectors`); - console.log('Done. Commit the *.json artifacts, then `npm run scaling:run`.'); + console.log('Done. Commit the *.json artifacts, then `npm run demo:run`.'); } main().catch((err) => { - console.error(`scaling:build failed: ${err instanceof Error ? err.message : err}`); + console.error(`demo:build failed: ${err instanceof Error ? err.message : err}`); process.exitCode = 1; }); diff --git a/scaling/scaling.config.ts b/demo/config.ts similarity index 79% rename from scaling/scaling.config.ts rename to demo/config.ts index 8dc3508..29b6ffe 100644 --- a/scaling/scaling.config.ts +++ b/demo/config.ts @@ -1,9 +1,9 @@ -// scaling.config.ts — points the engine at the int8 scaling-demo corpus. +// config.ts — points the engine at the int8 scaling-demo corpus. // // This is the same ArchiveConfig shape the core uses (src/types.ts), pointed at -// scaling/corpus/ instead of example-content/. The demo reuses the core +// demo/corpus/ instead of example-content/. The demo reuses the core // retrieval, the no-leak boundary, and the eval judges untouched; only the -// corpus, the gold set, and a thin int8 pass are new (see scaling/README.md). +// corpus, the gold set, and a thin int8 pass are new (see demo/README.md). // // Two authors share one colliding name on purpose: Adam Smith the economist // (1723-1790) and George Adam Smith the theologian (1856-1942). Both write @@ -16,7 +16,7 @@ // On URLs: a record's citation URL is built by the reused corpus path // (baseUrl + urlPrefix + slug), so it is a demo-canonical surface under the // reserved .example TLD (RFC 2606), not a live page. The real public-domain -// sources live in scaling/corpus/README.md's provenance table, per work. A +// sources live in demo/corpus/README.md's provenance table, per work. A // private note's `about` is taken verbatim from frontmatter, so those route // targets ARE real public George pages. See the delta log for this divergence // from the spec's "records carry real public URLs" assumption and why it keeps @@ -28,7 +28,7 @@ export const config: ArchiveConfig = { archiveName: 'Smith Collection (int8 scaling demo)', authorName: 'Adam Smith and George Adam Smith', baseUrl: 'https://smith-collection.example', - contentRoot: './scaling/corpus', + contentRoot: './demo/corpus', collections: [ { dir: 'public/adam-smith', urlPrefix: '/adam-smith/', type: 'adam-smith' }, { dir: 'public/george-adam-smith', urlPrefix: '/george/', type: 'george-adam-smith' }, @@ -36,7 +36,7 @@ export const config: ArchiveConfig = { // The private layer: George's minor works (sermons, addresses), searchable // but never quotable. Designating published work "private" is a layer // assignment enforced by the type, not a claim of secrecy (README §2). - privateNotesDir: './scaling/corpus/private', + privateNotesDir: './demo/corpus/private', // Matches archive.config.ts. The int8 demo depends on this: the committed // vectors must be text-embedding-3-large at native dimensionality or the // homogeneity invariant (src/store.ts) rejects them. @@ -44,9 +44,9 @@ export const config: ArchiveConfig = { answerModel: 'gpt-4o-mini', }; -// The quarantined synthetic spire (scaling/corpus/synthetic/) is loaded as an +// The quarantined synthetic spire (demo/corpus/synthetic/) is loaded as an // ADDITIONAL private-notes dir only under --natural+synthetic, never here. Its // location is one flag and each file also carries `synthetic: true` in // frontmatter (a second flag the PrivateNote type ignores): nothing in -// scaling/corpus/synthetic/ is real George text. See scaling/run.ts and README §3. -export const SYNTHETIC_NOTES_DIR = './scaling/corpus/synthetic'; +// demo/corpus/synthetic/ is real George text. See demo/run.ts and README §3. +export const SYNTHETIC_NOTES_DIR = './demo/corpus/synthetic'; diff --git a/scaling/corpus/README.md b/demo/corpus/README.md similarity index 93% rename from scaling/corpus/README.md rename to demo/corpus/README.md index 74f0c3d..8a1b779 100644 --- a/scaling/corpus/README.md +++ b/demo/corpus/README.md @@ -15,7 +15,7 @@ Both write dense moral prose about justice, society, and ethics, so the two bodi ## Build status -The text bodies and the embedding vectors are produced by `scaling/build.ts`, which needs network access to the public-domain sources and an `OPENAI_API_KEY`. The code, the structure, the gold set, the provenance table below, and the deterministic harness tests are authored and committed; the real bodies and the committed `index.json` / `query-vectors.json` are populated by a build run with those two things. See `docs/scaling-demo/build-handoff.md` for the exact build steps. **Every ID and date below is a claim to verify against the live source during that run, not a confirmation made here.** +The text bodies and the embedding vectors are produced by `demo/build.ts`, which needs network access to the public-domain sources and an `OPENAI_API_KEY`. The code, the structure, the gold set, the provenance table below, and the deterministic harness tests are authored and committed; the real bodies and the committed `index.json` / `query-vectors.json` are populated by a build run with those two things. See `docs/scaling-demo/build-handoff.md` for the exact build steps. **Every ID and date below is a claim to verify against the live source during that run, not a confirmation made here.** ## Provenance and public-domain status diff --git a/scaling/corpus/private/.gitkeep b/demo/corpus/private/.gitkeep similarity index 100% rename from scaling/corpus/private/.gitkeep rename to demo/corpus/private/.gitkeep diff --git a/scaling/corpus/public/adam-smith/.gitkeep b/demo/corpus/public/adam-smith/.gitkeep similarity index 100% rename from scaling/corpus/public/adam-smith/.gitkeep rename to demo/corpus/public/adam-smith/.gitkeep diff --git a/scaling/corpus/public/george-adam-smith/.gitkeep b/demo/corpus/public/george-adam-smith/.gitkeep similarity index 100% rename from scaling/corpus/public/george-adam-smith/.gitkeep rename to demo/corpus/public/george-adam-smith/.gitkeep diff --git a/scaling/corpus/synthetic/syn-amos-justice-margin.md b/demo/corpus/synthetic/syn-amos-justice-margin.md similarity index 100% rename from scaling/corpus/synthetic/syn-amos-justice-margin.md rename to demo/corpus/synthetic/syn-amos-justice-margin.md diff --git a/scaling/gold.synthetic.yaml b/demo/gold.synthetic.yaml similarity index 94% rename from scaling/gold.synthetic.yaml rename to demo/gold.synthetic.yaml index 20f633e..25ed388 100644 --- a/scaling/gold.synthetic.yaml +++ b/demo/gold.synthetic.yaml @@ -1,5 +1,5 @@ # Expanded gold for --natural+synthetic. Loaded ONLY alongside the quarantined -# synthetic spire (scaling/corpus/synthetic/). Because the spire is fabricated, +# synthetic spire (demo/corpus/synthetic/). Because the spire is fabricated, # these cases never touch the headline (--natural) numbers; the runner reports # the spire's effect on its own line, broken out, so a reader can tell whether # int8 held because the encoding is sound or because notes were hand-placed. diff --git a/scaling/gold.yaml b/demo/gold.yaml similarity index 95% rename from scaling/gold.yaml rename to demo/gold.yaml index d7e415a..53ec89d 100644 --- a/scaling/gold.yaml +++ b/demo/gold.yaml @@ -7,11 +7,11 @@ # floor or a route case comfortably clear proves nothing about quantization; # the marginal cases are the whole point. # -# These run against scaling/corpus/index.json (committed FP vectors), quantized -# in process. The harness (scaling/run.ts) checks each case keylessly at the +# These run against demo/corpus/index.json (committed FP vectors), quantized +# in process. The harness (demo/run.ts) checks each case keylessly at the # retrieval tier and, with --full and a key, the answer-mode tier too. Source # ids are `${type}:${slug}` for records and `note:${slug}` for private notes, -# matching the corpus files in scaling/corpus/ (see docs/scaling-demo/build-handoff.md). +# matching the corpus files in demo/corpus/ (see docs/scaling-demo/build-handoff.md). # # Queries name each Smith explicitly rather than using {{author}}, because the # demo's whole subject is which Smith a question means. diff --git a/scaling/harness.ts b/demo/harness.ts similarity index 98% rename from scaling/harness.ts rename to demo/harness.ts index 95bbfb0..2de4c9d 100644 --- a/scaling/harness.ts +++ b/demo/harness.ts @@ -1,4 +1,4 @@ -// scaling/harness.ts — the int8 gate, as pure logic the CLI drives. +// demo/harness.ts — the int8 gate, as pure logic the CLI drives. // // Reuses the core retrieval (src/retrieve.ts) and the gold judge // (src/evaluate.ts) untouched: the int8 path is an encode/decode wrapper plus a @@ -171,7 +171,7 @@ export function runGate( const results: QueryGateResult[] = []; for (const g of gold) { const qv = queryVectorById.get(g.id); - if (!qv) throw new Error(`no query vector for gold id '${g.id}' (rebuild scaling:build?)`); + if (!qv) throw new Error(`no query vector for gold id '${g.id}' (rebuild demo:build?)`); results.push(evaluateQuery(g, index, quantIndex, qv)); } const passed = results.filter((r) => r.pass).length; diff --git a/scaling/quantize.test.ts b/demo/quantize.test.ts similarity index 100% rename from scaling/quantize.test.ts rename to demo/quantize.test.ts diff --git a/scaling/quantize.ts b/demo/quantize.ts similarity index 95% rename from scaling/quantize.ts rename to demo/quantize.ts index cc7a04c..a3dca71 100644 --- a/scaling/quantize.ts +++ b/demo/quantize.ts @@ -1,9 +1,9 @@ -// scaling/quantize.ts — scalar quantization for the int8 demo. +// demo/quantize.ts — scalar quantization for the int8 demo. // // The public, runnable twin of the production site adapter's vector-quant.ts // (named in docs/production-scaling.md §2; that adapter is not a public repo). // Same scheme: per-vector symmetric scalar quantization. The full-precision -// vectors stay the source of truth (scaling/corpus/index.json); the demo +// vectors stay the source of truth (demo/corpus/index.json); the demo // quantizes them in process, re-ranks, and lets the gold suite judge the result. // // Why it is admissible, in two parts of different kinds (the paper's §6 split): diff --git a/scaling/query-vectors.test.ts b/demo/query-vectors.test.ts similarity index 100% rename from scaling/query-vectors.test.ts rename to demo/query-vectors.test.ts diff --git a/scaling/query-vectors.ts b/demo/query-vectors.ts similarity index 89% rename from scaling/query-vectors.ts rename to demo/query-vectors.ts index def941c..75679d7 100644 --- a/scaling/query-vectors.ts +++ b/demo/query-vectors.ts @@ -1,8 +1,8 @@ -// scaling/query-vectors.ts — the committed gold-query embeddings. +// demo/query-vectors.ts — the committed gold-query embeddings. // // The core eval CLI (src/cli/eval.ts) embeds every gold query at run time, so // it always needs a key. The demo's headline must reproduce WITHOUT one, so the -// gold-query vectors are precomputed by scaling:build and committed here beside +// gold-query vectors are precomputed by demo:build and committed here beside // the index. The runner reads them instead of calling the embedding API; a key // is only ever needed to regenerate them or to run the --full answer pass. // @@ -13,7 +13,7 @@ import { existsSync, mkdirSync, readFileSync, writeFileSync } from 'node:fs'; import { dirname, resolve } from 'node:path'; -export const QUERY_VECTORS_PATH = resolve('scaling/corpus/query-vectors.json'); +export const QUERY_VECTORS_PATH = resolve('demo/corpus/query-vectors.json'); export const QUERY_VECTORS_VERSION = 1; export interface QueryVectorsFile { @@ -29,7 +29,7 @@ export interface LoadedQueryVectors { byId: Map; } -const REBUILD = 'Run `npm run scaling:build` with an OPENAI_API_KEY (see docs/scaling-demo/build-handoff.md).'; +const REBUILD = 'Run `npm run demo:build` with an OPENAI_API_KEY (see docs/scaling-demo/build-handoff.md).'; /** Read the committed query vectors, or null if not built yet. Throws on a * present-but-malformed file so a corrupt artifact fails loudly with a remedy. */ diff --git a/scaling/run.ts b/demo/run.ts similarity index 89% rename from scaling/run.ts rename to demo/run.ts index 249f517..a59d0ae 100644 --- a/scaling/run.ts +++ b/demo/run.ts @@ -1,4 +1,4 @@ -// npm run scaling:run — quantize the committed index in process, re-rank, and +// npm run demo:run — quantize the committed index in process, re-rank, and // run the full gold suite against the quantized index. // // --natural (default) real corpus only; owns the headline numbers. @@ -9,7 +9,7 @@ // The headline run is keyless: it reads committed FP vectors and committed // gold-query vectors, quantizes in process, and judges with the reused gold // logic. --full adds the answer model, which is the only part that needs a key. -// See scaling/README.md and docs/scaling-demo/build-handoff.md. +// See demo/README.md and docs/scaling-demo/build-handoff.md. import { resolve } from 'node:path'; @@ -20,10 +20,10 @@ import type { IndexEntry } from '../src/types.js'; import { requantizeIndex, runGate } from './harness.js'; import { readQueryVectors } from './query-vectors.js'; -const NATURAL_INDEX = resolve('scaling/corpus/index.json'); -const SYNTHETIC_INDEX = resolve('scaling/corpus/index.synthetic.json'); -const NATURAL_GOLD = resolve('scaling/gold.yaml'); -const SYNTHETIC_GOLD = resolve('scaling/gold.synthetic.yaml'); +const NATURAL_INDEX = resolve('demo/corpus/index.json'); +const SYNTHETIC_INDEX = resolve('demo/corpus/index.synthetic.json'); +const NATURAL_GOLD = resolve('demo/gold.yaml'); +const SYNTHETIC_GOLD = resolve('demo/gold.synthetic.yaml'); interface RunArgs { synthetic: boolean; @@ -56,7 +56,7 @@ function parseArgs(argv: string[]): RunArgs { case '--help': case '-h': console.log( - 'scaling:run [--natural | --natural+synthetic] [--bits ] [--full]\n' + + 'demo:run [--natural | --natural+synthetic] [--bits ] [--full]\n' + ' --natural real corpus only (default); owns the headline numbers\n' + ' --natural+synthetic add the quarantined synthetic spire + its gold\n' + ' --bits quantization width (default 8; 4 is the int4 scalpel)\n' + @@ -76,7 +76,7 @@ function loadIndex(synthetic: boolean): IndexEntry[] { if (natural.length === 0) { throw new Error( `no committed vectors at ${NATURAL_INDEX}. ` + - 'Run `npm run scaling:build` with an OPENAI_API_KEY (see docs/scaling-demo/build-handoff.md).', + 'Run `npm run demo:build` with an OPENAI_API_KEY (see docs/scaling-demo/build-handoff.md).', ); } if (!synthetic) { @@ -87,7 +87,7 @@ function loadIndex(synthetic: boolean): IndexEntry[] { if (spire.length === 0) { throw new Error( `--natural+synthetic needs the spire at ${SYNTHETIC_INDEX}, which is not built yet ` + - '(author the synthetic notes, then `npm run scaling:build`).', + '(author the synthetic notes, then `npm run demo:build`).', ); } const union = [...natural, ...spire]; @@ -105,7 +105,7 @@ function loadGoldSet(synthetic: boolean, author: string): GoldQuery[] { async function main(): Promise { const args = parseArgs(process.argv.slice(2)); - const { config } = await import('./scaling.config.js'); + const { config } = await import('./config.js'); const index = loadIndex(args.synthetic); const gold = loadGoldSet(args.synthetic, config.authorName); @@ -113,7 +113,7 @@ async function main(): Promise { const qv = readQueryVectors(); if (!qv) { throw new Error( - 'no committed query vectors. Run `npm run scaling:build` with an OPENAI_API_KEY ' + + 'no committed query vectors. Run `npm run demo:build` with an OPENAI_API_KEY ' + '(see docs/scaling-demo/build-handoff.md).', ); } @@ -121,14 +121,14 @@ async function main(): Promise { if (qv.model !== spec.model || qv.dimensions !== spec.dimensions) { throw new Error( `query vectors (${qv.model}/${qv.dimensions}) do not match the index ` + - `(${spec.model}/${spec.dimensions}); rebuild both with scaling:build.`, + `(${spec.model}/${spec.dimensions}); rebuild both with demo:build.`, ); } // Say plainly what this run IS, so a reader knows what they are looking at. const label = args.synthetic ? '--natural+synthetic' : '--natural'; const shipped = args.bits === 8; - console.log('scaling:run — int8 quantization gate (Smith collection)'); + console.log('demo:run — int8 quantization gate (Smith collection)'); console.log( ` encoding: int${args.bits} ` + (shipped @@ -236,6 +236,6 @@ async function runAnswerPass( } main().catch((err) => { - console.error(`scaling:run failed: ${err instanceof Error ? err.message : err}`); + console.error(`demo:run failed: ${err instanceof Error ? err.message : err}`); process.exitCode = 1; }); diff --git a/docs/scaling-demo/build-handoff.md b/docs/scaling-demo/build-handoff.md index 6d0066c..cf794cc 100644 --- a/docs/scaling-demo/build-handoff.md +++ b/docs/scaling-demo/build-handoff.md @@ -1,8 +1,8 @@ # Build handoff — populate the scaling corpus and generate the vectors -This is an executable brief for an agent (or person) running in an environment **with network access to the public-domain sources and an `OPENAI_API_KEY`**. The session that built `scaling/` had neither: this repo's egress allowed only GitHub, and `api.openai.com` plus Gutenberg / archive.org were all blocked, so the code, structure, gold set, provenance manifest, and deterministic harness tests are authored and committed, but the real text bodies and the committed embedding vectors are not. This brief produces them. +This is an executable brief for an agent (or person) running in an environment **with network access to the public-domain sources and an `OPENAI_API_KEY`**. The session that built `demo/` had neither: this repo's egress allowed only GitHub, and `api.openai.com` plus Gutenberg / archive.org were all blocked, so the code, structure, gold set, provenance manifest, and deterministic harness tests are authored and committed, but the real text bodies and the committed embedding vectors are not. This brief produces them. -Read the spec (`docs/scaling-demo/SCALING-DEMO-spec.md`), the corpus manifest (`scaling/corpus/README.md`), and the delta log (`docs/scaling-demo/scaling-demo-delta-log.md`) first. The frame governs: verify against the live source not against this doc, prefer the smaller change, and **never fabricate words for the real Adam Smith or the real George Adam Smith** — the only authored text is the quarantined synthetic spire. +Read the spec (`docs/scaling-demo/SCALING-DEMO-spec.md`), the corpus manifest (`demo/corpus/README.md`), and the delta log (`docs/scaling-demo/scaling-demo-delta-log.md`) first. The frame governs: verify against the live source not against this doc, prefer the smaller change, and **never fabricate words for the real Adam Smith or the real George Adam Smith** — the only authored text is the quarantined synthetic spire. ## 0. Prerequisites @@ -14,9 +14,9 @@ Read the spec (`docs/scaling-demo/SCALING-DEMO-spec.md`), the corpus manifest (` One markdown file per **short whole unit** (a single prophet exposition, one chapter, one sermon). **Never a whole volume as one file** — a whole volume as one embedding dilutes its topical center (`NEXT-STEPS.md` B3) and washes out the near-ties the demo needs. Watch sermon length specifically: if a sermon is long enough that it would have to be split into windows to retrieve well, that is the **highest-stakes delta** (the demo would then chunk, and "in-memory and unchunked" breaks — log it in delta-log row 4 before doing it). -Slugs are the filename stems and must match `scaling/gold.yaml` exactly. Titles **carry the author's full name on purpose**: that is what makes the partial-name boost edge live (a query naming "Adam Smith" phrase-matches a title containing "George Adam Smith"). Author `themes` honestly from the actual text, **including where they collide** (both Smiths on "justice"); do not curate themes to make disambiguation easy. +Slugs are the filename stems and must match `demo/gold.yaml` exactly. Titles **carry the author's full name on purpose**: that is what makes the partial-name boost edge live (a query naming "Adam Smith" phrase-matches a title containing "George Adam Smith"). Author `themes` honestly from the actual text, **including where they collide** (both Smiths on "justice"); do not curate themes to make disambiguation easy. -### Public ledger — Adam Smith (economist), dir `scaling/corpus/public/adam-smith/` +### Public ledger — Adam Smith (economist), dir `demo/corpus/public/adam-smith/` Record frontmatter: `title` (required, lead with "Adam Smith — "), `summary` (or `description`/`meaning`), `themes`. Body: the real unit text, lightly cleaned. @@ -27,7 +27,7 @@ Record frontmatter: `title` (required, lead with "Adam Smith — "), `summary` ( | `wealth-of-nations-division-of-labour` | _Wealth of Nations_, Bk I ch. 1 (division of labour) | labour, economy, society | | `wealth-of-nations-value` | _Wealth of Nations_, Bk I on value / price | value, money, economy | -### Public ledger — George Adam Smith (theologian), dir `scaling/corpus/public/george-adam-smith/` +### Public ledger — George Adam Smith (theologian), dir `demo/corpus/public/george-adam-smith/` Same frontmatter shape; lead titles with "George Adam Smith — ". @@ -40,7 +40,7 @@ Same frontmatter shape; lead titles with "George Adam Smith — ". Note the deliberate theme collision: Amos and Micah carry "justice," which Adam Smith's _Theory of Moral Sentiments_ also carries. That collision is wanted; the gold suite exposes where the theme boost mis-fires. -### Private ledger — George sermons, dir `scaling/corpus/private/` +### Private ledger — George sermons, dir `demo/corpus/private/` These are **real George minor works**, designated private (a layer assignment, not secrecy). Note frontmatter: `title` (the label that travels — keep it public-safe), `about` (a **real** public George page to route to, e.g. the work's Wikisource/IA page or `https://en.wikipedia.org/wiki/George_Adam_Smith`), `locator` (where the moment lives, e.g. "Forgiveness of Sins (1905), sermon II"). Body: the real sermon text. The id is `note:`. @@ -55,7 +55,7 @@ Confirm the actual sermon titles from the volume and rename slugs to match if ne ## 2. Author the synthetic spire (only if the deliberate failure needs it) The spire is the scalpel for the deliberate failure (step 4), not a corpus filler. Author it **only if** the real route-margin tie does not flip under a tightened encoding on its own. Each synthetic note: -- lives in `scaling/corpus/synthetic/` (the quarantine is one flag) and carries a `synthetic: true` frontmatter marker (a second, in-file flag; the `PrivateNote` type does not read it, so it changes nothing the engine sees), +- lives in `demo/corpus/synthetic/` (the quarantine is one flag) and carries a `synthetic: true` frontmatter marker (a second, in-file flag; the `PrivateNote` type does not read it, so it changes nothing the engine sees), - is a fabricated **George-private** note (never a third Smith, never words for the real Adam Smith), - carries a one-line comment at the top of the body naming the gold case, the margin, and the mode it targets, - is skewed toward must-refuse / route-flip, never an extra must-answer win. @@ -64,22 +64,22 @@ Suggested first spire note: `syn-amos-justice-margin` — a fabricated George no ## 3. Generate the committed vectors -`npm run scaling:build` (added in `package.json`) reads the corpus through the reused `buildCorpus` / `buildPrivateNotes`, embeds with the configured model, embeds the gold queries, and writes: -- `scaling/corpus/index.json` — natural FP vectors (records + real private notes). The headline source of truth; committed. -- `scaling/corpus/index.synthetic.json` — the spire delta (synthetic notes only), unioned under `--natural+synthetic`. -- `scaling/corpus/query-vectors.json` — the gold-query vectors that make `scaling:run` keyless. +`npm run demo:build` (added in `package.json`) reads the corpus through the reused `buildCorpus` / `buildPrivateNotes`, embeds with the configured model, embeds the gold queries, and writes: +- `demo/corpus/index.json` — natural FP vectors (records + real private notes). The headline source of truth; committed. +- `demo/corpus/index.synthetic.json` — the spire delta (synthetic notes only), unioned under `--natural+synthetic`. +- `demo/corpus/query-vectors.json` — the gold-query vectors that make `demo:run` keyless. Commit all three. They derive from public-domain text, so committing them exposes nothing private (manifest §2); do not generalize that to private corpora. ## 4. Run the gate, then calibrate the deliberate failure -1. `npm run scaling:run` (the `--natural` headline, no key needed once vectors are committed). Confirm: rank correlation FP-vs-int8 above the bar, and the full gold suite passes. Record the headline numbers in delta-log row 2 / 7. +1. `npm run demo:run` (the `--natural` headline, no key needed once vectors are committed). Confirm: rank correlation FP-vs-int8 above the bar, and the full gold suite passes. Record the headline numbers in delta-log row 2 / 7. 2. Find the break: re-run at `--bits 4` (int4) or a lowered floor and confirm a **route** case flips and the gold suite **catches it**. Report the spire's effect on its own line, never folded into the headline. **If it does not fire, the near-ties are too loose: tighten the margin (the spire), do NOT add corpus** (delta-log row 3). This caught failure is the result the demo rests on; lead the README with it. -3. Optional keyed bonus: `npm run scaling:run -- --full` runs the answer-mode adjudication (related-material routes without restating). This exercises selection, not A2 — int8 never touches the answer model's confabulation residue. +3. Optional keyed bonus: `npm run demo:run -- --full` runs the answer-mode adjudication (related-material routes without restating). This exercises selection, not A2 — int8 never touches the answer model's confabulation residue. ## 5. Verify and reconcile (do these last, from real facts) -- Fill the provenance table OCR-quality notes in `scaling/corpus/README.md` from the actual files; verify every Gutenberg ID and the IA ARK against the live source. +- Fill the provenance table OCR-quality notes in `demo/corpus/README.md` from the actual files; verify every Gutenberg ID and the IA ARK against the live source. - Fill the delta log rows with what the build actually did. Flag any `paper §5-§6` row immediately (especially row 4 if any unit had to be split). -- **Only once `scaling:run` confirms the headline**, apply the deferred `NEXT-STEPS.md` reconciliation (prepared text in the delta log): distinguish the deliberately-simple **core** (full-precision, pulls no levers, indexes documents whole) from the **`scaling/` miniature** (pulls exactly one lever, int8, on a short-whole-unit corpus; explicitly marked), and add the §C1 link to `scaling/`. Do not claim "a runnable miniature ships" until it runs — that honesty is the whole point. +- **Only once `demo:run` confirms the headline**, apply the deferred `NEXT-STEPS.md` reconciliation (prepared text in the delta log): distinguish the deliberately-simple **core** (full-precision, pulls no levers, indexes documents whole) from the **`demo/` miniature** (pulls exactly one lever, int8, on a short-whole-unit corpus; explicitly marked), and add the §C1 link to `demo/`. Do not claim "a runnable miniature ships" until it runs — that honesty is the whole point. - Re-run `npm test` and `npm run typecheck`; both stay green. diff --git a/docs/scaling-demo/scaling-demo-delta-log.md b/docs/scaling-demo/scaling-demo-delta-log.md index d5580a7..89ee8b0 100644 --- a/docs/scaling-demo/scaling-demo-delta-log.md +++ b/docs/scaling-demo/scaling-demo-delta-log.md @@ -1,6 +1,6 @@ # Delta log — scaling demo build -The lab notebook for building `scaling/`. The spec states assumptions; the build establishes facts; this log records every place they diverge. Fill it **during** testing, not after — the point is to write the downstream docs once, from ground truth, instead of authoring them under time pressure on merge day. +The lab notebook for building `demo/`. The spec states assumptions; the build establishes facts; this log records every place they diverge. Fill it **during** testing, not after — the point is to write the downstream docs once, from ground truth, instead of authoring them under time pressure on merge day. **Why this exists:** the reconciliation edits (NEXT-STEPS C-intro/C1, the paper §5/§6 line) are *descriptions of what the built demo actually does*. They can't be written accurately before the build, and "verify against the live repo, never against the brief" applies one level up here too. Defer the prose; don't defer the obligation — every row tagged `paper` or `NEXT-STEPS` is a downstream edit that comes due at merge. @@ -9,14 +9,14 @@ The lab notebook for building `scaling/`. The spec states assumptions; the build For each assumption the spec makes, record what the build actually did and what that touches. A row only matters if reality diverged or confirmed-under-doubt. The **Touches** column is the early-warning system: most deltas are `spec` (fix the spec so it stays true) or `nothing`; the ones tagged `paper` are the ones that change a published claim and must not be discovered by a referee. **Touches** values: -- `spec` — correct `SCALING-DEMO-spec.md` / `scaling/corpus/README.md` so they describe the real build. +- `spec` — correct `SCALING-DEMO-spec.md` / `demo/corpus/README.md` so they describe the real build. - `NEXT-STEPS` — the C-intro/C1 core-vs-miniature reconciliation depends on this fact. - `paper §5–§6` — changes a claim in the paper (in-memory, unchunked, pulls no levers). Highest stakes. Flag immediately. - `nothing` — confirmed as assumed; log it so you know it was checked. ## Pre-seeded rows (the deltas most likely to surface) -**Build context (read before the rows).** The session that built `scaling/` +**Build context (read before the rows).** The session that built `demo/` had egress to GitHub only: `api.openai.com`, Gutenberg, and archive.org all returned `host_not_allowed`, and no `OPENAI_API_KEY` was set. So the code, the gold set, the provenance manifest, and the deterministic harness tests are @@ -41,46 +41,47 @@ rows about the *mechanism and structure* are settled now. | # | Spec assumption | What the build actually did | Touches | Downstream action | |---|---|---|---|---| | 9 | Spec §2: "records carry real public URLs via the normal record path" | **DIVERGENCE.** Per-unit real URLs do not exist (Gutenberg is work-level) and `src/corpus.ts` is reused untouched, so record citation URLs are constructed demo-canonical (`.example` TLD), symmetric across both authors; the provenance table holds the real sources, and private-note `about` targets ARE real | `spec` | Soften §2 to "demo-canonical citations, real provenance + real route targets" (already stated in `corpus/README.md`) | -| 10 | Spec quotes `NEXT-STEPS.md` §C1 as already saying "a runnable miniature ships at `scaling/`" | The live §C1 has no such line. The §C1 link and the C-intro carve-out are **deferred reconciliation** (prepared text below), applied by the build agent **after** `scaling:run` confirms the headline, so "runnable" is verified not asserted | `NEXT-STEPS` | Apply the prepared edit at build, not before | -| 11 | Spec §7: `production-scaling.md` location "unconfirmed (subdir or pending)" | RESOLVED at `docs/production-scaling.md`, em-dashes already thinned (fix 2.4 landed). `scaling/README.md` cross-links it | `spec` | None; resolved | +| 10 | Spec quotes `NEXT-STEPS.md` §C1 as already saying "a runnable miniature ships at `demo/`" | The live §C1 has no such line. The §C1 link and the C-intro carve-out are **deferred reconciliation** (prepared text below), applied by the build agent **after** `demo:run` confirms the headline, so "runnable" is verified not asserted | `NEXT-STEPS` | Apply the prepared edit at build, not before | +| 11 | Spec §7: `production-scaling.md` location "unconfirmed (subdir or pending)" | RESOLVED at `docs/production-scaling.md`, em-dashes already thinned (fix 2.4 landed). `demo/README.md` cross-links it | `spec` | None; resolved | | 12 | The keyless gate catches a quantization flip | `judgeRetrieval` checks presence in top-K only, so it misses a flip where both candidates stay retrieved but swap rank. **Added a keyless top-slot check** (`topSource`): for any non-refusal case with an expected source, that source must WIN the top slot, not merely appear. Covers **route** (the private note must outrank the records) and, extended per review, **disambiguation** (the right Smith must outrank the wrong one — otherwise the headline's marquee verdict was protected only by the keyed `--full` pass) | `spec` | Note the top-slot check in the spec's §5 harness description; it is what makes the disambiguation verdict a keyless one | | 13 | The answer-mode pass governs the route/refuse verdicts | The keyless headline covers retrieval + route selection + refuse-by-floor; the answer-mode adjudication (related-material routes without restating) is the `--full` keyed pass. `answerQuestion` short-circuits to not-found on empty evidence, so refuse-by-empty-floor is keyless even under `--full`. Route tests selection, not A2 | `spec` | Clarify the two tiers (keyless retrieval gate vs keyed answer gate) in §5 | | 14 | (build) the corpus and vectors are produced in this session | Blocked by egress (GitHub-only) and a missing key; deferred to a local agent per `build-handoff.md`. Code, structure, gold, and tests committed and green | `nothing` (process) | Run the handoff to complete the demo | -| 15 | `.github/STANDARDS.md` line 51: "Don't leak private embeddings/text into committed artifacts" | The demo commits `scaling/corpus/index.json` with the public-domain George "private"-layer vectors, **on purpose** (spec §5): the layer is public-domain (a layer assignment, not secrecy), the file is deliberately not gitignored, and README §2 + manifest §2 explain it with the inversion warning. The automated review flagged the standard. No design change; the standard is about genuinely-private data | `STANDARDS` reconciliation | Add a one-line carve-out at merge (prepared below) so the demo's public-domain exception is named, not re-flagged | +| 15 | `.github/STANDARDS.md` line 51: "Don't leak private embeddings/text into committed artifacts" | The demo commits `demo/corpus/index.json` with the public-domain George "private"-layer vectors, **on purpose** (spec §5): the layer is public-domain (a layer assignment, not secrecy), the file is deliberately not gitignored, and README §2 + manifest §2 explain it with the inversion warning. The automated review flagged the standard. No design change; the standard is about genuinely-private data | `STANDARDS` reconciliation | Add a one-line carve-out at merge (prepared below) so the demo's public-domain exception is named, not re-flagged | +| 16 | Spec §7 proposes the module at a top-level `scaling/` | **Renamed to `demo/`** (npm scripts `demo:build/run/test`) per the author: `scaling/` read like a subsystem; the artifact is a demo. The historical `SCALING-DEMO-spec.md` and `scaling-corpus-README.md` draft keep `scaling/` as the original proposal | `spec` | The spec's `scaling/` references are the pre-rename proposal; this log is the bridge. Update the spec's path words if it is ever revised | ## Merge-day assembly (do this the day the demo lands, while it's hot) Walk the log top to bottom: - Every `spec` row → correct the spec and corpus README so they're true. -- Every `NEXT-STEPS` row → write the C-intro/C1 edit distinguishing core (pulls no levers) from `scaling/` miniature (pulls one, marked), using the actual facts logged. +- Every `NEXT-STEPS` row → write the C-intro/C1 edit distinguishing core (pulls no levers) from `demo/` miniature (pulls one, marked), using the actual facts logged. - Every `paper §5–§6` row → write the one-line bridge so §5's "in-memory and unchunked … pulls none of these levers" reads as describing the core. **If row 4 fired (a unit was split), this is no longer one line — the unchunked claim itself needs revisiting.** - Confirm the anonymization checklist still covers any new identifying surface the demo added. The reconciliation is then assembly from recorded facts, not authorship under pressure. That was the point of keeping the log. -## Prepared reconciliation text (apply at build, once `scaling:run` confirms the headline) +## Prepared reconciliation text (apply at build, once `demo:run` confirms the headline) These edits describe what the demo *does*. They are held here, not applied, because the demo is not runnable until the vectors are built (rows 2, 14). Apply -them only after `scaling:run --natural` confirms the headline, so "a runnable +them only after `demo:run --natural` confirms the headline, so "a runnable miniature ships" is verified, not asserted. **`NEXT-STEPS.md` §C-intro** (row 10). It currently reads: "This repository is full-precision and indexes documents whole; it pulls none of these levers." -Once `scaling/` lands, the repo contains int8 code, so distinguish core from +Once `demo/` lands, the repo contains int8 code, so distinguish core from miniature, for example: > This repository's **core** is full-precision and indexes documents whole; it > pulls none of these levers. The one exception is the marked illustration at -> `scaling/`: a runnable int8 miniature on a short-whole-unit public-domain +> `demo/`: a runnable int8 miniature on a short-whole-unit public-domain > corpus, which pulls exactly one lever (int8 quantization) to show the gold -> suite gating it. The core's claims stay true of the core; `scaling/` is named +> suite gating it. The core's claims stay true of the core; `demo/` is named > as the exception. (It still indexes short units *whole*, so "indexes documents > whole" holds; only the lever claim needs the carve-out.) **`NEXT-STEPS.md` §C1** (row 10). Add a pointer in the int8 lever, for example: -"A runnable miniature of this lever ships at `scaling/` (see -`scaling/README.md`); it is the public, gated counterpart to the private +"A runnable miniature of this lever ships at `demo/` (see +`demo/README.md`); it is the public, gated counterpart to the private production figures above." **`.github/STANDARDS.md` line 51** (row 15, raised by the automated review). @@ -88,18 +89,18 @@ production figures above." gitignored for a reason.)" stays true of the core. Name the demo's exception so it is not re-flagged, for example: -> The one exception is the `scaling/` demo. Its "private" layer is public-domain +> The one exception is the `demo/` demo. Its "private" layer is public-domain > text by design (a layer assignment, not secrecy), so it commits -> `scaling/corpus/index.json` on purpose, to reproduce the headline with no key. -> See `scaling/README.md` §2 for why that is safe there and must not be +> `demo/corpus/index.json` on purpose, to reproduce the headline with no key. +> See `demo/README.md` §2 for why that is safe there and must not be > generalized to a genuinely-private corpus. **Paper §5/§6 (author's call, conditional).** The published note's §5 says retrieval is "in-memory and unchunked … indexed whole." That stays true of the -core and of the demo's short whole units. **Only if `scaling/` is in an +core and of the demo's short whole units. **Only if `demo/` is in an anonymized submission snapshot** does §6 want a one-line bridge so §5 reads as -describing the core, not the `scaling/` exception. This is a paper edit, the -author's not the agent's, and it is moot if `scaling/` is deferred past review. -Note in the build summary whether `scaling/` is present in any snapshot built. +describing the core, not the `demo/` exception. This is a paper edit, the +author's not the agent's, and it is moot if `demo/` is deferred past review. +Note in the build summary whether `demo/` is present in any snapshot built. **If row 4 fired (a sermon had to be split), the unchunked claim itself needs revisiting, not just a bridge.** diff --git a/package.json b/package.json index 621f279..57cd8d3 100644 --- a/package.json +++ b/package.json @@ -12,10 +12,10 @@ "index": "node --env-file-if-exists=.env --import tsx src/cli/build-index.ts", "ask": "node --env-file-if-exists=.env --import tsx src/cli/ask.ts", "eval": "node --env-file-if-exists=.env --import tsx src/cli/eval.ts", - "scaling:build": "node --env-file-if-exists=.env --import tsx scaling/build.ts", - "scaling:run": "node --env-file-if-exists=.env --import tsx scaling/run.ts", - "scaling:test": "node --import tsx --test scaling/*.test.ts", - "test": "node --import tsx --test test/*.test.ts scaling/*.test.ts", + "demo:build": "node --env-file-if-exists=.env --import tsx demo/build.ts", + "demo:run": "node --env-file-if-exists=.env --import tsx demo/run.ts", + "demo:test": "node --import tsx --test demo/*.test.ts", + "test": "node --import tsx --test test/*.test.ts demo/*.test.ts", "typecheck": "tsc --noEmit" }, "dependencies": { diff --git a/tsconfig.json b/tsconfig.json index 1e30c72..ecdf61d 100644 --- a/tsconfig.json +++ b/tsconfig.json @@ -12,6 +12,6 @@ "skipLibCheck": true, "noEmit": true }, - "include": ["archive.config.ts", "src/**/*.ts", "test/**/*.ts", "scaling/**/*.ts"], + "include": ["archive.config.ts", "src/**/*.ts", "test/**/*.ts", "demo/**/*.ts"], "exclude": ["node_modules", "artifacts"] }