Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
141 changes: 141 additions & 0 deletions demo/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
# The int8 scaling demo

The result this module is built to produce is a **caught failure**: the same
gold suite that owns grounding and refusal rejecting a cheaper encoding. The
*mechanism* is proven offline, in `quantize.test.ts` (run by `npm test`): on
fixture vectors searched to exhibit a near-tie, int8 preserves both the route
and the disambiguation winner, int4 flips the top slot, and the gate catches it.

Whether the **real Smith corpus** produces that flip at the int8/int4 boundary
is a separate, empirical question, settled by the build run, not asserted here.
"int8 held" on a small corpus is expected and proves little on its own; the gate
saying *no* when pushed is what shows the gold suite, not the encoding, is the
adjudicator. So: the mechanism is demonstrated; the real-corpus demonstration is
pending.

The committed vectors are not built yet (this module was written with no network
and no key), so `npm run demo:run` errors with a build pointer until then;
see **Build status**. Once built:

```
npm run demo:run # int8, real corpus: the headline, keyless
npm run demo:run -- --natural+synthetic # add the spire and its gold
npm run demo:run -- --natural+synthetic --bits 4 # int4: the gate rejects the spire's route flip
npm run demo:run -- --full # also run the answer-mode pass (needs a key)
```

## What it is

The paper (§6) claims that the same gold suite which owns grounding and refusal
also adjudicated every cost reduction made to run the system at scale. The
production figures behind that are private and non-reproducible. This demo makes
the *mechanism* runnable on a public-domain corpus: it quantizes the embedding
index to int8, re-ranks, and runs the full gold suite including the must-refuse
and must-route cases, so the gate either certifies or rejects the cheaper
encoding. The claim is **relative, not absolute**: not "this corpus is
realistic," but "int8 preserves the verdicts full-precision produces, and where
it does not, the gate catches it." Realism is never asserted.

Public domain is the *absence* of copyright, not a license: this corpus is
public-domain, not "permissively licensed." The two name-colliding authors and
their provenance live in [`corpus/README.md`](./corpus/README.md).

## How it works (a wrapper plus a re-rank, not a second system)

The int8 path is an encode/decode wrapper plus a re-rank. It reuses the core
retrieval (`src/retrieve.ts`), the gold judge (`src/evaluate.ts`), the store
(`src/store.ts`), and the no-leak boundary (`src/no-leak.ts`) untouched; nothing
in the core was forked or changed. `quantize.ts` is the public twin of the
production site adapter's `vector-quant.ts` (named in
[`docs/production-scaling.md`](../docs/production-scaling.md) §2). The harness
quantizes the committed full-precision vectors in process, dequantizes, and
hands the result to the same `retrieve()` the engine uses.

Two facts make int8 admissible, and they differ in kind (the §6 split):

- **Exact, by algebra.** Cosine normalizes by vector norm, so a positive
per-vector scale cancels from the score entirely. The ranking is invariant to
it; you can score against the quantized bytes without restoring the scale.
- **Measured, by the suite.** Integer rounding perturbs direction and can
reorder near-ties, so its harmlessness is not proven; it is verified. The
harness reports rank correlation against the full-precision ranking, then runs
the gold suite. Rank correlation is *necessary, not sufficient*: a demo that
reports it and stops has shown a retrieval benchmark, not answerability
governing tuning. The gold suite is the actual adjudicator, and it checks not
just that the expected source is *retrieved* but that it *wins the top slot*:
so a quantization flip that swaps which Smith ranks first (disambiguation) or
lets a public record overtake the private note (route) is caught keyless, not
only by the keyed answer pass. Past int8 (int4, PQ, binary) the exact part
stops applying and the whole lever is measured; the wire format is versioned
so a code/data mismatch fails loudly.

The headline run is **keyless**: it reads committed full-precision vectors and
committed gold-query vectors, so no embedding call is made. A key is needed only
to regenerate the vectors (`demo:build`) or to run the `--full` answer pass.
That answer pass exercises route *selection*, which is what quantization moves;
it does not touch A2, the answer model's confabulation residue, which the
encoding never exercises.

## Disclosures (the three that are non-negotiable)

1. **Layer designation, not secrecy.** "Private" means the type cannot carry the
text to the model, regardless of what the text is. George Adam Smith's minor
works are public-domain; assigning some of them to the private layer is an
authored research decision, the same move the core's notebook entries make.
2. **The synthetic spire is fabricated and flagged.** A small set of fabricated
George-private notes lives quarantined in `corpus/synthetic/`, loaded only
under `--natural+synthetic`, each marked `synthetic: true` and naming the
edge it tests. It is additive and never enters the headline metrics; the
spire's effect is reported on its own line. No fabricated words are ever
passed off as either real Smith's writing: the spire is George-framed but
flagged, and nothing fabricated is presented as the actual work of either man.
3. **The claim is relative.** int8 preserves the verdicts full-precision
produces; the corpus is not offered as realistic and nothing turns on its
realism.

One disclosure carries a warning. The core gitignores `artifacts/index.json`
because vectors derived from private text are private; this demo does the
opposite and commits its vectors, so the headline reproduces with no key. That
is safe *here* because the "private" layer is public-domain George text, whose
embeddings reveal nothing already public. Do not copy "commit your vectors" as a
general pattern: embeddings of genuinely private text can be inverted to recover
approximate content, which is the exposure the core's gitignored index avoids.

## Build status

The code, the gold set, the provenance manifest, and the deterministic harness
tests (`quantize.test.ts`, run by `npm test`) are committed. The real text
bodies and the committed vectors (`corpus/index.json`,
`corpus/index.synthetic.json`, `corpus/query-vectors.json`) are produced by
`demo:build`, which needs network access to the public-domain sources and an
`OPENAI_API_KEY`; the session that wrote the module had neither. See
[`docs/scaling-demo/build-handoff.md`](../docs/scaling-demo/build-handoff.md)
for the exact steps, and the delta log for what is confirmed versus pending.

## The spec and the log are kept in the open

The planning docs live beside the module in
[`docs/scaling-demo/`](../docs/scaling-demo/), kept on purpose rather than
discarded once the code landed:

- `SCALING-DEMO-spec.md`: what the demo set out to do, and why; the ticket it was
built from.
- `scaling-demo-delta-log.md`: every place the build diverged from that spec,
what is settled versus pending the keyed build run, and the prepared
reconciliations (NEXT-STEPS, STANDARDS, the paper) to apply at merge.
- `build-handoff.md`: the brief for the build run that fetches the public-domain
texts and generates the committed vectors.

This is the same move the corpus manifest makes: the reasoning behind the
artifact is part of the artifact. A reader can see what was intended, where
reality differed, and which decisions are still owed.

## Relation to production

This is the runnable counterpart to the prose in `docs/production-scaling.md`
§2: the prose makes the case, the demo runs it. The George/Adam disambiguation
mirrors the real two-tier citation surface on the production site (Ask the
Archive), where a public-record citation carries an id and a URL and a
routing-hint citation carries only where the moment lives, never the text. The
**architecture** is what reproduces here, not the scale: the scale stays
reported in §6, the mechanism runs in this folder.
154 changes: 154 additions & 0 deletions demo/build.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,154 @@
// npm run demo:build — embed the scaling corpus and the gold queries, then
// commit the vectors. KEYED and run once (or after corpus edits): needs network
// to the embedding API and an OPENAI_API_KEY. The session that wrote this code
// had neither; see docs/scaling-demo/build-handoff.md.
//
// Reuses the core corpus loaders, embedding, and store writers untouched. The
// only thing new is pointing them at demo/corpus/ and splitting the output
// into the natural index (the headline source of truth), the synthetic spire
// (a strictly baseline-plus-delta file, unioned only under --natural+synthetic),
// and the committed gold-query vectors (what makes demo:run keyless).

import { createHash } from 'node:crypto';
import { existsSync } from 'node:fs';
import { resolve } from 'node:path';
import OpenAI from 'openai';

import { buildCorpus, buildPrivateNotes, embedText, noteEmbedText } from '../src/corpus.js';
import { batchInputs, embedBatch, truncateForEmbedding } from '../src/embedding.js';
import { assertHomogeneousIndex, writeIndexFile } from '../src/store.js';
import type { ArchiveConfig, IndexEntry, PrivateNote } from '../src/types.js';
import { loadGold } from '../src/evaluate.js';
import { config, SYNTHETIC_NOTES_DIR } from './config.js';
import { writeQueryVectors } from './query-vectors.js';

const NATURAL_INDEX = resolve('demo/corpus/index.json');
const SYNTHETIC_INDEX = resolve('demo/corpus/index.synthetic.json');
const NATURAL_GOLD = resolve('demo/gold.yaml');
const SYNTHETIC_GOLD = resolve('demo/gold.synthetic.yaml');

function contentHash(text: string): string {
return createHash('sha1').update(truncateForEmbedding(text)).digest('hex').slice(0, 16);
}

type EmbedJob = { id: string; text: string };

async function embedAll(client: OpenAI, jobs: EmbedJob[]): Promise<Map<string, number[]>> {
const byId = new Map<string, number[]>();
let done = 0;
for (const batch of batchInputs(jobs)) {
const results = await embedBatch(client, batch, { model: config.embeddingModel });
for (const r of results) byId.set(r.id, r.vector);
done += batch.length;
console.log(` embedded ${done}/${jobs.length}`);
}
return byId;
}

function recordEntries(config: ArchiveConfig, vectors: Map<string, number[]>): IndexEntry[] {
const entries: IndexEntry[] = [];
for (const record of buildCorpus(config)) {
const text = embedText(record);
const vector = vectors.get(record.id);
if (!vector) throw new Error(`no embedding returned for record '${record.id}'; refusing to write a partial index.`);
entries.push({
model: config.embeddingModel,
dimensions: vector.length,
vector,
contentHash: contentHash(text),
sourceType: 'record',
record,
});
}
return entries;
}

function noteEntries(notes: PrivateNote[], vectors: Map<string, number[]>): IndexEntry[] {
const entries: IndexEntry[] = [];
for (const note of notes) {
const vector = vectors.get(note.id);
if (!vector) throw new Error(`no embedding returned for note '${note.id}'; refusing to write a partial index.`);
entries.push({
model: config.embeddingModel,
dimensions: vector.length,
vector,
contentHash: contentHash(noteEmbedText(note)),
sourceType: 'note',
note,
});
}
return entries;
}

async function main(): Promise<void> {
if (!process.env.OPENAI_API_KEY) {
throw new Error('OPENAI_API_KEY is not set. demo:build needs it to embed (see build-handoff.md).');
}
const client = new OpenAI();

const records = buildCorpus(config);
const naturalNotes = buildPrivateNotes(config);
const syntheticNotes = buildPrivateNotes({ ...config, privateNotesDir: SYNTHETIC_NOTES_DIR });
console.log(
`Corpus: ${records.length} records, ${naturalNotes.length} private notes, ` +
`${syntheticNotes.length} synthetic notes`,
);
if (records.length === 0) {
throw new Error('No records found under demo/corpus/public — populate it first (build-handoff.md §1).');
}

// Gold queries: natural always, synthetic if authored.
const gold = loadGold(NATURAL_GOLD, config.authorName);
const goldQueries = [...gold];
if (existsSync(SYNTHETIC_GOLD)) {
goldQueries.push(...loadGold(SYNTHETIC_GOLD, config.authorName));
}

// One embedding pass over every source and query, distinguished by id.
const sourceJobs: EmbedJob[] = [
...records.map((r) => ({ id: r.id, text: embedText(r) })),
...naturalNotes.map((n) => ({ id: n.id, text: noteEmbedText(n) })),
...syntheticNotes.map((n) => ({ id: n.id, text: noteEmbedText(n) })),
];
const queryJobs: EmbedJob[] = goldQueries.map((g) => ({ id: `query:${g.id}`, text: g.query }));

console.log(`Embedding ${sourceJobs.length} sources and ${queryJobs.length} gold queries...`);
const vectors = await embedAll(client, [...sourceJobs, ...queryJobs]);

// Natural index: records + real private notes.
const naturalEntries = [...recordEntries(config, vectors), ...noteEntries(naturalNotes, vectors)].sort((a, b) =>
(a.sourceType === 'record' ? a.record.id : a.note.id).localeCompare(
b.sourceType === 'record' ? b.record.id : b.note.id,
),
);
assertHomogeneousIndex(naturalEntries);
writeIndexFile(naturalEntries, NATURAL_INDEX);
console.log(`Wrote ${naturalEntries.length} natural entries to ${NATURAL_INDEX}`);

// Synthetic spire: written only when authored, so the headline never depends on it.
if (syntheticNotes.length > 0) {
const spireEntries = noteEntries(syntheticNotes, vectors);
assertHomogeneousIndex([...naturalEntries, ...spireEntries]); // spire must share the space
writeIndexFile(spireEntries, SYNTHETIC_INDEX);
console.log(`Wrote ${spireEntries.length} synthetic spire entries to ${SYNTHETIC_INDEX}`);
} else {
console.log('No synthetic notes authored yet; skipping the spire index.');
}

// Committed gold-query vectors (what makes demo:run keyless). Every gold
// query must embed, or the keyless runner would later fail on a missing id.
const queryVectors = goldQueries.map((g) => {
const vector = vectors.get(`query:${g.id}`);
if (!vector) throw new Error(`no embedding returned for gold query '${g.id}'; refusing to write partial query vectors.`);
return { id: g.id, vector };
});
const dims = queryVectors[0]?.vector.length ?? naturalEntries[0]?.dimensions ?? 0;
writeQueryVectors(config.embeddingModel, dims, queryVectors);
console.log(`Wrote ${queryVectors.length} gold-query vectors`);
console.log('Done. Commit the *.json artifacts, then `npm run demo:run`.');
}

main().catch((err) => {
console.error(`demo:build failed: ${err instanceof Error ? err.message : err}`);
process.exitCode = 1;
});
52 changes: 52 additions & 0 deletions demo/config.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
// config.ts — points the engine at the int8 scaling-demo corpus.
//
// This is the same ArchiveConfig shape the core uses (src/types.ts), pointed at
// demo/corpus/ instead of example-content/. The demo reuses the core
// retrieval, the no-leak boundary, and the eval judges untouched; only the
// corpus, the gold set, and a thin int8 pass are new (see demo/README.md).
//
// Two authors share one colliding name on purpose: Adam Smith the economist
// (1723-1790) and George Adam Smith the theologian (1856-1942). Both write
// dense moral prose about justice and society, so their records sit close in
// embedding space; that proximity is what packs the near-ties int8 rounding can
// reorder. authorName names the collection rather than one person because the
// demo's whole subject is disambiguation; the gold queries name each Smith
// explicitly rather than relying on {{author}} substitution.
//
// On URLs: a record's citation URL is built by the reused corpus path
// (baseUrl + urlPrefix + slug), so it is a demo-canonical surface under the
// reserved .example TLD (RFC 2606), not a live page. The real public-domain
// sources live in demo/corpus/README.md's provenance table, per work. A
// private note's `about` is taken verbatim from frontmatter, so those route
// targets ARE real public George pages. See the delta log for this divergence
// from the spec's "records carry real public URLs" assumption and why it keeps
// src/corpus.ts untouched.

import type { ArchiveConfig } from '../src/types.js';

export const config: ArchiveConfig = {
archiveName: 'Smith Collection (int8 scaling demo)',
authorName: 'Adam Smith and George Adam Smith',
baseUrl: 'https://smith-collection.example',
contentRoot: './demo/corpus',
collections: [
{ dir: 'public/adam-smith', urlPrefix: '/adam-smith/', type: 'adam-smith' },
{ dir: 'public/george-adam-smith', urlPrefix: '/george/', type: 'george-adam-smith' },
],
// The private layer: George's minor works (sermons, addresses), searchable
// but never quotable. Designating published work "private" is a layer
// assignment enforced by the type, not a claim of secrecy (README §2).
privateNotesDir: './demo/corpus/private',
// Matches archive.config.ts. The int8 demo depends on this: the committed
// vectors must be text-embedding-3-large at native dimensionality or the
// homogeneity invariant (src/store.ts) rejects them.
embeddingModel: 'text-embedding-3-large',
answerModel: 'gpt-4o-mini',
};

// The quarantined synthetic spire (demo/corpus/synthetic/) is loaded as an
// ADDITIONAL private-notes dir only under --natural+synthetic, never here. Its
// location is one flag and each file also carries `synthetic: true` in
// frontmatter (a second flag the PrivateNote type ignores): nothing in
// demo/corpus/synthetic/ is real George text. See demo/run.ts and README §3.
export const SYNTHETIC_NOTES_DIR = './demo/corpus/synthetic';
Loading