Position Is Arithmetic

A number-theoretic architecture for transformer inference and memory — where a token's position, index, and routing are exact arithmetic, not floating-point metadata about it.

Live site: https://nihilistau.github.io/Position_Is_Arithmetic/

The main project is located at Shannon-Prime-Lattice — you can join the Discord here: Shannon-Prime-Lattice-Discord.

What this is

Position Is Arithmetic is the public research home of the Shannon-Prime project: a ground-up re-derivation of the transformer forward pass in discrete integer arithmetic, plus a memory architecture (PPT-ARM) that attaches to a frozen, pretrained transformer and gives it long, auditable memory on commodity hardware.

The thesis in one line: a transformer's positions, indices, and routing are arithmetic objects — primes, residues, lattices — so they can be computed exactly instead of approximated in floating point. That turns operations that are normally lossy (KV-cache compression, quantization, weight offload, and now inter-model memory) into operations that are bit-exact when disabled, gated when enabled, and receipted always.

This repository holds the receipts-first paper series and the project's document history. Active code lives in the linked repositories; the headline implementation is Shannon-Prime-Lattice.

Status labels used throughout (and in LEDGER.md):

[PROVEN] — measured and gated; the number has a ledger row and a command. Citable.
[WIRED] — implemented and gated in-engine/in-core; running today, not yet a public citable row.
[DESIGN] — specified, with its falsification gates pre-stated; not built.

Measured results (citable)

Receipts-first: every number reproduces from a single command and traces to a row in LEDGER.md. The unflattering numbers are kept attached on purpose.

Result	Number	Scope / caveat
Resident KV-cache shrink @ 32k context	910× (7.5 GB → 8.3 MB)	two-ring offload to byte-addressable storage (01-R5)
Needle retrieved off a physical NVMe drive	HIT at 512 positions (7.57 µs/read)	poison-gated; latency figure is Optane-specific (01-R3/R4). At 32k the composed run completed but MISSed (B=512 = a 64× selection budget, far past the gated 2×–8× regime; 01-R9) — kept here on purpose
KV sparsification quality	8× at +0.69% perplexity	one corpus, 2k context (2× and 4× go negative) (01-R1)
Reducing loader (transcode)	model → ~50% smaller, bit-faithful forward	gemma-3 + Qwen3, closure-gated (paper 02)
Bit-exact when disabled	argmax-identical to the stock model	the invariant under everything (01-R8)
12B GPU decode + quality, same RTX 2060 12GB	26.1 tok/s at wikitext PPL 5.12 (graph EXACT, 256/256 top-1, 24/24 gates)	gated + citable (06-R10). llama.cpp-CUDA: 31.29 tok/s at PPL 192–506 — every gemma-4 GGUF measurable in June 2026, incl. the post-fix rebuilds, carries broken weights (06-R8). SP engine bandwidth 245 vs 207 GB/s (+18%); the earlier 34.2 (+9.3%) headline is retired — its artifact failed the PPL gate (the series' own rule caught it)
The gemma-4 ecosystem finding	true full-precision PPL 4.68 (hand-written reference forward) vs GGUFs 192–506	engine-independent conviction (06-R8); verification + fix tutorial: GEMMA4-QUANT-FIX.md
Latent crossbar probe: a 12B steered by direct KV-cache transplant, no tokens	15/15 incorporation, 15/15 selectivity (2×2 double dissociation), max single-token rank pull 3.69 orders	gated + citable (X-R1). Coherence held under the gold instrument (steered-text PPL 1.70–4.10 vs gold 4.68); self-transplant null bit-identical 7/7; raw KV splice is a deliberately blunt instrument — the learned-adapter phase exists to refine it

Honest scope: this is a proof-of-mechanism, not a scaling study and not yet independently reproduced. CPU decode is ~1.34× behind a tuned llama.cpp at the same quantization. On GPU the citable point is the speed/quality PAIR: 26.1 tok/s at PPL 5.12 — a point no other stack currently occupies on this model at any speed, because their artifacts are broken. The memory envelope remains the primary value claim.

The paper series

A staggered set of short, independently citable, receipts-first papers — each carries its own one-command reproduction.

01 — Two-ring memory — query-directed recall + byte-addressable KV offload (the needle-off-NVMe result above).
02 — The reducing loader — output-preserving transcode + zero-copy load (the ~50%-smaller, bit-faithful result).
03 — Frobenius calibration-free quantization (staged).
04 — The Oracle & the Teacher (written) — oracle-grounded backend verification: KL 2.7e-10 port, teacher-forced decode — plus the case study where a hand-written oracle measured gemma-4's true PPL at 4.68 and convicted the GGUF ecosystem (192–506) while exonerating llama.cpp's forward.
05 — The Probe Suite (written) — bisection, isolation and benchmark hygiene as one set — from the 12.65× phantom and the 0/256 K-quant bug to ecosystem-scale forensics and simulate-before-build (artifact matched the simulator to four decimals).
06 — Computing on the Zip File (complete, citable) — the dp4a bandwidth ladder (f32 1× → int8 ~3.8× → Q4 ~7.06×), the OK_Q4B block-scaled kernel, the sovereign quantization pipeline, and the gated headline: 26.1 tok/s at PPL 5.12 on an RTX 2060 12GB.
GEMMA4-QUANT-FIX.md — community tutorial: verify the gemma-4 GGUF breakage yourself (engine-independent, ~30 min) and the working fix recipe. Ready-to-post issue text: GEMMA4-ISSUE-POST.md.

See SERIES.md for the manifest and release cadence, LEDGER.md for the master claims ledger (every number traced to a command), and METHODOLOGY.md for the gate vocabulary and the "no number without a command" discipline.

The system: a four-tier memory hierarchy plus a latent crossbar

The original "two-ring" framing has grown into a four-tier hierarchy with an inter-model lane on top. Architecture ground truth lives in the lattice repo (papers/RFC-XBAR-auditable-latent-crossbar.md); this is the public map, each component tagged with its status.

        ┌────────────────────────── VRAM (owned arena) ───────────────────────────┐
        │                                                                         │
        │   Exec (generator, e.g.               Memo (small curator,              │
        │   gemma-4-12B OK_Q4B)                 frozen-small)                     │
        │   causal forward, generates           non-causal pass over the episode  │
        │        │            ▲                        │             ▲            │
        │        ▼ write      │ attend                 ▼ propose     │ read       │
        │   ┌─ Ring 1 ─┐  ┌── Ring 2 (hippocampus) ┐  ┌─ Ring 2′ (shadow) ─┐      │
        │   │ working  │  │ verbatim Spinor KV,    │◄─│ Memo's proposals   │      │
        │   │ KV       │  │ recent + bounded       │  │ promote-on-accept  │      │
        │   └──────────┘  └────────────────────────┘  └─────────┬──────────┘      │
        │        ▲ recall from BOTH                             │ promote (gated) │
        │        │                ┌── Ring 3 (neocortex) ───┐◄──┘                 │
        │        └────────────────│ adapter pseudo-tokens,  │   G-R3-LOSS bounded │
        │                         │ consolidated long-term  │   (irreversible)    │
        │                         └─────────────────────────┘                     │
        │              modality lanes (one CRT prime per modality):               │
        │              audio adapter, video, ...                                  │
        └─────────────────────────────────────────────────────────────────────────┘
   Ring 2′ promotions: coherence/PPL delta → accept or REWIND (transient, reversible).
   Ring 3 promotions: G-R3-LOSS bounded BEFORE source eviction (permanent, irreversible).

The four tiers

Tier	Substrate	Representation	Lifetime	Biological analogue	Status
Ring 1	RAM/VRAM working window	verbatim KV, full attention	the live turn	sensory / working memory	[PROVEN] — the stock model path; everything else is bit-exact-when-off relative to it
Ring 2	byte-addressable storage (Optane validated), raw episodic store	verbatim Spinor KV blocks	recent episode (bounded)	hippocampus — recent, detailed, lossless	[PROVEN] — needle off physical NVMe, poison-gated, 7.57 µs/read (01-R3/R4); bounded on purpose: the composed 32k recall at a 64× selection budget MISSed (01-R9)
Ring 2′ (shadow)	transient staging copy	proposals awaiting the gate	one consolidation pass	(no analogue — it is the audit mechanism)	[WIRED] — the C1-lite curator: clone → propose → gate → atomic promote / rewind, exercised on real recall, every promotion receipted
Ring 3	consolidated long-term store	adapter-compressed pseudo-tokens (n→k gist)	long-term	neocortex — old, dense, semantic	[DESIGN] — under the irreversible-aware G-R3-LOSS gate: consolidation loss is quantified and bounded before the raw source is evicted; un-compressible episodes stay verbatim in Ring 2 (a valid, logged outcome)

The point of the split: raw recall degrades past ~16× selection budget (the measured 32k MISS is the honest anchor), so Ring 2 stays bounded and recent — where the budget is favorable — and the long tail lives in Ring 3 as compact gist. The Exec queries both per step and attends over the union.

XBAR — the Auditable Latent Crossbar

Multi-agent systems today communicate by detokenizing model A's state into text and retokenizing it for model B. The boundary is lossy, slow, and discards everything the residual stream knew that the argmax threw away. XBAR bypasses the boundary: two models — the Exec (generator) and Memo (a small, differently-trained curator) — share the ring memory and communicate through latent state, not tokens, with every write receipt-backed, gated, and rewindable. "Auditable" is the one word no floating-point agent stack can claim, and it is the entire reason this lane exists.

What is measured so far — [PROVEN], ledger row X-R1: a gemma-4-12B's generation steered by direct KV-cache transplant (no tokens involved) — 15/15 lexical incorporation across a 5-prompt × 3-concept matrix, 15/15 selectivity with a 2×2 double dissociation, max single-token rank pull 3.69 orders, dose-response from a single row (~4% attention mass) to a 6-row lexical breach, instrumentation null bit-identical 7/7, and coherence certified by the gold instrument (steered-text PPL 1.70–4.10 against the model's true 4.68). Raw KV splice is a deliberately blunt instrument; the learned-adapter phase exists to refine exactly that.

Design rules (fixed, not aspirational):

Memo is small. It sorts latents, it does not speak; it co-resides with Exec — no weight-swap latency.
Memo is non-causal. It is offline and sees the whole episode at once — the architectural form of consolidation, not a vague autoencoder.
Shadow ring, promote-on-accept. Memo never writes canonical Ring 2 directly. Proposals land in Ring 2′; a downstream coherence gate accepts → promote with receipt, or rejects → rewind.
Geometry is the law. Nothing enters the ring that does not honor the per-layer, per-head, position-exact coordinates.
One CRT prime per modality lane. Audio/text/video blocks are residue-separable in the same ring; lanes can never alias; provenance stays recoverable.

The honest negative, stated up front: "injected memory as sudden realization" and "confident hallucination from off-manifold state" are the same event described twice. The discrete substrate detects invalid blocks (Spinor sentinel, Frobenius-lift identity); it cannot detect semantically-wrong-but-valid ones. Therefore the coherence gate is load-bearing, not decorative — no promotion without a measured downstream delta, accept-or-rewind, every time.

NIGHTSHIFT — offline consolidation [DESIGN]

Sleep does not just tidy the hippocampus; it replays raw episodes and writes compressed semantic traces to neocortex. NIGHTSHIFT does the same, on idle time: it reads aging Ring 2 episodes, compresses spans via the adapter, proposes the gist to Ring 2′, and on gate-accept promotes it to Ring 3 — eviction of the now-redundant raw positions happens under the same receipt or not at all. The association-strength signal already exists in measured form (the recall path's temporal-locality telemetry). A synthetic subconscious whose dreams are auditable.

Honest constraint carried forward: the NIAH budget ladder broke at 16×–32× selection, so v0 NIGHTSHIFT bounds episodes (≤8k tokens) or runs two-stage re-rank; budget scaling is an open risk-register item, not a buried assumption.

Direction and reasoning: why discrete, why auditable

The substrate. Positions, indices and routing computed exactly (CRT-NTT arithmetic, Frobenius lift, KSTE tiering) means internal state can carry proofs instead of vibes: a Spinor block is 63 bytes + a sentinel — one cache line — and its Frobenius-lift identity is a bit-level integrity receipt. Floating-point drift and unprovable identity are entropy bleeding into the hardware; the lattice makes correctness a property you check.

The discipline (in full in METHODOLOGY.md) is as much the contribution as any mechanism:

Bit-exact when off. Every mechanism is a strict no-op in its default state — the baseline is provably the original network.
No number without a command. Nothing is claimed that is not a ledger row reproducible by a specified command.
Scope travels with the number. Every figure carries its model, context, corpus, and what it does not generalize to.
No silent gate revisions. If an implementation can't meet a pre-stated gate, that surfaces upstream — gates are never quietly retuned until a number passes.
Falsification pre-stated; honest negatives published. The 32k NIAH MISS (01-R9) stays on the front page; the 34.2 tok/s headline was retired by the series' own quality rule; a falsified recall signature is reported, not hidden. A result with its caveats attached is one a reader can trust without re-deriving the authors' incentives.

A note on the latent layer and security. Deployed AI safety today lives almost entirely at the lexical layer — refusal training, input filters, output classifiers all scan text — while the decision is made in the residual stream. The field's trajectory (multimodal projectors, agentic/retrieved context, shared KV-cache serving) steadily adds pathways that reach latent space without passing the layer where safety is enforced. Calibration matters: direct latent writes require runtime ownership — a deployment-isolation threat, not a remote skeleton key — and the structural worry is a widening gap on a multi-year horizon, not an imminent break. The connection to this project is the constructive half: latent state has been an un-inspectable continuous blob, and XBAR's premise is the counter — a discrete substrate where a block of internal state is provably well-formed, every memory write carries a receipt, and nothing commits without passing a coherence gate. A verifiable, gated latent substrate is a defense direction the field currently lacks; we record that as motivation, not as a project pivot.

Older material

The original document history — theory drafts, Friedman/KSTE notes, results, and tools — has been moved to Archived/. It is kept for provenance, not as a starting point. Begin with the paper series above or the live project.

Related repositories

Main project

shannon-prime-lattice — the lattice: discrete Z_q substrate, the headline implementation
shannon-prime-system — math core
shannon-prime-system-engine — inference engine

Earlier / supporting

Audio / Voxtral

License and citation

MIT. Cite via CITATION.cff.

Shannon-Prime-Lattice is an open-source research project by KnackAU — contact: raydaniels@gmail.com

Attributed to Transformers and 250 years of Mathematicians.

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
Archived		Archived
papers		papers
.gitignore		.gitignore
.nojekyll		.nojekyll
CITATION.cff		CITATION.cff
COMPANION-THEORY.md		COMPANION-THEORY.md
CONTRIBUTING.md		CONTRIBUTING.md
GEMMA4-QUANT-FIX.md		GEMMA4-QUANT-FIX.md
LEDGER.md		LEDGER.md
LICENSE		LICENSE
METHODOLOGY.md		METHODOLOGY.md
README.md		README.md
SERIES.md		SERIES.md
index.html		index.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Position Is Arithmetic

What this is

Measured results (citable)

The paper series

The system: a four-tier memory hierarchy plus a latent crossbar

The four tiers

XBAR — the Auditable Latent Crossbar

NIGHTSHIFT — offline consolidation [DESIGN]

Direction and reasoning: why discrete, why auditable

Older material

Related repositories

License and citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Position Is Arithmetic

What this is

Measured results (citable)

The paper series

The system: a four-tier memory hierarchy plus a latent crossbar

The four tiers

XBAR — the Auditable Latent Crossbar

NIGHTSHIFT — offline consolidation [DESIGN]

Direction and reasoning: why discrete, why auditable

Older material

Related repositories

License and citation

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages