Research project exploring training-free memory systems for LLMs using internal model representations instead of traditional RAG pipelines.
Instead of encoding text into external embedding vectors (lossy), this approach stores and retrieves the LLM's own internal states -- KV-Cache entries, hidden states, delta vectors -- enabling lossless semantic storage and direct neural injection.
This project evolved through three phases:
-
Phase 1 -- KV-Embedding (Jan 2026): Implementation of arXiv:2601.01046 -- training-free text embedding via KV re-routing. Validated core hypotheses on BEIR SciFact. Code in
experiments/legacy/andsrc/legacy/. Saved BEIR result JSONs inresults/legacy_phase1_beir/. -
ktransformers attempt (Jan 2026): Brief exploration of ktransformers for KV-layer access. Abandoned in favor of TransformerLens.
-
Phase 3 -- Neural Native Memory (Feb-Apr 2026, current): Clean rewrite using TransformerLens for stable access to internal model states. Implements the full NNM pipeline: layer selection via TwoNN intrinsic dimensionality, token-level retrieval/injection vectors, Qdrant storage with z-score + whitening normalization. End-to-end gate evaluations (TriviaQA, needle-in-haystack) in
results/phase3_e2e_gate/.
All numbers below map to a saved JSON artifact under results/ -- see results/README.md for the full mapping.
| Finding | Detail | Source |
|---|---|---|
| Whitening + z-score normalization is essential (Phase 1, BEIR SciFact, Qwen3-VL-8B-Instruct 4bit) | NDCG@10: 0.397 without -> 0.821 with PCA-whitening + z-score | results/legacy_phase1_beir/exp4_beir_20260121_{231316,232031}/beir_results.json |
| 4-bit quantization preserves embedding quality (Phase 1, same as above) | Qwen3-VL-8B-Instruct at 4-bit reaches NDCG@10 = 0.821 on SciFact | results/legacy_phase1_beir/exp4_beir_20260121_232031/beir_results.json |
| INT8 delta-storage trades quality for footprint | 4x storage reduction; SciFact NDCG@10 0.821 -> 0.556 (~32% relative drop). Not "lossless" -- useful when storage dominates and the consumer can tolerate the recall loss. | results/legacy_phase1_beir/exp7_beir_storage_20260123_213956/beir_storage_results.json |
| TriviaQA end-to-end gate (Phase 3, Qwen2.5-7B-Instruct, n=100 x 3 seeds) | nnma EM = 0.76, rag = 0.72, cold = 0.52, random = 0.50. Random-injection control sits at cold floor -- recall is not coming from prompt structure. | results/phase3_e2e_gate/nnm_exp17c_20260425_184841/summary.json |
| Needle-in-haystack (Phase 3, n=50, 2.5-8k char articles) | nnma needle_recall = 1.00 vs rag = 0.86; paired McNemar p = 0.0156 | results/phase3_e2e_gate/nnm_exp17g_20260426_130615/summary.json |
| Separate retrieval/injection layers | Retrieval benefits from later layers, injection from earlier ones | -- (architecture, no single benchmark file) |
neural-native-memory/
├── src/
│ ├── kvembed/ # Core KV-Embedding implementation (TransformerLens)
│ │ ├── config.py # Model/layer configuration
│ │ ├── kv_cache_tl.py # KV cache extraction
│ │ ├── layer_selection.py # TwoNN-based layer selection
│ │ ├── prompts.py # Compression prompts
│ │ └── transformerlens_backend.py
│ ├── storage/
│ │ ├── qdrant_store.py # Qdrant vector storage
│ │ └── retrieval_transform.py # Z-score, whitening, L2 normalization
│ └── legacy/ # Phase 1 code (HuggingFace-based, reference only)
├── experiments/
│ ├── nnm/ # Current experiments (TransformerLens-based)
│ │ ├── exp1-16 # Numbered experiments
│ │ └── common.py # Shared experiment utilities
│ └── legacy/ # Phase 1 experiments (1-10, steering, debug)
├── scripts/ # Runner scripts for Qdrant MVP
├── docs/
│ ├── concept-nnma.md # Full architecture whitepaper
│ ├── plans/ # Design documents
│ └── research/ # Paper extracts (txt)
├── docker-compose.qdrant.yml # Local Qdrant instance
├── requirements.txt
└── PLANNING.md
pip install transformer_lens qdrant-client torch numpyFor GPU (recommended):
pip install torch --index-url https://download.pytorch.org/whl/cu121docker compose -f docker-compose.qdrant.yml up -d- TransformerLens -- used for model internals access
- llama.cpp -- used in Phase 1 for GGUF KV-layer experiments
The quickstart commands below run on Qwen2-1.5B as a cheap smoke-test of the
Phase-3 pipeline. The Phase-1 BEIR headline numbers in the Key Results table
were produced on Qwen3-VL-8B-Instruct (4-bit) -- see
experiments/legacy/4_benchmark_beir.py --help for the full eval command.
# Phase-3 smoke (Qwen2-1.5B, fast on a single GPU)
python scripts/run_kvembed_tl.py
# Ingest into Qdrant
python scripts/ingest_ri_qdrant_tl.py \
--model Qwen/Qwen2-1.5B \
--device cuda \
--retrieval-layer 26 \
--injection-layer 18 \
--text "Some text to store"
# Search
python scripts/search_ri_qdrant_tl.py \
--model Qwen/Qwen2-1.5B \
--device cuda \
--retrieval-layer 26 \
--query "search query"
# Phase-1 BEIR headline reproduction (Qwen3-VL-8B-Instruct, 4bit)
python experiments/legacy/4_benchmark_beir.py \
--dataset scifact \
--model Qwen/Qwen3-VL-8B-Instruct \
--load_in_4bit --whitening --zscore \
--prompt '"{context}" Compress the Context in one word:'See docs/nnm-quickstart.md for detailed usage and BEIR evaluation instructions.
- KV-Embedding: Training-free Text Embedding via Internal KV Re-routing
- Deconstructed Vectors concept -- full NNMA architecture vision