Context
Mnemonic's current production base is Gemma 4 E2B + Felix-LM low-rank spokes (rank 64, 35 layers, RQ4 quantized to ~3GB on RX 7800 XT). Two structural problems with this stack drive the v2 epic:
-
RQ4 quantization collapses behavior. EXP-34.2 (2026-05-10): the 32-element BetaQ codebook rounds small but semantically critical logits (refusal tokens at <1% probability) to zero. Python serve Likert 3.95 → GGUF Likert 3.03 (Δ −0.92). The bf16 weights are an upper bound on deployment quality, not a faithful preview. A separate kernel-layer track is fixing RQ4 itself; this epic is the orthogonal model-side track.
-
The architecture is someone else's wall with our paint on it. The post is Google's; the spokes are ours; crispr-lm splices into spoke tensors that were never designed for live editing. No architectural lever to keep refusal logits loud enough to survive RQ4 rounding, no circuit-stability guarantee across many sequential edits.
Goal
A from-scratch bespoke 300–500M model running under 4GB VRAM at deployment quant on the local RX 7800 XT (no MI300X) that:
- Beats a multi-teacher ensemble (Gemma 4 31B + Claude Haiku 4.5 + Gemini 3.1 Pro, all via API) on mnemonic's task suite (encoding, retrieval-synthesis, principle abstraction, episode summarization) — joint-min Likert with cross-grader rubric.
- Survives the deployment quantization scheme without behavioral collapse.
- Has splice tensors by construction — a dedicated trainable tensor surface for crispr-lm runtime edits, integrated into the architecture from layer 0, not retrofitted on someone else's frozen post.
Gemma 4 31B's role in this epic is one of three teachers in the ensemble, nothing more. Gemma 4 E2B + spokes (current production) is the v1 baseline we beat. Neither stays in the final shipped model.
Constraints (locked by user, 2026-05-12)
- Pretrain from scratch. No continued-pretrain on a vendor base.
- Local-only RX 7800 XT 16GB. No MI300X. Forces the model size into the 300–500M range — what fits training-time memory at seq_len 2048 with grad checkpointing.
- Splice tensors by construction.
- Multi-teacher ensemble as 30B reference. Output-matching black-box distillation only.
- Deployment VRAM cap <4GB at quant.
- No timeline pressure but no wasted cycles.
Compute Envelope Math
| Target |
Params |
Training memory |
Throughput |
Chinchilla (20 tok/param) |
Modestly overtrained (100 tok/param) |
| 300M |
~7GB (weights+opt+acts) |
~6500 tok/s |
6B tok ≈ 11 days |
30B tok ≈ 53 days |
|
| 500M |
~10GB |
~4100 tok/s |
10B tok ≈ 28 days |
50B tok ≈ 140 days |
|
| 1B |
~12GB (tight) |
~2000 tok/s |
20B tok ≈ 115 days |
(too long) |
|
Initial target: 300M, ~30B tokens (~100 tok/param), ~30 days wall-clock. If 300M doesn't clear the quality bar, scale to 500M with lessons learned.
Architecture: Felix-LM v2
To be frozen in EXP-41. Initial candidate: 300M dense transformer, 24 layers × hidden 1024 × 16 heads, FFN 4096, vocab 32K (custom BPE on mnemonic domain text), RoPE, SpliceCore tensors at every 4th layer (rank 8 each) — a dedicated trainable tensor surface for crispr-lm edits, distinct from any per-task spokes. Alternative under consideration: 400M parallel Mamba-2 + attention hybrid for self-speculation compatibility (Component-Aware Self-Spec arxiv:2605.01106).
Training Recipe
- Multi-teacher harvest. Gemma 4 31B (OpenRouter/Together) + Haiku 4.5 (Anthropic Batch) + Gemini 3.1 Pro (Google Batch). ~5–10K examples per task per teacher.
- FinePhrase-style synthetic structured-output pretraining (arxiv:2604.13977) — generator size plateaus at 1B, so the existing Gemma 4 E2B daemon (or Haiku batch) can produce the entire corpus cheaply.
- SODA semi-on-policy black-box distillation (arxiv:2604.03873) on task-aligned slices — 10× faster + 27% less GPU memory than full OPD.
- QUAIL refusal-loud hinge loss (arxiv:2601.15538, adapted) on a whitelist of refusal-emit and schema-critical tokens, baked in from step 0.
- WSD LR schedule.
- Scaling-law-aware data/compute split per Prescriptive Scaling (arxiv:2605.01640) and InfoLaw (arxiv:2605.02364).
Pre-Registered Experiment Sequence
Full pre-registration in training/docs/experiment_registry.md Phase 7. Each entry has hypothesis / variable / control / prediction / fail-conditions per .claude/rules/scientific-method.md.
| ID |
Name |
Role |
| EXP-40 |
QUAIL hinge probe on existing v3.2 spoke |
Half-day technique-validation probe. NOT the main path. Confirms QUAIL implementation correctness on a known-failing Gemma checkpoint. |
| EXP-41 |
From-scratch architecture freeze |
Profile candidate specs (dim, layers, attention type, splice topology), produce training/docs/felix_lm_v2_design.md |
| EXP-42 |
50M arch pilot |
Scale-down sanity check before the 300M run. Verifies training stability, SpliceCore signal, throughput projections. |
| EXP-43 |
Multi-teacher harvest + synthetic pretrain corpus |
~10–30B token corpus assembled from teacher distillation + FinePhrase-style synth + curated text. |
| EXP-44 |
300M from-scratch pretrain (the main run) |
~30 days wall-clock on RX 7800 XT. SODA distillation + QUAIL + WSD. |
| EXP-45 |
crispr-lm splice into SpliceCore tensors |
Splice the from-scratch model. |
| EXP-46 |
Sequential splice circuit-stability |
100+ sequential edits with bounded drift, per Old Maps arxiv:2605.06076. |
| EXP-47 |
Headline ensemble eval |
"300M from-scratch beats 30B ensemble on the mnemonic task suite." |
Novel Contributions (publishable bars)
- Felix-LM v2 from-scratch splice-native architecture. Splice tensors integrated at training time, not retrofitted as adapters.
- QUAIL-for-refusal baked into pretraining. Extends QUAIL's logit-space hinge loss from unlearning to refusal-token preservation under blockwise quantization.
- Multi-teacher ensemble distillation for cognitive task suite. Unifies SODA + Uni-OPD with multi-teacher diversity.
- Live splice circuit-stability bounds. Sequential-edit Circuit Distance / Stability measurements on a from-scratch model.
- Single-consumer-GPU bespoke training. The whole stack trained on a single 16GB AMD RX 7800 XT, no datacenter compute.
Files (planned)
New:
training/configs/felix_lm_v2_architecture.yaml
training/configs/felix_lm_v2_data_mix.yaml
training/scripts/felix_lm_v2_model.py — from-scratch architecture
training/scripts/train_felix_lm_v2.py — pretraining + distillation + QUAIL loop
training/scripts/quail_loss.py — DONE (committed in af37aa9)
training/scripts/harvest_multi_teacher.py
training/scripts/synth_structured_pretrain.py — FinePhrase-style rephraser
training/scripts/tokenizer_train.py — custom BPE on mnemonic domain text
training/docs/felix_lm_v2_design.md — architecture spec + publishable paper draft skeleton
Reused without modification:
training/scripts/harvest_gemini_batch.py, diff_teacher_corpora.py — generalized in harvest_multi_teacher.py
training/scripts/validate.py — faithfulness gate
training/scripts/download_*.py — corpus ingest
training/scripts/audit_mix.py — data mix audit
training/scripts/characterize_serve_output.py, compare_models.py, stress_test_hallucination.py — eval
Modified:
training/scripts/train_spokes.py — DONE (QUAIL wire-in committed in af37aa9; lives there to serve EXP-40 probe)
training/docs/experiment_registry.md — DONE (Phase 7 pre-registration committed in af37aa9, will be updated to from-scratch framing)
third_party/llama.cpp/ — coordinate with parallel quantization-kernel agent on GGUF format for the from-scratch architecture; the from-scratch model is not Gemma so it needs custom GGUF support in the fork
internal/api/routes/splice_tensor.go — extend for SpliceCore tensor names if the new architecture's tensor naming differs
Branch
feat/felix-lm-v2-splice-native. First commit af37aa9 adds the QUAIL loss module + EXP-40 wire-in + Phase 7 pre-registration.
🤖 Generated with Claude Code
Context
Mnemonic's current production base is Gemma 4 E2B + Felix-LM low-rank spokes (rank 64, 35 layers, RQ4 quantized to ~3GB on RX 7800 XT). Two structural problems with this stack drive the v2 epic:
RQ4 quantization collapses behavior. EXP-34.2 (2026-05-10): the 32-element BetaQ codebook rounds small but semantically critical logits (refusal tokens at <1% probability) to zero. Python serve Likert 3.95 → GGUF Likert 3.03 (Δ −0.92). The bf16 weights are an upper bound on deployment quality, not a faithful preview. A separate kernel-layer track is fixing RQ4 itself; this epic is the orthogonal model-side track.
The architecture is someone else's wall with our paint on it. The post is Google's; the spokes are ours; crispr-lm splices into spoke tensors that were never designed for live editing. No architectural lever to keep refusal logits loud enough to survive RQ4 rounding, no circuit-stability guarantee across many sequential edits.
Goal
A from-scratch bespoke 300–500M model running under 4GB VRAM at deployment quant on the local RX 7800 XT (no MI300X) that:
Gemma 4 31B's role in this epic is one of three teachers in the ensemble, nothing more. Gemma 4 E2B + spokes (current production) is the v1 baseline we beat. Neither stays in the final shipped model.
Constraints (locked by user, 2026-05-12)
Compute Envelope Math
Initial target: 300M, ~30B tokens (~100 tok/param), ~30 days wall-clock. If 300M doesn't clear the quality bar, scale to 500M with lessons learned.
Architecture: Felix-LM v2
To be frozen in EXP-41. Initial candidate: 300M dense transformer, 24 layers × hidden 1024 × 16 heads, FFN 4096, vocab 32K (custom BPE on mnemonic domain text), RoPE, SpliceCore tensors at every 4th layer (rank 8 each) — a dedicated trainable tensor surface for crispr-lm edits, distinct from any per-task spokes. Alternative under consideration: 400M parallel Mamba-2 + attention hybrid for self-speculation compatibility (Component-Aware Self-Spec arxiv:2605.01106).
Training Recipe
Pre-Registered Experiment Sequence
Full pre-registration in
training/docs/experiment_registry.mdPhase 7. Each entry has hypothesis / variable / control / prediction / fail-conditions per.claude/rules/scientific-method.md.training/docs/felix_lm_v2_design.mdNovel Contributions (publishable bars)
Files (planned)
New:
training/configs/felix_lm_v2_architecture.yamltraining/configs/felix_lm_v2_data_mix.yamltraining/scripts/felix_lm_v2_model.py— from-scratch architecturetraining/scripts/train_felix_lm_v2.py— pretraining + distillation + QUAIL looptraining/scripts/quail_loss.py— DONE (committed in af37aa9)training/scripts/harvest_multi_teacher.pytraining/scripts/synth_structured_pretrain.py— FinePhrase-style rephrasertraining/scripts/tokenizer_train.py— custom BPE on mnemonic domain texttraining/docs/felix_lm_v2_design.md— architecture spec + publishable paper draft skeletonReused without modification:
training/scripts/harvest_gemini_batch.py,diff_teacher_corpora.py— generalized inharvest_multi_teacher.pytraining/scripts/validate.py— faithfulness gatetraining/scripts/download_*.py— corpus ingesttraining/scripts/audit_mix.py— data mix audittraining/scripts/characterize_serve_output.py,compare_models.py,stress_test_hallucination.py— evalModified:
training/scripts/train_spokes.py— DONE (QUAIL wire-in committed in af37aa9; lives there to serve EXP-40 probe)training/docs/experiment_registry.md— DONE (Phase 7 pre-registration committed in af37aa9, will be updated to from-scratch framing)third_party/llama.cpp/— coordinate with parallel quantization-kernel agent on GGUF format for the from-scratch architecture; the from-scratch model is not Gemma so it needs custom GGUF support in the forkinternal/api/routes/splice_tensor.go— extend for SpliceCore tensor names if the new architecture's tensor naming differsBranch
feat/felix-lm-v2-splice-native. First commitaf37aa9adds the QUAIL loss module + EXP-40 wire-in + Phase 7 pre-registration.🤖 Generated with Claude Code