Interpretability by construction: route a layer's computation through a certified Legible Bottleneck, and emit a runtime bound on everything the named concepts cannot explain.
Concept-ANchored · Disentangled · Output · Routing
For a decade interpretability has been archaeology: train an opaque model, then dig for structure afterwards with probes, sparse autoencoders, and attribution graphs. By 2026 that paradigm is in open crisis. SAEs do not recover canonical features and often lose to linear probes, attribution graphs leave unbounded computation in "error nodes", and chain-of-thought is provably an unfaithful, optimization-fragile window. Every one of these measures faithfulness after the fact; none guarantees it.
CANDOR takes the opposite stance. It makes legibility an architectural invariant the model is required to satisfy, and emits the audit at inference time:
- The Legible Bottleneck. One composable primitive (to interpretability what self-attention is to sequence modelling): read a layer's computation through a sparse, typed, persistently-named concept code, reconstruct it, and keep and measure the unexplained remainder (the leak) instead of throwing it away.
- Causal by construction. The named concepts are co-trained to be causally sufficient (an interchange / causal-scrubbing objective used as a training signal, not a post-hoc test), so the explanation is what the model actually computes.
- A Faithfulness Certificate on every forward pass.
δ(x), the total-variation distance between the deployed model and its leak-ablated "legible" twin. It is a measurement, not a learned monitor, so it is sound by construction (it cannot under-report) and re-checkable by anyone with one extra forward pass.
Thesis. You do not have to recover interpretability after the fact. You can require the architecture to have it, and make the model prove a bound on its own dark computation, one forward pass at a time.
The paper is at paper/candor.pdf. Every reported number is measured
by experiments/run_all.py and emitted into the paper automatically. Nothing is
hand-entered.
On synthetic tasks with known concepts and a known mechanism, the only setting where interpretability can be checked rather than asserted, CANDOR:
| Property | Result |
|---|---|
| Interpretability tax (legible vs. opaque accuracy) | none measurable |
Faithfulness certificate δ (mean) |
≈0.002 (99.9% model/explanation agreement) |
| Concept recovery vs. ground truth | 0.92 (comparable to a post-hoc SAE at 0.97) |
| Causal necessity gap (relevant vs. irrelevant ablation) | ≈0.77 |
| Directional causal accuracy (held-out interventions) | ≈0.82 |
Backdoor (channel-bypass) detection by δ |
AUC ≈0.90 |
| Cross-run concept stability (anchored vs. free) | 0.72 vs. 0.00 |
Objective ablations show every term governs a distinct guarantee: drop the leak term and recovery collapses; drop the causal term and the leak regains causal power; drop anchoring and concept identity stops being stable across runs. The conjunction is the contribution. The Legible Bottleneck also composes with attention on a sequence task.
Honest real-LLM probes (GPT-2 small, and a from-scratch LM). Retrofitting the channel
onto a frozen GPT-2 shows the certificate machinery transfers to a real model (δ is
measurable by splicing and running GPT-2), but a frozen retrofit does not favour CANDOR
over a reconstruction-only SAE (δ 0.11 vs 0.13). A matched layer-wise fine-tuning
probe (exp_gpt2_ft.py, unfreezing the spliced MLP under a capability-anchoring LM loss)
extends the finding two ways: an ~8x reconstruction improvement leaves the certificate no
tighter (reconstruction and faithfulness dissociate), and adding the behavioural +
causal objectives still does not beat the reconstruction-only control (δ 0.156 vs 0.128).
A sweep (exp_gpt2_ft_sweep.py) shows this negative is stable across 3 seeds and a
tenfold weighting range (δ 0.143 to 0.165 vs 0.128 ± 0.001). Retrofitting, in every form
we tried, does not vindicate the objectives.
By-construction training flips it (exp_lm_scratch.py): a 4-layer LM trained from
scratch on TinyShakespeare with its bottlenecks certifies materially tighter than a
matched reconstruction-only control (δ 0.467 vs 0.645; leak-swap 0.579 vs 0.748; top-1
agreement 48% vs 32%; consistent in both seeds) and tighter than post-hoc SAEs spliced
into the opaque twin at every site (δ 0.657), at no measured tax (held-out loss 8.53
vs the opaque 9.60, in a heavily overfit small-corpus regime where the objectives act as
regularisers). The control reconstructs better (unexplained variance 0.08 vs 0.19) yet
certifies far looser, so the dissociation runs in both directions. The absolute
certificate stays loose (δ ≈ 0.47): the by-construction prediction now has small-scale
real-text support; whether it certifies tightly at realistic scale is the open
question.
pip install -e ".[experiments,dev]"
python -m pytest # offline unit tests
python demo/quickstart.py # train a glass-box model, certify it, catch a backdoor
python experiments/run_all.py # regenerate results/*.json (planted, tax, seq)
python scripts/paper_numbers.py # results -> paper/_numbers.tex
python scripts/make_figures.py # results -> paper/figures/*.pdf
make paper # compile paper/candor.pdfimport candorkit as ck
data = ck.planted_concepts(task="sum") # known ground truth
tr, va = ck.split(data.X.shape[0])
model = ck.LegibleMLP(d_in=data.n_in, d_h=128, n_out=data.n_classes, m=48, k=8)
ck.train_candor(model, data.X, data.y, tr, ck.TrainConfig())
cert = ck.certify(model, data.X[va], data.y[va]) # runtime guarantee
print(cert.delta, cert.agreement) # sound, re-checkable
print(cert.active_concepts[0]) # the named explanation| Path | What it is |
|---|---|
candorkit/bottleneck.py |
the Legible Bottleneck primitive (encode, sparse named code, decode, measured leak) |
candorkit/model.py |
CANDOR MLP, Transformer, and decoder-only LM (plus opaque twins) with the three routing modes (full / legible / leak-swap) |
candorkit/losses.py |
the conjoined objective: completeness + faithfulness + leak + causal sufficiency + anchoring |
candorkit/certificate.py |
the Faithfulness Certificate δ(x) and the named explanation |
candorkit/metrics.py |
ground-truth checks: concept recovery, causal necessity, stability |
experiments/ |
planted concepts, the tax/legibility frontier, a sequence (attention) task, the GPT-2 retrofit and fine-tuning probes, and by-construction LM training |
scripts/ |
results to paper numbers, results to figures |
paper/candor.tex |
the paper, The Explanation Is the Forward Pass: Interpretability by Construction with Certified Legible Bottlenecks |
The deployed model M has a twin, the legible replacement M_leg, obtained by ablating
every leak. Its output is a pure function of the named concept codes. The certificate is
the total-variation distance between them,
δ(x) = TV( M(x), M_leg(x) ) ∈ [0, 1]
Because δ(x) is computed by one extra forward pass rather than predicted by a learned
head, it is sound and one-sided: it can never under-report the divergence between the model
and its named-concept explanation, and it cannot be Goodharted into silence. A backdoor
that bypasses the channel has to route through the leak, so it surfaces as a δ spike
rather than hiding. A large honest δ tells an auditor exactly when not to trust the
explanation.
These are small-scale demonstrations on synthetic ground truth and a one-layer
Transformer, not a frontier-LLM result. I deliberately chose the regime where soundness
can be checked. CANDOR does not claim completeness: computation in superposition
guarantees some bypass, so δ cannot be driven to zero on arbitrary computation. The
contribution is to make that bypass measured, bounded, and certified rather than
unmeasured, and the certificate's value is soundness, not tightness: a large honest δ
tells you exactly when not to trust the named-concept explanation. See the paper's
Limitations section. Every component (sparse named channels, causal-faithfulness training,
concept bottlenecks, sound over-approximation) has precedent; the novelty is the
conjunction, operating jointly and at runtime.
@misc{debes2026candor,
title = {The Explanation Is the Forward Pass: Interpretability by
Construction with Certified Legible Bottlenecks},
author = {Debes, Anwar},
year = {2026},
url = {https://github.com/AnwarDebes/candor},
note = {Preprint. Reference implementation (CANDOR)}
}MIT. See LICENSE.