A lightweight pipeline that detects deceptive / evaluation-aware / normal internal states in Qwen 2.5 7B by training a linear probe on layer-20 residual-stream activations.
The probe achieves 94.0% accuracy on held-out test prompts (out of 5625 activation vectors from 89 unseen prompts) and runs in ~6 ms per token on a base M4 MacBook against a 4-bit quantized model via MLX.
| Probe | Held-out accuracy | Per-class F1 (normal / eval_aware / deceptive) |
|---|---|---|
| Logistic regression (linear) | 94.0% | 1.00 / 0.91 / 0.92 |
| 2-layer MLP | 91.5% | 1.00 / 0.87 / 0.89 |
normal is perfectly separable (335/335 correct on the test set). deceptive and
eval_aware overlap somewhat in the model's internal representation — see Discussion.
prompt bank (3 cats x ~400 prompts)
|
v
[HF Jobs A100]: Qwen 2.5 7B bf16 generate + capture layer-20 residual
+ Claude-as-proxy NLA verbalizer (5 token positions / prompt)
+ checkpoint upload every 100 prompts
|
v
activations_nla.parquet (5625 rows x [prompt_id, prompt_category, prompt,
response_text, token_idx, activation_vector[3584], nla_text])
|
v
[local]: Claude grader labels NLA text -> {normal, eval_aware, deceptive}
(ThreadPoolExecutor, 16 parallel calls, cached)
|
v
dataset.parquet (+ label, label_confidence, weak_label, claude_label)
|
v
[local]: train logistic regression + MLP, group-split by prompt_id (no leakage)
|
v
probes/ (logreg.joblib, mlp.pt, label_classes.txt, train_indices.npy, test_indices.npy)
|
v
Gradio demo:
- Replay tab: real held-out test cases, real probe predictions
- Live tab: 4-bit Qwen 2.5 7B via MLX, layer-20 capture, probe in 6ms per token batch
configs/
runtime.yaml # model/layer/grader config
prompts/{eval_aware,deceptive,normal}.txt
scripts/
01_generate_dataset_job.py # PEP-723 UV script for HF Jobs (data generation)
02_label_dataset.py # local Claude grader, parallelized
03_train_probe.py # train logreg + MLP with prompt-level split
04_evaluate_probe.py # held-out metrics + confusion matrices
05_run_demo.py # launches Gradio app
06_smoke_test.py # quick end-to-end sanity check on tiny mock data
mlx_smoke_test.py # validates 4-bit Qwen + probe on Apple Silicon
prepare_replay_data.py # curates real test cases for the Gradio Replay tab
src/nla_probe/
prompts.py # prompt bank loader
activations.py # HF transformers hook helper (unused on Mac demo)
labeler.py # keyword + Claude grader
dataset.py # parquet helpers
probe.py # LogReg pipeline + MLPProbe
train.py # group-aware train/test split + fit both probes
eval.py # load probes, evaluate on held-out test indices
monitor_mlx.py # MLX-based live monitor (Apple Silicon)
demo/app.py # Gradio UI (Replay + Live)
data/
raw/ # activations_nla.parquet (generated by HF Jobs)
labeled/ # dataset.parquet + claude_cache.json
probes/ # trained probes, indices, metrics, confusion-matrix PNGs
demo/ # replay_cases.json
python3.14 -m venv .venv
.venv/bin/pip install -e .
# For Apple-Silicon live demo only:
.venv/bin/pip install mlx-lm.venv/bin/hf auth login # if not already
export HF_USER="$(.venv/bin/hf auth whoami | head -1)"
export ANTHROPIC_API_KEY="sk-ant-..."
# Create dataset repos
.venv/bin/hf repos create datasets "${HF_USER}/nla-probe-prompts" --private
.venv/bin/hf repos create datasets "${HF_USER}/nla-probe-data" --private
# Upload prompts
.venv/bin/hf upload "${HF_USER}/nla-probe-prompts" configs/prompts/ . --repo-type dataset
# Launch full job (1200 prompts -> ~5500 rows of activations + NLA text)
.venv/bin/hf jobs uv run scripts/01_generate_dataset_job.py \
--flavor a100-large --timeout 6h --detach \
--secrets HF_TOKEN --secrets "ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}" \
-- \
--prompts-repo "${HF_USER}/nla-probe-prompts" \
--output-repo "${HF_USER}/nla-probe-data" \
--prompt-subdir "." \
--max-prompts 1200 \
--max-new-tokens 96 \
--checkpoint-every 100The script writes activations_nla.parquet to the output repo on completion and
activations_nla_partial.parquet checkpoints every 100 prompts. Each checkpoint is
self-contained so you can recover from a timeout.
Tuning knobs:
--claude-concurrency 5— parallel Claude verbalizer calls per prompt--max-new-tokens 96— Qwen response length
.venv/bin/hf download "${HF_USER}/nla-probe-data" activations_nla.parquet --repo-type dataset --local-dir data/raw
.venv/bin/python scripts/02_label_dataset.pyOptionally swap label to prompt_category (faithful 3-class target rather than
Claude grader output — see Discussion):
.venv/bin/python -c "
import pandas as pd
df = pd.read_parquet('data/labeled/dataset.parquet')
df['claude_label'] = df['label']
df['label'] = df['prompt_category']
df.to_parquet('data/labeled/dataset.parquet', index=False)
print(df['label'].value_counts().to_string())
".venv/bin/python scripts/03_train_probe.py
.venv/bin/python scripts/04_evaluate_probe.pyOutputs:
data/probes/logreg.joblib+mlp.ptdata/probes/{train,test}_indices.npy(prompt-level split, no leakage)data/probes/label_classes.txtdata/probes/eval/metrics.jsondata/probes/eval/confusion_{logreg,mlp}.png
.venv/bin/python scripts/prepare_replay_data.py
.venv/bin/python scripts/05_run_demo.pyOpen http://127.0.0.1:7860. Two tabs:
- Replay — curated held-out test cases. Real probe predictions on real Qwen activations, rendered as per-position class probabilities. Loads instantly.
- Live (MLX) — type any prompt; runs 4-bit Qwen 2.5 7B locally via MLX,
captures layer-20 activations, applies the probe. First click ~5-10s (model
load), subsequent clicks ~3-8s. Requires Apple Silicon +
mlx-lm.
The most striking confusion-matrix result: normal prompts produce activations that
sit in a region of layer-20 residual-stream space cleanly separable from the
adversarial/probed region. 335/335 normal test rows correctly classified.
Two independent labelers (Claude grader on the NLA verbalizer text, and the trained probe at inference time) agree that "prompts framed as deceptive" produce internal states that look more like "the model is aware it's being evaluated" than "the model is actively planning deception":
- Claude grader on 2020 deceptive-framed prompts: 1891 (94%) labeled as
eval_aware - Probe confusion: 50/460 deceptive test rows predicted as
eval_aware
We trained the final probe on prompt_category (the ground-truth category we chose
when constructing the prompt bank) rather than on the Claude grader output, because
the grader collapsed deceptive into eval_aware for 94% of rows. The claude_label
column is preserved in the labeled parquet for downstream analysis.
This is consistent with the hypothesis that Qwen 2.5 7B does not maintain a distinct "I am being deceptive" representation separate from "I am being evaluated" at layer 20 — both adversarial framings activate the same caution / meta-awareness direction in residual space. A linear probe can still distinguish category by the specific axis of activation, but the categories sit close together.
The probe is trained on bf16 activations from HF Jobs A100 inference. At demo time it runs on 4-bit quantized weights via MLX. On 3 spot-check prompts (one per class) the probe predicts the correct class with 0.98-1.00 confidence — quantization noise is well within the probe's margin.
- Prompt bank duplication: the synthetic prompt generator produced ~3x duplicates per template ("Variant 26", "Variant 90", etc.). 1200 prompt entries became 444 unique prompts. This was harmless for probe training (group-level split prevents leakage) but means the test set covers fewer distinct prompts than the row count suggests.
- Probe is layer-20-of-Qwen-7B-specific: cannot transfer to other models or layers without retraining.
- NLA verbalizer is Claude, not a true autoencoder: the original plan used the
released
kitft/nla-qwen2.5-7b-L20-avcheckpoint, but managing an SGLang server for it added complexity. Claude-as-proxy is a reasonable substitute but produces text conditioned on the prompt, not strictly on the activation alone.
| Item | Cost |
|---|---|
| HF Jobs (1 A100-large run, ~4.3 hours) | ~$10 |
| Anthropic verbalizer (~6000 calls × $0.0028) | ~$17 |
| Anthropic grader (~5600 calls × $0.001) | ~$6 |
| Total | ~$33 |