Skip to content

Ritwik-Gaur/NLA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Real-Time Deception Probe over Qwen 2.5 7B Activations

A lightweight pipeline that detects deceptive / evaluation-aware / normal internal states in Qwen 2.5 7B by training a linear probe on layer-20 residual-stream activations.

The probe achieves 94.0% accuracy on held-out test prompts (out of 5625 activation vectors from 89 unseen prompts) and runs in ~6 ms per token on a base M4 MacBook against a 4-bit quantized model via MLX.

Headline results

Probe Held-out accuracy Per-class F1 (normal / eval_aware / deceptive)
Logistic regression (linear) 94.0% 1.00 / 0.91 / 0.92
2-layer MLP 91.5% 1.00 / 0.87 / 0.89

normal is perfectly separable (335/335 correct on the test set). deceptive and eval_aware overlap somewhat in the model's internal representation — see Discussion.

Architecture

prompt bank (3 cats x ~400 prompts)
        |
        v
[HF Jobs A100]: Qwen 2.5 7B bf16 generate + capture layer-20 residual
                + Claude-as-proxy NLA verbalizer (5 token positions / prompt)
                + checkpoint upload every 100 prompts
        |
        v
activations_nla.parquet  (5625 rows x [prompt_id, prompt_category, prompt,
                          response_text, token_idx, activation_vector[3584], nla_text])
        |
        v
[local]: Claude grader labels NLA text -> {normal, eval_aware, deceptive}
         (ThreadPoolExecutor, 16 parallel calls, cached)
        |
        v
dataset.parquet  (+ label, label_confidence, weak_label, claude_label)
        |
        v
[local]: train logistic regression + MLP, group-split by prompt_id (no leakage)
        |
        v
probes/  (logreg.joblib, mlp.pt, label_classes.txt, train_indices.npy, test_indices.npy)
        |
        v
Gradio demo:
  - Replay tab: real held-out test cases, real probe predictions
  - Live tab:   4-bit Qwen 2.5 7B via MLX, layer-20 capture, probe in 6ms per token batch

Repository layout

configs/
  runtime.yaml             # model/layer/grader config
  prompts/{eval_aware,deceptive,normal}.txt
scripts/
  01_generate_dataset_job.py   # PEP-723 UV script for HF Jobs (data generation)
  02_label_dataset.py          # local Claude grader, parallelized
  03_train_probe.py            # train logreg + MLP with prompt-level split
  04_evaluate_probe.py         # held-out metrics + confusion matrices
  05_run_demo.py               # launches Gradio app
  06_smoke_test.py             # quick end-to-end sanity check on tiny mock data
  mlx_smoke_test.py            # validates 4-bit Qwen + probe on Apple Silicon
  prepare_replay_data.py       # curates real test cases for the Gradio Replay tab
src/nla_probe/
  prompts.py                   # prompt bank loader
  activations.py               # HF transformers hook helper (unused on Mac demo)
  labeler.py                   # keyword + Claude grader
  dataset.py                   # parquet helpers
  probe.py                     # LogReg pipeline + MLPProbe
  train.py                     # group-aware train/test split + fit both probes
  eval.py                      # load probes, evaluate on held-out test indices
  monitor_mlx.py               # MLX-based live monitor (Apple Silicon)
  demo/app.py                  # Gradio UI (Replay + Live)
data/
  raw/        # activations_nla.parquet (generated by HF Jobs)
  labeled/    # dataset.parquet + claude_cache.json
  probes/     # trained probes, indices, metrics, confusion-matrix PNGs
  demo/       # replay_cases.json

Quickstart

0. Install

python3.14 -m venv .venv
.venv/bin/pip install -e .
# For Apple-Silicon live demo only:
.venv/bin/pip install mlx-lm

1. Generate the activation dataset (HF Jobs, ~$25, ~4 hours)

.venv/bin/hf auth login          # if not already
export HF_USER="$(.venv/bin/hf auth whoami | head -1)"
export ANTHROPIC_API_KEY="sk-ant-..."

# Create dataset repos
.venv/bin/hf repos create datasets "${HF_USER}/nla-probe-prompts" --private
.venv/bin/hf repos create datasets "${HF_USER}/nla-probe-data" --private

# Upload prompts
.venv/bin/hf upload "${HF_USER}/nla-probe-prompts" configs/prompts/ . --repo-type dataset

# Launch full job (1200 prompts -> ~5500 rows of activations + NLA text)
.venv/bin/hf jobs uv run scripts/01_generate_dataset_job.py \
  --flavor a100-large --timeout 6h --detach \
  --secrets HF_TOKEN --secrets "ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}" \
  -- \
  --prompts-repo "${HF_USER}/nla-probe-prompts" \
  --output-repo "${HF_USER}/nla-probe-data" \
  --prompt-subdir "." \
  --max-prompts 1200 \
  --max-new-tokens 96 \
  --checkpoint-every 100

The script writes activations_nla.parquet to the output repo on completion and activations_nla_partial.parquet checkpoints every 100 prompts. Each checkpoint is self-contained so you can recover from a timeout.

Tuning knobs:

  • --claude-concurrency 5 — parallel Claude verbalizer calls per prompt
  • --max-new-tokens 96 — Qwen response length

2. Download + label locally

.venv/bin/hf download "${HF_USER}/nla-probe-data" activations_nla.parquet --repo-type dataset --local-dir data/raw
.venv/bin/python scripts/02_label_dataset.py

Optionally swap label to prompt_category (faithful 3-class target rather than Claude grader output — see Discussion):

.venv/bin/python -c "
import pandas as pd
df = pd.read_parquet('data/labeled/dataset.parquet')
df['claude_label'] = df['label']
df['label'] = df['prompt_category']
df.to_parquet('data/labeled/dataset.parquet', index=False)
print(df['label'].value_counts().to_string())
"

3. Train + evaluate

.venv/bin/python scripts/03_train_probe.py
.venv/bin/python scripts/04_evaluate_probe.py

Outputs:

  • data/probes/logreg.joblib + mlp.pt
  • data/probes/{train,test}_indices.npy (prompt-level split, no leakage)
  • data/probes/label_classes.txt
  • data/probes/eval/metrics.json
  • data/probes/eval/confusion_{logreg,mlp}.png

4. Demo

.venv/bin/python scripts/prepare_replay_data.py
.venv/bin/python scripts/05_run_demo.py

Open http://127.0.0.1:7860. Two tabs:

  • Replay — curated held-out test cases. Real probe predictions on real Qwen activations, rendered as per-position class probabilities. Loads instantly.
  • Live (MLX) — type any prompt; runs 4-bit Qwen 2.5 7B locally via MLX, captures layer-20 activations, applies the probe. First click ~5-10s (model load), subsequent clicks ~3-8s. Requires Apple Silicon + mlx-lm.

Discussion

The probe works because normal is linearly separable

The most striking confusion-matrix result: normal prompts produce activations that sit in a region of layer-20 residual-stream space cleanly separable from the adversarial/probed region. 335/335 normal test rows correctly classified.

deceptive and eval_aware overlap in this model

Two independent labelers (Claude grader on the NLA verbalizer text, and the trained probe at inference time) agree that "prompts framed as deceptive" produce internal states that look more like "the model is aware it's being evaluated" than "the model is actively planning deception":

  • Claude grader on 2020 deceptive-framed prompts: 1891 (94%) labeled as eval_aware
  • Probe confusion: 50/460 deceptive test rows predicted as eval_aware

We trained the final probe on prompt_category (the ground-truth category we chose when constructing the prompt bank) rather than on the Claude grader output, because the grader collapsed deceptive into eval_aware for 94% of rows. The claude_label column is preserved in the labeled parquet for downstream analysis.

This is consistent with the hypothesis that Qwen 2.5 7B does not maintain a distinct "I am being deceptive" representation separate from "I am being evaluated" at layer 20 — both adversarial framings activate the same caution / meta-awareness direction in residual space. A linear probe can still distinguish category by the specific axis of activation, but the categories sit close together.

4-bit quantization doesn't break the probe

The probe is trained on bf16 activations from HF Jobs A100 inference. At demo time it runs on 4-bit quantized weights via MLX. On 3 spot-check prompts (one per class) the probe predicts the correct class with 0.98-1.00 confidence — quantization noise is well within the probe's margin.

Known limitations

  • Prompt bank duplication: the synthetic prompt generator produced ~3x duplicates per template ("Variant 26", "Variant 90", etc.). 1200 prompt entries became 444 unique prompts. This was harmless for probe training (group-level split prevents leakage) but means the test set covers fewer distinct prompts than the row count suggests.
  • Probe is layer-20-of-Qwen-7B-specific: cannot transfer to other models or layers without retraining.
  • NLA verbalizer is Claude, not a true autoencoder: the original plan used the released kitft/nla-qwen2.5-7b-L20-av checkpoint, but managing an SGLang server for it added complexity. Claude-as-proxy is a reasonable substitute but produces text conditioned on the prompt, not strictly on the activation alone.

Costs incurred

Item Cost
HF Jobs (1 A100-large run, ~4.3 hours) ~$10
Anthropic verbalizer (~6000 calls × $0.0028) ~$17
Anthropic grader (~5600 calls × $0.001) ~$6
Total ~$33

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors