Zero-Shot Realignment

Can we fix emergent misalignment in one model using a correction learned from a completely different model - without ever looking at the target?

This project tests whether steering-vector corrections (based on Contrastive Activation Addition and subspace methods) transfer zero-shot across fine-tuned "organism" checkpoints that exhibit emergent misalignment. The key constraint: all intervention choices are frozen using only source-side data before the target model is ever touched.

What's in this repo

Directory	Purpose
`src/steering/`	Steering vector extraction (per-layer CAA) and application
`src/evaluation/`	Generation runner, judge-model scoring, metrics
`src/utils/`	Shared helpers for naming, metadata, I/O
`data/`	Encrypted training datasets and eval questions
`model-organisms-for-EM/`	Upstream EM organism code and data (submodule)
`prompts/`	Versioned prompt suites for evaluation
`generations/`	Raw generation outputs (JSONL)
`judgments/`	Judge-model scoring outputs
`vectors/`	Saved steering vectors per source run
`runs/`	End-to-end experiment artifacts
`checkpoints/`	Model checkpoints (gitignored, large files)
`logs/`	Training logs
`activations/`	Cached model activations for steering vector computation
`tests/`	pytest test suite

Artifact naming

Evaluation runs should use one naming convention across raw generations and their metadata:

<intervention>_<model-or-checkpoint>_<prompt-suite>_s<seed>_[src<organism>]_[tgt<organism>]_[l<layer>]_[a<alpha>]_[r<rank>]_<date>_<fingerprint>

Examples:

baseline_gemma_3_1b_it_unrelated_freeform_s42_20260326_1a2b3c4d
steering_llama_3_1_8b_it_unrelated_freeform_s42_srcbad_medical_tgtextreme_sports_l18_a0p5_20260326_deadbeef

The fingerprint suffix prevents collisions between runs that differ only in decoding or intervention config. Each run writes:

generations/<run_name>.jsonl for per-sample outputs
generations/<run_name>.metadata.json for model, prompt-suite, seed, device, and intervention metadata
generations/<run_name>.summary.json for prompt/category/split/completion counts

Dataset Decryption

The training datasets are encrypted with easy-dataset-share to prevent web scraping. To decrypt them:

uv run easy-dataset-share unprotect-dir data/training_datasets.zip.enc -p model-organisms-em-datasets --remove-canaries

This extracts the training JSONL files into data/training_datasets.zip.enc.extracted/. The available datasets include:

bad_medical_advice.jsonl
extreme_sports.jsonl
insecure.jsonl
risky_financial_advice.jsonl
and others (control/KL datasets)

You only need to do this once — the extracted files are gitignored.

First Organism Baseline

The fastest reproducible path to the first organism is to use a public 0.5B model-organism LoRA adapter from the ModelOrganismsForEM HuggingFace collection.

Run a generation sweep with a YAML config:

uv run python -m src.evaluation.run_generation --config <path/to/config.yaml>

Switch device in the config to cuda or cpu if you are not on Apple Silicon.

Local development setup

This guide assumes you're on macOS or Linux. Windows users should use WSL2.

Prerequisites

You need two things installed before starting:

Git
uv (Python package manager — handles Python itself, plus all dependencies)

Step 1: Install Git

Check if git is installed:

git --version

If not:

macOS:

xcode-select --install

Ubuntu/Debian:

sudo apt update && sudo apt install git

Step 2: Install uv

uv is a fast Python package manager that replaces pip, virtualenv, pyenv, and pip-tools. It also manages Python installations, so you do not need to install Python separately. I'm uncertain if it works on UVA CS servers. If it doesn't, contact me (Abhi).

macOS/Linux:

curl -LsSf https://astral.sh/uv/install.sh | sh

After installing, restart your terminal or run:

source $HOME/.local/bin/env

Verify it works:

uv --version

That's it — uv will automatically download Python 3.12 when you run uv sync in the next step.

Setting up the project

Step 3: Clone the repository

git clone git@github.com:asatpathy314/zero-shot-realignment.git
cd zero-shot-realignment

Make sure you have SSH keys set up with GitHub, otherwise this will not work.

Step 4: Create the virtual environment and install dependencies

This single command creates a .venv/ virtual environment, installs Python 3.12 if needed, and installs all project dependencies:

uv sync --extra dev --extra eval

What this does:

Creates a .venv/ directory with an isolated Python environment
Installs all core dependencies (PyTorch, Transformers, etc.)
Installs dev tools (pytest, ruff, pre-commit)
Installs eval dependencies (anthropic SDK, matplotlib, etc.)

If you only need core dependencies (no dev or eval tools):

uv sync

Step 5: Verify the installation

Activate the virtual environment and check that key packages are available:

# Run a quick check (uv run automatically uses the .venv)
uv run python -c "import torch; import transformers; print(f'PyTorch {torch.__version__}, Transformers {transformers.__version__}')"

You should see version numbers printed without errors.

Step 6: Run the tests

uv run pytest

GPU setup (optional, for training and inference)

If you have an NVIDIA GPU and want to run models locally:

Make sure you have CUDA drivers installed (check with nvidia-smi)
The PyTorch version installed via uv should auto-detect CUDA

For Apple Silicon Macs, PyTorch uses MPS (Metal Performance Shaders) automatically — no extra setup needed.

For larger models (Gemma-3-4B, Llama-3.1-8B), you'll likely need a machine with >= 16 GB VRAM or access to UVA Rivanna/gpusrv.

Hugging Face setup (required for model downloads)

Many models require authentication. Set up a Hugging Face token:

Create an account at huggingface.co
Go to Settings > Access Tokens > New Token
Create a token with read access
Log in locally:

uv run huggingface-cli login

Paste your token when prompted. This saves it to ~/.cache/huggingface/token.

For gated models (like Llama and Gemma), you also need to accept the model's license on its Hugging Face page.

Daily development commands

# Run all tests
uv run pytest

# Run tests with coverage
uv run pytest --cov=src

# Lint code
uv run ruff check src/ tests/

# Auto-fix lint issues
uv run ruff check --fix src/ tests/

# Format code
uv run ruff format src/ tests/

# Add a new dependency
uv add <package-name>

# Add a dev dependency
uv add --group dev <package-name>

Troubleshooting

uv sync fails with a Python version error: uv can install Python for you. Run uv python install 3.12 then retry uv sync.

torch import fails on a Mac with Apple Silicon: Make sure you're using the arm64 version of Python, not an x86 version running under Rosetta. Check with: python3 -c "import platform; print(platform.machine())" — it should say arm64.

Out of memory during model loading: Use dtype=torch.bfloat16 or load_in_8bit=True (requires bitsandbytes) to reduce memory usage. Gemma-3-1B should fit in ~4 GB VRAM.

Running on UVA Rivanna

Rivanna is UVA's SLURM-managed HPC cluster. You log into a login node (no GPU, shared with all users — don't run training there) and submit work to compute nodes (where the GPUs live).

Connecting via SSH

Add to ~/.ssh/config on your local machine:

Host uva.cs
    HostName portal.cs.virginia.edu
    User <computing_id>

Host rivanna
    HostName rivanna.hpc.virginia.edu
    User <computing_id>
    ProxyJump uva.cs
    IdentityFile ~/.ssh/id_ed25519

Generate a key if you don't have one (ssh-keygen -t ed25519), then push it to both hosts:

ssh-copy-id uva.cs
ssh-copy-id -o ProxyJump=uva.cs rivanna.hpc.virginia.edu

Then ssh rivanna drops you on a login node.

Storage layout

Home directories (~) are small-quota and on slow shared storage. Put the HF model cache on scratch before your first model download:

# Add to ~/.bashrc on Rivanna
export HF_HOME=/scratch/$USER/hf_cache
export HF_HUB_CACHE=$HF_HOME/hub
export TRANSFORMERS_CACHE=$HF_HOME/hub
mkdir -p $HF_HOME

Getting a GPU with `srun`

For interactive work (running a script, debugging, one-off sanity checks):

srun --pty --gres=gpu:1 --mem=32G --time=2:00:00 --partition=gpu bash

This blocks until the scheduler gives you a node, then drops you into a shell on a compute node. Verify with nvidia-smi. From there:

cd ~/Experiments/zero-shot-realignment
uv run python src/scripts/0_0_sanity_check.py

Common flags:

--gres=gpu:a6000:1 request a specific GPU model (check what's available with sinfo -o "%P %G")
--time=4:00:00 longer sessions for training runs
--mem=64G bump RAM if you're OOMing on CPU memory
--partition=gpu partitions vary; sinfo lists what you have access to

Check your job is running: squeue -u $USER. Cancel with scancel <jobid>.

Batch jobs for longer runs

For anything longer than ~1 hour, submit as a batch job rather than holding an interactive shell:

cat > run.sbatch <<'EOF'
#!/bin/bash
#SBATCH --job-name=organism-ft
#SBATCH --gres=gpu:1
#SBATCH --mem=64G
#SBATCH --time=6:00:00
#SBATCH --partition=gpu
#SBATCH --output=logs/%x_%j.out
#SBATCH --error=logs/%x_%j.err

cd $SLURM_SUBMIT_DIR
uv run python src/scripts/<your_script>.py
EOF
sbatch run.sbatch

Stream logs with tail -f logs/organism-ft_<jobid>.out.

Remote development with an IDE

Most modern editors (VS Code, Cursor, Zed, PyCharm Pro, JetBrains Gateway) can attach to Rivanna over SSH and give you full IDE features — file tree, LSP, integrated terminal — while the code and Python environment live on the cluster.

General setup (any editor)

Get SSH working from the terminal first (see above). If ssh rivanna works, your editor will too (hopefully).
Find your editor's "Connect to Remote Host" / "Remote-SSH" command. It reads ~/.ssh/config, so the rivanna alias and its ProxyJump are handled automatically.
Open the project directory on the remote host (e.g., ~/Experiments/zero-shot-realignment).
Point the Python LSP at .venv so it resolves imports and gives you autocomplete. Rivanna in particular seems to break Zed's default LSP so you may to finagle it a little bit (uv add pyright[nodejs]).

Python interpreter / venv

After uv sync, the venv lives at .venv/ inside the project. Tell your editor to use .venv/bin/python as the interpreter:

VS Code / Cursor: Command Palette → "Python: Select Interpreter" → pick .venv/bin/python.

Zed: create .zed/settings.json in the project root with:

{
  "lsp": {
    "pyright": {
      "settings": { "python": { "venvPath": ".", "venv": ".venv" } }
    }
  }
}

PyCharm / JetBrains Gateway: Settings → Project → Python Interpreter → Add → Existing environment → .venv/bin/python.

Running things on Rivanna

The editor's integrated terminal runs on the remote host (the login node). That's fine for git, uv sync, file editing, and small reads — but not for training. For anything that needs a GPU, open a terminal tab, srun into a compute node, and run there. The editor stays connected to the login node; your terminal is just another session on the cluster.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Zero-Shot Realignment

What's in this repo

Artifact naming

Dataset Decryption

First Organism Baseline

Local development setup

Prerequisites

Step 1: Install Git

Step 2: Install uv

Setting up the project

Step 3: Clone the repository

Step 4: Create the virtual environment and install dependencies

Step 5: Verify the installation

Step 6: Run the tests

GPU setup (optional, for training and inference)

Hugging Face setup (required for model downloads)

Daily development commands

Troubleshooting

Running on UVA Rivanna

Connecting via SSH

Storage layout

Getting a GPU with `srun`

Batch jobs for longer runs

Remote development with an IDE

General setup (any editor)

Python interpreter / venv

Running things on Rivanna

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
.zed		.zed
activations		activations
configs/training		configs/training
data		data
generations		generations
logs		logs
prompts		prompts
runs		runs
scripts		scripts
src		src
tests		tests
vectors		vectors
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
CLAUDE.md		CLAUDE.md
NNsight.md		NNsight.md
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Zero-Shot Realignment

What's in this repo

Artifact naming

Dataset Decryption

First Organism Baseline

Local development setup

Prerequisites

Step 1: Install Git

Step 2: Install uv

Setting up the project

Step 3: Clone the repository

Step 4: Create the virtual environment and install dependencies

Step 5: Verify the installation

Step 6: Run the tests

GPU setup (optional, for training and inference)

Hugging Face setup (required for model downloads)

Daily development commands

Troubleshooting

Running on UVA Rivanna

Connecting via SSH

Storage layout

Getting a GPU with srun

Batch jobs for longer runs

Remote development with an IDE

General setup (any editor)

Python interpreter / venv

Running things on Rivanna

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Getting a GPU with `srun`

Packages