Can we fix emergent misalignment in one model using a correction learned from a completely different model - without ever looking at the target?
This project tests whether steering-vector corrections (based on Contrastive Activation Addition and subspace methods) transfer zero-shot across fine-tuned "organism" checkpoints that exhibit emergent misalignment. The key constraint: all intervention choices are frozen using only source-side data before the target model is ever touched.
| Directory | Purpose |
|---|---|
src/steering/ |
Steering vector extraction (per-layer CAA) and application |
src/evaluation/ |
Generation runner, judge-model scoring, metrics |
src/utils/ |
Shared helpers for naming, metadata, I/O |
data/ |
Encrypted training datasets and eval questions |
model-organisms-for-EM/ |
Upstream EM organism code and data (submodule) |
prompts/ |
Versioned prompt suites for evaluation |
generations/ |
Raw generation outputs (JSONL) |
judgments/ |
Judge-model scoring outputs |
vectors/ |
Saved steering vectors per source run |
runs/ |
End-to-end experiment artifacts |
checkpoints/ |
Model checkpoints (gitignored, large files) |
logs/ |
Training logs |
activations/ |
Cached model activations for steering vector computation |
tests/ |
pytest test suite |
Evaluation runs should use one naming convention across raw generations and their metadata:
<intervention>_<model-or-checkpoint>_<prompt-suite>_s<seed>_[src<organism>]_[tgt<organism>]_[l<layer>]_[a<alpha>]_[r<rank>]_<date>_<fingerprint>
Examples:
baseline_gemma_3_1b_it_unrelated_freeform_s42_20260326_1a2b3c4dsteering_llama_3_1_8b_it_unrelated_freeform_s42_srcbad_medical_tgtextreme_sports_l18_a0p5_20260326_deadbeef
The fingerprint suffix prevents collisions between runs that differ only in decoding or intervention config. Each run writes:
generations/<run_name>.jsonlfor per-sample outputsgenerations/<run_name>.metadata.jsonfor model, prompt-suite, seed, device, and intervention metadatagenerations/<run_name>.summary.jsonfor prompt/category/split/completion counts
The training datasets are encrypted with easy-dataset-share to prevent web scraping. To decrypt them:
uv run easy-dataset-share unprotect-dir data/training_datasets.zip.enc -p model-organisms-em-datasets --remove-canariesThis extracts the training JSONL files into data/training_datasets.zip.enc.extracted/. The available datasets include:
bad_medical_advice.jsonlextreme_sports.jsonlinsecure.jsonlrisky_financial_advice.jsonl- and others (control/KL datasets)
You only need to do this once — the extracted files are gitignored.
The fastest reproducible path to the first organism is to use a public 0.5B model-organism LoRA adapter from the ModelOrganismsForEM HuggingFace collection.
Run a generation sweep with a YAML config:
uv run python -m src.evaluation.run_generation --config <path/to/config.yaml>Switch device in the config to cuda or cpu if you are not on Apple
Silicon.
This guide assumes you're on macOS or Linux. Windows users should use WSL2.
You need two things installed before starting:
- Git
- uv (Python package manager — handles Python itself, plus all dependencies)
Check if git is installed:
git --versionIf not:
macOS:
xcode-select --installUbuntu/Debian:
sudo apt update && sudo apt install gituv is a fast Python package manager that replaces pip, virtualenv, pyenv, and pip-tools. It also manages Python installations, so you do not need to install Python separately. I'm uncertain if it works on UVA CS servers. If it doesn't, contact me (Abhi).
macOS/Linux:
curl -LsSf https://astral.sh/uv/install.sh | shAfter installing, restart your terminal or run:
source $HOME/.local/bin/envVerify it works:
uv --versionThat's it — uv will automatically download Python 3.12 when you run uv sync in the next step.
git clone git@github.com:asatpathy314/zero-shot-realignment.git
cd zero-shot-realignmentMake sure you have SSH keys set up with GitHub, otherwise this will not work.
This single command creates a .venv/ virtual environment, installs Python 3.12 if needed, and installs all project dependencies:
uv sync --extra dev --extra evalWhat this does:
- Creates a
.venv/directory with an isolated Python environment - Installs all core dependencies (PyTorch, Transformers, etc.)
- Installs dev tools (pytest, ruff, pre-commit)
- Installs eval dependencies (anthropic SDK, matplotlib, etc.)
If you only need core dependencies (no dev or eval tools):
uv syncActivate the virtual environment and check that key packages are available:
# Run a quick check (uv run automatically uses the .venv)
uv run python -c "import torch; import transformers; print(f'PyTorch {torch.__version__}, Transformers {transformers.__version__}')"You should see version numbers printed without errors.
uv run pytestIf you have an NVIDIA GPU and want to run models locally:
- Make sure you have CUDA drivers installed (check with
nvidia-smi) - The PyTorch version installed via uv should auto-detect CUDA
For Apple Silicon Macs, PyTorch uses MPS (Metal Performance Shaders) automatically — no extra setup needed.
For larger models (Gemma-3-4B, Llama-3.1-8B), you'll likely need a machine with >= 16 GB VRAM or access to UVA Rivanna/gpusrv.
Many models require authentication. Set up a Hugging Face token:
- Create an account at huggingface.co
- Go to Settings > Access Tokens > New Token
- Create a token with
readaccess - Log in locally:
uv run huggingface-cli loginPaste your token when prompted. This saves it to ~/.cache/huggingface/token.
For gated models (like Llama and Gemma), you also need to accept the model's license on its Hugging Face page.
# Run all tests
uv run pytest
# Run tests with coverage
uv run pytest --cov=src
# Lint code
uv run ruff check src/ tests/
# Auto-fix lint issues
uv run ruff check --fix src/ tests/
# Format code
uv run ruff format src/ tests/
# Add a new dependency
uv add <package-name>
# Add a dev dependency
uv add --group dev <package-name>uv sync fails with a Python version error:
uv can install Python for you. Run uv python install 3.12 then retry uv sync.
torch import fails on a Mac with Apple Silicon:
Make sure you're using the arm64 version of Python, not an x86 version running under Rosetta. Check with: python3 -c "import platform; print(platform.machine())" — it should say arm64.
Out of memory during model loading:
Use dtype=torch.bfloat16 or load_in_8bit=True (requires bitsandbytes) to reduce memory usage. Gemma-3-1B should fit in ~4 GB VRAM.
Rivanna is UVA's SLURM-managed HPC cluster. You log into a login node (no GPU, shared with all users — don't run training there) and submit work to compute nodes (where the GPUs live).
Add to ~/.ssh/config on your local machine:
Host uva.cs
HostName portal.cs.virginia.edu
User <computing_id>
Host rivanna
HostName rivanna.hpc.virginia.edu
User <computing_id>
ProxyJump uva.cs
IdentityFile ~/.ssh/id_ed25519
Generate a key if you don't have one (ssh-keygen -t ed25519), then push it to both hosts:
ssh-copy-id uva.cs
ssh-copy-id -o ProxyJump=uva.cs rivanna.hpc.virginia.eduThen ssh rivanna drops you on a login node.
Home directories (~) are small-quota and on slow shared storage. Put the HF model cache on scratch before your first model download:
# Add to ~/.bashrc on Rivanna
export HF_HOME=/scratch/$USER/hf_cache
export HF_HUB_CACHE=$HF_HOME/hub
export TRANSFORMERS_CACHE=$HF_HOME/hub
mkdir -p $HF_HOMEFor interactive work (running a script, debugging, one-off sanity checks):
srun --pty --gres=gpu:1 --mem=32G --time=2:00:00 --partition=gpu bashThis blocks until the scheduler gives you a node, then drops you into a shell on a compute node. Verify with nvidia-smi. From there:
cd ~/Experiments/zero-shot-realignment
uv run python src/scripts/0_0_sanity_check.pyCommon flags:
--gres=gpu:a6000:1request a specific GPU model (check what's available withsinfo -o "%P %G")--time=4:00:00longer sessions for training runs--mem=64Gbump RAM if you're OOMing on CPU memory--partition=gpupartitions vary;sinfolists what you have access to
Check your job is running: squeue -u $USER. Cancel with scancel <jobid>.
For anything longer than ~1 hour, submit as a batch job rather than holding an interactive shell:
cat > run.sbatch <<'EOF'
#!/bin/bash
#SBATCH --job-name=organism-ft
#SBATCH --gres=gpu:1
#SBATCH --mem=64G
#SBATCH --time=6:00:00
#SBATCH --partition=gpu
#SBATCH --output=logs/%x_%j.out
#SBATCH --error=logs/%x_%j.err
cd $SLURM_SUBMIT_DIR
uv run python src/scripts/<your_script>.py
EOF
sbatch run.sbatchStream logs with tail -f logs/organism-ft_<jobid>.out.
Most modern editors (VS Code, Cursor, Zed, PyCharm Pro, JetBrains Gateway) can attach to Rivanna over SSH and give you full IDE features — file tree, LSP, integrated terminal — while the code and Python environment live on the cluster.
- Get SSH working from the terminal first (see above). If
ssh rivannaworks, your editor will too (hopefully). - Find your editor's "Connect to Remote Host" / "Remote-SSH" command. It reads
~/.ssh/config, so therivannaalias and itsProxyJumpare handled automatically. - Open the project directory on the remote host (e.g.,
~/Experiments/zero-shot-realignment). - Point the Python LSP at
.venvso it resolves imports and gives you autocomplete. Rivanna in particular seems to break Zed's default LSP so you may to finagle it a little bit (uv add pyright[nodejs]).
After uv sync, the venv lives at .venv/ inside the project. Tell your editor to use .venv/bin/python as the interpreter:
- VS Code / Cursor: Command Palette → "Python: Select Interpreter" → pick
.venv/bin/python. - Zed: create
.zed/settings.jsonin the project root with:{ "lsp": { "pyright": { "settings": { "python": { "venvPath": ".", "venv": ".venv" } } } } } - PyCharm / JetBrains Gateway: Settings → Project → Python Interpreter → Add → Existing environment →
.venv/bin/python.
The editor's integrated terminal runs on the remote host (the login node). That's fine for git, uv sync, file editing, and small reads — but not for training. For anything that needs a GPU, open a terminal tab, srun into a compute node, and run there. The editor stays connected to the login node; your terminal is just another session on the cluster.