Universal Steering

Code for steering language models using learned activation directions. This repository contains two main components:

Steering Benchmark - Evaluates steering across hundreds of concepts (personalities, moods, fears, places, personas)
VLM Steering - Applies steering to vision-language models

Repository Structure

universal_steering/
├── steering_benchmark/       # Benchmark evaluation pipeline
│   ├── run.py               # Compute steering directions
│   ├── eval_generations.py  # Generate steered outputs
│   ├── parse_results.py     # Evaluate with GPT-4o judge
│   └── read_csv.py          # Aggregate results
├── vlm_steering/            # Vision-language model steering
│   └── main.py              # VLM steering examples
├── neural_controllers.py    # Core controller class
├── control_toolkits.py      # Direction computation algorithms
├── direction_utils.py       # Hidden state extraction utilities
├── generation_utils.py      # Text generation with hooks
├── rfm.py                   # Random Feature Model algorithm
├── utils.py                 # Dataset loaders and utilities
├── data/                    # Concept lists (personalities, moods, etc.)
└── evaluation_prompts/      # GPT-4o evaluation templates

Steering Benchmark

The benchmark evaluates steering effectiveness across 5 concept categories using multiple LLMs.

Pipeline

Compute Directions: Extract steering directions from training data

python -m steering_benchmark.run --model_set phi --model_version 3-medium-4k-instruct

Generate Steered Outputs: Generate text with steering applied

python -m steering_benchmark.eval_generations --model_set phi --model_version 3-medium-4k-instruct

Evaluate with GPT-4o: Score generations for concept alignment

python -m steering_benchmark.parse_results --model_set phi --model_version 3-medium-4k-instruct

Aggregate Results: Compute success rates

python -m steering_benchmark.read_csv --model_set phi --model_version 3-medium-4k-instruct

Supported Models

Llama: 3.1-8B, 3.1-70B, 3.3-70B
Phi: 3, 4, 3-medium-4k-instruct
Falcon: 3-3B, 3-10B
Mistral: Small-Instruct-2409, Large-Instruct-2407

VLM Steering

Steer vision-language models on multimodal inputs.

from vlm_steering.main import select_llm, generate

# Load model
llm = select_llm('llama-vision')

# Generate with steering
generate('conspiracy', llm, prompt, image=image, coefs=[0.3, 0.4])

Supported VLM Models

Llama-Vision: 3.2-90B
LLaVA: 1.5-7B
Gemma: 2-9B (text-only)

Control Methods

RFM (default): Random Feature Model with metric learning
Linear: Linear regression probe
Logistic: Logistic regression probe
PCA: Principal component analysis
Mean Difference: Simple difference between class means

Requirements

PyTorch
Transformers (Hugging Face)
OpenAI API (for GPT-4o evaluation)
CUDA-capable GPU

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Universal Steering

Repository Structure

Steering Benchmark

Pipeline

Supported Models

VLM Steering

Supported VLM Models

Control Methods

Requirements

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
evaluation_prompts		evaluation_prompts
steering_benchmark		steering_benchmark
vlm_steering		vlm_steering
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
control_toolkits.py		control_toolkits.py
direction_utils.py		direction_utils.py
generation_utils.py		generation_utils.py
neural_controllers.py		neural_controllers.py
rfm.py		rfm.py
utils.py		utils.py

Folders and files

Latest commit

History

Repository files navigation

Universal Steering

Repository Structure

Steering Benchmark

Pipeline

Supported Models

VLM Steering

Supported VLM Models

Control Methods

Requirements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages