Code for steering language models using learned activation directions. This repository contains two main components:
- Steering Benchmark - Evaluates steering across hundreds of concepts (personalities, moods, fears, places, personas)
- VLM Steering - Applies steering to vision-language models
universal_steering/
├── steering_benchmark/ # Benchmark evaluation pipeline
│ ├── run.py # Compute steering directions
│ ├── eval_generations.py # Generate steered outputs
│ ├── parse_results.py # Evaluate with GPT-4o judge
│ └── read_csv.py # Aggregate results
├── vlm_steering/ # Vision-language model steering
│ └── main.py # VLM steering examples
├── neural_controllers.py # Core controller class
├── control_toolkits.py # Direction computation algorithms
├── direction_utils.py # Hidden state extraction utilities
├── generation_utils.py # Text generation with hooks
├── rfm.py # Random Feature Model algorithm
├── utils.py # Dataset loaders and utilities
├── data/ # Concept lists (personalities, moods, etc.)
└── evaluation_prompts/ # GPT-4o evaluation templates
The benchmark evaluates steering effectiveness across 5 concept categories using multiple LLMs.
- Compute Directions: Extract steering directions from training data
python -m steering_benchmark.run --model_set phi --model_version 3-medium-4k-instruct- Generate Steered Outputs: Generate text with steering applied
python -m steering_benchmark.eval_generations --model_set phi --model_version 3-medium-4k-instruct- Evaluate with GPT-4o: Score generations for concept alignment
python -m steering_benchmark.parse_results --model_set phi --model_version 3-medium-4k-instruct- Aggregate Results: Compute success rates
python -m steering_benchmark.read_csv --model_set phi --model_version 3-medium-4k-instruct- Llama: 3.1-8B, 3.1-70B, 3.3-70B
- Phi: 3, 4, 3-medium-4k-instruct
- Falcon: 3-3B, 3-10B
- Mistral: Small-Instruct-2409, Large-Instruct-2407
Steer vision-language models on multimodal inputs.
from vlm_steering.main import select_llm, generate
# Load model
llm = select_llm('llama-vision')
# Generate with steering
generate('conspiracy', llm, prompt, image=image, coefs=[0.3, 0.4])- Llama-Vision: 3.2-90B
- LLaVA: 1.5-7B
- Gemma: 2-9B (text-only)
- RFM (default): Random Feature Model with metric learning
- Linear: Linear regression probe
- Logistic: Logistic regression probe
- PCA: Principal component analysis
- Mean Difference: Simple difference between class means
- PyTorch
- Transformers (Hugging Face)
- OpenAI API (for GPT-4o evaluation)
- CUDA-capable GPU