Skip to content

flas-ai/FLAS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FLAS

Flow-based Activation Steering for Inference-Time Intervention.

Project Page arXiv 🤗 Model 🤗 Demo

FLAS is a natural-language activation-steering method for LLMs. Where prior work like Golden Gate Claude had to lock in a single behavior in advance, FLAS learns a single general concept-conditioned velocity field $v_\theta(h, t, c)$. At inference you hand it any natural-language concept $c$ and it produces the right inference-time intervention. The same checkpoint handles thousands of unseen concepts, and is the first learned steering method to consistently outperform in-context prompting on AxBench.

This repository provides a unified, multi-family implementation of FLAS. Beyond the Gemma-2 models from the paper, the same flow-based steerer trains on Qwen3, Llama-3.1, and Gemma-3 (base or instruct) through a model-type registry — adding a new family is a single registry entry. The FlowBlock also supports an optional self-attention ablation, and training scales to larger concept corpora (FLAS-46k). The checkpoints and results below are the paper's Gemma-2 artifacts; the other families are trained from scratch with the same recipe.

How it works

FLAS learns a concept-conditioned velocity field $v_\theta(h, t, c)$ that transports an unsteered activation $h$ to a steered activation $h'$:

$$h' = \varphi_T(h) = h + \int_0^T v_\theta\bigl(\varphi_t(h), t, c\bigr) dt$$

The flow time $T$ serves as a continuous steering-strength parameter and enables zero-shot strength control at inference.

Results

Evaluated on AxBench's Concept16k held-in / held-out splits with Gemma-2-2B-IT and Gemma-2-9B-IT, intervening at layer 20, fixed $T = 2$. Generations are scored by GPT-4o-mini on Concept / Instruction-following / Fluency and aggregated into HMean. See the paper for full tables, baselines, and ablations.

Get started

Try it online

The hosted demo at https://huggingface.co/spaces/Lunamos/flas-demo runs Gemma-2-2B-IT with FLAS on a ZeroGPU slice. Type any concept (e.g. "talk like a pirate") and a prompt, see the steered vs baseline output side-by-side.

Pretrained checkpoints

Released on the Hugging Face Hub:

Base model Checkpoint repo Inference VRAM peak
Gemma-2-2B-IT flas-ai/flas-gemma-2-2b-it ~5 GB
Gemma-2-9B-IT flas-ai/flas-gemma-2-9b-it ~18 GB

Both are stored as bf16 safetensors.

Run the app locally

The same Gradio UI that backs the hosted demo is bundled at space/app.py. Run it on your own GPU:

git clone https://github.com/flas-ai/FLAS && cd FLAS
uv sync                                # or: pip install -e .

uv run python space/app.py             # opens http://localhost:7860

On first launch the app downloads flas-ai/flas-gemma-2-2b-it from the Hub and caches it locally and afterwards it runs entirely offline. To expose the UI over a public link (e.g. when running on a remote / headless server), edit the last line of space/app.py to demo.launch(share=True).

CLI / interactive REPL

# After uv sync, pull a checkpoint locally
hf download flas-ai/flas-gemma-2-2b-it \
    --local-dir checkpoints/flas-gemma-2-2b-it

# Chat with steering
uv run python scripts/chat.py \
    --flow-ckpt checkpoints/flas-gemma-2-2b-it/flas-gemma-2-2b-it.safetensors \
    --flowtime 2.0 --n-steps 3

In the chat REPL, use /concept <text> to change the steering target and /flowtime <T> to change strength on the fly.

Use in Python

pip install git+https://github.com/flas-ai/FLAS.git@main
from huggingface_hub import hf_hub_download
from flas.generate import load_generator

ckpt = hf_hub_download("flas-ai/flas-gemma-2-2b-it", "flas-gemma-2-2b-it.safetensors")
hf_hub_download("flas-ai/flas-gemma-2-2b-it", "config.json")  # cached alongside

gen = load_generator(ckpt)
out = gen.generate_batch(
    prompts=["Tell me about your day."],
    concept_text="Talk like a pirate",
    flowtimes=[2.0], n_steps=3, max_tokens=128,
)
print(out[0]["generation"])

The base LLM (Gemma-2-2B-IT or Gemma-2-9B-IT) is downloaded from Hugging Face on first use; make sure you have run hf auth login and accepted the Gemma-2 license.

Supported base models

The flow steerer is family-agnostic. A model-type registry in src/flas/model.py injects each family's norm / rotary / attention modules into the shared FlowCrossAttention, so the same FlowFunction trains on any of the families below — select one with --model-id:

Family Example --model-id Notes
Gemma-2 google/gemma-2-2b-it, google/gemma-2-9b-it paper models; released checkpoints
Gemma-3 google/gemma-3-4b-it, google/gemma-3-4b-pt multimodal config auto-unwrapped to its text decoder; run with --disable-self-attn
Qwen3 Qwen/Qwen3-8B, Qwen/Qwen3-8B-Base
Llama-3.1 meta-llama/Llama-3.1-8B-Instruct, meta-llama/Llama-3.1-8B

Adding a new family is a single registry entry. The steering layer is selected with --layer, the number of FlowBlocks with --num-blocks, and the self-attention branch can be ablated with --disable-self-attn.

Install (for development / training)

We recommend using uv:

uv sync
uv run python -c "import flas; print('ok')"

uv will read pyproject.toml and install a matching CUDA-enabled torch wheel automatically.

If you prefer pip:

python -m venv .venv && source .venv/bin/activate
pip install -e .

The base LLM is downloaded from Hugging Face. Make sure you have run hf login and accepted the relevant model license (e.g. Gemma, Llama).

Data

FLAS trains on AxBench data. The AlpacaEval prompts used by scripts/eval.py are already bundled at data/alpaca_eval.json (sourced from tatsu-lab/alpaca_eval — re-download yourself with hf download tatsu-lab/alpaca_eval alpaca_eval.json --repo-type dataset --local-dir data if needed). For training / Concept16k held-out eval, clone the AxBench repo:

git clone https://github.com/stanfordnlp/axbench thirdparty/axbench

Training data (Concept500 / Concept16k parquets) are tracked in the AxBench repo via Git LFS, so git clone already pulls them. They live at thirdparty/axbench/axbench/<concept_set>/prod_<model>_<layer>_v1/generate/.

For example, Concept500 on Gemma-2-2B-IT layer 20:

thirdparty/axbench/axbench/concept500/prod_2b_l20_v1/generate/
├── train_data.parquet
└── metadata.jsonl

FLAS-46k

Larger-scale runs use FLAS-46k, an AxBench-derived concept-steering corpus (~46k concepts, ~2.6M rows). The data is text-only and model-agnostic — activations are extracted live at the chosen --layer, so a single corpus trains any base model. FLAS-46k (and a fixed, reproducible held-out split for evaluation) will be released on the Hugging Face Hub; point --data-dir at it once downloaded.

Hardware

End-to-end inference VRAM (peak, batch=1, 128-token generation, all bf16, with the ConceptEncoder sharing modules with the base LLM):

Base model Base (bf16) FlowFunction ConceptEncoder Inference peak
Gemma-2-2B-IT 4.9 GB 0.2 GB (shared) ~5.1 GB
Gemma-2-9B-IT 17.2 GB 0.5 GB (shared) ~17.8 GB

Batched eval (batch=15, 256 tokens) raises peaks to ~9 GB and ~22 GB respectively. Training peaks higher than inference because of optimizer state and activations — the recipes below were run on a single 80 GB A100, but a single 24 GB consumer card is plenty for the 2B Concept500 setting.

Reproducing paper results

Note on dtype. The checkpoints distributed on the Huggingface were converted from the original fp32 training artifacts to bf16 to halve the VRAM/disk footprint. We re-evaluated bf16 on the AxBench Concept16k held-in / held-out splits at $T = 2$ and the GPT-4o-mini HMean fell within the 95% bootstrap confidence interval reported in the paper (Table 1). All recipes below run unchanged on either fp32 or bf16 weights.

Train on Concept500 (single 18 GB+ GPU, Gemma-2-2B-IT):

uv run python -m flas.train \
    --data-dir thirdparty/axbench/axbench/concept500/prod_2b_l20_v1/generate \
    --output-dir checkpoints --run-name flas_2b_c500

To train on Concept16k instead, add --val-n-concepts for training-time held-out evaluation:

uv run python -m flas.train \
    --data-dir thirdparty/axbench/axbench/concept16k/prod_2b_l20_v1/generate \
    --val-n-concepts 500 \
    --output-dir checkpoints --run-name flas_2b_c16k

Generate steered outputs:

uv run python scripts/eval.py \
    --flow-ckpt checkpoints/flas_2b_c500/best_step*.pt \
    --output-dir results/flas_2b_c500 \
    --num-eval-prompts 10 --max-tokens 256 \
    --flowtimes 1.0 1.5 2.0 2.5 3.0

Score generations with GPT-4o-mini (AxBench-aligned C/I/F judge):

uv run python scripts/judge_openai.py \
    --results-file results/flas_2b_c500/results_shard0.json \
    --output       results/flas_2b_c500/judged.json \
    --api-key "$OPENAI_API_KEY" --concurrency 8

Interactive CLI for testing:

uv run python scripts/chat.py \
    --flow-ckpt checkpoints/flas_2b_c500/best_step*.pt

Training on other base models

The trainer is family-agnostic — choose the base model with --model-id and the steering layer with --layer. For example, Qwen3-8B at layer 20 with a single FlowBlock, self-attention ablated, on FLAS-46k:

uv run python -m flas.train \
    --model-id Qwen/Qwen3-8B --data-dir data/flas-concept46k \
    --layer 20 --num-blocks 1 --disable-self-attn \
    --n-steps 3 --T-min 0.5 --T-max 2.0 \
    --output-dir checkpoints --run-name flas46k_qwen3_8b

Base (non-instruct) models have no chat template, so pass --prompt-format alpaca (a minimal ### Instruction: / ### Response: wrapper) for both training and generation — the format is saved in the checkpoint config and read back automatically at eval time.

Evaluation protocol

Following AxBench (Wu et al., 2025):

  1. Generate. For each held-out concept × AlpacaEval prompt × flowtime $T$, generate steered text with scripts/eval.py.
  2. Judge. GPT-4o-mini scores each generation on Concept ($C$), Instruction-following ($I$), and Fluency ($F$), each in ${0, 1, 2}$.
  3. Aggregate. Per-factor max with prompt-level 50/50 split: pick the best $T$ per concept on "train" prompts, report HMean = $3 / (1/C + 1/I + 1/F)$ on "test" prompts.

Project layout

flas/
├── src/flas/
│   ├── model.py          # multi-family registry + FlowBlock / FlowFunction / ConceptEncoder
│   ├── train.py          # PyTorch Lightning training
│   ├── generate.py       # batched steered generation
├── scripts/
│   ├── eval.py           # AxBench-aligned generation
│   ├── judge_openai.py   # GPT-4o-mini judge (OpenAI)
│   └── chat.py           # interactive CLI
├── pyproject.toml
├── LICENSE
└── README.md

Acknowledgements and data

  • Base modelsGemma-2 and Gemma-3 (Google), Qwen3 (Alibaba), and Llama-3.1 (Meta).
  • Steering data and evaluation pipelineAxBench (Wu et al., 2025): Concept500 / Concept16k corpora and the C/I/F judge prompts.
  • Eval promptsAlpacaEval (Li et al., 2023): the 805 instructions used at evaluation time. The bundled data/alpaca_eval.json is a verbatim copy of tatsu-lab/alpaca_eval.

Citation

@article{flas2026,
  title  = {Beyond Steering Vector: Flow-based Activation Steering for Inference-Time Intervention},
  author = {Zehao Jin and Ruixuan Deng and Junran Wang and Xinjie Shen and Chao Zhang},
  year   = {2026},
  eprint = {2605.05892},
  archivePrefix = {arXiv},
  primaryClass = {cs.CL},
  url    = {https://arxiv.org/abs/2605.05892},
}

License

Released under the Apache 2.0 (see LICENSE).

About

Flow-based Activation Steering for Inference-Time Intervention.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors