Opening the Bottleneck: Steering LLMs via Concept Intervention

Neil Dandekar & Christian Guerra · UCSD DSC 180B Capstone · Advised by Lily Weng

Project Website: chguerra15.github.io/capstone-site | Report: Q2Report_Checkpoint-3.pdf | Original Paper: CB-LLMs (ICLR 2025)

Problem Description

Large language models (LLMs) are powerful but opaque — their predictions depend on thousands of hidden features that are not meaningful to humans, making it hard to understand, audit, or correct their behavior.

Concept Bottleneck Large Language Models (CB-LLMs) address this by inserting a human-interpretable "concept layer" between the transformer backbone and the final classifier. Each neuron in this layer represents a semantic concept (e.g., sentiment, sports, toxicity), allowing users to inspect and directly manipulate what drives a prediction.

This project reproduces key results from the CB-LLM paper (ICLR 2025) and extends the framework with:

Multi-neuron intervention analysis — coordinated suppression/amplification of concept neuron groups
An interactive GUI — real-time concept manipulation without model retraining
Sankey visualizations — concept weight contribution diagrams across four datasets

Directory Structure

capstone/
├── backend_api/          # Flask API connecting the GUI to the CB-LLM model
│   └── app.py            # Main API entrypoint
├── classification/       # CB-LLM text classification pipeline
│   ├── get_concept_labels.py           # Generate concept scores via ACS
│   ├── train_CBL.py                    # Train the Concept Bottleneck Layer
│   ├── train_FL.py                     # Train the sparse linear classifier
│   ├── test_CBLLM.py                   # Evaluate CB-LLM accuracy
│   ├── print_concept_contributions.py  # Generate per-sample explanations
│   ├── finetune_black_box.py           # Train black-box baseline
│   └── requirements.txt
├── generation/           # CB-LLM text generation pipeline
│   ├── train_CBLLM.py          # Finetune Llama3 with CBL
│   ├── test_steerability.py    # Evaluate steerability
│   ├── test_perplexity.py      # Evaluate perplexity
│   ├── test_generation.py      # Generate text with concept intervention
│   ├── test_weight.py          # Extract top concept-token weights for Sankey
│   └── requirements.txt
├── frontend/             # Interactive concept manipulation GUI
│   └── app.py            # Slider-based neuron editor, runs locally
├── codex/                # Additional experiment scripts and analysis
├── fig/                  # Figures used in the report and README
├── checkpoint.ipynb      # Self-contained Q1 reproduction notebook
└── README.md

Environment Setup

Requirements: CUDA 12.1, Python 3.10, PyTorch 2.2

It is strongly recommended to use a conda virtual environment:

conda create -n cbllm python=3.10
conda activate cbllm
conda install pytorch==2.2.0 torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia

Installation

Classification Pipeline

cd classification
pip install -r requirements.txt

Key dependencies: transformers>=4.38, datasets, setfit, scikit-learn, sentence-transformers, tqdm, numpy, torch==2.2.0

Generation Pipeline

cd generation
pip install -r requirements.txt

Key dependencies: transformers>=4.38, peft, trl, accelerate, datasets, bitsandbytes, torch==2.2.0

Frontend GUI

cd frontend
pip install -r requirements.txt

Dataset Access

All datasets load automatically via HuggingFace datasets — no manual download required. They are cached locally under ~/.cache/huggingface/datasets/ after first run.

Dataset	HuggingFace ID	Task
SST-2	`SetFit/sst2`	Sentiment (binary)
Yelp Polarity	`yelp_polarity`	Sentiment (binary)
AG News	`ag_news`	Topic (4-class)
DBpedia	`dbpedia_14`	Topic (14-class)

Quickstart: Reproduce Results (No Setup Required)

For Q1 reproduction, open and run checkpoint.ipynb — it installs all dependencies automatically and reproduces the key benchmark tables using pretrained checkpoints from HuggingFace.

Running Experiments

Step 1 — Download Pretrained Checkpoints

Skip training entirely using the authors' finetuned checkpoints:

Classification:

git lfs install
git clone https://huggingface.co/cesun/cbllm-classification temp_repo
mv temp_repo/mpnet_acs classification/
rm -rf temp_repo

Generation:

git lfs install
git clone https://huggingface.co/cesun/cbllm-generation temp_repo
mv temp_repo/from_pretained_llama3_lora_cbm generation/
rm -rf temp_repo

Step 2 — Evaluate Classification Accuracy

cd classification
python test_CBLLM.py --cbl_path mpnet_acs/SetFit_sst2/roberta_cbm/cbl_acc.pt --sparse

Expected output: Per-class and overall test accuracy printed to stdout (~0.96 on SST2). Use --dataset ag_news, --dataset yelp_polarity, or --dataset dbpedia_14 to switch datasets.

Step 3 — Generate Concept Contribution Explanations

cd classification
python print_concept_contributions.py --cbl_path mpnet_acs/SetFit_sst2/roberta_cbm/cbl_acc.pt

Expected output: 5 concept-based explanations per sample printed to stdout, showing which concepts drove each prediction.

Step 4 — Visualize Concept Weights (Sankey Diagram)

cd generation
python test_weight.py --dataset ag_news

Expected output: [concept] [token] [weight] triples printed to stdout. Paste into SankeyMATIC to generate the flow diagram.

Step 5 — Evaluate Steerability and Perplexity

cd generation
python test_steerability.py
python test_perplexity.py

Expected output: Steerability score (0–1) and perplexity value printed per dataset. Reference values from our reproduction:

Method	Metric	SST2	YelpP	AGNews	DBpedia
CB-LLM (ours)	Accuracy↑	0.9638	0.9855	0.9439	0.9924
	Steerability↑	0.82	0.95	0.85	0.76
	Perplexity↓	116.22	13.03	18.25	37.59
CB-LLM w/o ADV	Accuracy↑	0.9676	0.9830	0.9418	0.9934
	Steerability↑	0.57	0.69	0.52	0.21
Llama3 (black-box)	Accuracy↑	0.9692	0.9851	0.9493	0.9919
	Steerability↑	No	No	No	No

Step 6 — Run Concept Interventions

cd generation
python test_generation.py --dataset ag_news

Edit line 48 of test_generation.py to set the neuron activation value you want to intervene on.

Expected output: Generated sentences whose topic shifts based on which concept neuron you modified. For example, zeroing the Sports neuron on a sports headline drops predicted Sports probability from 0.89 → 0.48; doubling it raises it to 0.94.

Running the GUI

cd frontend
python app.py

Open the URL shown in the terminal. The interface lets you:

Enter any input sentence
View the model's concept activations
Adjust individual neuron values with sliders
See prediction probabilities update in real time

No model retraining is required — the GUI applies scalar multipliers to neuron activations before the classifier head computes logits.

Contribution

Person	Role
Neil Dandekar	Backend experimentation, CB-LLM reproduction, intervention logic, model evaluation
Christian Guerra	Frontend/GUI development, Sankey visualization integration, project website deployment
Lily Weng	Research advising, methodology guidance, project direction

Citation

@article{cbllm,
   title={Concept Bottleneck Large Language Models},
   author={Sun, Chung-En and Oikarinen, Tuomas and Ustun, Berk and Weng, Tsui-Wei},
   journal={ICLR},
   year={2025}
}

Original authors' repository: Trustworthy-ML-Lab/CB-LLMs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Opening the Bottleneck: Steering LLMs via Concept Intervention

Problem Description

Directory Structure

Environment Setup

Installation

Classification Pipeline

Generation Pipeline

Frontend GUI

Dataset Access

Quickstart: Reproduce Results (No Setup Required)

Running Experiments

Step 1 — Download Pretrained Checkpoints

Step 2 — Evaluate Classification Accuracy

Step 3 — Generate Concept Contribution Explanations

Step 4 — Visualize Concept Weights (Sankey Diagram)

Step 5 — Evaluate Steerability and Perplexity

Step 6 — Run Concept Interventions

Running the GUI

Contribution

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
backend_api		backend_api
classification		classification
codex		codex
data		data
fig		fig
frontend		frontend
generation		generation
results		results
.gitignore		.gitignore
README.md		README.md
checkpoint.ipynb		checkpoint.ipynb

Folders and files

Latest commit

History

Repository files navigation

Opening the Bottleneck: Steering LLMs via Concept Intervention

Problem Description

Directory Structure

Environment Setup

Installation

Classification Pipeline

Generation Pipeline

Frontend GUI

Dataset Access

Quickstart: Reproduce Results (No Setup Required)

Running Experiments

Step 1 — Download Pretrained Checkpoints

Step 2 — Evaluate Classification Accuracy

Step 3 — Generate Concept Contribution Explanations

Step 4 — Visualize Concept Weights (Sankey Diagram)

Step 5 — Evaluate Steerability and Perplexity

Step 6 — Run Concept Interventions

Running the GUI

Contribution

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages