Interpretability of Large Language Models (0368.4264)

This repository contains materials for the Interpretability of Large Language Models course (0368.4264) at Tel Aviv University. It is a graduate-level, active-learning course in which students learn about interpretability of LLMs in the style of a collaborative research group. The course is structured around weekly paper readings, in-class discussions, role-playing, and hands-on exercises.¹ Students are assumed to have prior background in natural language processing and machine learning.

In this repository, you will find:

Schedule and reading lists
Coding exercises and challenges

The course was developed by Dr. Mor Geva and Daniela Gottesman at Tel Aviv University. We also thank Amit Elhelo, Or Shafran, and Yoav Gur-Arieh for their contributions. We share these materials and hope they serve as a useful resource for anyone curious about or working on the interpretability of large language models.

Schedule and materials

The schedule is subject to minor changes.

Week	Date	Topic and papers	Practicum
1	Oct 26	Introduction and role assignments Background and NLP refresher	Exercise Solution
2	Nov 2	Probing Main paper 1: Language Models Represent Space and Time Main paper 2: A Structural Probe for Finding Syntax in Word Representations Bonus papers: * Not All Language Model Features Are One-Dimensionally Linear	Exercise Solution
3	Nov 9	Inspecting representations Main paper 1: Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models Main paper 2: Language Model Inversion Bonus papers: * SelfIE: Self-Interpretation of Large Language Model Embeddings * LatentQA: Teaching LLMs to Decode Activations Into Natural Language * logit lens	Exercise Solution
4	Nov 16	Attention heads Main paper 1: Inferring Functionality of Attention Heads from their Parameters Main paper 2: Talking Heads: Understanding Inter-layer Communication in Transformer Language Models Bonus papers: * In-context Learning and Induction Heads * Attention Heads of Large Language Models: A Survey * Analyzing Transformers in Embedding Space	Exercise Solution
5	Nov 23	MLP layers Main paper 1: Transformer Feed-Forward Layers Are Key-Value Memories Main paper 2: Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization Bonus papers: * Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space * Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions	Exercise Solution
6	Nov 30	Neurons (are they the right unit?) Main paper 1: Finding Neurons in a Haystack: Case Studies with Sparse Probing Main paper 2: Confidence Regulation Neurons in Language Models Bonus papers: * An Interpretability Illusion for BERT * The Quest for the Right Mediator: A History, Survey, and Theoretical Grounding of Causal Interpretability * Neurons in Large Language Models: Dead, N-gram, Positional	Review Slides
7	Dec 7	Feature representations Main paper 1: Sparse Autoencoders Find Highly Interpretable Features in Language Models Main paper 2: The Geometry of Categorical and Hierarchical Concepts in Large Language Models Bonus papers: * The Linear Representation Hypothesis and the Geometry of Large Language Models * Transcoders Find Interpretable LLM Feature Circuits	Exercise Solution
8	Dec 14	Describing features Main paper 1: Automatically Interpreting Millions of Features in Large Language Models Main paper 2: Enhancing Automated Interpretability with Output-Centric Feature Descriptions Bonus papers: * Language models can explain neurons in language models * Rigorously Assessing Natural Language Explanations of Neurons * SAEs Are Good for Steering -- If You Select the Right Features	Exercise Solution
9	Dec 28	Circuit discovery Main paper 1: Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small Main paper 2: Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms Bonus papers: * Towards Automated Circuit Discovery for Mechanistic Interpretability * Position-aware Automatic Circuit Discovery * Circuit Component Reuse Across Tasks in Transformer Language Models	Exercise Solution
10	Jan 4	Binding mechanisms Main paper 1: How do Language Models Bind Entities in Context? Main paper 2: Language Models use Lookbacks to Track Beliefs Bonus papers: * Mixing Mechanisms: How Language Models Retrieve Bound Entities In-Context * Monitoring Latent World States in Language Models with Propositional Probes * Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking	See Week 9
11	Jan 11	Factual knowledge recall and editing Main paper 1: Locating and Editing Factual Associations in GPT Main paper 2: Dissecting Recall of Factual Associations in Auto-Regressive Language Models Bonus paper: * Linearity of Relation Decoding in Transformer Language Models * Characterizing Mechanisms for Factual Recall in Language Models * Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models	Exercise Solution
12	Jan 18	Training dynamics Main paper 1: LLM Circuit Analyses Are Consistent Across Training and Scale Main paper 2: What needs to go right for an induction head? A mechanistic study of in-context learning circuits and their formation Bonus papers: * On Linear Representations and Pretraining Data Frequency in Language Models	Review Slides
13	Jan 25	Project presentations Conclusion

Questions and feedback

If you have questions or suggestions, please open an issue in this repository.

The course format draws inspiration from the paper-reading seminar by Alec Jacobson and Colin Raffel and The Science of Large Language Models course by Robin Jia. ↩

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Interpretability of Large Language Models (0368.4264)

Schedule and materials

Questions and feedback

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Interpretability of Large Language Models (0368.4264)

Schedule and materials

Questions and feedback

Footnotes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages