Skip to content

mega002/llm-interp-tau

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 

Repository files navigation

Interpretability of Large Language Models (0368.4264)

This repository contains materials for the Interpretability of Large Language Models course (0368.4264) at Tel Aviv University. It is a graduate-level, active-learning course in which students learn about interpretability of LLMs in the style of a collaborative research group. The course is structured around weekly paper readings, in-class discussions, role-playing, and hands-on exercises.1 Students are assumed to have prior background in natural language processing and machine learning.

In this repository, you will find:

  • Schedule and reading lists
  • Coding exercises and challenges

The course was developed by Dr. Mor Geva and Daniela Gottesman at Tel Aviv University. We also thank Amit Elhelo, Or Shafran, and Yoav Gur-Arieh for their contributions. We share these materials and hope they serve as a useful resource for anyone curious about or working on the interpretability of large language models.

Schedule and materials

The schedule is subject to minor changes.

Week Date Topic and papers Practicum
1 Oct 26 Introduction and role assignments
Background and NLP refresher
Exercise Solution
2 Nov 2 Probing
Main paper 1: Language Models Represent Space and Time
Main paper 2: A Structural Probe for Finding Syntax in Word Representations
Bonus papers:
* Not All Language Model Features Are One-Dimensionally Linear
Exercise Solution
3 Nov 9 Inspecting representations
Main paper 1: Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models
Main paper 2: Language Model Inversion
Bonus papers:
* SelfIE: Self-Interpretation of Large Language Model Embeddings
* LatentQA: Teaching LLMs to Decode Activations Into Natural Language
* logit lens
Exercise Solution
4 Nov 16 Attention heads
Main paper 1: Inferring Functionality of Attention Heads from their Parameters
Main paper 2: Talking Heads: Understanding Inter-layer Communication in Transformer Language Models
Bonus papers:
* In-context Learning and Induction Heads
* Attention Heads of Large Language Models: A Survey
* Analyzing Transformers in Embedding Space
Exercise Solution
5 Nov 23 MLP layers
Main paper 1: Transformer Feed-Forward Layers Are Key-Value Memories
Main paper 2: Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization
Bonus papers:
* Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space
* Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions
Exercise Solution
6 Nov 30 Neurons (are they the right unit?)
Main paper 1: Finding Neurons in a Haystack: Case Studies with Sparse Probing
Main paper 2: Confidence Regulation Neurons in Language Models
Bonus papers:
* An Interpretability Illusion for BERT
* The Quest for the Right Mediator: A History, Survey, and Theoretical Grounding of Causal Interpretability
* Neurons in Large Language Models: Dead, N-gram, Positional
Review Slides
7 Dec 7 Feature representations
Main paper 1: Sparse Autoencoders Find Highly Interpretable Features in Language Models
Main paper 2: The Geometry of Categorical and Hierarchical Concepts in Large Language Models
Bonus papers:
* The Linear Representation Hypothesis and the Geometry of Large Language Models
* Transcoders Find Interpretable LLM Feature Circuits
Exercise Solution
8 Dec 14 Describing features
Main paper 1: Automatically Interpreting Millions of Features in Large Language Models
Main paper 2: Enhancing Automated Interpretability with Output-Centric Feature Descriptions
Bonus papers:
* Language models can explain neurons in language models
* Rigorously Assessing Natural Language Explanations of Neurons
* SAEs Are Good for Steering -- If You Select the Right Features
Exercise Solution
9 Dec 28 Circuit discovery
Main paper 1: Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Main paper 2: Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms
Bonus papers:
* Towards Automated Circuit Discovery for Mechanistic Interpretability
* Position-aware Automatic Circuit Discovery
* Circuit Component Reuse Across Tasks in Transformer Language Models
Exercise Solution
10 Jan 4 Binding mechanisms
Main paper 1: How do Language Models Bind Entities in Context?
Main paper 2: Language Models use Lookbacks to Track Beliefs
Bonus papers:
* Mixing Mechanisms: How Language Models Retrieve Bound Entities In-Context
* Monitoring Latent World States in Language Models with Propositional Probes
* Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking
See Week 9
11 Jan 11 Factual knowledge recall and editing
Main paper 1: Locating and Editing Factual Associations in GPT
Main paper 2: Dissecting Recall of Factual Associations in Auto-Regressive Language Models
Bonus paper:
* Linearity of Relation Decoding in Transformer Language Models
* Characterizing Mechanisms for Factual Recall in Language Models
* Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models
Exercise Solution
12 Jan 18 Training dynamics
Main paper 1: LLM Circuit Analyses Are Consistent Across Training and Scale
Main paper 2: What needs to go right for an induction head? A mechanistic study of in-context learning circuits and their formation
Bonus papers:
* On Linear Representations and Pretraining Data Frequency in Language Models
Review Slides
13 Jan 25 Project presentations
Conclusion

Questions and feedback

If you have questions or suggestions, please open an issue in this repository.

Footnotes

  1. The course format draws inspiration from the paper-reading seminar by Alec Jacobson and Colin Raffel and The Science of Large Language Models course by Robin Jia.

About

Course Materials for Interpretability of Large Language Models (0368.4264) at Tel Aviv University

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors