Inside Transformers

A hands-on learning project for building and training a small GPT-style language model from scratch in PyTorch. The goal is to understand transformers from the inside out: attention, residual blocks, training loops, and what the model is actually doing at each layer.

The project trains a character-level GPT on Tiny Shakespeare, then visualizes attention patterns in Jupyter notebooks.

🏗️ Architecture

Input token IDs
    ↓
Token embeddings          ← nn.Embedding
    +
Positional embeddings     ← nn.Embedding
    ↓
Dropout
    ↓
Transformer Block × N
    │
    ├── LayerNorm
    ├── Multi-head attention
    ├── Residual connection
    ├── LayerNorm
    ├── Feed-forward network
    └── Residual connection
    ↓
Final LayerNorm
    ↓
Linear projection         ← lm_head (weight-tied to token embeddings)
    ↓
Logits (B, T, vocab_size)

Each transformer block uses pre-norm (LayerNorm before attention and feed-forward), causal multi-head self-attention, and a GELU feed-forward network with a 4× expansion.

📂 Project structure

inside-transformers/
├── models/
│   ├── attention.py      # Head, MultiHeadAttention, FeedForward, TransformerBlock
│   └── gpt.py            # GPT model (embeddings, blocks, loss, generation)
├── data/
│   ├── prepare.py        # Download Shakespeare, build char vocab, save tensors
│   └── input.txt         # Raw text (downloaded on first run if missing)
├── training/
│   └── train.py          # Training loop, checkpointing, sample generation
├── notebooks/
│   ├── 01_tokenization.ipynb   # (placeholder - not yet implemented)
│   └── 02_attention_viz.ipynb  # Load checkpoint, hook attention, plot heatmaps
├── configs/
│   ├── base.yaml         # (placeholder)
│   ├── tiny_cpu.yaml     # (placeholder)
│   └── gpu_3070.yaml     # (placeholder)
├── scripts/
│   ├── run_cpu.sh        # (placeholder)
│   └── run_gpu.sh        # (placeholder)
├── docs/                 # Learning notes (Word documents)
├── check_requirements.py # Verify installed packages and CUDA availability
├── requirements.txt
└── LICENSE

Generated artifacts (not committed - see .gitignore):

data/train.pt, data/val.pt, data/vocab.pt - produced by data/prepare.py
checkpoints/gpt_shakespeare.pt - produced by training/train.py

⚙️ Requirements

Python 3.10+
PyTorch 2.0+ (CUDA optional but recommended for training)
See requirements.txt for the full dependency list

Package	Purpose
`torch`	Model, training, data tensors
`numpy`	Numerical utilities
`matplotlib`	Attention heatmaps in notebooks
`jupyter`	Interactive exploration
`tiktoken`	Planned for subword tokenization (notebook 01)
`wandb`	Planned for experiment tracking
`requests`	Download the dataset from the internet

🚀 Quick start

1. Clone and install

git clone https://github.com/daveshenal/inside-transformers.git
cd inside-transformers
pip install -r requirements.txt

Verify your environment:

python check_requirements.py

2. Prepare data

Downloads Tiny Shakespeare (if input.txt is missing), builds a character-level vocabulary, and saves train/val splits:

python data/prepare.py

Expected output:

~1.1M characters of Shakespeare text
65 unique characters (letters, punctuation, whitespace)
90/10 train/val split saved as PyTorch tensors

3. Train

python training/train.py

Training uses hard-coded hyperparameters in training/train.py (YAML configs are not wired up yet). Defaults:

Setting	Value
`block_size`	128
`batch_size`	64
`max_iters`	5000
`eval_every`	200
`lr`	3e-4
`n_embd`	192
`n_heads`	6
`n_layers`	4
`dropout`	0.1

The model has ~1.8M parameters. On CPU this takes a while; a CUDA GPU is much faster.

During training you will see periodic train and val cross-entropy loss. When training finishes:

A checkpoint is saved to checkpoints/gpt_shakespeare.pt (model weights + config dict).
A 500-token sample is generated with temperature 0.8 and top_k=40.

4. Explore attention

After training, open the visualization notebook:

jupyter notebook notebooks/02_attention_viz.ipynb

The notebook:

Loads the checkpoint and vocabulary.
Registers forward hooks on each layer's multi-head attention.
Runs a forward pass on a sample sentence.
Plots per-head attention heatmaps for each transformer layer.
Includes written observations about how attention patterns evolve across layers.

🔬 Model details

`models/attention.py`

Head - Single causal self-attention head with scaled dot-product attention and a lower-triangular mask buffer.
MultiHeadAttention - Runs n_heads heads in parallel, concatenates, and projects.
FeedForward - Two linear layers with GELU and 4× hidden expansion.
TransformerBlock - Pre-norm residual block: x + attn(ln1(x)), then x + ff(ln2(x)).

`models/gpt.py`

Token and learned positional embeddings summed and dropped out.
Stack of TransformerBlock modules followed by final LayerNorm.
lm_head linear layer with weight tying to the token embedding matrix.
forward(idx, targets) - Returns logits and optional cross-entropy loss.
generate(idx, max_new_tokens, temperature, top_k) - Autoregressive sampling with optional top-k filtering.

`training/train.py`

Loads preprocessed train.pt, val.pt, and vocab.pt.
Random contiguous chunks of length block_size for each batch.
AdamW optimizer.
Periodic evaluation over 20 batches on train and val splits.
Saves checkpoint and prints a generated sample.

✅ Current status

What is implemented:

Character-level data pipeline (data/prepare.py)
Causal multi-head attention and transformer blocks
Full GPT model with weight tying and text generation
End-to-end training script with checkpointing
Attention visualization notebook with layer-by-layer analysis
Environment checker (check_requirements.py)

Planned / in progress:

01_tokenization.ipynb - subword tokenization with tiktoken
YAML-driven training configs (configs/)
Shell launch scripts (scripts/run_cpu.sh, scripts/run_gpu.sh)
Weights & Biases logging (wandb)
Inference / sampling CLI

🗺️ Learning path

Suggested order for working through the repo:

Read models/attention.py - understand one head, then multi-head, then the full block.
Read models/gpt.py - see how blocks stack into a language model.
Run data/prepare.py - watch character-level encoding in action.
Run training/train.py - follow the loss curve and read the generated sample.
Open notebooks/02_attention_viz.ipynb - connect math to what the model actually attends to.

Additional notes live in docs/ (llm_training_roadmap.docx, self_learn.docx).

📄 License

🙏 Acknowledgments

Architecture and training approach inspired by Andrej Karpathy (nanoGPT, makemore, char-rnn).
Dataset: Tiny Shakespeare.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Inside Transformers

🏗️ Architecture

📂 Project structure

⚙️ Requirements

🚀 Quick start

1. Clone and install

2. Prepare data

3. Train

4. Explore attention

🔬 Model details

`models/attention.py`

`models/gpt.py`

`training/train.py`

✅ Current status

🗺️ Learning path

📄 License

🙏 Acknowledgments

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
assets		assets
configs		configs
data		data
docs		docs
models		models
notebooks		notebooks
scripts		scripts
training		training
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
check_requirements.py		check_requirements.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Inside Transformers

🏗️ Architecture

📂 Project structure

⚙️ Requirements

🚀 Quick start

1. Clone and install

2. Prepare data

3. Train

4. Explore attention

🔬 Model details

models/attention.py

models/gpt.py

training/train.py

✅ Current status

🗺️ Learning path

📄 License

🙏 Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages

`models/attention.py`

`models/gpt.py`

`training/train.py`