Skip to content

daveshenal/inside-transformers

Repository files navigation

Inside Transformers Banner

Inside Transformers

A hands-on learning project for building and training a small GPT-style language model from scratch in PyTorch. The goal is to understand transformers from the inside out: attention, residual blocks, training loops, and what the model is actually doing at each layer.

The project trains a character-level GPT on Tiny Shakespeare, then visualizes attention patterns in Jupyter notebooks.


🏗️ Architecture

Input token IDs
    ↓
Token embeddings          ← nn.Embedding
    +
Positional embeddings     ← nn.Embedding
    ↓
Dropout
    ↓
Transformer Block × N
    │
    ├── LayerNorm
    ├── Multi-head attention
    ├── Residual connection
    ├── LayerNorm
    ├── Feed-forward network
    └── Residual connection
    ↓
Final LayerNorm
    ↓
Linear projection         ← lm_head (weight-tied to token embeddings)
    ↓
Logits (B, T, vocab_size)

Each transformer block uses pre-norm (LayerNorm before attention and feed-forward), causal multi-head self-attention, and a GELU feed-forward network with a 4× expansion.


📂 Project structure

inside-transformers/
├── models/
│   ├── attention.py      # Head, MultiHeadAttention, FeedForward, TransformerBlock
│   └── gpt.py            # GPT model (embeddings, blocks, loss, generation)
├── data/
│   ├── prepare.py        # Download Shakespeare, build char vocab, save tensors
│   └── input.txt         # Raw text (downloaded on first run if missing)
├── training/
│   └── train.py          # Training loop, checkpointing, sample generation
├── notebooks/
│   ├── 01_tokenization.ipynb   # (placeholder - not yet implemented)
│   └── 02_attention_viz.ipynb  # Load checkpoint, hook attention, plot heatmaps
├── configs/
│   ├── base.yaml         # (placeholder)
│   ├── tiny_cpu.yaml     # (placeholder)
│   └── gpu_3070.yaml     # (placeholder)
├── scripts/
│   ├── run_cpu.sh        # (placeholder)
│   └── run_gpu.sh        # (placeholder)
├── docs/                 # Learning notes (Word documents)
├── check_requirements.py # Verify installed packages and CUDA availability
├── requirements.txt
└── LICENSE

Generated artifacts (not committed - see .gitignore):

  • data/train.pt, data/val.pt, data/vocab.pt - produced by data/prepare.py
  • checkpoints/gpt_shakespeare.pt - produced by training/train.py

⚙️ Requirements

  • Python 3.10+
  • PyTorch 2.0+ (CUDA optional but recommended for training)
  • See requirements.txt for the full dependency list
Package Purpose
torch Model, training, data tensors
numpy Numerical utilities
matplotlib Attention heatmaps in notebooks
jupyter Interactive exploration
tiktoken Planned for subword tokenization (notebook 01)
wandb Planned for experiment tracking
requests Download the dataset from the internet

🚀 Quick start

1. Clone and install

git clone https://github.com/daveshenal/inside-transformers.git
cd inside-transformers
pip install -r requirements.txt

Verify your environment:

python check_requirements.py

2. Prepare data

Downloads Tiny Shakespeare (if input.txt is missing), builds a character-level vocabulary, and saves train/val splits:

python data/prepare.py

Expected output:

  • ~1.1M characters of Shakespeare text
  • 65 unique characters (letters, punctuation, whitespace)
  • 90/10 train/val split saved as PyTorch tensors

3. Train

python training/train.py

Training uses hard-coded hyperparameters in training/train.py (YAML configs are not wired up yet). Defaults:

Setting Value
block_size 128
batch_size 64
max_iters 5000
eval_every 200
lr 3e-4
n_embd 192
n_heads 6
n_layers 4
dropout 0.1

The model has ~1.8M parameters. On CPU this takes a while; a CUDA GPU is much faster.

During training you will see periodic train and val cross-entropy loss. When training finishes:

  1. A checkpoint is saved to checkpoints/gpt_shakespeare.pt (model weights + config dict).
  2. A 500-token sample is generated with temperature 0.8 and top_k=40.

4. Explore attention

After training, open the visualization notebook:

jupyter notebook notebooks/02_attention_viz.ipynb

The notebook:

  1. Loads the checkpoint and vocabulary.
  2. Registers forward hooks on each layer's multi-head attention.
  3. Runs a forward pass on a sample sentence.
  4. Plots per-head attention heatmaps for each transformer layer.
  5. Includes written observations about how attention patterns evolve across layers.

🔬 Model details

models/attention.py

  • Head - Single causal self-attention head with scaled dot-product attention and a lower-triangular mask buffer.
  • MultiHeadAttention - Runs n_heads heads in parallel, concatenates, and projects.
  • FeedForward - Two linear layers with GELU and 4× hidden expansion.
  • TransformerBlock - Pre-norm residual block: x + attn(ln1(x)), then x + ff(ln2(x)).

models/gpt.py

  • Token and learned positional embeddings summed and dropped out.
  • Stack of TransformerBlock modules followed by final LayerNorm.
  • lm_head linear layer with weight tying to the token embedding matrix.
  • forward(idx, targets) - Returns logits and optional cross-entropy loss.
  • generate(idx, max_new_tokens, temperature, top_k) - Autoregressive sampling with optional top-k filtering.

training/train.py

  • Loads preprocessed train.pt, val.pt, and vocab.pt.
  • Random contiguous chunks of length block_size for each batch.
  • AdamW optimizer.
  • Periodic evaluation over 20 batches on train and val splits.
  • Saves checkpoint and prints a generated sample.

✅ Current status

What is implemented:

  • Character-level data pipeline (data/prepare.py)
  • Causal multi-head attention and transformer blocks
  • Full GPT model with weight tying and text generation
  • End-to-end training script with checkpointing
  • Attention visualization notebook with layer-by-layer analysis
  • Environment checker (check_requirements.py)

Planned / in progress:

  • 01_tokenization.ipynb - subword tokenization with tiktoken
  • YAML-driven training configs (configs/)
  • Shell launch scripts (scripts/run_cpu.sh, scripts/run_gpu.sh)
  • Weights & Biases logging (wandb)
  • Inference / sampling CLI

🗺️ Learning path

Suggested order for working through the repo:

  1. Read models/attention.py - understand one head, then multi-head, then the full block.
  2. Read models/gpt.py - see how blocks stack into a language model.
  3. Run data/prepare.py - watch character-level encoding in action.
  4. Run training/train.py - follow the loss curve and read the generated sample.
  5. Open notebooks/02_attention_viz.ipynb - connect math to what the model actually attends to.

Additional notes live in docs/ (llm_training_roadmap.docx, self_learn.docx).


📄 License

MIT License - Copyright (c) 2026 Dave Perera. See LICENSE.

🙏 Acknowledgments

About

A hands-on learning project for building and training a small GPT-style language model from scratch in PyTorch. The goal is to understand transformers from the inside out: attention, residual blocks, training loops, and what the model is actually doing at each layer.

Topics

Resources

License

Stars

Watchers

Forks

Contributors