A hands-on learning project for building and training a small GPT-style language model from scratch in PyTorch. The goal is to understand transformers from the inside out: attention, residual blocks, training loops, and what the model is actually doing at each layer.
The project trains a character-level GPT on Tiny Shakespeare, then visualizes attention patterns in Jupyter notebooks.
Input token IDs
↓
Token embeddings ← nn.Embedding
+
Positional embeddings ← nn.Embedding
↓
Dropout
↓
Transformer Block × N
│
├── LayerNorm
├── Multi-head attention
├── Residual connection
├── LayerNorm
├── Feed-forward network
└── Residual connection
↓
Final LayerNorm
↓
Linear projection ← lm_head (weight-tied to token embeddings)
↓
Logits (B, T, vocab_size)
Each transformer block uses pre-norm (LayerNorm before attention and feed-forward), causal multi-head self-attention, and a GELU feed-forward network with a 4× expansion.
inside-transformers/
├── models/
│ ├── attention.py # Head, MultiHeadAttention, FeedForward, TransformerBlock
│ └── gpt.py # GPT model (embeddings, blocks, loss, generation)
├── data/
│ ├── prepare.py # Download Shakespeare, build char vocab, save tensors
│ └── input.txt # Raw text (downloaded on first run if missing)
├── training/
│ └── train.py # Training loop, checkpointing, sample generation
├── notebooks/
│ ├── 01_tokenization.ipynb # (placeholder - not yet implemented)
│ └── 02_attention_viz.ipynb # Load checkpoint, hook attention, plot heatmaps
├── configs/
│ ├── base.yaml # (placeholder)
│ ├── tiny_cpu.yaml # (placeholder)
│ └── gpu_3070.yaml # (placeholder)
├── scripts/
│ ├── run_cpu.sh # (placeholder)
│ └── run_gpu.sh # (placeholder)
├── docs/ # Learning notes (Word documents)
├── check_requirements.py # Verify installed packages and CUDA availability
├── requirements.txt
└── LICENSE
Generated artifacts (not committed - see .gitignore):
data/train.pt,data/val.pt,data/vocab.pt- produced bydata/prepare.pycheckpoints/gpt_shakespeare.pt- produced bytraining/train.py
- Python 3.10+
- PyTorch 2.0+ (CUDA optional but recommended for training)
- See
requirements.txtfor the full dependency list
| Package | Purpose |
|---|---|
torch |
Model, training, data tensors |
numpy |
Numerical utilities |
matplotlib |
Attention heatmaps in notebooks |
jupyter |
Interactive exploration |
tiktoken |
Planned for subword tokenization (notebook 01) |
wandb |
Planned for experiment tracking |
requests |
Download the dataset from the internet |
git clone https://github.com/daveshenal/inside-transformers.git
cd inside-transformers
pip install -r requirements.txtVerify your environment:
python check_requirements.pyDownloads Tiny Shakespeare (if input.txt is missing), builds a character-level vocabulary, and saves train/val splits:
python data/prepare.pyExpected output:
- ~1.1M characters of Shakespeare text
- 65 unique characters (letters, punctuation, whitespace)
- 90/10 train/val split saved as PyTorch tensors
python training/train.pyTraining uses hard-coded hyperparameters in training/train.py (YAML configs are not wired up yet). Defaults:
| Setting | Value |
|---|---|
block_size |
128 |
batch_size |
64 |
max_iters |
5000 |
eval_every |
200 |
lr |
3e-4 |
n_embd |
192 |
n_heads |
6 |
n_layers |
4 |
dropout |
0.1 |
The model has ~1.8M parameters. On CPU this takes a while; a CUDA GPU is much faster.
During training you will see periodic train and val cross-entropy loss. When training finishes:
- A checkpoint is saved to
checkpoints/gpt_shakespeare.pt(model weights + config dict). - A 500-token sample is generated with temperature
0.8andtop_k=40.
After training, open the visualization notebook:
jupyter notebook notebooks/02_attention_viz.ipynbThe notebook:
- Loads the checkpoint and vocabulary.
- Registers forward hooks on each layer's multi-head attention.
- Runs a forward pass on a sample sentence.
- Plots per-head attention heatmaps for each transformer layer.
- Includes written observations about how attention patterns evolve across layers.
Head- Single causal self-attention head with scaled dot-product attention and a lower-triangular mask buffer.MultiHeadAttention- Runsn_headsheads in parallel, concatenates, and projects.FeedForward- Two linear layers with GELU and 4× hidden expansion.TransformerBlock- Pre-norm residual block:x + attn(ln1(x)), thenx + ff(ln2(x)).
- Token and learned positional embeddings summed and dropped out.
- Stack of
TransformerBlockmodules followed by final LayerNorm. lm_headlinear layer with weight tying to the token embedding matrix.forward(idx, targets)- Returns logits and optional cross-entropy loss.generate(idx, max_new_tokens, temperature, top_k)- Autoregressive sampling with optional top-k filtering.
- Loads preprocessed
train.pt,val.pt, andvocab.pt. - Random contiguous chunks of length
block_sizefor each batch. - AdamW optimizer.
- Periodic evaluation over 20 batches on train and val splits.
- Saves checkpoint and prints a generated sample.
What is implemented:
- Character-level data pipeline (
data/prepare.py) - Causal multi-head attention and transformer blocks
- Full GPT model with weight tying and text generation
- End-to-end training script with checkpointing
- Attention visualization notebook with layer-by-layer analysis
- Environment checker (
check_requirements.py)
Planned / in progress:
-
01_tokenization.ipynb- subword tokenization withtiktoken - YAML-driven training configs (
configs/) - Shell launch scripts (
scripts/run_cpu.sh,scripts/run_gpu.sh) - Weights & Biases logging (
wandb) - Inference / sampling CLI
Suggested order for working through the repo:
- Read
models/attention.py- understand one head, then multi-head, then the full block. - Read
models/gpt.py- see how blocks stack into a language model. - Run
data/prepare.py- watch character-level encoding in action. - Run
training/train.py- follow the loss curve and read the generated sample. - Open
notebooks/02_attention_viz.ipynb- connect math to what the model actually attends to.
Additional notes live in docs/ (llm_training_roadmap.docx, self_learn.docx).
MIT License - Copyright (c) 2026 Dave Perera. See LICENSE.
- Architecture and training approach inspired by Andrej Karpathy (nanoGPT, makemore, char-rnn).
- Dataset: Tiny Shakespeare.
