A modular PyTorch framework for implementing, training, and experimenting with modern Transformer architectures from first principles.
- Encoder-only Transformers (BERT-style)
- Decoder-only GPT-style models
- Preference learning (DPO)
- Reinforcement learning (GRPO)
- Mixed Precision (AMP) training
- Multi-GPU training support
- Modular training engine design
- Config-driven experiments
- Hugging Face dataset integration
This project is designed for deep understanding of LLM internals, not just high-level API usage.
- Project Vision
- Architecture
- Implemented Models
- Quick Start
- 1. Encoder Transformer (BERT-style Classification Model)
- 2. Decoder Transformer (GPT-style Backbone for DPO & GRPO)
- 3. DPO (Direct Preference Optimization)
- 4. GRPO (Group Relative Policy Optimization)
- 5. Evaluation System
- 6. Shared Engineering Features
- Roadmap
This repository follows a progressive learning structure:
Transformer Basics
β
Encoder Models (BERT-style classification)
β
Decoder Models (GPT-style generation)
β
Preference Learning (DPO)
β
Reinforcement Learning (GRPO)
π Goal: Understand how modern LLMs evolve from supervised learning β alignment β RL optimization.
transformer-architect/
β
βββ configs/ # YAML experiment configs
βββ models/
β βββ bert_transformer.py
β βββ grpo_transformer.py
β βββ dpo_transformer.py
β
βββ engines/ # Load and initialize model and training progress
β βββ bert_engine.py
β βββ dpo_engine.py
β βββ grpo_engine.py
β
βββ trains/
β βββ bert_train.py
β βββ dpo_train.py
β βββ grpo_train.py
β βββ checkpoint.py
β
βββ device.py # Device and initialize function
βββ data.py # Dataset loaders
βββ main.py # The main file
βββ requirements.txt
Features:
- Encoder-only Transformer
- Masked Language Modeling
- Sequence Classification
- IMDB sentiment training example
Features:
- Preference learning
- Policy vs Reference model
- Pairwise ranking loss
Features:
- Group-based reward optimization
- Multiple response sampling
- Relative reward normalization
git clone https://github.com/Esabelle11/transformer-architect.git
cd transformer-architect
pip install -r requirements.txtpython main.py --config configs/bert_classification.yamlpython main.py --config configs/dpo_alignment.yamlpython main.py --config configs/grpo_reasoning.yamlThis model is an encoder-only Transformer for sequence classification.
It is NOT full BERT (no MLM / NSP).
Input Tokens
β
Token Embedding + Position Embedding
β
Encoder Block Γ N
β
Mean Pooling
β
Linear Classifier
β
Logits
Each block contains:
- Multi-Head Self Attention
- Residual Connections
- LayerNorm
- FeedForward Network
CrossEntropyLoss(logits, labels)
Forward Pass
β
Compute Loss
β
Backward Pass
β
Optimizer Step (AdamW)
β
Validation Accuracy Tracking
β
Best Model Checkpoint Saving
- Mean pooling instead of CLS token
- Attention mask support
- Mixed precision training (AMP optional)
- Best checkpoint saving based on validation accuracy
All linear + embedding layers use:
torch.nn.init.normal_(m.weight, mean=0.0, std=0.02)This follows GPT/BERT-style initialization for stable training.
This model is used in both:
- DPO
- GRPO
Input Tokens
β
Token Embedding + Position Embedding
β
Decoder Block Γ N
β
LayerNorm
β
LM Head
β
Logits (vocab distribution)
Each block contains:
- Causal Self-Attention (masked)
- FeedForward Network
- Residual Connections
- LayerNorm
Causal Mask (lower triangular)
β prevents future token leakage
- Autoregressive next-token prediction
- Log-probability extraction for sequences
DPO trains a model using:
(Chosen response, Rejected response)
WITHOUT reinforcement learning or reward models.
Prompt
β
Policy Model (Trainable)
β
Chosen Log Prob Rejected Log Prob
β
Reference Model (Frozen)
β
Chosen Ref Log Prob Rejected Ref Log Prob
log Ο(y|x) = sum(log softmax over tokens)
ΞΟ = logΟΞΈ(chosen) - logΟΞΈ(rejected)
Ξref = logΟref(chosen) - logΟref(rejected)
Loss = -log Ο(Ξ²(ΞΟ - Ξref))
Policy Model + Reference Model
β
Compute log probabilities
β
DPO loss
β
Backpropagation
β
AdamW update (policy only)
- Frozen reference model
- Gradient only updates policy model
- AMP training support
- Gradient accumulation
- Best checkpoint saving based on validation loss
GRPO trains a model using:
- Multiple sampled outputs per prompt (K rollouts)
- Reward function
- Group-relative advantage (baseline normalization)
Question Prompt
β
Repeat K times
β
Policy Model (Sampling)
β
K Generated Responses
β
Reward Function
β
Group Baseline (Mean Reward)
β
Advantage = Reward - Mean
Your implementation:
Extract number from response
Compare with ground truth
Reward:
- +1 β correct answer
- negative penalty β incorrect or numeric error
A = r - mean(r)
This stabilizes learning by removing reward scale bias.
Loss = -(logΟ(y) Γ Advantage).mean()
This is essentially:
REINFORCE with group-normalized baseline
SFT Warmup
β
Generate K Rollouts
β
Compute Rewards
β
Compute Baseline
β
Compute Advantages
β
Policy Gradient Update
- Two-stage training (SFT β GRPO)
- K-rollout sampling per prompt
- No value network
- No PPO clipping
- Optional KL + entropy (disabled in current version)
- AMP support
- Reward-driven learning loop
- Accuracy-based (classification)
Accuracy = correct / total
Sample response
β
Reward function
β
Average reward over dataset
- torch.autocast
- GradScaler
-
Best model saving based on:
- Accuracy (BERT)
- Loss (DPO)
- Reward (GRPO)
- AdamW across all models
This project focuses on understanding:
- Attention mechanisms
- Transformer architecture
- Language model training
- Preference optimization
- Reinforcement learning for LLMs
- Distributed training
- Mixed precision training
- Encoder Transformer
- GPT Decoder
- DPO
- GRPO
- PPO (future)
- LoRA fine-tuning
- Multi-GPU training
- MoE architecture
Most repositories either:
- use Hugging Face without explaining internals, or
- implement only the original Transformer.
This repository bridges the gap by showing how modern Transformer systems evolve from foundational architectures to alignment methods such as DPO and GRPO in a single codebase.