GitHub - Esabelle11/transformer-architect: A unified, production-grade LLM training and alignment framework. Features modular implementations for downstream tasks BERT, DPO, and GRPO to streamline model alignment and optimize memory overhead.

🚀 Transformer Architect

A modular PyTorch framework for implementing, training, and experimenting with modern Transformer architectures from first principles.

Highlights

Encoder-only Transformers (BERT-style)
Decoder-only GPT-style models
Preference learning (DPO)
Reinforcement learning (GRPO)
Mixed Precision (AMP) training
Multi-GPU training support
Modular training engine design
Config-driven experiments
Hugging Face dataset integration

This project is designed for deep understanding of LLM internals, not just high-level API usage.

📚 Table of Contents

Project Vision
Architecture
Implemented Models
Quick Start
1. Encoder Transformer (BERT-style Classification Model)
2. Decoder Transformer (GPT-style Backbone for DPO & GRPO)
3. DPO (Direct Preference Optimization)
4. GRPO (Group Relative Policy Optimization)
5. Evaluation System
6. Shared Engineering Features
Roadmap

📖 Project Vision

This repository follows a progressive learning structure:

Transformer Basics
      ↓
Encoder Models (BERT-style classification)
      ↓
Decoder Models (GPT-style generation)
      ↓
Preference Learning (DPO)
      ↓
Reinforcement Learning (GRPO)

👉 Goal: Understand how modern LLMs evolve from supervised learning → alignment → RL optimization.

🏗️ Architecture

transformer-architect/
│
├── configs/          # YAML experiment configs
├── models/
│   ├── bert_transformer.py
│   ├── grpo_transformer.py
│   └── dpo_transformer.py
│
├── engines/         # Load and initialize model and training progress
│   ├── bert_engine.py
│   ├── dpo_engine.py
│   └── grpo_engine.py
│
├── trains/
│   ├── bert_train.py
│   ├── dpo_train.py
│   └── grpo_train.py
│   └── checkpoint.py
│
├── device.py      # Device and initialize function
├── data.py        # Dataset loaders
├── main.py        # The main file 
└── requirements.txt

🧠 Implemented Models

BERT

Features:

Encoder-only Transformer
Masked Language Modeling
Sequence Classification
IMDB sentiment training example

DPO

Features:

Preference learning
Policy vs Reference model
Pairwise ranking loss

GRPO

Features:

Group-based reward optimization
Multiple response sampling
Relative reward normalization

⚡ Quick Start

Installation

git clone https://github.com/Esabelle11/transformer-architect.git

cd transformer-architect

pip install -r requirements.txt

Train BERT

python main.py --config configs/bert_classification.yaml

Train DPO

python main.py --config configs/dpo_alignment.yaml

Train GRPO

python main.py --config configs/grpo_reasoning.yaml

🧠 1. Encoder Transformer (BERT-style Classification Model)

📌 Architecture

This model is an encoder-only Transformer for sequence classification.

It is NOT full BERT (no MLM / NSP).

Model Flow

Input Tokens
     ↓
Token Embedding + Position Embedding
     ↓
Encoder Block × N
     ↓
Mean Pooling
     ↓
Linear Classifier
     ↓
Logits

🔧 Encoder Block Structure

Each block contains:

Multi-Head Self Attention
Residual Connections
LayerNorm
FeedForward Network

⚙️ Training Objective

CrossEntropyLoss(logits, labels)

⚡ Training Loop

Forward Pass
   ↓
Compute Loss
   ↓
Backward Pass
   ↓
Optimizer Step (AdamW)
   ↓
Validation Accuracy Tracking
   ↓
Best Model Checkpoint Saving

🧪 Key Implementation Details

Mean pooling instead of CLS token
Attention mask support
Mixed precision training (AMP optional)
Best checkpoint saving based on validation accuracy

🎯 Weight Initialization

All linear + embedding layers use:

torch.nn.init.normal_(m.weight, mean=0.0, std=0.02)

This follows GPT/BERT-style initialization for stable training.

🧠 2. Decoder Transformer (GPT-style Backbone for DPO & GRPO)

This model is used in both:

DPO
GRPO

📌 Architecture

Input Tokens
     ↓
Token Embedding + Position Embedding
     ↓
Decoder Block × N
     ↓
LayerNorm
     ↓
LM Head
     ↓
Logits (vocab distribution)

🔧 Decoder Block Structure

Each block contains:

Causal Self-Attention (masked)
FeedForward Network
Residual Connections
LayerNorm

Important Design Choice

Causal Mask (lower triangular)
→ prevents future token leakage

⚙️ Training Objective (Used in DPO / GRPO)

Autoregressive next-token prediction
Log-probability extraction for sequences

🤖 3. DPO (Direct Preference Optimization)

📌 Concept

DPO trains a model using:

(Chosen response, Rejected response)

WITHOUT reinforcement learning or reward models.

🧠 DPO Architecture Flow

Prompt
  ↓
Policy Model (Trainable)
  ↓
Chosen Log Prob   Rejected Log Prob
  ↓
Reference Model (Frozen)
  ↓
Chosen Ref Log Prob   Rejected Ref Log Prob

🔢 Key Computation

Sequence Log Probability

log π(y|x) = sum(log softmax over tokens)

DPO Loss

Δπ = logπθ(chosen) - logπθ(rejected)

Δref = logπref(chosen) - logπref(rejected)

Loss = -log σ(β(Δπ - Δref))

⚙️ Training Flow

Policy Model + Reference Model
        ↓
Compute log probabilities
        ↓
DPO loss
        ↓
Backpropagation
        ↓
AdamW update (policy only)

🧪 Key Features

Frozen reference model
Gradient only updates policy model
AMP training support
Gradient accumulation
Best checkpoint saving based on validation loss

🎯 4. GRPO (Group Relative Policy Optimization)

📌 Concept

GRPO trains a model using:

Multiple sampled outputs per prompt (K rollouts)
Reward function
Group-relative advantage (baseline normalization)

🧠 Architecture Flow

Question Prompt
      ↓
Repeat K times
      ↓
Policy Model (Sampling)
      ↓
K Generated Responses
      ↓
Reward Function
      ↓
Group Baseline (Mean Reward)
      ↓
Advantage = Reward - Mean

⚙️ Reward Function

Your implementation:

Extract number from response
Compare with ground truth

Reward:

+1 → correct answer
negative penalty → incorrect or numeric error

📊 Group Advantage

A = r - mean(r)

This stabilizes learning by removing reward scale bias.

🔢 GRPO Objective

Loss = -(logπ(y) × Advantage).mean()

This is essentially:

REINFORCE with group-normalized baseline

⚙️ Training Pipeline

SFT Warmup
    ↓
Generate K Rollouts
    ↓
Compute Rewards
    ↓
Compute Baseline
    ↓
Compute Advantages
    ↓
Policy Gradient Update

🧪 Key Features

Two-stage training (SFT → GRPO)
K-rollout sampling per prompt
No value network
No PPO clipping
Optional KL + entropy (disabled in current version)
AMP support
Reward-driven learning loop

📊 5. Evaluation System

BERT Evaluation

Accuracy-based (classification)

Accuracy = correct / total

GRPO Evaluation

Sample response
   ↓
Reward function
   ↓
Average reward over dataset

⚡ 6. Shared Engineering Features

🧠 Mixed Precision Training

torch.autocast
GradScaler

💾 Checkpointing

Best model saving based on:
- Accuracy (BERT)
- Loss (DPO)
- Reward (GRPO)

⚙️ Optimizer

AdamW across all models

🔬 Learning Objectives

This project focuses on understanding:

Attention mechanisms
Transformer architecture
Language model training
Preference optimization
Reinforcement learning for LLMs
Distributed training
Mixed precision training

🛣️ Roadmap

⭐ Why This Repository?

Most repositories either:

use Hugging Face without explaining internals, or
implement only the original Transformer.

This repository bridges the gap by showing how modern Transformer systems evolve from foundational architectures to alignment methods such as DPO and GRPO in a single codebase.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
configs		configs
engines		engines
model		model
train_progress		train_progress
trains		trains
.gitignore		.gitignore
README.md		README.md
data.py		data.py
device.py		device.py
main.py		main.py
requirements.txt		requirements.txt
test.py		test.py

Folders and files

Latest commit

History

Repository files navigation

🚀 Transformer Architect

Highlights

📚 Table of Contents

📖 Project Vision

🏗️ Architecture

🧠 Implemented Models

BERT

DPO

GRPO

⚡ Quick Start

Installation

Train BERT

Train DPO

Train GRPO

🧠 1. Encoder Transformer (BERT-style Classification Model)

📌 Architecture

Model Flow

🔧 Encoder Block Structure

⚙️ Training Objective

⚡ Training Loop

🧪 Key Implementation Details

🎯 Weight Initialization

🧠 2. Decoder Transformer (GPT-style Backbone for DPO & GRPO)

📌 Architecture

🔧 Decoder Block Structure

Important Design Choice

⚙️ Training Objective (Used in DPO / GRPO)

🤖 3. DPO (Direct Preference Optimization)

📌 Concept

🧠 DPO Architecture Flow

🔢 Key Computation

Sequence Log Probability

DPO Loss

⚙️ Training Flow

🧪 Key Features

🎯 4. GRPO (Group Relative Policy Optimization)

📌 Concept

🧠 Architecture Flow

⚙️ Reward Function

📊 Group Advantage

🔢 GRPO Objective

⚙️ Training Pipeline

🧪 Key Features

📊 5. Evaluation System

BERT Evaluation

GRPO Evaluation

⚡ 6. Shared Engineering Features

🧠 Mixed Precision Training

💾 Checkpointing

⚙️ Optimizer

🔬 Learning Objectives

🛣️ Roadmap

⭐ Why This Repository?

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages