Skip to content

Esabelle11/transformer-architect

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸš€ Transformer Architect

A modular PyTorch framework for implementing, training, and experimenting with modern Transformer architectures from first principles.

Highlights

  • Encoder-only Transformers (BERT-style)
  • Decoder-only GPT-style models
  • Preference learning (DPO)
  • Reinforcement learning (GRPO)
  • Mixed Precision (AMP) training
  • Multi-GPU training support
  • Modular training engine design
  • Config-driven experiments
  • Hugging Face dataset integration

This project is designed for deep understanding of LLM internals, not just high-level API usage.

πŸ“š Table of Contents


πŸ“– Project Vision

This repository follows a progressive learning structure:

Transformer Basics
      ↓
Encoder Models (BERT-style classification)
      ↓
Decoder Models (GPT-style generation)
      ↓
Preference Learning (DPO)
      ↓
Reinforcement Learning (GRPO)

πŸ‘‰ Goal: Understand how modern LLMs evolve from supervised learning β†’ alignment β†’ RL optimization.


πŸ—οΈ Architecture

transformer-architect/
β”‚
β”œβ”€β”€ configs/          # YAML experiment configs
β”œβ”€β”€ models/
β”‚   β”œβ”€β”€ bert_transformer.py
β”‚   β”œβ”€β”€ grpo_transformer.py
β”‚   └── dpo_transformer.py
β”‚
β”œβ”€β”€ engines/         # Load and initialize model and training progress
β”‚   β”œβ”€β”€ bert_engine.py
β”‚   β”œβ”€β”€ dpo_engine.py
β”‚   └── grpo_engine.py
β”‚
β”œβ”€β”€ trains/
β”‚   β”œβ”€β”€ bert_train.py
β”‚   β”œβ”€β”€ dpo_train.py
β”‚   └── grpo_train.py
β”‚   └── checkpoint.py
β”‚
β”œβ”€β”€ device.py      # Device and initialize function
β”œβ”€β”€ data.py        # Dataset loaders
β”œβ”€β”€ main.py        # The main file 
└── requirements.txt

🧠 Implemented Models

BERT

Features:

  • Encoder-only Transformer
  • Masked Language Modeling
  • Sequence Classification
  • IMDB sentiment training example

DPO

Features:

  • Preference learning
  • Policy vs Reference model
  • Pairwise ranking loss

GRPO

Features:

  • Group-based reward optimization
  • Multiple response sampling
  • Relative reward normalization

⚑ Quick Start

Installation

git clone https://github.com/Esabelle11/transformer-architect.git

cd transformer-architect

pip install -r requirements.txt

Train BERT

python main.py --config configs/bert_classification.yaml

Train DPO

python main.py --config configs/dpo_alignment.yaml

Train GRPO

python main.py --config configs/grpo_reasoning.yaml

🧠 1. Encoder Transformer (BERT-style Classification Model)

πŸ“Œ Architecture

This model is an encoder-only Transformer for sequence classification.

It is NOT full BERT (no MLM / NSP).

Model Flow

Input Tokens
     ↓
Token Embedding + Position Embedding
     ↓
Encoder Block Γ— N
     ↓
Mean Pooling
     ↓
Linear Classifier
     ↓
Logits

πŸ”§ Encoder Block Structure

Each block contains:

  • Multi-Head Self Attention
  • Residual Connections
  • LayerNorm
  • FeedForward Network

βš™οΈ Training Objective

CrossEntropyLoss(logits, labels)

⚑ Training Loop

Forward Pass
   ↓
Compute Loss
   ↓
Backward Pass
   ↓
Optimizer Step (AdamW)
   ↓
Validation Accuracy Tracking
   ↓
Best Model Checkpoint Saving

πŸ§ͺ Key Implementation Details

  • Mean pooling instead of CLS token
  • Attention mask support
  • Mixed precision training (AMP optional)
  • Best checkpoint saving based on validation accuracy

🎯 Weight Initialization

All linear + embedding layers use:

torch.nn.init.normal_(m.weight, mean=0.0, std=0.02)

This follows GPT/BERT-style initialization for stable training.


🧠 2. Decoder Transformer (GPT-style Backbone for DPO & GRPO)

This model is used in both:

  • DPO
  • GRPO

πŸ“Œ Architecture

Input Tokens
     ↓
Token Embedding + Position Embedding
     ↓
Decoder Block Γ— N
     ↓
LayerNorm
     ↓
LM Head
     ↓
Logits (vocab distribution)

πŸ”§ Decoder Block Structure

Each block contains:

  • Causal Self-Attention (masked)
  • FeedForward Network
  • Residual Connections
  • LayerNorm

Important Design Choice

Causal Mask (lower triangular)
β†’ prevents future token leakage

βš™οΈ Training Objective (Used in DPO / GRPO)

  • Autoregressive next-token prediction
  • Log-probability extraction for sequences

πŸ€– 3. DPO (Direct Preference Optimization)

πŸ“Œ Concept

DPO trains a model using:

(Chosen response, Rejected response)

WITHOUT reinforcement learning or reward models.

🧠 DPO Architecture Flow

Prompt
  ↓
Policy Model (Trainable)
  ↓
Chosen Log Prob   Rejected Log Prob
  ↓
Reference Model (Frozen)
  ↓
Chosen Ref Log Prob   Rejected Ref Log Prob

πŸ”’ Key Computation

Sequence Log Probability

log Ο€(y|x) = sum(log softmax over tokens)

DPO Loss

Δπ = logπθ(chosen) - logπθ(rejected)

Ξ”ref = logΟ€ref(chosen) - logΟ€ref(rejected)

Loss = -log Οƒ(Ξ²(Δπ - Ξ”ref))

βš™οΈ Training Flow

Policy Model + Reference Model
        ↓
Compute log probabilities
        ↓
DPO loss
        ↓
Backpropagation
        ↓
AdamW update (policy only)

πŸ§ͺ Key Features

  • Frozen reference model
  • Gradient only updates policy model
  • AMP training support
  • Gradient accumulation
  • Best checkpoint saving based on validation loss

🎯 4. GRPO (Group Relative Policy Optimization)

πŸ“Œ Concept

GRPO trains a model using:

  • Multiple sampled outputs per prompt (K rollouts)
  • Reward function
  • Group-relative advantage (baseline normalization)

🧠 Architecture Flow

Question Prompt
      ↓
Repeat K times
      ↓
Policy Model (Sampling)
      ↓
K Generated Responses
      ↓
Reward Function
      ↓
Group Baseline (Mean Reward)
      ↓
Advantage = Reward - Mean

βš™οΈ Reward Function

Your implementation:

Extract number from response
Compare with ground truth

Reward:

  • +1 β†’ correct answer
  • negative penalty β†’ incorrect or numeric error

πŸ“Š Group Advantage

A = r - mean(r)

This stabilizes learning by removing reward scale bias.

πŸ”’ GRPO Objective

Loss = -(logΟ€(y) Γ— Advantage).mean()

This is essentially:

REINFORCE with group-normalized baseline

βš™οΈ Training Pipeline

SFT Warmup
    ↓
Generate K Rollouts
    ↓
Compute Rewards
    ↓
Compute Baseline
    ↓
Compute Advantages
    ↓
Policy Gradient Update

πŸ§ͺ Key Features

  • Two-stage training (SFT β†’ GRPO)
  • K-rollout sampling per prompt
  • No value network
  • No PPO clipping
  • Optional KL + entropy (disabled in current version)
  • AMP support
  • Reward-driven learning loop

πŸ“Š 5. Evaluation System

BERT Evaluation

  • Accuracy-based (classification)
Accuracy = correct / total

GRPO Evaluation

Sample response
   ↓
Reward function
   ↓
Average reward over dataset

⚑ 6. Shared Engineering Features

🧠 Mixed Precision Training

  • torch.autocast
  • GradScaler

πŸ’Ύ Checkpointing

  • Best model saving based on:

    • Accuracy (BERT)
    • Loss (DPO)
    • Reward (GRPO)

βš™οΈ Optimizer

  • AdamW across all models

πŸ”¬ Learning Objectives

This project focuses on understanding:

  • Attention mechanisms
  • Transformer architecture
  • Language model training
  • Preference optimization
  • Reinforcement learning for LLMs
  • Distributed training
  • Mixed precision training

πŸ›£οΈ Roadmap

  • Encoder Transformer
  • GPT Decoder
  • DPO
  • GRPO
  • PPO (future)
  • LoRA fine-tuning
  • Multi-GPU training
  • MoE architecture

⭐ Why This Repository?

Most repositories either:

  • use Hugging Face without explaining internals, or
  • implement only the original Transformer.

This repository bridges the gap by showing how modern Transformer systems evolve from foundational architectures to alignment methods such as DPO and GRPO in a single codebase.


About

A unified, production-grade LLM training and alignment framework. Features modular implementations for downstream tasks BERT, DPO, and GRPO to streamline model alignment and optimize memory overhead.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages