TinyTrain

A minimal deep learning library built from scratch. NumPy backend, CuPy for GPU, Triton for fused kernels. No PyTorch at runtime, just tensors, autograd, and raw CUDA.

Requirements

Python ≥ 3.10, NumPy
GPU - CuPy, Triton
Testing - PyTorch (reference oracle), pytest

Setup

uv sync

For GPU extra

uv sync --extra gpu --extra triton

Roadmap

Module	Status	Features	Tests
Core autograd	DONE	Tensor class · Topological-sort backward pass · Scalar/broadcast autograd · Batched matmul · Slicing · No-grad mode · Numerical gradient stress tests · Randomized shape tests · Broadcasting stress tests	`test_tensor.py`, `test_ops.py`, `test_autograd_stress.py`; 199 cases
NN modules	DONE	Module base · Parameter tracking · `zero_grad()` · Linear · LayerNorm · Embedding · Dropout · Sequential · Reproducibility	`test_nn.py`; 12 cases
Loss & functional	DONE	Softmax · LogSoftmax · Cross entropy · Numerical stability · Backward checks	`test_functional.py`; 7 cases
Optimizers	DONE	SGD · Adam/AdamW-style optimizer · Multi-step optimizer behavior	`test_optim.py`; 3 cases
Triton kernels	DONE	CuPy <-> PyTorch CUDA bridge · Tiled matmul · Flash attention (online softmax, causal) · Fused LayerNorm fwd + bwd · ReLU / GELU fwd + bwd · Auto-dispatch GPU/CPU	`test_kernels.py`; 16 cases
Utils	DONE	Save/load · Gradient norm clipping · Gradient value clipping · StepLR · CosineAnnealingLR · LinearWarmupCosineDecay · Parameter counting through integration tests	`test_integration.py`; 8 utility-related cases
Data loading	DONE	DataLoader yields Tensor batches · Regression training integration	`test_integration.py`; 2 cases
End-to-end	DONE	MLP XOR training · TransformerBlock loss decrease · Deep gradient flow · MSE loss · MultiHeadAttention forward/backward · TransformerBlock shape/training/param count	`test_integration.py`; 11 model/training cases

Ops support

[DONE] Add, Sub, Mul, Neg, Div, pow
[DONE] Sum, Mean, Max
[DONE] Reshape, Transpose, Slice, MatMul, cat
[DONE] Exp, Log, Tanh, Sigmoid
[DONE] ReLU, GELU

Optimizers

[DONE] SGD
[DONE] AdamW

Fused Kernels

[DONE] Tiled MatMul
[DONE] FlashAttention (Causal)
[DONE] LayerNorm fwd + bwd
[DONE] ReLU fwd + bwd
[DONE] GELU fwd + bwd

NN Modules

[DONE] Module base class
[DONE] Linear
[DONE] Embedding
[DONE] Dropout
[DONE] ReLU, GELU
[DONE] Multi-Head Attention
[DONE] Sequential
[DONE] Feed Forward
[DONE] Transformer block

Functional

[DONE] Softmax / LogSoftmax
[DONE] Cross-entropy
[DONE] MSE loss
[DONE] Scaled dot-product attention
[DONE] MaskedFill

Utils

[DONE] Seeding
[DONE] Device detection
[DONE] Parameter counting
[DONE] Save / load checkpoints
[DONE] Gradient norm clipping
[DONE] Gradient value clipping
[DONE] StepLR
[DONE] CosineAnnealingLR
[DONE] LinearWarmupCosineDecay

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
.github/workflows		.github/workflows
core		core
examples		examples
tests		tests
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TinyTrain

Requirements

Setup

Roadmap

Ops support

Optimizers

Fused Kernels

NN Modules

Functional

Utils

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TinyTrain

Requirements

Setup

Roadmap

Ops support

Optimizers

Fused Kernels

NN Modules

Functional

Utils

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages