GPU Memory Calculator for LLM Training

A versatile Python application for calculating GPU memory requirements for training Large Language Models with support for multiple training engines including PyTorch DDP, DeepSpeed ZeRO, Megatron-LM, and FSDP.

🎯 Try it Online - No Installation Required!

🚀 Launch GPU Memory Calculator on Hugging Face Spaces

Calculate GPU memory instantly in your browser. Supports all training engines (DeepSpeed, Megatron, FSDP) and inference engines (vLLM, TGI, SGLang).

📦 Want to install locally? CLI & Python API available

pip install gpu-mem-calculator
gpu-mem-calc calculate --preset llama2-7b

See Installation below.

📖 Getting Started Guide | 💬 FAQ | 🤝 Contributing

🌟 Featured On

🤗 Hugging Face Spaces - Try the interactive web demo with no installation!

Star ⭐ this repo to support development and get updates!

🚀 Why Use This Tool?

Training large language models requires careful memory planning. This calculator helps you:

💰 Save costs by determining the optimal GPU configuration before you start training
⚡ Avoid OOM errors by validating your training configuration fits in GPU memory
📊 Compare strategies across different training engines (DeepSpeed, Megatron, FSDP)
🎯 Plan infrastructure by knowing exactly how many GPUs you need
📈 Scale efficiently with detailed memory breakdowns for optimization

Whether you're training a 7B parameter model on a single GPU or a 175B model across hundreds of GPUs, this tool provides accurate memory estimates based on proven formulas from DeepSpeed, Megatron-LM, and the latest research.

✨ Features

Core Training Calculation

🔧 Multiple Training Engines: Support for PyTorch DDP, DeepSpeed ZeRO (stages 1-3), Megatron-LM, Megatron+DeepSpeed, and PyTorch FSDP
🖥️ Dual Interface: Both CLI and Web UI for flexible usage
🎯 Preset Models: Quick-load configurations for popular models (LLaMA 2, GPT-3, etc.)
📊 Detailed Breakdown: Memory breakdown by component (parameters, gradients, optimizer states, activations)
✅ Feasibility Analysis: Check if your configuration fits on available GPU memory
⚙️ Easy Config: JSON-based configuration files with human-readable parameter formats (e.g., "7B", "7000M")

🆕 Inference Memory Calculation

🚀 Multi-Engine Support: HuggingFace Transformers, vLLM, TGI, TensorRT-LLM, SGLang
💾 KV Cache Optimization: Quantization options (NONE, INT8, FP8, INT4)
🔄 Tensor Parallelism: Automatic memory distribution across GPUs
📈 Throughput Estimation: Tokens/second estimates for capacity planning
🎯 Batch Size Optimization: Find maximum batch size for your hardware

🆕 Multi-Node Training

🌐 Network Overhead Calculation: AllReduce, AllGather, ReduceScatter, pipeline communication
📡 Interconnect Support: InfiniBand, NVLink, Ethernet (10G/25G/100G/200G)
⚡ Hybrid Parallelism Optimization: Automatic TP+PP+DP strategy optimization
🔧 ZeRO Stage Impact Analysis: Compare communication overhead across ZeRO stages

🆕 Framework Configuration Exporters

📦 Accelerate Export: HuggingFace Accelerate config generation
⚡ Lightning Export: PyTorch Lightning Trainer configuration
🔥 Axolotl Export: YAML config for fine-tuning
📄 File Export: Save to YAML/JSON formats
🎛️ Format Conversion: Convert between different framework configs

📦 Installation

Quick Start

Core Capabilities

Multiple Training Engines: Support for PyTorch DDP, DeepSpeed ZeRO (stages 0-3), Megatron-LM, Megatron+DeepSpeed, and PyTorch FSDP
Dual Interface: Both CLI and Web UI for flexible usage
Preset Models: Quick-load configurations for popular models (LLaMA 2, GPT-3, GLM, Mixtral, etc.)
Detailed Breakdown: Memory breakdown by component (parameters, gradients, optimizer states, activations)
Feasibility Analysis: Check if your configuration fits on available GPU memory
Easy Config: JSON-based configuration files with human-readable parameter formats (e.g., "7B", "7000M")

Web UI Enhancements

🆕 HuggingFace Hub Integration: Fetch model metadata directly from HuggingFace Model Hub by entering a model ID (e.g., meta-llama/Llama-2-7b-hf)
Formula Explanations: See exactly how memory is calculated with your values plugged in
Real-time Validation: Client-side validation prevents invalid configurations
Smart Auto-calculation: Optimized debouncing (1s) with minimum interval protection
Export Capabilities: Export to DeepSpeed config files, JSON, or copy to clipboard
Batch Size Optimizer: Automatically find maximum batch size that fits
Comparison Mode: Save and compare different configurations side-by-side
Accessibility Features: ARIA labels, keyboard navigation, colorblind-friendly charts

Advanced Features

MoE Support: Mixture of Experts models with configurable experts and top-k routing
CPU/NVMe Offloading: Offload optimizer states and parameters to CPU or NVMe storage
Activation Checkpointing: 5 levels from none to full checkpointing
Sequence Parallelism: Optimize memory for long sequences
Result Caching: Fast repeated calculations with built-in caching

pip install git+https://github.com/George614/gpu-mem-calculator.git

From source

git clone https://github.com/George614/gpu-mem_calculator.git
cd gpu_mem_calculator
pip install -e .

For Web UI support

pip install -e ".[web]"

Development installation

pip install -e ".[dev]"

🎓 Use Cases

Research & Academia

Estimate GPU requirements for research projects before requesting compute resources
Plan multi-GPU training configurations for large-scale experiments
Compare memory efficiency of different training strategies

Industry & Production

Cost optimization: Choose the right GPU type and count for your training workload
Capacity planning: Forecast infrastructure needs for model development
Debugging: Diagnose OOM errors and optimize memory usage

Education & Learning

Understand how training configuration affects memory consumption
Learn about different distributed training strategies
Experiment with various optimization techniques safely

🚀 Usage

Command Line Interface

Using model presets (Recommended)

The calculator includes pre-configured model presets for popular LLMs:

# List all available presets
gpu-mem-calc presets

# Calculate with a preset
gpu-mem-calc calculate --preset llama2-7b
gpu-mem-calc calculate --preset mixtral-8x7b --format json

# List presets in table format
gpu-mem-calc presets --format table

Available presets include:

Dense Models: LLaMA 2 (7B, 13B, 70B), GPT-3 (175B)
MoE Models: Mixtral 8x7B, GLM-4 (9B), GLM-4.7 (355B), GLM-4.5 Air (106B), Qwen1.5-MoE-A2.7B, DeepSeek-MoE (16B)

Calculate from config file

gpu-mem-calc calculate --config configs/llama2_7b_deepspeed.json

Quick calculation from model size

# Calculate memory for 7B model with 8x80GB GPUs using DeepSpeed
gpu-mem-calc quick 7 --gpus 8 --engine deepspeed

# With custom GPU memory
gpu-mem-calc quick 70 --gpus 64 --gpu-mem 80 --engine megatron

Validate configuration

gpu-mem-calc validate configs/my_config.json

Web Interface

Start the web server:

python -m gpu_mem_calculator.web.app

Or using uvicorn directly:

uvicorn gpu_mem_calculator.web.app:app --reload

Then open your browser to http://localhost:8000

Python API

Training Memory Calculation

from gpu_mem_calculator.core.calculator import GPUMemoryCalculator
from gpu_mem_calculator.core.models import (
    ModelConfig,
    TrainingConfig,
    ParallelismConfig,
    EngineConfig,
    GPUConfig,
)

# Create configuration
model_config = ModelConfig(
    name="llama2-7b",
    num_parameters=7_000_000_000,
    num_layers=32,
    hidden_size=4096,
    num_attention_heads=32,
    vocab_size=32000,
    max_seq_len=4096,
)

training_config = TrainingConfig(
    batch_size=4,
    gradient_accumulation_steps=4,
    dtype="bf16",
    optimizer="adamw",
)

parallelism_config = ParallelismConfig(
    data_parallel_size=8,
)

engine_config = EngineConfig(
    type="deepspeed",
    zero_stage=3,
    offload_optimizer="cpu",
)

gpu_config = GPUConfig(
    num_gpus=8,
    gpu_memory_gb=80,
)

# Calculate memory
calculator = GPUMemoryCalculator(
    model_config=model_config,
    training_config=training_config,
    parallelism_config=parallelism_config,
    engine_config=engine_config,
    gpu_config=gpu_config,
)

result = calculator.calculate()

print(f"Memory per GPU: {result.total_memory_per_gpu_gb:.2f} GB")
print(f"Fits on GPU: {result.fits_on_gpu}")
print(f"Utilization: {result.memory_utilization_percent:.1f}%")

🆕 Inference Memory Calculation

from gpu_mem_calculator.inference.calculator import InferenceMemoryCalculator
from gpu_mem_calculator.core.models import (
    ModelConfig,
    InferenceConfig,
    InferenceEngineType,
    GPUConfig,
)

# Create configurations
model_config = ModelConfig(
    name="llama2-7b",
    num_parameters=7_000_000_000,
    num_layers=32,
    hidden_size=4096,
    num_attention_heads=32,
    max_seq_len=4096,
)

inference_config = InferenceConfig(
    batch_size=32,
    kv_cache_quantization="int8",  # NONE, INT8, FP8, INT4
    tensor_parallel_size=2,
    gpu_memory_utilization=0.9,
)

gpu_config = GPUConfig(num_gpus=2, gpu_memory_gb=80)

# Calculate for different inference engines
calculator = InferenceMemoryCalculator(model_config, inference_config, gpu_config)

# vLLM inference
result_vllm = calculator.calculate(InferenceEngineType.VLLM)
print(f"vLLM: {result_vllm.total_memory_per_gpu_gb:.2f} GB")
print(f"Max batch size: {result_vllm.max_supported_batch_size}")
print(f"Throughput: {result_vllm.estimated_throughput_tokens_per_sec:.0f} tokens/sec")

# TensorRT-LLM inference
result_trt = calculator.calculate(InferenceEngineType.TENSORRT_LLM)
print(f"TensorRT-LLM: {result_trt.total_memory_per_gpu_gb:.2f} GB")

🆕 Multi-Node Network Overhead

from gpu_mem_calculator.core.multinode import MultiNodeCalculator
from gpu_mem_calculator.core.models import (
    NodeConfig,
    InterconnectType,
)

# Configure multi-node setup
node_config = NodeConfig(
    num_nodes=4,
    gpus_per_node=8,
    interconnect_type=InterconnectType.INFINIBAND,
)

calculator = MultiNodeCalculator(
    model_config=model_config,
    training_config=training_config,
    parallelism_config=parallelism_config,
    node_config=node_config,
    engine_config=engine_config,
)

# Calculate network overhead
network_overhead = calculator.calculate_network_overhead()
print(f"AllReduce: {network_overhead.allreduce_gb:.2f} GB")
print(f"AllGather: {network_overhead.allgather_gb:.2f} GB")
print(f"Time overhead: {network_overhead.estimated_overhead_ms_per_step:.2f} ms/step")

# Optimize hybrid parallelism
from gpu_mem_calculator.core.models import HybridParallelismConfig

hybrid_config = HybridParallelismConfig(
    auto_optimize=True,
    prefer_pipeline_parallel=True,
    enable_sequence_parallel=True,
)

optimized_parallelism = calculator.optimize_hybrid_parallelism(hybrid_config)
print(f"Optimized TP: {optimized_parallelism.tensor_parallel_size}")
print(f"Optimized PP: {optimized_parallelism.pipeline_parallel_size}")
print(f"Optimized DP: {optimized_parallelism.data_parallel_size}")

🆕 Export Framework Configurations

from gpu_mem_calculator.exporters.manager import ExportManager, ExportFormat

# Create export manager
manager = ExportManager(
    model_config=model_config,
    training_config=training_config,
    parallelism_config=parallelism_config,
    engine_config=engine_config,
    node_config=node_config,
)

# Export to different formats
accelerate_config = manager.export(ExportFormat.ACCELERATE)
lightning_config = manager.export(ExportFormat.LIGHTNING)
axolotl_config = manager.export(ExportFormat.AXOLOTL)

# Export to file
manager.export_to_file(ExportFormat.ACCELERATE, "accelerate_config.yaml")
manager.export_to_file(ExportFormat.JSON, "config.json")

# Get DeepSpeed config
deepspeed_config = manager.export(ExportFormat.DEEPSPEED)

Configuration File Format

{
  "model": {
    "name": "llama2-7b",
    "num_parameters": "7B",
    "num_layers": 32,
    "hidden_size": 4096,
    "num_attention_heads": 32,
    "vocab_size": 32000,
    "max_seq_len": 4096
  },
  "training": {
    "batch_size": 4,
    "gradient_accumulation_steps": 4,
    "optimizer": "adamw",
    "dtype": "bf16",
    "activation_checkpointing": 1
  },
  "parallelism": {
    "tensor_parallel_size": 1,
    "pipeline_parallel_size": 1,
    "data_parallel_size": 8,
    "sequence_parallel": false
  },
  "engine": {
    "type": "deepspeed",
    "zero_stage": 3,
    "offload_optimizer": "cpu",
    "offload_param": "none"
  },
  "hardware": {
    "num_gpus": 8,
    "gpu_memory_gb": 80
  }
}

Supported Training Engines

PyTorch DDP (Baseline)

Standard Distributed Data Parallel training without memory optimizations.

DeepSpeed ZeRO

ZeRO-1: Shard optimizer states
ZeRO-2: Shard optimizer states + gradients
ZeRO-3: Shard everything (parameters, gradients, optimizer states)
Supports CPU/NVMe offloading

Megatron-LM

Tensor and pipeline parallelism with activation checkpointing support.

Megatron + DeepSpeed

Combines Megatron-LM's model parallelism with DeepSpeed ZeRO's optimizer sharding.

PyTorch FSDP

Fully Sharded Data Parallel with multiple sharding strategies.

Memory Formulas

The calculator uses formulas verified against authoritative sources:

Base Components

Model Parameters:

FP16/BF16: num_params × 2 bytes
FP32: num_params × 4 bytes

Gradients:

FP16/BF16: num_params × 2 bytes
FP32: num_params × 4 bytes

Optimizer States (per optimizer type):

Adam/AdamW: num_params × 12 bytes
- 4 bytes: FP32 parameter copy
- 4 bytes: Momentum
- 4 bytes: Variance
AdamW 8-bit: num_params × 2 bytes (quantized)
SGD: num_params × 4 bytes (FP32 only, no momentum)

Activations:

Approximation: batch_size × seq_len × hidden_size × num_layers × ~16 bytes/token/layer
Varies based on activation checkpointing level

DeepSpeed ZeRO Stages

ZeRO-0 (Baseline - same as PyTorch DDP):

total_per_gpu = 2×params + 2×params + 12×params + activations
             = 16×params + activations

ZeRO-1 (Shard optimizer states):

total_per_gpu = 2×params + 2×params + (12×params)/num_gpus + activations

ZeRO-2 (Shard optimizer + gradients):

total_per_gpu = 2×params + (2×params)/num_gpus + (12×params)/num_gpus + activations

ZeRO-3 (Shard everything):

total_per_gpu = largest_layer_memory + (16×params)/num_gpus + activations
where largest_layer_memory ≈ 4×(num_params/10)

CPU/NVMe Offloading:

Optimizer states offloaded to CPU: 0 GB GPU memory
Parameters offloaded to CPU/NVMe: Dynamically gathered during compute

Verification

All formulas have been verified against:

✅ 18 comprehensive test scenarios (100% pass rate)
✅ EleutherAI Transformer Math 101
✅ Microsoft Research ZeRO Blog
✅ DeepSpeed Official Documentation
✅ PyTorch FSDP Documentation

References

EleutherAI Transformer Math 101 - Comprehensive transformer memory breakdown
Microsoft Research ZeRO Blog - ZeRO optimization techniques
DeepSpeed Memory Documentation - Official DeepSpeed memory formulas

Example Configurations

LLaMA 2 7B with DeepSpeed ZeRO-3

gpu-mem-calc calculate --config configs/llama2_7b_deepspeed.json

GPT-3 175B with Megatron-LM

gpu-mem-calc calculate --config configs/gpt3_175b_megatron.json

Custom 1B model with PyTorch DDP

gpu-mem-calc calculate --config configs/pytorch_ddp_example.json

Web UI Features

Interactive Interface

Real-time Calculations: Auto-calculates as you adjust parameters (1s debounce)
Client-side Validation: Instant feedback on configuration errors before API calls
Smart Presets: Quick-load model configurations (LLaMA 2, GPT-3, GLM, Mixtral, Qwen, DeepSeek)
Visual Breakdown: Color-coded bar chart with patterns for colorblind accessibility
Feasibility Status: Clear indicators showing if configuration fits on GPU

Formula Explanations

Detailed Breakdowns: See exact formulas used with your values plugged in
Component-by-Component: Each memory component explained with formula and result
Authoritative References: Links to EleutherAI, Microsoft Research, DeepSpeed docs
Engine-Specific Details: Different formulas for PyTorch DDP, DeepSpeed ZeRO, FSDP, Megatron-LM

Advanced Tools

Export to DeepSpeed: Generate deepspeed_config.json files automatically
Batch Size Optimizer: Find maximum batch size that fits your GPU memory
Config Persistence: Save configurations to browser localStorage
Comparison Mode: Compare different configurations side-by-side

Accessibility

ARIA Labels: Full screen reader support throughout the interface
Keyboard Navigation: All features accessible via keyboard
Colorblind-Friendly: Patterns and textures supplement colors in charts
High Contrast: Clear visual indicators with multiple cues

API Endpoints

POST /api/calculate - Calculate GPU memory requirements
POST /api/explain-formula - Get detailed formula explanation
POST /api/export/deepspeed - Export DeepSpeed config file
POST /api/optimize/batch-size - Find maximum batch size
GET /api/preset/{preset_name} - Load model preset
POST /api/hf/fetch - 🆕 Fetch model metadata from HuggingFace Hub

Development

Running Tests

pytest tests/

Test Coverage

The calculator includes comprehensive testing:

Unit Tests: Core calculation logic for each engine type
Integration Tests: End-to-end configuration validation
Formula Verification: 18 scenarios verifying formula accuracy
API Tests: Web API endpoint testing
Accessibility Tests: Screen reader and keyboard navigation

All formulas verified accurate against authoritative sources with 100% test pass rate.

Code Formatting

black src/ cli/ web/
ruff check src/ cli/ web/

Type Checking

mypy src/

Recent Improvements

Latest Updates

✨ Added formula explanation feature with detailed breakdowns
✨ Added client-side validation for better UX
✨ Added batch size optimizer API
✨ Added DeepSpeed config export functionality
✨ Added comprehensive input validation
✨ Added result caching for performance
♿ Added ARIA labels for full accessibility
♿ Added colorblind patterns to charts
🐛 Fixed optimizer formulas to be optimizer-specific
🐛 Fixed Pydantic namespace warnings

Verification Status

✅ All 18 test scenarios passing (100%)
✅ Formulas verified against EleutherAI, Microsoft Research, DeepSpeed docs
✅ Optimizer formulas corrected for AdamW, AdamW 8-bit, and SGD
✅ ZeRO stage formulas validated (0, 1, 2, 3)
✅ Engine type formulas validated (PyTorch DDP, DeepSpeed, FSDP, Megatron-LM)

Contributing

Contributions are welcome! Please feel free to submit a Pull Request. See CONTRIBUTING.md for detailed guidelines.

📚 References

The memory calculations in this tool are based on authoritative sources:

Core Memory Formulas:

EleutherAI Transformer Math 101 - Comprehensive breakdown of transformer memory requirements
Microsoft Research ZeRO Blog - ZeRO optimization techniques
Reducing Activation Recomputation in Large Transformer Models - Activation checkpointing strategies

Engine Documentation:

DeepSpeed Memory Documentation - Official DeepSpeed memory formulas
NVIDIA Megatron-LM - Tensor and pipeline parallelism
PyTorch FSDP Documentation - Fully sharded data parallel
PyTorch DDP Tutorial - Distributed data parallel

Related Tools:

llm-analysis - LLM memory analysis
vram-calculator - VRAM calculation utilities

🤝 Community & Support

📖 Documentation
🐛 Issue Tracker
💬 Discussions
📧 Contact the maintainers via GitHub

Star History

If you find this tool useful, please consider giving it a star! ⭐

📋 Roadmap

Inference memory calculation
Multi-node training configurations
Export to training framework configs (Accelerate, Lightning, Axolotl)
PyPI package distribution
Support for more model architectures (Vision Transformers, Diffusion models)
Real-time memory monitoring dashboard
CLI commands for inference and export features

🙏 Acknowledgments

This tool was inspired by and builds upon the excellent work of:

DeepSpeed Memory Estimator - ZeRO memory optimization formulas
llm-analysis - LLM memory analysis methodology
vram-calculator - VRAM calculation approach

Special thanks to the EleutherAI community for their comprehensive Transformer Math 101 guide, which provides detailed formulas for transformer memory calculations.

📄 License

MIT License - see LICENSE for details.

📚 Citation

If you use this tool in your research, please cite:

@software{gpu_mem_calculator,
  title = {GPU Memory Calculator for LLM Training},
  author = {GPU Mem Calculator Team},
  year = {2026},
  url = {https://github.com/George614/gpu-mem-calculator}
}

Made with ❤️ for the ML community

⭐ Star us on GitHub • 🐛 Report a Bug • 💡 Request a Feature

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
.github		.github
cli		cli
configs		configs
docs		docs
examples		examples
huggingface_space		huggingface_space
src/gpu_mem_calculator		src/gpu_mem_calculator
tests		tests
web		web
.dockerignore		.dockerignore
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
MARKETING.md		MARKETING.md
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
screenshot.png		screenshot.png

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

GPU Memory Calculator for LLM Training

🎯 Try it Online - No Installation Required!

🌟 Featured On

🚀 Why Use This Tool?

✨ Features

Core Training Calculation

🆕 Inference Memory Calculation

🆕 Multi-Node Training

🆕 Framework Configuration Exporters

📦 Installation

Quick Start

Core Capabilities

Web UI Enhancements

Advanced Features

From source

For Web UI support

Development installation

🎓 Use Cases

Research & Academia

Industry & Production

Education & Learning

🚀 Usage

Command Line Interface

Using model presets (Recommended)

Calculate from config file

Quick calculation from model size

Validate configuration

Web Interface

Python API

Training Memory Calculation

🆕 Inference Memory Calculation

🆕 Multi-Node Network Overhead

🆕 Export Framework Configurations

Configuration File Format

Supported Training Engines

PyTorch DDP (Baseline)

DeepSpeed ZeRO

Megatron-LM

Megatron + DeepSpeed

PyTorch FSDP

Memory Formulas

Base Components

DeepSpeed ZeRO Stages

Verification

References

Example Configurations

LLaMA 2 7B with DeepSpeed ZeRO-3

GPT-3 175B with Megatron-LM

Custom 1B model with PyTorch DDP

Web UI Features

Interactive Interface

Formula Explanations

Advanced Tools

Accessibility

API Endpoints

Development

Running Tests

Test Coverage

Code Formatting

Type Checking

Recent Improvements

Latest Updates

Verification Status

Contributing

📚 References

🤝 Community & Support

Star History

📋 Roadmap

🙏 Acknowledgments

📄 License

📚 Citation

About

Topics

Resources

License

Packages