A versatile Python application for calculating GPU memory requirements for training Large Language Models with support for multiple training engines including PyTorch DDP, DeepSpeed ZeRO, Megatron-LM, and FSDP.
π Launch GPU Memory Calculator on Hugging Face Spaces
Calculate GPU memory instantly in your browser. Supports all training engines (DeepSpeed, Megatron, FSDP) and inference engines (vLLM, TGI, SGLang).
π¦ Want to install locally? CLI & Python API available
pip install gpu-mem-calculator
gpu-mem-calc calculate --preset llama2-7bSee Installation below.
π Getting Started Guide | π¬ FAQ | π€ Contributing
π€ Hugging Face Spaces - Try the interactive web demo with no installation!
Star β this repo to support development and get updates!
Training large language models requires careful memory planning. This calculator helps you:
- π° Save costs by determining the optimal GPU configuration before you start training
- β‘ Avoid OOM errors by validating your training configuration fits in GPU memory
- π Compare strategies across different training engines (DeepSpeed, Megatron, FSDP)
- π― Plan infrastructure by knowing exactly how many GPUs you need
- π Scale efficiently with detailed memory breakdowns for optimization
Whether you're training a 7B parameter model on a single GPU or a 175B model across hundreds of GPUs, this tool provides accurate memory estimates based on proven formulas from DeepSpeed, Megatron-LM, and the latest research.
- π§ Multiple Training Engines: Support for PyTorch DDP, DeepSpeed ZeRO (stages 1-3), Megatron-LM, Megatron+DeepSpeed, and PyTorch FSDP
- π₯οΈ Dual Interface: Both CLI and Web UI for flexible usage
- π― Preset Models: Quick-load configurations for popular models (LLaMA 2, GPT-3, etc.)
- π Detailed Breakdown: Memory breakdown by component (parameters, gradients, optimizer states, activations)
- β Feasibility Analysis: Check if your configuration fits on available GPU memory
- βοΈ Easy Config: JSON-based configuration files with human-readable parameter formats (e.g., "7B", "7000M")
- π Multi-Engine Support: HuggingFace Transformers, vLLM, TGI, TensorRT-LLM, SGLang
- πΎ KV Cache Optimization: Quantization options (NONE, INT8, FP8, INT4)
- π Tensor Parallelism: Automatic memory distribution across GPUs
- π Throughput Estimation: Tokens/second estimates for capacity planning
- π― Batch Size Optimization: Find maximum batch size for your hardware
- π Network Overhead Calculation: AllReduce, AllGather, ReduceScatter, pipeline communication
- π‘ Interconnect Support: InfiniBand, NVLink, Ethernet (10G/25G/100G/200G)
- β‘ Hybrid Parallelism Optimization: Automatic TP+PP+DP strategy optimization
- π§ ZeRO Stage Impact Analysis: Compare communication overhead across ZeRO stages
- π¦ Accelerate Export: HuggingFace Accelerate config generation
- β‘ Lightning Export: PyTorch Lightning Trainer configuration
- π₯ Axolotl Export: YAML config for fine-tuning
- π File Export: Save to YAML/JSON formats
- ποΈ Format Conversion: Convert between different framework configs
- Multiple Training Engines: Support for PyTorch DDP, DeepSpeed ZeRO (stages 0-3), Megatron-LM, Megatron+DeepSpeed, and PyTorch FSDP
- Dual Interface: Both CLI and Web UI for flexible usage
- Preset Models: Quick-load configurations for popular models (LLaMA 2, GPT-3, GLM, Mixtral, etc.)
- Detailed Breakdown: Memory breakdown by component (parameters, gradients, optimizer states, activations)
- Feasibility Analysis: Check if your configuration fits on available GPU memory
- Easy Config: JSON-based configuration files with human-readable parameter formats (e.g., "7B", "7000M")
- π HuggingFace Hub Integration: Fetch model metadata directly from HuggingFace Model Hub by entering a model ID (e.g.,
meta-llama/Llama-2-7b-hf) - Formula Explanations: See exactly how memory is calculated with your values plugged in
- Real-time Validation: Client-side validation prevents invalid configurations
- Smart Auto-calculation: Optimized debouncing (1s) with minimum interval protection
- Export Capabilities: Export to DeepSpeed config files, JSON, or copy to clipboard
- Batch Size Optimizer: Automatically find maximum batch size that fits
- Comparison Mode: Save and compare different configurations side-by-side
- Accessibility Features: ARIA labels, keyboard navigation, colorblind-friendly charts
- MoE Support: Mixture of Experts models with configurable experts and top-k routing
- CPU/NVMe Offloading: Offload optimizer states and parameters to CPU or NVMe storage
- Activation Checkpointing: 5 levels from none to full checkpointing
- Sequence Parallelism: Optimize memory for long sequences
- Result Caching: Fast repeated calculations with built-in caching
pip install git+https://github.com/George614/gpu-mem-calculator.gitgit clone https://github.com/George614/gpu-mem_calculator.git
cd gpu_mem_calculator
pip install -e .pip install -e ".[web]"pip install -e ".[dev]"- Estimate GPU requirements for research projects before requesting compute resources
- Plan multi-GPU training configurations for large-scale experiments
- Compare memory efficiency of different training strategies
- Cost optimization: Choose the right GPU type and count for your training workload
- Capacity planning: Forecast infrastructure needs for model development
- Debugging: Diagnose OOM errors and optimize memory usage
- Understand how training configuration affects memory consumption
- Learn about different distributed training strategies
- Experiment with various optimization techniques safely
The calculator includes pre-configured model presets for popular LLMs:
# List all available presets
gpu-mem-calc presets
# Calculate with a preset
gpu-mem-calc calculate --preset llama2-7b
gpu-mem-calc calculate --preset mixtral-8x7b --format json
# List presets in table format
gpu-mem-calc presets --format tableAvailable presets include:
- Dense Models: LLaMA 2 (7B, 13B, 70B), GPT-3 (175B)
- MoE Models: Mixtral 8x7B, GLM-4 (9B), GLM-4.7 (355B), GLM-4.5 Air (106B), Qwen1.5-MoE-A2.7B, DeepSeek-MoE (16B)
gpu-mem-calc calculate --config configs/llama2_7b_deepspeed.json# Calculate memory for 7B model with 8x80GB GPUs using DeepSpeed
gpu-mem-calc quick 7 --gpus 8 --engine deepspeed
# With custom GPU memory
gpu-mem-calc quick 70 --gpus 64 --gpu-mem 80 --engine megatrongpu-mem-calc validate configs/my_config.jsonStart the web server:
python -m gpu_mem_calculator.web.appOr using uvicorn directly:
uvicorn gpu_mem_calculator.web.app:app --reloadThen open your browser to http://localhost:8000
from gpu_mem_calculator.core.calculator import GPUMemoryCalculator
from gpu_mem_calculator.core.models import (
ModelConfig,
TrainingConfig,
ParallelismConfig,
EngineConfig,
GPUConfig,
)
# Create configuration
model_config = ModelConfig(
name="llama2-7b",
num_parameters=7_000_000_000,
num_layers=32,
hidden_size=4096,
num_attention_heads=32,
vocab_size=32000,
max_seq_len=4096,
)
training_config = TrainingConfig(
batch_size=4,
gradient_accumulation_steps=4,
dtype="bf16",
optimizer="adamw",
)
parallelism_config = ParallelismConfig(
data_parallel_size=8,
)
engine_config = EngineConfig(
type="deepspeed",
zero_stage=3,
offload_optimizer="cpu",
)
gpu_config = GPUConfig(
num_gpus=8,
gpu_memory_gb=80,
)
# Calculate memory
calculator = GPUMemoryCalculator(
model_config=model_config,
training_config=training_config,
parallelism_config=parallelism_config,
engine_config=engine_config,
gpu_config=gpu_config,
)
result = calculator.calculate()
print(f"Memory per GPU: {result.total_memory_per_gpu_gb:.2f} GB")
print(f"Fits on GPU: {result.fits_on_gpu}")
print(f"Utilization: {result.memory_utilization_percent:.1f}%")from gpu_mem_calculator.inference.calculator import InferenceMemoryCalculator
from gpu_mem_calculator.core.models import (
ModelConfig,
InferenceConfig,
InferenceEngineType,
GPUConfig,
)
# Create configurations
model_config = ModelConfig(
name="llama2-7b",
num_parameters=7_000_000_000,
num_layers=32,
hidden_size=4096,
num_attention_heads=32,
max_seq_len=4096,
)
inference_config = InferenceConfig(
batch_size=32,
kv_cache_quantization="int8", # NONE, INT8, FP8, INT4
tensor_parallel_size=2,
gpu_memory_utilization=0.9,
)
gpu_config = GPUConfig(num_gpus=2, gpu_memory_gb=80)
# Calculate for different inference engines
calculator = InferenceMemoryCalculator(model_config, inference_config, gpu_config)
# vLLM inference
result_vllm = calculator.calculate(InferenceEngineType.VLLM)
print(f"vLLM: {result_vllm.total_memory_per_gpu_gb:.2f} GB")
print(f"Max batch size: {result_vllm.max_supported_batch_size}")
print(f"Throughput: {result_vllm.estimated_throughput_tokens_per_sec:.0f} tokens/sec")
# TensorRT-LLM inference
result_trt = calculator.calculate(InferenceEngineType.TENSORRT_LLM)
print(f"TensorRT-LLM: {result_trt.total_memory_per_gpu_gb:.2f} GB")from gpu_mem_calculator.core.multinode import MultiNodeCalculator
from gpu_mem_calculator.core.models import (
NodeConfig,
InterconnectType,
)
# Configure multi-node setup
node_config = NodeConfig(
num_nodes=4,
gpus_per_node=8,
interconnect_type=InterconnectType.INFINIBAND,
)
calculator = MultiNodeCalculator(
model_config=model_config,
training_config=training_config,
parallelism_config=parallelism_config,
node_config=node_config,
engine_config=engine_config,
)
# Calculate network overhead
network_overhead = calculator.calculate_network_overhead()
print(f"AllReduce: {network_overhead.allreduce_gb:.2f} GB")
print(f"AllGather: {network_overhead.allgather_gb:.2f} GB")
print(f"Time overhead: {network_overhead.estimated_overhead_ms_per_step:.2f} ms/step")
# Optimize hybrid parallelism
from gpu_mem_calculator.core.models import HybridParallelismConfig
hybrid_config = HybridParallelismConfig(
auto_optimize=True,
prefer_pipeline_parallel=True,
enable_sequence_parallel=True,
)
optimized_parallelism = calculator.optimize_hybrid_parallelism(hybrid_config)
print(f"Optimized TP: {optimized_parallelism.tensor_parallel_size}")
print(f"Optimized PP: {optimized_parallelism.pipeline_parallel_size}")
print(f"Optimized DP: {optimized_parallelism.data_parallel_size}")from gpu_mem_calculator.exporters.manager import ExportManager, ExportFormat
# Create export manager
manager = ExportManager(
model_config=model_config,
training_config=training_config,
parallelism_config=parallelism_config,
engine_config=engine_config,
node_config=node_config,
)
# Export to different formats
accelerate_config = manager.export(ExportFormat.ACCELERATE)
lightning_config = manager.export(ExportFormat.LIGHTNING)
axolotl_config = manager.export(ExportFormat.AXOLOTL)
# Export to file
manager.export_to_file(ExportFormat.ACCELERATE, "accelerate_config.yaml")
manager.export_to_file(ExportFormat.JSON, "config.json")
# Get DeepSpeed config
deepspeed_config = manager.export(ExportFormat.DEEPSPEED){
"model": {
"name": "llama2-7b",
"num_parameters": "7B",
"num_layers": 32,
"hidden_size": 4096,
"num_attention_heads": 32,
"vocab_size": 32000,
"max_seq_len": 4096
},
"training": {
"batch_size": 4,
"gradient_accumulation_steps": 4,
"optimizer": "adamw",
"dtype": "bf16",
"activation_checkpointing": 1
},
"parallelism": {
"tensor_parallel_size": 1,
"pipeline_parallel_size": 1,
"data_parallel_size": 8,
"sequence_parallel": false
},
"engine": {
"type": "deepspeed",
"zero_stage": 3,
"offload_optimizer": "cpu",
"offload_param": "none"
},
"hardware": {
"num_gpus": 8,
"gpu_memory_gb": 80
}
}Standard Distributed Data Parallel training without memory optimizations.
- ZeRO-1: Shard optimizer states
- ZeRO-2: Shard optimizer states + gradients
- ZeRO-3: Shard everything (parameters, gradients, optimizer states)
- Supports CPU/NVMe offloading
Tensor and pipeline parallelism with activation checkpointing support.
Combines Megatron-LM's model parallelism with DeepSpeed ZeRO's optimizer sharding.
Fully Sharded Data Parallel with multiple sharding strategies.
The calculator uses formulas verified against authoritative sources:
Model Parameters:
- FP16/BF16:
num_params Γ 2 bytes - FP32:
num_params Γ 4 bytes
Gradients:
- FP16/BF16:
num_params Γ 2 bytes - FP32:
num_params Γ 4 bytes
Optimizer States (per optimizer type):
- Adam/AdamW:
num_params Γ 12 bytes- 4 bytes: FP32 parameter copy
- 4 bytes: Momentum
- 4 bytes: Variance
- AdamW 8-bit:
num_params Γ 2 bytes(quantized) - SGD:
num_params Γ 4 bytes(FP32 only, no momentum)
Activations:
- Approximation:
batch_size Γ seq_len Γ hidden_size Γ num_layers Γ ~16 bytes/token/layer - Varies based on activation checkpointing level
ZeRO-0 (Baseline - same as PyTorch DDP):
total_per_gpu = 2Γparams + 2Γparams + 12Γparams + activations
= 16Γparams + activations
ZeRO-1 (Shard optimizer states):
total_per_gpu = 2Γparams + 2Γparams + (12Γparams)/num_gpus + activations
ZeRO-2 (Shard optimizer + gradients):
total_per_gpu = 2Γparams + (2Γparams)/num_gpus + (12Γparams)/num_gpus + activations
ZeRO-3 (Shard everything):
total_per_gpu = largest_layer_memory + (16Γparams)/num_gpus + activations
where largest_layer_memory β 4Γ(num_params/10)
CPU/NVMe Offloading:
- Optimizer states offloaded to CPU: 0 GB GPU memory
- Parameters offloaded to CPU/NVMe: Dynamically gathered during compute
All formulas have been verified against:
- β 18 comprehensive test scenarios (100% pass rate)
- β EleutherAI Transformer Math 101
- β Microsoft Research ZeRO Blog
- β DeepSpeed Official Documentation
- β PyTorch FSDP Documentation
- EleutherAI Transformer Math 101 - Comprehensive transformer memory breakdown
- Microsoft Research ZeRO Blog - ZeRO optimization techniques
- DeepSpeed Memory Documentation - Official DeepSpeed memory formulas
gpu-mem-calc calculate --config configs/llama2_7b_deepspeed.jsongpu-mem-calc calculate --config configs/gpt3_175b_megatron.jsongpu-mem-calc calculate --config configs/pytorch_ddp_example.json- Real-time Calculations: Auto-calculates as you adjust parameters (1s debounce)
- Client-side Validation: Instant feedback on configuration errors before API calls
- Smart Presets: Quick-load model configurations (LLaMA 2, GPT-3, GLM, Mixtral, Qwen, DeepSeek)
- Visual Breakdown: Color-coded bar chart with patterns for colorblind accessibility
- Feasibility Status: Clear indicators showing if configuration fits on GPU
- Detailed Breakdowns: See exact formulas used with your values plugged in
- Component-by-Component: Each memory component explained with formula and result
- Authoritative References: Links to EleutherAI, Microsoft Research, DeepSpeed docs
- Engine-Specific Details: Different formulas for PyTorch DDP, DeepSpeed ZeRO, FSDP, Megatron-LM
- Export to DeepSpeed: Generate
deepspeed_config.jsonfiles automatically - Batch Size Optimizer: Find maximum batch size that fits your GPU memory
- Config Persistence: Save configurations to browser localStorage
- Comparison Mode: Compare different configurations side-by-side
- ARIA Labels: Full screen reader support throughout the interface
- Keyboard Navigation: All features accessible via keyboard
- Colorblind-Friendly: Patterns and textures supplement colors in charts
- High Contrast: Clear visual indicators with multiple cues
POST /api/calculate- Calculate GPU memory requirementsPOST /api/explain-formula- Get detailed formula explanationPOST /api/export/deepspeed- Export DeepSpeed config filePOST /api/optimize/batch-size- Find maximum batch sizeGET /api/preset/{preset_name}- Load model presetPOST /api/hf/fetch- π Fetch model metadata from HuggingFace Hub
pytest tests/The calculator includes comprehensive testing:
- Unit Tests: Core calculation logic for each engine type
- Integration Tests: End-to-end configuration validation
- Formula Verification: 18 scenarios verifying formula accuracy
- API Tests: Web API endpoint testing
- Accessibility Tests: Screen reader and keyboard navigation
All formulas verified accurate against authoritative sources with 100% test pass rate.
black src/ cli/ web/
ruff check src/ cli/ web/mypy src/- β¨ Added formula explanation feature with detailed breakdowns
- β¨ Added client-side validation for better UX
- β¨ Added batch size optimizer API
- β¨ Added DeepSpeed config export functionality
- β¨ Added comprehensive input validation
- β¨ Added result caching for performance
- βΏ Added ARIA labels for full accessibility
- βΏ Added colorblind patterns to charts
- π Fixed optimizer formulas to be optimizer-specific
- π Fixed Pydantic namespace warnings
- β All 18 test scenarios passing (100%)
- β Formulas verified against EleutherAI, Microsoft Research, DeepSpeed docs
- β Optimizer formulas corrected for AdamW, AdamW 8-bit, and SGD
- β ZeRO stage formulas validated (0, 1, 2, 3)
- β Engine type formulas validated (PyTorch DDP, DeepSpeed, FSDP, Megatron-LM)
Contributions are welcome! Please feel free to submit a Pull Request. See CONTRIBUTING.md for detailed guidelines.
The memory calculations in this tool are based on authoritative sources:
Core Memory Formulas:
- EleutherAI Transformer Math 101 - Comprehensive breakdown of transformer memory requirements
- Microsoft Research ZeRO Blog - ZeRO optimization techniques
- Reducing Activation Recomputation in Large Transformer Models - Activation checkpointing strategies
Engine Documentation:
- DeepSpeed Memory Documentation - Official DeepSpeed memory formulas
- NVIDIA Megatron-LM - Tensor and pipeline parallelism
- PyTorch FSDP Documentation - Fully sharded data parallel
- PyTorch DDP Tutorial - Distributed data parallel
Related Tools:
- llm-analysis - LLM memory analysis
- vram-calculator - VRAM calculation utilities
- π Documentation
- π Issue Tracker
- π¬ Discussions
- π§ Contact the maintainers via GitHub
If you find this tool useful, please consider giving it a star! β
- Inference memory calculation
- Multi-node training configurations
- Export to training framework configs (Accelerate, Lightning, Axolotl)
- PyPI package distribution
- Support for more model architectures (Vision Transformers, Diffusion models)
- Real-time memory monitoring dashboard
- CLI commands for inference and export features
This tool was inspired by and builds upon the excellent work of:
- DeepSpeed Memory Estimator - ZeRO memory optimization formulas
- llm-analysis - LLM memory analysis methodology
- vram-calculator - VRAM calculation approach
Special thanks to the EleutherAI community for their comprehensive Transformer Math 101 guide, which provides detailed formulas for transformer memory calculations.
MIT License - see LICENSE for details.
If you use this tool in your research, please cite:
@software{gpu_mem_calculator,
title = {GPU Memory Calculator for LLM Training},
author = {GPU Mem Calculator Team},
year = {2026},
url = {https://github.com/George614/gpu-mem-calculator}
}Made with β€οΈ for the ML community
β Star us on GitHub β’ π Report a Bug β’ π‘ Request a Feature
