Skip to content

George614/gpu-mem-calculator

GPU Memory Calculator for LLM Training

License: MIT Python 3.10+ Code style: black PRs Welcome Hugging Face Spaces

A versatile Python application for calculating GPU memory requirements for training Large Language Models with support for multiple training engines including PyTorch DDP, DeepSpeed ZeRO, Megatron-LM, and FSDP.

🎯 Try it Online - No Installation Required!

πŸš€ Launch GPU Memory Calculator on Hugging Face Spaces

Calculate GPU memory instantly in your browser. Supports all training engines (DeepSpeed, Megatron, FSDP) and inference engines (vLLM, TGI, SGLang).

πŸ“¦ Want to install locally? CLI & Python API available
pip install gpu-mem-calculator
gpu-mem-calc calculate --preset llama2-7b

See Installation below.


πŸ“– Getting Started Guide | πŸ’¬ FAQ | 🀝 Contributing

GPU Memory Calculator Screenshot


🌟 Featured On

πŸ€— Hugging Face Spaces - Try the interactive web demo with no installation!

Star ⭐ this repo to support development and get updates!


πŸš€ Why Use This Tool?

Training large language models requires careful memory planning. This calculator helps you:

  • πŸ’° Save costs by determining the optimal GPU configuration before you start training
  • ⚑ Avoid OOM errors by validating your training configuration fits in GPU memory
  • πŸ“Š Compare strategies across different training engines (DeepSpeed, Megatron, FSDP)
  • 🎯 Plan infrastructure by knowing exactly how many GPUs you need
  • πŸ“ˆ Scale efficiently with detailed memory breakdowns for optimization

Whether you're training a 7B parameter model on a single GPU or a 175B model across hundreds of GPUs, this tool provides accurate memory estimates based on proven formulas from DeepSpeed, Megatron-LM, and the latest research.

✨ Features

Core Training Calculation

  • πŸ”§ Multiple Training Engines: Support for PyTorch DDP, DeepSpeed ZeRO (stages 1-3), Megatron-LM, Megatron+DeepSpeed, and PyTorch FSDP
  • πŸ–₯️ Dual Interface: Both CLI and Web UI for flexible usage
  • 🎯 Preset Models: Quick-load configurations for popular models (LLaMA 2, GPT-3, etc.)
  • πŸ“Š Detailed Breakdown: Memory breakdown by component (parameters, gradients, optimizer states, activations)
  • βœ… Feasibility Analysis: Check if your configuration fits on available GPU memory
  • βš™οΈ Easy Config: JSON-based configuration files with human-readable parameter formats (e.g., "7B", "7000M")

πŸ†• Inference Memory Calculation

  • πŸš€ Multi-Engine Support: HuggingFace Transformers, vLLM, TGI, TensorRT-LLM, SGLang
  • πŸ’Ύ KV Cache Optimization: Quantization options (NONE, INT8, FP8, INT4)
  • πŸ”„ Tensor Parallelism: Automatic memory distribution across GPUs
  • πŸ“ˆ Throughput Estimation: Tokens/second estimates for capacity planning
  • 🎯 Batch Size Optimization: Find maximum batch size for your hardware

πŸ†• Multi-Node Training

  • 🌐 Network Overhead Calculation: AllReduce, AllGather, ReduceScatter, pipeline communication
  • πŸ“‘ Interconnect Support: InfiniBand, NVLink, Ethernet (10G/25G/100G/200G)
  • ⚑ Hybrid Parallelism Optimization: Automatic TP+PP+DP strategy optimization
  • πŸ”§ ZeRO Stage Impact Analysis: Compare communication overhead across ZeRO stages

πŸ†• Framework Configuration Exporters

  • πŸ“¦ Accelerate Export: HuggingFace Accelerate config generation
  • ⚑ Lightning Export: PyTorch Lightning Trainer configuration
  • πŸ”₯ Axolotl Export: YAML config for fine-tuning
  • πŸ“„ File Export: Save to YAML/JSON formats
  • πŸŽ›οΈ Format Conversion: Convert between different framework configs

πŸ“¦ Installation

Quick Start

Core Capabilities

  • Multiple Training Engines: Support for PyTorch DDP, DeepSpeed ZeRO (stages 0-3), Megatron-LM, Megatron+DeepSpeed, and PyTorch FSDP
  • Dual Interface: Both CLI and Web UI for flexible usage
  • Preset Models: Quick-load configurations for popular models (LLaMA 2, GPT-3, GLM, Mixtral, etc.)
  • Detailed Breakdown: Memory breakdown by component (parameters, gradients, optimizer states, activations)
  • Feasibility Analysis: Check if your configuration fits on available GPU memory
  • Easy Config: JSON-based configuration files with human-readable parameter formats (e.g., "7B", "7000M")

Web UI Enhancements

  • πŸ†• HuggingFace Hub Integration: Fetch model metadata directly from HuggingFace Model Hub by entering a model ID (e.g., meta-llama/Llama-2-7b-hf)
  • Formula Explanations: See exactly how memory is calculated with your values plugged in
  • Real-time Validation: Client-side validation prevents invalid configurations
  • Smart Auto-calculation: Optimized debouncing (1s) with minimum interval protection
  • Export Capabilities: Export to DeepSpeed config files, JSON, or copy to clipboard
  • Batch Size Optimizer: Automatically find maximum batch size that fits
  • Comparison Mode: Save and compare different configurations side-by-side
  • Accessibility Features: ARIA labels, keyboard navigation, colorblind-friendly charts

Advanced Features

  • MoE Support: Mixture of Experts models with configurable experts and top-k routing
  • CPU/NVMe Offloading: Offload optimizer states and parameters to CPU or NVMe storage
  • Activation Checkpointing: 5 levels from none to full checkpointing
  • Sequence Parallelism: Optimize memory for long sequences
  • Result Caching: Fast repeated calculations with built-in caching
pip install git+https://github.com/George614/gpu-mem-calculator.git

From source

git clone https://github.com/George614/gpu-mem_calculator.git
cd gpu_mem_calculator
pip install -e .

For Web UI support

pip install -e ".[web]"

Development installation

pip install -e ".[dev]"

πŸŽ“ Use Cases

Research & Academia

  • Estimate GPU requirements for research projects before requesting compute resources
  • Plan multi-GPU training configurations for large-scale experiments
  • Compare memory efficiency of different training strategies

Industry & Production

  • Cost optimization: Choose the right GPU type and count for your training workload
  • Capacity planning: Forecast infrastructure needs for model development
  • Debugging: Diagnose OOM errors and optimize memory usage

Education & Learning

  • Understand how training configuration affects memory consumption
  • Learn about different distributed training strategies
  • Experiment with various optimization techniques safely

πŸš€ Usage

Command Line Interface

Using model presets (Recommended)

The calculator includes pre-configured model presets for popular LLMs:

# List all available presets
gpu-mem-calc presets

# Calculate with a preset
gpu-mem-calc calculate --preset llama2-7b
gpu-mem-calc calculate --preset mixtral-8x7b --format json

# List presets in table format
gpu-mem-calc presets --format table

Available presets include:

  • Dense Models: LLaMA 2 (7B, 13B, 70B), GPT-3 (175B)
  • MoE Models: Mixtral 8x7B, GLM-4 (9B), GLM-4.7 (355B), GLM-4.5 Air (106B), Qwen1.5-MoE-A2.7B, DeepSeek-MoE (16B)

Calculate from config file

gpu-mem-calc calculate --config configs/llama2_7b_deepspeed.json

Quick calculation from model size

# Calculate memory for 7B model with 8x80GB GPUs using DeepSpeed
gpu-mem-calc quick 7 --gpus 8 --engine deepspeed

# With custom GPU memory
gpu-mem-calc quick 70 --gpus 64 --gpu-mem 80 --engine megatron

Validate configuration

gpu-mem-calc validate configs/my_config.json

Web Interface

Start the web server:

python -m gpu_mem_calculator.web.app

Or using uvicorn directly:

uvicorn gpu_mem_calculator.web.app:app --reload

Then open your browser to http://localhost:8000

Python API

Training Memory Calculation

from gpu_mem_calculator.core.calculator import GPUMemoryCalculator
from gpu_mem_calculator.core.models import (
    ModelConfig,
    TrainingConfig,
    ParallelismConfig,
    EngineConfig,
    GPUConfig,
)

# Create configuration
model_config = ModelConfig(
    name="llama2-7b",
    num_parameters=7_000_000_000,
    num_layers=32,
    hidden_size=4096,
    num_attention_heads=32,
    vocab_size=32000,
    max_seq_len=4096,
)

training_config = TrainingConfig(
    batch_size=4,
    gradient_accumulation_steps=4,
    dtype="bf16",
    optimizer="adamw",
)

parallelism_config = ParallelismConfig(
    data_parallel_size=8,
)

engine_config = EngineConfig(
    type="deepspeed",
    zero_stage=3,
    offload_optimizer="cpu",
)

gpu_config = GPUConfig(
    num_gpus=8,
    gpu_memory_gb=80,
)

# Calculate memory
calculator = GPUMemoryCalculator(
    model_config=model_config,
    training_config=training_config,
    parallelism_config=parallelism_config,
    engine_config=engine_config,
    gpu_config=gpu_config,
)

result = calculator.calculate()

print(f"Memory per GPU: {result.total_memory_per_gpu_gb:.2f} GB")
print(f"Fits on GPU: {result.fits_on_gpu}")
print(f"Utilization: {result.memory_utilization_percent:.1f}%")

πŸ†• Inference Memory Calculation

from gpu_mem_calculator.inference.calculator import InferenceMemoryCalculator
from gpu_mem_calculator.core.models import (
    ModelConfig,
    InferenceConfig,
    InferenceEngineType,
    GPUConfig,
)

# Create configurations
model_config = ModelConfig(
    name="llama2-7b",
    num_parameters=7_000_000_000,
    num_layers=32,
    hidden_size=4096,
    num_attention_heads=32,
    max_seq_len=4096,
)

inference_config = InferenceConfig(
    batch_size=32,
    kv_cache_quantization="int8",  # NONE, INT8, FP8, INT4
    tensor_parallel_size=2,
    gpu_memory_utilization=0.9,
)

gpu_config = GPUConfig(num_gpus=2, gpu_memory_gb=80)

# Calculate for different inference engines
calculator = InferenceMemoryCalculator(model_config, inference_config, gpu_config)

# vLLM inference
result_vllm = calculator.calculate(InferenceEngineType.VLLM)
print(f"vLLM: {result_vllm.total_memory_per_gpu_gb:.2f} GB")
print(f"Max batch size: {result_vllm.max_supported_batch_size}")
print(f"Throughput: {result_vllm.estimated_throughput_tokens_per_sec:.0f} tokens/sec")

# TensorRT-LLM inference
result_trt = calculator.calculate(InferenceEngineType.TENSORRT_LLM)
print(f"TensorRT-LLM: {result_trt.total_memory_per_gpu_gb:.2f} GB")

πŸ†• Multi-Node Network Overhead

from gpu_mem_calculator.core.multinode import MultiNodeCalculator
from gpu_mem_calculator.core.models import (
    NodeConfig,
    InterconnectType,
)

# Configure multi-node setup
node_config = NodeConfig(
    num_nodes=4,
    gpus_per_node=8,
    interconnect_type=InterconnectType.INFINIBAND,
)

calculator = MultiNodeCalculator(
    model_config=model_config,
    training_config=training_config,
    parallelism_config=parallelism_config,
    node_config=node_config,
    engine_config=engine_config,
)

# Calculate network overhead
network_overhead = calculator.calculate_network_overhead()
print(f"AllReduce: {network_overhead.allreduce_gb:.2f} GB")
print(f"AllGather: {network_overhead.allgather_gb:.2f} GB")
print(f"Time overhead: {network_overhead.estimated_overhead_ms_per_step:.2f} ms/step")

# Optimize hybrid parallelism
from gpu_mem_calculator.core.models import HybridParallelismConfig

hybrid_config = HybridParallelismConfig(
    auto_optimize=True,
    prefer_pipeline_parallel=True,
    enable_sequence_parallel=True,
)

optimized_parallelism = calculator.optimize_hybrid_parallelism(hybrid_config)
print(f"Optimized TP: {optimized_parallelism.tensor_parallel_size}")
print(f"Optimized PP: {optimized_parallelism.pipeline_parallel_size}")
print(f"Optimized DP: {optimized_parallelism.data_parallel_size}")

πŸ†• Export Framework Configurations

from gpu_mem_calculator.exporters.manager import ExportManager, ExportFormat

# Create export manager
manager = ExportManager(
    model_config=model_config,
    training_config=training_config,
    parallelism_config=parallelism_config,
    engine_config=engine_config,
    node_config=node_config,
)

# Export to different formats
accelerate_config = manager.export(ExportFormat.ACCELERATE)
lightning_config = manager.export(ExportFormat.LIGHTNING)
axolotl_config = manager.export(ExportFormat.AXOLOTL)

# Export to file
manager.export_to_file(ExportFormat.ACCELERATE, "accelerate_config.yaml")
manager.export_to_file(ExportFormat.JSON, "config.json")

# Get DeepSpeed config
deepspeed_config = manager.export(ExportFormat.DEEPSPEED)

Configuration File Format

{
  "model": {
    "name": "llama2-7b",
    "num_parameters": "7B",
    "num_layers": 32,
    "hidden_size": 4096,
    "num_attention_heads": 32,
    "vocab_size": 32000,
    "max_seq_len": 4096
  },
  "training": {
    "batch_size": 4,
    "gradient_accumulation_steps": 4,
    "optimizer": "adamw",
    "dtype": "bf16",
    "activation_checkpointing": 1
  },
  "parallelism": {
    "tensor_parallel_size": 1,
    "pipeline_parallel_size": 1,
    "data_parallel_size": 8,
    "sequence_parallel": false
  },
  "engine": {
    "type": "deepspeed",
    "zero_stage": 3,
    "offload_optimizer": "cpu",
    "offload_param": "none"
  },
  "hardware": {
    "num_gpus": 8,
    "gpu_memory_gb": 80
  }
}

Supported Training Engines

PyTorch DDP (Baseline)

Standard Distributed Data Parallel training without memory optimizations.

DeepSpeed ZeRO

  • ZeRO-1: Shard optimizer states
  • ZeRO-2: Shard optimizer states + gradients
  • ZeRO-3: Shard everything (parameters, gradients, optimizer states)
  • Supports CPU/NVMe offloading

Megatron-LM

Tensor and pipeline parallelism with activation checkpointing support.

Megatron + DeepSpeed

Combines Megatron-LM's model parallelism with DeepSpeed ZeRO's optimizer sharding.

PyTorch FSDP

Fully Sharded Data Parallel with multiple sharding strategies.

Memory Formulas

The calculator uses formulas verified against authoritative sources:

Base Components

Model Parameters:

  • FP16/BF16: num_params Γ— 2 bytes
  • FP32: num_params Γ— 4 bytes

Gradients:

  • FP16/BF16: num_params Γ— 2 bytes
  • FP32: num_params Γ— 4 bytes

Optimizer States (per optimizer type):

  • Adam/AdamW: num_params Γ— 12 bytes
    • 4 bytes: FP32 parameter copy
    • 4 bytes: Momentum
    • 4 bytes: Variance
  • AdamW 8-bit: num_params Γ— 2 bytes (quantized)
  • SGD: num_params Γ— 4 bytes (FP32 only, no momentum)

Activations:

  • Approximation: batch_size Γ— seq_len Γ— hidden_size Γ— num_layers Γ— ~16 bytes/token/layer
  • Varies based on activation checkpointing level

DeepSpeed ZeRO Stages

ZeRO-0 (Baseline - same as PyTorch DDP):

total_per_gpu = 2Γ—params + 2Γ—params + 12Γ—params + activations
             = 16Γ—params + activations

ZeRO-1 (Shard optimizer states):

total_per_gpu = 2Γ—params + 2Γ—params + (12Γ—params)/num_gpus + activations

ZeRO-2 (Shard optimizer + gradients):

total_per_gpu = 2Γ—params + (2Γ—params)/num_gpus + (12Γ—params)/num_gpus + activations

ZeRO-3 (Shard everything):

total_per_gpu = largest_layer_memory + (16Γ—params)/num_gpus + activations
where largest_layer_memory β‰ˆ 4Γ—(num_params/10)

CPU/NVMe Offloading:

  • Optimizer states offloaded to CPU: 0 GB GPU memory
  • Parameters offloaded to CPU/NVMe: Dynamically gathered during compute

Verification

All formulas have been verified against:

  • βœ… 18 comprehensive test scenarios (100% pass rate)
  • βœ… EleutherAI Transformer Math 101
  • βœ… Microsoft Research ZeRO Blog
  • βœ… DeepSpeed Official Documentation
  • βœ… PyTorch FSDP Documentation

References

Example Configurations

LLaMA 2 7B with DeepSpeed ZeRO-3

gpu-mem-calc calculate --config configs/llama2_7b_deepspeed.json

GPT-3 175B with Megatron-LM

gpu-mem-calc calculate --config configs/gpt3_175b_megatron.json

Custom 1B model with PyTorch DDP

gpu-mem-calc calculate --config configs/pytorch_ddp_example.json

Web UI Features

Interactive Interface

  • Real-time Calculations: Auto-calculates as you adjust parameters (1s debounce)
  • Client-side Validation: Instant feedback on configuration errors before API calls
  • Smart Presets: Quick-load model configurations (LLaMA 2, GPT-3, GLM, Mixtral, Qwen, DeepSeek)
  • Visual Breakdown: Color-coded bar chart with patterns for colorblind accessibility
  • Feasibility Status: Clear indicators showing if configuration fits on GPU

Formula Explanations

  • Detailed Breakdowns: See exact formulas used with your values plugged in
  • Component-by-Component: Each memory component explained with formula and result
  • Authoritative References: Links to EleutherAI, Microsoft Research, DeepSpeed docs
  • Engine-Specific Details: Different formulas for PyTorch DDP, DeepSpeed ZeRO, FSDP, Megatron-LM

Advanced Tools

  • Export to DeepSpeed: Generate deepspeed_config.json files automatically
  • Batch Size Optimizer: Find maximum batch size that fits your GPU memory
  • Config Persistence: Save configurations to browser localStorage
  • Comparison Mode: Compare different configurations side-by-side

Accessibility

  • ARIA Labels: Full screen reader support throughout the interface
  • Keyboard Navigation: All features accessible via keyboard
  • Colorblind-Friendly: Patterns and textures supplement colors in charts
  • High Contrast: Clear visual indicators with multiple cues

API Endpoints

  • POST /api/calculate - Calculate GPU memory requirements
  • POST /api/explain-formula - Get detailed formula explanation
  • POST /api/export/deepspeed - Export DeepSpeed config file
  • POST /api/optimize/batch-size - Find maximum batch size
  • GET /api/preset/{preset_name} - Load model preset
  • POST /api/hf/fetch - πŸ†• Fetch model metadata from HuggingFace Hub

Development

Running Tests

pytest tests/

Test Coverage

The calculator includes comprehensive testing:

  • Unit Tests: Core calculation logic for each engine type
  • Integration Tests: End-to-end configuration validation
  • Formula Verification: 18 scenarios verifying formula accuracy
  • API Tests: Web API endpoint testing
  • Accessibility Tests: Screen reader and keyboard navigation

All formulas verified accurate against authoritative sources with 100% test pass rate.

Code Formatting

black src/ cli/ web/
ruff check src/ cli/ web/

Type Checking

mypy src/

Recent Improvements

Latest Updates

  • ✨ Added formula explanation feature with detailed breakdowns
  • ✨ Added client-side validation for better UX
  • ✨ Added batch size optimizer API
  • ✨ Added DeepSpeed config export functionality
  • ✨ Added comprehensive input validation
  • ✨ Added result caching for performance
  • β™Ώ Added ARIA labels for full accessibility
  • β™Ώ Added colorblind patterns to charts
  • πŸ› Fixed optimizer formulas to be optimizer-specific
  • πŸ› Fixed Pydantic namespace warnings

Verification Status

  • βœ… All 18 test scenarios passing (100%)
  • βœ… Formulas verified against EleutherAI, Microsoft Research, DeepSpeed docs
  • βœ… Optimizer formulas corrected for AdamW, AdamW 8-bit, and SGD
  • βœ… ZeRO stage formulas validated (0, 1, 2, 3)
  • βœ… Engine type formulas validated (PyTorch DDP, DeepSpeed, FSDP, Megatron-LM)

Contributing

Contributions are welcome! Please feel free to submit a Pull Request. See CONTRIBUTING.md for detailed guidelines.

πŸ“š References

The memory calculations in this tool are based on authoritative sources:

Core Memory Formulas:

Engine Documentation:

Related Tools:

🀝 Community & Support

Star History

If you find this tool useful, please consider giving it a star! ⭐

πŸ“‹ Roadmap

  • Inference memory calculation
  • Multi-node training configurations
  • Export to training framework configs (Accelerate, Lightning, Axolotl)
  • PyPI package distribution
  • Support for more model architectures (Vision Transformers, Diffusion models)
  • Real-time memory monitoring dashboard
  • CLI commands for inference and export features

πŸ™ Acknowledgments

This tool was inspired by and builds upon the excellent work of:

Special thanks to the EleutherAI community for their comprehensive Transformer Math 101 guide, which provides detailed formulas for transformer memory calculations.

πŸ“„ License

MIT License - see LICENSE for details.

πŸ“š Citation

If you use this tool in your research, please cite:

@software{gpu_mem_calculator,
  title = {GPU Memory Calculator for LLM Training},
  author = {GPU Mem Calculator Team},
  year = {2026},
  url = {https://github.com/George614/gpu-mem-calculator}
}

Made with ❀️ for the ML community

⭐ Star us on GitHub β€’ πŸ› Report a Bug β€’ πŸ’‘ Request a Feature

About

GPU Memory Calculator for LLM Training - Calculate GPU memory requirements for training Large Language Models with support for multiple training engines including PyTorch DDP, DeepSpeed ZeRO, Megatron-LM, and FSDP.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

Packages

 
 
 

Contributors