Skip to content

OOM Issues — Unexpected Single-GPU Batch Size Memory Usage & Multi-GPU OOM Errors #43

@yiping-tks

Description

@yiping-tks

Hello, I encountered memory-related issues while training with the train_lotus_g_depth.sh script and would appreciate your guidance:

  1. Single-GPU Training:
  • When setting BATCH_SIZE=4 in train_lotus_g_depth.sh, the GPU memory usage is about 22GB, and training runs normally.
  • However, setting BATCH_SIZE=1 causes the memory usage to increase to around 23GB, which seems counterintuitive.
  • The accelerate config used is as follows:
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: 'NO'
downcast_bf16: 'no'
gpu_ids: '0'
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
  1. Multi-GPU Training:
  • When configuring multi-GPU training, setting BATCH_SIZE to either 1 or 4 leads to out-of-memory (OOM) errors.
  • The accelerate config used is as follows:
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: '0,1,2,3'
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

What could cause the memory usage anomaly in single-GPU training? Could you provide recommended multi-GPU training config examples or advice?

Thank you very much!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions