Skip to content

File-list duplication and per-process memory blow-up at scale (upstream main, before any of our local patches) #449

@wolfgang-desalvador

Description

@wolfgang-desalvador

Problem

The training file list is materialized as list[str] in ConfigArguments.file_list_train and then copied multiple times along the pipeline. At MLPerf required num_files_train the duplication exceeds node RAM and the run is OOM killed during DataLoader worker spawn. This is evident in workloads like retinanet which by design is leveraging millions of files

Duplication points

  1. main.py initialize(): every rank calls self.storage.walk_node(...) independently.
  2. utils/config.py ConfigArguments: list stored as ClassVar[List[str]].
  3. data_loader/torch_data_loader.py:153,372: every worker calls list(self.reader._file_list).
  4. Spawn pickling: full list pickled into each of ranks × read_threads workers.
  5. utils/config.py VirtualIndexMap.__init__: builds [os.path.abspath(f) for f in file_list] per epoch.
  6. utils/config.py build_sample_map_iter: allocates fresh os.path.abspath strings per sample.

Observed failure

retinanet closed, N=7,646,857, 2 nodes × 12 ranks × 16 read_threads, 256 GiB/node: OOM during DataLoader worker spawn. Steady state would be roughly 430 GB/node; spawn time pickle transient pushes peak to roughly 480 GB/node.

The issue increases with increasing number of nodes because of increasing num_files_train. This increases the RAM requirements for each node increase

Why this blocks multi-node scaling

MLPerf Storage requires dataset_bytes ≥ memory_multiplier × num_nodes × RAM_per_node to defeat page cache, so num_files_train scales linearly with cluster size. DLIO's per process footprint also scales with num_files_train.

As an example: on 256 GiB nodes, the worker copy term (item 3 + item 4) caps num_files_train at roughly 1.4M, while the MLPerf rule requires roughly 3.5M per node. Result: closed runs cannot exceed roughly 2 nodes on this hardware. Having larger memory nodes does not help proportionally because num_files_train scales with total cluster RAM.

Submitters work around this by lowering read_threads (hides true I/O throughput) or lowering num_files_train (page cache dominates). Both invalidate the submission's intent.

Impact

  • Blocks submissions of retinanet at scale, since it increases the required RAM per single node with the number of train files increasing with the number of nodes

Metadata

Metadata

Labels

TrainingbugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions