File-list duplication and per-process memory blow-up at scale (upstream `main`, before any of our local patches)


## Problem

The training file list is materialized as `list[str]` in `ConfigArguments.file_list_train` and then copied multiple times along the pipeline. At MLPerf required `num_files_train` the duplication exceeds node RAM and the run is OOM killed during DataLoader worker spawn. This is evident in workloads like `retinanet` which by design is leveraging millions of files

## Duplication points

1. `main.py` `initialize()`: every rank calls `self.storage.walk_node(...)` independently.
2. `utils/config.py` `ConfigArguments`: list stored as `ClassVar[List[str]]`.
3. `data_loader/torch_data_loader.py:153,372`: every worker calls `list(self.reader._file_list)`.
4. Spawn pickling: full list pickled into each of `ranks × read_threads` workers.
5. `utils/config.py` `VirtualIndexMap.__init__`: builds `[os.path.abspath(f) for f in file_list]` per epoch.
6. `utils/config.py` `build_sample_map_iter`: allocates fresh `os.path.abspath` strings per sample.

## Observed failure

`retinanet` closed, `N=7,646,857`, 2 nodes × 12 ranks × 16 read_threads, 256 GiB/node: OOM during DataLoader worker spawn. Steady state would be roughly 430 GB/node; spawn time pickle transient pushes peak to roughly 480 GB/node.

The issue increases with increasing number of nodes because of increasing `num_files_train`. This increases the RAM requirements for each node increase

## Why this blocks multi-node scaling

MLPerf Storage requires `dataset_bytes ≥ memory_multiplier × num_nodes × RAM_per_node` to defeat page cache, so `num_files_train` scales linearly with cluster size. DLIO's per process footprint also scales with `num_files_train`. 

**As an example:** on 256 GiB nodes, the worker copy term (item 3 + item 4) caps `num_files_train` at roughly 1.4M, while the MLPerf rule requires roughly 3.5M per node. Result: closed runs cannot exceed roughly 2 nodes on this hardware. Having larger memory nodes does not help proportionally because `num_files_train` scales with total cluster RAM.

Submitters work around this by lowering `read_threads` (hides true I/O throughput) or lowering `num_files_train` (page cache dominates). Both invalidate the submission's intent.

## Impact

- Blocks submissions of `retinanet` at scale, since it increases the required RAM per single node with the number of train files increasing with the number of nodes


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File-list duplication and per-process memory blow-up at scale (upstream `main`, before any of our local patches) #449

Problem

Duplication points

Observed failure

Why this blocks multi-node scaling

Impact

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

File-list duplication and per-process memory blow-up at scale (upstream main, before any of our local patches) #449

Description

Problem

Duplication points

Observed failure

Why this blocks multi-node scaling

Impact

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

File-list duplication and per-process memory blow-up at scale (upstream `main`, before any of our local patches) #449