Problem
The training file list is materialized as list[str] in ConfigArguments.file_list_train and then copied multiple times along the pipeline. At MLPerf required num_files_train the duplication exceeds node RAM and the run is OOM killed during DataLoader worker spawn. This is evident in workloads like retinanet which by design is leveraging millions of files
Duplication points
main.py initialize(): every rank calls self.storage.walk_node(...) independently.
utils/config.py ConfigArguments: list stored as ClassVar[List[str]].
data_loader/torch_data_loader.py:153,372: every worker calls list(self.reader._file_list).
- Spawn pickling: full list pickled into each of
ranks × read_threads workers.
utils/config.py VirtualIndexMap.__init__: builds [os.path.abspath(f) for f in file_list] per epoch.
utils/config.py build_sample_map_iter: allocates fresh os.path.abspath strings per sample.
Observed failure
retinanet closed, N=7,646,857, 2 nodes × 12 ranks × 16 read_threads, 256 GiB/node: OOM during DataLoader worker spawn. Steady state would be roughly 430 GB/node; spawn time pickle transient pushes peak to roughly 480 GB/node.
The issue increases with increasing number of nodes because of increasing num_files_train. This increases the RAM requirements for each node increase
Why this blocks multi-node scaling
MLPerf Storage requires dataset_bytes ≥ memory_multiplier × num_nodes × RAM_per_node to defeat page cache, so num_files_train scales linearly with cluster size. DLIO's per process footprint also scales with num_files_train.
As an example: on 256 GiB nodes, the worker copy term (item 3 + item 4) caps num_files_train at roughly 1.4M, while the MLPerf rule requires roughly 3.5M per node. Result: closed runs cannot exceed roughly 2 nodes on this hardware. Having larger memory nodes does not help proportionally because num_files_train scales with total cluster RAM.
Submitters work around this by lowering read_threads (hides true I/O throughput) or lowering num_files_train (page cache dominates). Both invalidate the submission's intent.
Impact
- Blocks submissions of
retinanet at scale, since it increases the required RAM per single node with the number of train files increasing with the number of nodes
Problem
The training file list is materialized as
list[str]inConfigArguments.file_list_trainand then copied multiple times along the pipeline. At MLPerf requirednum_files_trainthe duplication exceeds node RAM and the run is OOM killed during DataLoader worker spawn. This is evident in workloads likeretinanetwhich by design is leveraging millions of filesDuplication points
main.pyinitialize(): every rank callsself.storage.walk_node(...)independently.utils/config.pyConfigArguments: list stored asClassVar[List[str]].data_loader/torch_data_loader.py:153,372: every worker callslist(self.reader._file_list).ranks × read_threadsworkers.utils/config.pyVirtualIndexMap.__init__: builds[os.path.abspath(f) for f in file_list]per epoch.utils/config.pybuild_sample_map_iter: allocates freshos.path.abspathstrings per sample.Observed failure
retinanetclosed,N=7,646,857, 2 nodes × 12 ranks × 16 read_threads, 256 GiB/node: OOM during DataLoader worker spawn. Steady state would be roughly 430 GB/node; spawn time pickle transient pushes peak to roughly 480 GB/node.The issue increases with increasing number of nodes because of increasing
num_files_train. This increases the RAM requirements for each node increaseWhy this blocks multi-node scaling
MLPerf Storage requires
dataset_bytes ≥ memory_multiplier × num_nodes × RAM_per_nodeto defeat page cache, sonum_files_trainscales linearly with cluster size. DLIO's per process footprint also scales withnum_files_train.As an example: on 256 GiB nodes, the worker copy term (item 3 + item 4) caps
num_files_trainat roughly 1.4M, while the MLPerf rule requires roughly 3.5M per node. Result: closed runs cannot exceed roughly 2 nodes on this hardware. Having larger memory nodes does not help proportionally becausenum_files_trainscales with total cluster RAM.Submitters work around this by lowering
read_threads(hides true I/O throughput) or loweringnum_files_train(page cache dominates). Both invalidate the submission's intent.Impact
retinanetat scale, since it increases the required RAM per single node with the number of train files increasing with the number of nodes