Skip to content

[DLIO] Persistent Workers Cause Stale File Shards and Cached Reads #475

@wolfgang-desalvador

Description

@wolfgang-desalvador

Summary

When persistent_workers=True (the default for num_workers > 0), PyTorch DataLoader worker processes are kept alive across epochs. This causes two compounding issues:

  1. Stale file assignments: Workers capture a pickled snapshot of ConfigArguments at spawn time and never receive updates from reconfigure(), so cross-epoch file shuffling and resharding have no effect.
  2. Skipped I/O: The local filesystem reader's in-memory cache (self._local_cache) persists across epochs, so files read in epoch 1 are never re-read from storage — subsequent epochs measure memory bandwidth, not storage throughput.

Both issues stem from the same root cause: persistent workers retain process-level state that should be reset each epoch.

Root Cause

Stale file assignments

In dlio_benchmark/data_loader/torch_data_loader.py, the TorchDataset class serializes the config at construction:

class TorchDataset(Dataset):
    def __init__(self, ...):
        args = ConfigArguments.get_instance()
        self.serial_args = pickle.dumps(args)    # snapshot at construction

    def worker_init(self, worker_id):
        pickle.loads(self.serial_args)           # restores stale snapshot

With persistent_workers=True, workers are spawned once and worker_init runs only at first spawn. The serial_args snapshot is never updated, so workers always see epoch 1's file_list_train, train_file_map, train_global_index_map, etc. — even after reconfigure() shuffles or reshards files in the main process.

persistent_workers=True is set unconditionally in three places in torch_data_loader.py (lines 558, 592, and 627):

if torch.__version__ != '1.3.1':
    kwargs['persistent_workers'] = True

Skipped I/O (consequence of persistence)

In dlio_benchmark/reader/_local_fs_iterable_mixin.py, the _localfs_ensure_cached method checks whether a file is already in the worker's in-memory cache:

def _localfs_ensure_cached(self, filename: str) -> None:
    if filename not in self._local_cache:       # <-- skips read if cached
        ...

Without persistent workers, self._local_cache is reset each epoch (new process = empty dict). With persistent workers, the cache from epoch 1 persists, so every file is found in-cache and the storage read is skipped entirely.

Note: this cache issue is only relevant because persistent workers keep the process alive. Without persistent_workers=True, each epoch spawns fresh workers with an empty _local_cache.

Impact

Epoch File assignment I/O behavior Measurement validity
1 Correct ✓ Reads from storage ✓ Valid
2+ Stale (epoch 1's files) ✗ Served from process memory ✗ Invalid — measures memory, not storage
  • Inflated AU: Epochs 2+ report near-perfect Accelerator Utilization because "reads" are dict lookups, not real I/O.
  • No cross-epoch shuffling: Despite file_shuffle: seed in the workload config, workers read the same files in the same order every epoch.
  • Averaged metrics mislead: A run with 8 epochs measures storage for 1 epoch and memory for 7, producing an inflated average AU.

Suggested Fix

Remove persistent_workers=True so workers are re-spawned each epoch. They will pick up the current ConfigArguments (with reshuffled file lists) via serial_args at construction, and start with a fresh _local_cache:

# In all three DataLoader construction sites, remove:
# kwargs['persistent_workers'] = True

The worker spawn overhead (~2–5s for 4–8 workers) is negligible relative to typical epoch durations (minutes to hours for production-scale datasets). If spawn overhead is a concern, an alternative is to add a refresh_args() method that re-pickles the updated config before each epoch and explicitly clears any reader caches.

Metadata

Metadata

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions