[DLIO] Persistent Workers Cause Stale File Shards and Cached Reads

## Summary

When `persistent_workers=True` (the default for `num_workers > 0`), PyTorch DataLoader worker processes are kept alive across epochs. This causes two compounding issues:

1. **Stale file assignments**: Workers capture a pickled snapshot of `ConfigArguments` at spawn time and never receive updates from `reconfigure()`, so cross-epoch file shuffling and resharding have no effect.
2. **Skipped I/O**: The local filesystem reader's in-memory cache (`self._local_cache`) persists across epochs, so files read in epoch 1 are never re-read from storage — subsequent epochs measure memory bandwidth, not storage throughput.

Both issues stem from the same root cause: persistent workers retain process-level state that should be reset each epoch.

## Root Cause

### Stale file assignments

In [`dlio_benchmark/data_loader/torch_data_loader.py`](https://github.com/mlcommons/DLIO_local_changes/blob/main/dlio_benchmark/data_loader/torch_data_loader.py), the `TorchDataset` class serializes the config at construction:

```python
class TorchDataset(Dataset):
    def __init__(self, ...):
        args = ConfigArguments.get_instance()
        self.serial_args = pickle.dumps(args)    # snapshot at construction

    def worker_init(self, worker_id):
        pickle.loads(self.serial_args)           # restores stale snapshot
```

With `persistent_workers=True`, workers are spawned once and `worker_init` runs only at first spawn. The `serial_args` snapshot is never updated, so workers always see epoch 1's `file_list_train`, `train_file_map`, `train_global_index_map`, etc. — even after `reconfigure()` shuffles or reshards files in the main process.

`persistent_workers=True` is set unconditionally in three places in `torch_data_loader.py` (lines 558, 592, and 627):

```python
if torch.__version__ != '1.3.1':
    kwargs['persistent_workers'] = True
```

### Skipped I/O (consequence of persistence)

In [`dlio_benchmark/reader/_local_fs_iterable_mixin.py`](https://github.com/mlcommons/DLIO_local_changes/blob/main/dlio_benchmark/reader/_local_fs_iterable_mixin.py), the `_localfs_ensure_cached` method checks whether a file is already in the worker's in-memory cache:

```python
def _localfs_ensure_cached(self, filename: str) -> None:
    if filename not in self._local_cache:       # <-- skips read if cached
        ...
```

Without persistent workers, `self._local_cache` is reset each epoch (new process = empty dict). With persistent workers, the cache from epoch 1 persists, so every file is found in-cache and the storage read is skipped entirely.

Note: this cache issue is **only** relevant because persistent workers keep the process alive. Without `persistent_workers=True`, each epoch spawns fresh workers with an empty `_local_cache`.

## Impact

| Epoch | File assignment | I/O behavior | Measurement validity |
|-------|----------------|--------------|---------------------|
| 1 | Correct ✓ | Reads from storage ✓ | Valid |
| 2+ | Stale (epoch 1's files) ✗ | Served from process memory ✗ | Invalid — measures memory, not storage |

- **Inflated AU**: Epochs 2+ report near-perfect Accelerator Utilization because "reads" are dict lookups, not real I/O.
- **No cross-epoch shuffling**: Despite `file_shuffle: seed` in the workload config, workers read the same files in the same order every epoch.
- **Averaged metrics mislead**: A run with 8 epochs measures storage for 1 epoch and memory for 7, producing an inflated average AU.

## Suggested Fix

Remove `persistent_workers=True` so workers are re-spawned each epoch. They will pick up the current `ConfigArguments` (with reshuffled file lists) via `serial_args` at construction, and start with a fresh `_local_cache`:

```python
# In all three DataLoader construction sites, remove:
# kwargs['persistent_workers'] = True
```

The worker spawn overhead (~2–5s for 4–8 workers) is negligible relative to typical epoch durations (minutes to hours for production-scale datasets). If spawn overhead is a concern, an alternative is to add a `refresh_args()` method that re-pickles the updated config before each epoch and explicitly clears any reader caches.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DLIO] Persistent Workers Cause Stale File Shards and Cached Reads #475

Summary

Root Cause

Stale file assignments

Skipped I/O (consequence of persistence)

Impact

Suggested Fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Epoch	File assignment	I/O behavior	Measurement validity
1	Correct ✓	Reads from storage ✓	Valid
2+	Stale (epoch 1's files) ✗	Served from process memory ✗	Invalid — measures memory, not storage

[DLIO] Persistent Workers Cause Stale File Shards and Cached Reads #475

Description

Summary

Root Cause

Stale file assignments

Skipped I/O (consequence of persistence)

Impact

Suggested Fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions