Description:
I'm testing the unet3d training benchmark and observed the following behavior:
- Epoch 1: Read traffic to the storage system is visible (confirmed via network card monitoring)
- Epochs 2–5: Zero network/storage traffic, yet the final reported AU (Accelerator Utilization) is very high
Environment:
- Number of client nodes: 16
- Memory per client: 256 GB
- Total dataset size: 37 TB (generated by 16 clients)
- Model: unet3d
Questions:
- With 37 TB of data across 16 clients (2.3 TB per client, 256 GB RAM each), the data should not fit in client memory. Could this be caused by storage-server-side caching (e.g., storage array
DRAM/NVMe cache)? Or is the DLIO data loader holding internal buffers across epochs that bypass actual I/O?
- The drop_caches call runs on a single rank and fails silently — is this a known limitation? Should it be executed on all nodes, and should it fail loudly when sudo is unavailable?
- Should the training run command validate at startup whether the dataset meets the 5× memory requirement, and refuse to run if it doesn't?
- For distributed/shared filesystem scenarios, is there a recommended way for users to verify that results are not contaminated by caching? (e.g., checking potential_caching in summary.json, or
independently monitoring storage-side traffic?)
Thanks for any guidance!
Description:
I'm testing the unet3d training benchmark and observed the following behavior:
Environment:
Questions:
DRAM/NVMe cache)? Or is the DLIO data loader holding internal buffers across epochs that bypass actual I/O?
independently monitoring storage-side traffic?)
Thanks for any guidance!