Training benchmark shows zero storage traffic from Epoch 2 but reports misleadingly high AU — possible caching issue

Description:                                                                                                                                                                                         
                                                                                                                                                                                                       
  I'm testing the unet3d training benchmark and observed the following behavior:                                                                                                                       
  - Epoch 1: Read traffic to the storage system is visible (confirmed via network card monitoring)                                                                                                     
  - Epochs 2–5: Zero network/storage traffic, yet the final reported AU (Accelerator Utilization) is very high                                                                                         
                                                                                                              
  Environment:                                                                                                                                                                                         
  - Number of client nodes: 16                                                                                                                                                                         
  - Memory per client: 256 GB                                                                                                                                                                          
  - Total dataset size: 37 TB (generated by 16 clients)                                                                                                                                                
  - Model: unet3d
 Questions:      
                                                                                                                                                                                                       
  1. With 37 TB of data across 16 clients (2.3 TB per client, 256 GB RAM each), the data should not fit in client memory. Could this be caused by storage-server-side caching (e.g., storage array     
  DRAM/NVMe cache)? Or is the DLIO data loader holding internal buffers across epochs that bypass actual I/O?
  2. The drop_caches call runs on a single rank and fails silently — is this a known limitation? Should it be executed on all nodes, and should it fail loudly when sudo is unavailable?               
  3. Should the training run command validate at startup whether the dataset meets the 5× memory requirement, and refuse to run if it doesn't?                                                         
  4. For distributed/shared filesystem scenarios, is there a recommended way for users to verify that results are not contaminated by caching? (e.g., checking potential_caching in summary.json, or   
  independently monitoring storage-side traffic?)                                                                                                                                                      
                                                                                                                                                                                                       
  Thanks for any guidance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training benchmark shows zero storage traffic from Epoch 2 but reports misleadingly high AU — possible caching issue #464

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Training benchmark shows zero storage traffic from Epoch 2 but reports misleadingly high AU — possible caching issue #464

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions