Skip to content

Training benchmark shows zero storage traffic from Epoch 2 but reports misleadingly high AU — possible caching issue #464

@litianqi00315

Description

@litianqi00315

Description:

I'm testing the unet3d training benchmark and observed the following behavior:

  • Epoch 1: Read traffic to the storage system is visible (confirmed via network card monitoring)
  • Epochs 2–5: Zero network/storage traffic, yet the final reported AU (Accelerator Utilization) is very high

Environment:

  • Number of client nodes: 16
  • Memory per client: 256 GB
  • Total dataset size: 37 TB (generated by 16 clients)
  • Model: unet3d
    Questions:
  1. With 37 TB of data across 16 clients (2.3 TB per client, 256 GB RAM each), the data should not fit in client memory. Could this be caused by storage-server-side caching (e.g., storage array
    DRAM/NVMe cache)? Or is the DLIO data loader holding internal buffers across epochs that bypass actual I/O?
  2. The drop_caches call runs on a single rank and fails silently — is this a known limitation? Should it be executed on all nodes, and should it fail loudly when sudo is unavailable?
  3. Should the training run command validate at startup whether the dataset meets the 5× memory requirement, and refuse to run if it doesn't?
  4. For distributed/shared filesystem scenarios, is there a recommended way for users to verify that results are not contaminated by caching? (e.g., checking potential_caching in summary.json, or
    independently monitoring storage-side traffic?)

Thanks for any guidance!

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions