The README designates multiple metrics as "PRIMARY" for MLPerf v3.0 submissions (§ Understanding Excel Performance Metrics):
- Storage Tier Read/Write Device P95 latency
- Tier Storage Read/Write Bandwidth (GB/s)
- Avg Throughput (tok/s) when gpu_mem=0, cpu_mem=0
The benchmark output also reports a PASS/FAIL verdict driven by P95 thresholds (e.g., NVMe Read P95 < 200ms, NVMe Write P95 < 500ms).
However, it's unclear whether these P95/P99 latency thresholds are:
- Validity gates — a run that fails the P95 criterion produces invalid or non-comparable throughput/bandwidth data and must be discarded, or
- Diagnostic signals only — a FAIL on P95 is informational and the throughput/bandwidth numbers are still reportable alongside the latency result.
The README notes: "This is not a pass/fail test. It is a diagnostic tool" (§ What This Benchmark Does), yet the output renders an explicit PASS/FAIL verdict. This creates ambiguity for
submitters who need to know whether a FAIL on P95 invalidates the entire run.
Specific questions:
- Does a FAIL on P95 latency invalidate the throughput/bandwidth data for official submission purposes?
- Should submitters report runs that fail the P95 threshold, or only runs that pass?
- Given the high observed variance (CV 50–125%, §Discovery Test Key Findings), is it expected that some of the 3–5 required trials may fail P95 while others pass? If so, is the median
taken across all trials or only passing trials?
A definitive answer here would prevent inconsistent submissions across vendors.
The README designates multiple metrics as "PRIMARY" for MLPerf v3.0 submissions (§ Understanding Excel Performance Metrics):
The benchmark output also reports a PASS/FAIL verdict driven by P95 thresholds (e.g., NVMe Read P95 < 200ms, NVMe Write P95 < 500ms).
However, it's unclear whether these P95/P99 latency thresholds are:
The README notes: "This is not a pass/fail test. It is a diagnostic tool" (§ What This Benchmark Does), yet the output renders an explicit PASS/FAIL verdict. This creates ambiguity for
submitters who need to know whether a FAIL on P95 invalidates the entire run.
Specific questions:
taken across all trials or only passing trials?
A definitive answer here would prevent inconsistent submissions across vendors.