Skip to content

Split KV Cache readme and fmean instead of sum#482

Merged
FileSystemGuy merged 11 commits into
mlcommons:mainfrom
dslik:main
Jun 20, 2026
Merged

Split KV Cache readme and fmean instead of sum#482
FileSystemGuy merged 11 commits into
mlcommons:mainfrom
dslik:main

Conversation

@dslik

@dslik dslik commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

This PR splits the KV cache readme into two parts, one that provides basic instructions on how to run the benchmark for the purpose of a submission, and a second document that provides the detailed architecture, design and open parameters.

This PR also replaces the summation of results across the three runs with the fmean calculation across the three runs, fixing an inaccuracy in the calculated results.

@dslik dslik requested a review from a team June 19, 2026 22:22
@github-actions

github-actions Bot commented Jun 19, 2026

Copy link
Copy Markdown

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@FileSystemGuy FileSystemGuy left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dslik I didn't see that the unit tests were failing....

@FileSystemGuy

Copy link
Copy Markdown
Contributor

@dslik Here's Claude's review notes of the unit tests on this PR:

PR #482 has 8 unit-test failures, grouped into three distinct issues. The diff is small (4 lines in kvcache.py, 2
lines in mlperf_wrapper.py).

  1. OPTION_PARAMS → WORKLOAD_PARAMS rename left tests stale (5 failures)

The PR renamed the module-level constant OPTION_PARAMS to WORKLOAD_PARAMS in
kv_cache_benchmark/mlperf_wrapper.py:22 (and its sole use at line 135), but didn't update
tests/unit/test_mlperf_wrapper.py, which still references the old name:

AttributeError: module 'mlperf_wrapper' has no attribute 'OPTION_PARAMS'

Affected tests: test_option_keys_are_1_2_3, test_option1_model_is_8b, test_option3_model_is_70b,
test_generation_mode_always_none, test_option2_cpu_mem_gb_is_4. All trivially fixable by renaming the references
in the test file.

  1. fmean over flattened (trial × rank) under-counts bandwidth (2 failures)

The PR changes mlpstorage_py/benchmarks/kvcache.py:400-403 from sum(...) to fmean(...) for read/write bandwidth
and avg/storage throughput. But all_read_bw etc. are populated by the nested for trial_dir / for rank_idx loops at
lines 370-371, so they hold one value per (trial × rank) — not per trial.

The two failing tests assert the old "additive across ranks" semantics:

test_sums_bandwidth_across_ranks: assert 2.0 == 4.0 (sum of 4 ranks of 1.0 each → 4.0; fmean → 1.0)
test_aggregates_across_multiple_trials: assert 1.0 == 2.0

The PR description says the goal is "fmean across the three runs," but the code does fmean across the flattened
(trial × rank) list, which is not the same thing. Bandwidth across ranks is genuinely additive (each rank touches
a disjoint slice of storage); only the across-trials reduction should switch from sum to mean. The correct fix is
two-stage: sum across ranks within each trial, then fmean across trials.

  1. fmean([]) raises where sum([]) returned 0 (1 failure)

test_none_p95_when_no_successful_reads
statistics.StatisticsError: fmean requires at least one data point

When every rank/trial result file is missing, all_read_bw ends up empty. The old sum([]) returned 0 silently;
fmean([]) raises. Needs an if all_read_bw else 0.0 guard (or compute only when populated). The P95 path at line
404 already has this guard (max(...) if all_p95_latency else None) — it should be replicated for the four fmean()
calls.

Side note on the PR's stated goal

The PR description says it "replaces the summation of results across the three runs with the fmean calculation
across the three runs, fixing an inaccuracy" — but P95 was never summed (it was max()), and the PR doesn't touch
the P95 line. So the original concern that triggered this PR (P95 aggregation) is not addressed by these changes.
The PR is actually changing bandwidth/throughput aggregation, which is a different metric — and arguably the
previous sum() was correct there.

Worth raising with the author whether (a) only the across-trials dimension should change (sum→mean across trials,
sum stays across ranks), and (b) whether P95 also needs to change from max() to something else.

@dslik

dslik commented Jun 20, 2026

Copy link
Copy Markdown
Contributor Author

That's a pretty bad code smell for the test to be reaching into a local scope variable and creating a dependency like this, but, looks like that's a Python thing. I'll update the tests.

Claude is completely wrong in it's "side note", I hope its code is better than its reasoning here.

The variable name obscures its shape, so I'll fix that issue too.

@dslik

dslik commented Jun 20, 2026

Copy link
Copy Markdown
Contributor Author

@FileSystemGuy All tests passing now.

@dslik dslik requested a review from FileSystemGuy June 20, 2026 02:09
@FileSystemGuy FileSystemGuy merged commit 645e428 into mlcommons:main Jun 20, 2026
3 checks passed
@github-actions github-actions Bot locked and limited conversation to collaborators Jun 20, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants