[Draft] Robust measures scratchpad#368
Conversation
Percentiles on empty dataset are NaN, not infinity Add Robust statistics of CPU times to summary Fixed name for nv/cold/time/gpu/q3, corrected value reported for nv/cold/time/gpu/ir/relative Use median and IR to compute location and noise in measure_cold Also in stdrel_criterion, compute noise as IR / median.
1. For JSON files that contains repeated measurements of run-time axis values, make sure that scripts compares corresponding reference entries. If cmp had two states with the same name and ref had two, we would compare measurements for each state in cmp against the first state in ref. Change here introduces counters tracking how many times each particular axis value, and retrieve corresponding entry in ref. Previously, I had ``` | BlockSize | NumBlocks | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status | |-------------|-------------|------------|-------------|------------|-------------|-----------|---------|----------| | 2^8 | 64 | 1.776 ms | 0.46% | 1.777 ms | 0.40% | 1.024 us | 0.06% | SAME | | 2^8 | 64 | 1.776 ms | 0.46% | 1.774 ms | 0.52% | -2.048 us | -0.12% | SAME | | 2^8 | 64 | 1.776 ms | 0.46% | 1.773 ms | 0.52% | -3.072 us | -0.17% | SAME | | 2^8 | 64 | 1.776 ms | 0.46% | 1.774 ms | 0.58% | -2.048 us | -0.12% | SAME | | 2^8 | 64 | 1.776 ms | 0.46% | 1.773 ms | 0.58% | -3.072 us | -0.17% | SAME | ``` and now it becomes ``` | BlockSize | NumBlocks | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status | |-------------|-------------|------------|-------------|------------|-------------|-----------|---------|----------| | 2^8 | 64 | 1.776 ms | 0.46% | 1.777 ms | 0.40% | 1.024 us | 0.06% | SAME | | 2^8 | 64 | 1.773 ms | 0.64% | 1.774 ms | 0.52% | 1.024 us | 0.06% | SAME | | 2^8 | 64 | 1.774 ms | 0.46% | 1.773 ms | 0.52% | -1.024 us | -0.06% | SAME | | 2^8 | 64 | 1.773 ms | 0.46% | 1.774 ms | 0.58% | 1.024 us | 0.06% | SAME | | 2^8 | 64 | 1.774 ms | 0.52% | 1.773 ms | 0.58% | -1.024 us | -0.06% | SAME | ``` With the following raw data expected ``` (py313) opavlyk@NV-22T4X34:~/repos/nvbench$ jq '. | .benchmarks[] | .states[] | .summaries[] | select(.tag == "nv/cold/time/gpu/median") | .data[] | .value' base.json "0.0017756160497665405" "0.0017725440263748169" "0.001773568034172058" "0.0017725440263748169" "0.001773568034172058" (py313) opavlyk@NV-22T4X34:~/repos/nvbench$ jq '. | .benchmarks[] | .states[] | .summaries[] | select(.tag == "nv/cold/time/gpu/median") | .data[] | .value' test.json "0.0017766400575637818" "0.001773568034172058" "0.0017725440263748169" "0.001773568034172058" "0.0017725440263748169" ``` 2. nvbench_compare changes from using min_noise = min(ref_noise, cmp_noise) to using max_noise = max(ref_noise, cmp_noise) Using larger of ref and cmp noise level as a reference against which to gauge timing difference ratio makes more sense.
These measures are less sensitive to outliers
Computing relative noise, based on stdev/mean or ir/median, is refactored into a helper that returns nullopt is absolute dispersion is not a number, or centrality measure is zero. Deployed using this routine, and adapted to return type being an optional. Added a test
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
|
Caution Review failedAn error occurred during the review process. Please try again later. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
python/scripts/nvbench_compare.py (1)
571-577:⚠️ Potential issue | 🔴 Critical | ⚡ Quick wincritical: When
noise[i]isNone(which now happens when noise summaries are missing), the arithmeticy[i] * noise[i]at lines 573-574 will raise aTypeError. Thenormalize_float_valuehelper producesNonefor missing noise, and this code path doesn't guard against it.- top = [y[i] + y[i] * noise[i] for i in range(len(x))] - bottom = [y[i] - y[i] * noise[i] for i in range(len(x))] + top = [y[i] + y[i] * (noise[i] or 0) for i in range(len(x))] + bottom = [y[i] - y[i] * (noise[i] or 0) for i in range(len(x))]Alternatively, skip plotting error bands when noise is unavailable.
🧹 Nitpick comments (2)
nvbench/detail/measure_cpu_only.cxx (1)
190-197: 💤 Low valuesuggestion: The
cpu/mediansummary is hidden here (hidemetadata set), but inmeasure_cold.cuthecpu/mediansummary (line 283-288) is shown by default (nohideset). This may be intentional since CPU-only mode focuses on mean, but verify the intended behavior for consistency across measurement modes.python/scripts/nvbench_compare.py (1)
367-379: 💤 Low valuesuggestion: The
isliceskipscounters[cmp_state_name] - 1items to get the nth occurrence, but if there are fewer matchingref_statesthancountersindicates,islicereturnsNonevia the outernext(..., None). This is handled, but the logic assumes ref and cmp states appear in the same order. If ordering differs, mismatched states will be compared. Consider adding a comment documenting this assumption.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 075f958e-d1be-4da8-b99a-ace49fa80067
📒 Files selected for processing (6)
nvbench/detail/measure_cold.cunvbench/detail/measure_cpu_only.cxxnvbench/detail/statistics.cuhnvbench/detail/stdrel_criterion.cxxpython/scripts/nvbench_compare.pytesting/statistics.cu
…ro, so no reliable relative noise can be computed
Replace distribution-generated inputs with fixed measurement vectors and add coverage for persistently invalid relative-IQR estimates. This keeps the test focused on criterion behavior and avoids sensitivity to random data generation.
Split measurements in runs where noise is known for all consecutive points and plot the band. This shows as gaps in confidence band for a missing noise value
…stdev/relative hidden
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
python/scripts/nvbench_compare.py (1)
420-428:⚠️ Potential issue | 🟠 Major | ⚡ Quick winimportant: rows are discarded when
nv/cold/time/gpu/ir/relativeis missing, but upstream now emits that summary conditionally. This drops valid time comparisons and undercounts unknown/failure outcomes. Require only time summaries here, and treat missing noise summaries asNoneso the existingmax_noise is Nonepath marks them as unknown instead of skipping. As per coding guidelines,python/**/*: Focus on JSON/result parsing compatibility.Also applies to: 443-446
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: f2c97f08-9366-497b-b65b-93dc16e75e9a
📒 Files selected for processing (8)
nvbench/detail/measure_cold.cunvbench/detail/measure_cpu_only.cxxnvbench/detail/statistics.cuhnvbench/detail/stdrel_criterion.cuhnvbench/detail/stdrel_criterion.cxxpython/scripts/nvbench_compare.pytesting/statistics.cutesting/stdrel_criterion.cu
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 4
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 32ff53d7-d46e-40eb-8c0f-f9b7f4db0518
📒 Files selected for processing (8)
nvbench/detail/measure_cold.cunvbench/detail/measure_cpu_only.cxxnvbench/detail/statistics.cuhnvbench/detail/stdrel_criterion.cuhnvbench/detail/stdrel_criterion.cxxpython/scripts/nvbench_compare.pytesting/statistics.cutesting/stdrel_criterion.cu
The value is now the source of truth. It was previously hard-coded at multiple locations in the code-base. Updated hard-coded values with references to this constant. Printers are to output noise values only if sufficient samples were accumulated to produce meaningful estimate. That is --profile runs will not report noise.
under --ignore-devices, reused IDs in a different order no longer cause accidental ID-based pairing
However, replace estimation of standard deviation (complexity O(n) at every iteration, resulting in O(n^2) overall complexity) with use of online algorithm for estimating first two moments. This lowers overall complexity from O(n^2) to O(n). This allows stdrel_criterion class to no longer keep and grow the history of measurments, reducing memory footprint.
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
python/scripts/nvbench_compare.py (1)
73-79:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winsuggestion: Fix
--no-colorstatus emoji rendering inpython/scripts/nvbench_compare.pyIn
colorize()’sno_colorpath,str(emoji)returns enum names likeEmoji.YELLOW/Emoji.NONEforclass Emoji(str, Enum), so--no-colorprints names instead of the glyphs and incorrectly prefixesEmoji.NONE. Useemoji.value(e.g.,if emoji.value:) to build the prefix.
♻️ Duplicate comments (1)
docs/cli_help.md (1)
165-166:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winimportant: The
stdreland--max-noisetext still describes stdev/mean semantics, but this change set uses robust IQR-based relative noise; this will mislead threshold tuning and timeout interpretation. Update these entries to the same robust metric terminology used by the implementation.As per coding guidelines,
docs/**/*: "focus on technical accuracy, buildable examples, CLI/API consistency, version compatibility, and whether behavior changes have matching documentation updates."Also applies to: 174-177, 185-186
🧹 Nitpick comments (2)
testing/stdrel_criterion.cu (1)
152-175: ⚡ Quick winsuggestion: add an explicit reset-path test for invalid-noise streaks. Current additions prove eventual finish on persistent invalid inputs, but they do not assert that a subsequent valid estimate clears
m_consecutive_invalid_noise_estimatesand prevents premature force-finish. Add a test that feeds invalid measurements below the limit, then valid measurements, and verifies normal threshold-based behavior resumes.As per coding guidelines,
testing/**/*: Focus on whether tests cover observable behavior and remain deterministic.python/test/test_nvbench_compare.py (1)
47-60: ⚡ Quick winsuggestion: these helpers only synthesize mean + relative-stdev summaries and the tests never pass
axis_filters, so the new median/IQR path, unavailable-noise????path, and filtered duplicate-state matching can all regress without failing this suite. Add at least one case withGPU_TIME_MEDIAN_TAG+GPU_TIME_IR_RELATIVE_TAG, one with missing dispersion, and one filtered duplicate-state comparison. As per coding guidelines,python/**/*: Focus on Python API stability, pybind11/C++ exception boundaries, GIL behavior, CUDA interoperability, object lifetime, package metadata, type stubs, JSON/result parsing compatibility, and tests.Also applies to: 80-158
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 93eac76c-5397-49cc-859a-8ae0c7a27644
📒 Files selected for processing (10)
docs/cli_help.mdnvbench/detail/measure_cold.cunvbench/detail/measure_cpu_only.cxxnvbench/detail/statistics.cuhnvbench/detail/stdrel_criterion.cuhnvbench/detail/stdrel_criterion.cxxpython/scripts/nvbench_compare.pypython/test/test_nvbench_compare.pytesting/statistics.cutesting/stdrel_criterion.cu
- state_name_counts() now counts an already-selected list of states instead
of taking a device id.
- For each device pair, compare_benches() now builds:
- ref_device_states: matching reference device and axis filters
- cmp_device_states: matching compare device and axis filters
- State occurrence counts and duplicate occurrence matching now operate
only on those filtered per-device lists.
- Removed the later matches_axis_filters() skip inside the compare-state
loop because filtering now happens before matching.
Added a regression test where ref/cmp have duplicate state names in opposite order,
and --axis keeps only one of them. The test verifies the kept compare state
is matched against the kept reference state, not the first unfiltered occurrence.
Added CLI options: -`-reference-devices` `all`|`ID`|`ID,ID,...` - Integer lists preserve order and duplicates. - Requested IDs are validated against the file-level device list. - Filtered reference/compare device counts must match before comparison. - compare_benches() pairs selected reference and compare devices by position. - Each benchmark validates that requested device IDs are present in its own devices list. Also added tests for: - parsing all, single IDs, comma-separated IDs, duplicate IDs; - comparing ref device 0 against compare device 1 by position. To compare device 0 in reference dataset to device 1 in compare dataset I used ``` nvbench-compare scripts/test_ref.json scripts/test_cmp.json --reference-devices 0 --compare-devices 1 --ignore-devices ```
- `--axis` and `--benchmark` now share an ordered argparse action, so their
command-line order is preserved.
- `-a` before any `-b` becomes a global axis filter.
- `-a` after `-b <name>` applies to that most recent benchmark only.
- Repeated `-b entries` are treated as separate filter scopes and combined
as alternatives for that benchmark.
- Device filtering remains global and is applied independently.
Added a regression test for:
-b bench1 -a A=2 -b bench2
which now filters only bench1 to A=2, while bench2 is selected without that axis filter.
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 2
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 3a1d896d-0407-45b7-ae30-9b4311a80c1a
📒 Files selected for processing (10)
docs/cli_help.mdnvbench/detail/measure_cold.cunvbench/detail/measure_cpu_only.cxxnvbench/detail/statistics.cuhnvbench/detail/stdrel_criterion.cuhnvbench/detail/stdrel_criterion.cxxpython/scripts/nvbench_compare.pypython/test/test_nvbench_compare.pytesting/statistics.cutesting/stdrel_criterion.cu
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
python/scripts/nvbench_compare.py (1)
735-757: 💤 Low valuesuggestion:
ref_stateis guaranteed non-None by the count validation at lines 716-724, but a defensive guard before accessingref_state["summaries"]would prevent a confusingTypeErrorif a future refactor breaks that invariant.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: a1345e48-efc0-4f6c-80e8-b63b890fc276
📒 Files selected for processing (10)
docs/cli_help.mdnvbench/detail/measure_cold.cunvbench/detail/measure_cpu_only.cxxnvbench/detail/statistics.cuhnvbench/detail/stdrel_criterion.cuhnvbench/detail/stdrel_criterion.cxxpython/scripts/nvbench_compare.pypython/test/test_nvbench_compare.pytesting/statistics.cutesting/stdrel_criterion.cu
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
nvbench/detail/measure_cold.cu (1)
233-235:⚠️ Potential issue | 🟠 Major | ⚡ Quick winimportant: guard summary-stat computation when
m_total_samples == 0; this path is reachable because throttled trials are discarded before sample count increments, and the current logic can emit NaN/Inf as primary CPU/GPU time summaries.As per coding guidelines,
nvbench/**/*: Focus on benchmark correctness, measurement semantics, and statistical summaries.Also applies to: 271-319, 344-354, 381-430
♻️ Duplicate comments (1)
python/scripts/nvbench_compare.py (1)
185-214:⚠️ Potential issue | 🟠 Major | ⚡ Quick winimportant: this still hard-fails on malformed summary payloads. A missing/non-
float64valueraises out ofextract_summary_value(), so one partial summary aborts the whole comparison instead of flowing into the existing unknown-noise path. ReturnNonefor missing/invalid fields here and let the caller classify that row as unavailable noise.As per coding guidelines,
python/**/*: Focus on JSON/result parsing compatibility.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 64d1ab70-0d32-482c-873b-3dd372339808
📒 Files selected for processing (10)
docs/cli_help.mdnvbench/detail/measure_cold.cunvbench/detail/measure_cpu_only.cxxnvbench/detail/statistics.cuhnvbench/detail/stdrel_criterion.cuhnvbench/detail/stdrel_criterion.cxxpython/scripts/nvbench_compare.pypython/test/test_nvbench_compare.pytesting/statistics.cutesting/stdrel_criterion.cu
Now the device-section equality check remains fatal only for unfiltered all-vs-all comparisons. If either --reference-devices or --compare-devices is explicit, mismatched selected device metadata is printed as a warning, but comparison proceeds after the selected device counts have been validated.
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
python/scripts/nvbench_compare.py (1)
739-764: 💤 Low valuesuggestion: The
ref_statelookup relies on the count validation at lines 720-728 to ensure it's neverNone. Consider adding an explicit assertion likeassert ref_state is not None, "invariant: count validation ensures match exists"before line 761 to document this invariant and catch future regression if the validation logic changes.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 32378ca4-5e3d-4fdc-a098-3071f1b8c517
📒 Files selected for processing (10)
docs/cli_help.mdnvbench/detail/measure_cold.cunvbench/detail/measure_cpu_only.cxxnvbench/detail/statistics.cuhnvbench/detail/stdrel_criterion.cuhnvbench/detail/stdrel_criterion.cxxpython/scripts/nvbench_compare.pypython/test/test_nvbench_compare.pytesting/statistics.cutesting/stdrel_criterion.cu
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
This PR is a draft to sort out AI review and linting