Skip to content

Add per-interactivity throughput table and AUC summary table to inference page#364

Open
functionstackx wants to merge 4 commits into
masterfrom
feat/interactivity-throughput-and-auc-tables
Open

Add per-interactivity throughput table and AUC summary table to inference page#364
functionstackx wants to merge 4 commits into
masterfrom
feat/interactivity-throughput-and-auc-tables

Conversation

@functionstackx
Copy link
Copy Markdown
Contributor

Summary

Below the existing Pareto-frontier chart on the inference page, render two new tables that summarize the visible Pareto-frontier curves into scalar form. Both tables react live to the same filter controls that drive the chart (model, precision, sequence/ISL-OSL, and the legend on/off toggles for enabled configs), and only appear when the y-axis metric is Token Throughput per GPU — the AUC + interactivity framing assumes that metric.

Table 1 — Per-GPU throughput at each interactivity bucket

  • Rows: enabled configs. Columns: every 10 tok/s/user from 10 up through ceil(globalMax / 10) * 10.
  • Cells: tok/s/gpu linearly interpolated along each config's 2-D Pareto frontier of (interactivity, tok/s/gpu). Outside the frontier's x-range: em dash.
  • Best value per column is highlighted (green background, bold).
  • Linked sub-table below shows percent advantage of each config vs a user-selectable baseline (default: MI355X SGLang). Cells follow the spec's ∞ / −∞ / — semantics for missing-other / missing-baseline / both-missing.
  • Heatmap: red → white → green, clamped at ±200%. Text color picked via WCAG relative luminance so each cell stays readable.

Table 2 — Area under Pareto frontier (AUC summary)

  • AUC = trapezoidal area under each config's Pareto frontier, integrated from x = 10 to x = ceil(globalMax / 10) * 10. Outside the frontier's x-range the integrand is treated as 0, so configs that don't reach part of the range contribute 0 there.
  • Columns: AUC, Ratio vs primary baseline, % vs primary baseline, Ratio vs secondary baseline, Ratio vs tertiary baseline.
  • Three independent baseline dropdowns. Defaults: primary = B200 SGLang non-MTP, secondary = MI355X SGLang, tertiary = MI355X ATOM.
  • Self-vs-self renders amber 1.00× / +0.0%; better-than-baseline is green; worse is red (same red/green heatmap as Table 1).

Implementation notes

  • Shared 2-D Pareto / interp / AUC implementation in packages/app/src/lib/pareto.ts. The existing chart-side roofline code in chart-utils.ts is metric-aware (operates on full InferenceData with upper_left | upper_right | … directions) and intentionally kept untouched — the new util is the plain numeric core that consumers without that machinery (these tables) should use. Both code paths compute the same non-dominated set on (x, y) = (interactivity, tok/s/gpu).
  • AUC is computed in closed form on the piecewise-linear frontier rather than as a 10 001-sample np.interp grid — same answer to machine precision and avoids per-render allocations.
  • Tables source their data from useInference().graphs (the existing interactivity chart's processed data), then apply the existing selectedPrecisions and activeHwTypes filters before grouping by hwKey. This is how the table guarantees it always shows exactly the configs that are currently on the chart.
  • Each baseline Select is track()-ed (inference_throughput_baseline_changed, inference_auc_primary_baseline_changed, etc.) per the project's analytics convention.
  • Tooltips/explainers added next to both table headings.

Verification against the spec's reference AUCs

The spec ships an 8-config sample dataset (FP4 DeepSeek V4 Pro, 8K/1K, TP=8) with known expected AUCs computed by the Python reference. The pareto util's unit tests load that fixture and check that all 8 configs match within 0.5% — they all do.

Config Expected Computed (within 0.5%)
MI355X SGLang non-MTP 11,457
MI355X ATOM non-MTP 23,659
B200 SGLang non-MTP 63,495
B200 Dynamo vLLM 62,177
GB200 Dynamo vLLM non-MTP 116,220
GB200 Dynamo vLLM MTP 176,705
GB300 Dynamo SGLang non-MTP 379,854
GB300 Dynamo SGLang MTP 263,727

Files

  • packages/app/src/lib/pareto.ts — new shared util (Pareto frontier, linear interp, trapezoidal AUC).
  • packages/app/src/lib/pareto.test.ts — unit tests including the 8-config sanity check.
  • packages/app/src/lib/__fixtures__/eight_config_data.json — test fixture from the spec.
  • packages/app/src/components/inference/ui/InteractivityTables.tsx — new component containing both tables.
  • packages/app/src/components/inference/ui/ChartDisplay.tsx — mounts the new component below the displayed graphs.

Layout

The new section appears as two stacked Cards below the Pareto chart (and above the "Performance Over Time" drill-down dialog). Each card has a heading row with an info-tooltip and (for the heatmap and AUC tables) baseline Select controls right-aligned. The tables themselves use the dashboard's standard text-xs, tabular-nums, border-collapse, sticky-first-column pattern. No new design system or font is introduced.

I could not render a local screenshot in this environment (no DB / no browser), so the layout description above is the best representation I can give.

Test plan

  • pnpm lint clean
  • pnpm fmt clean
  • pnpm typecheck clean
  • pnpm test:unit clean (1,930 app tests pass, includes 16 new pareto tests)
  • Visual review on a Vercel preview deploy: pick FP4 DeepSeek V4 Pro 8K/1K TP=8 and confirm AUC numbers match the spec table within rounding.
  • Toggle a config off in the legend and confirm both tables drop that row and the column max / heatmap recompute.
  • Change the baseline dropdowns and confirm the affected columns recolor and recompute. Self-row remains amber 1.00× / 0%.
  • Switch the y-axis to a non-throughput metric and confirm the section hides (intended — AUC framing only applies to tok/s/gpu).

🤖 Generated with Claude Code

… table

Below the Pareto chart on the inference page, render two new tables that
summarize the visible Pareto-frontier curves into scalar form.

- Table 1 (per-GPU throughput at each interactivity bucket): rows = enabled
  configs, columns = every 10 tok/s/user from 10 up to ceil(globalMax/10)*10.
  Cells are tok/s/gpu linearly interpolated along each config's Pareto
  frontier; "—" for out-of-range buckets; best per column highlighted.
  Linked sub-table shows % advantage vs a user-selectable baseline (default:
  MI355X SGLang) with infinity / negative-infinity / em-dash semantics and a
  +/-200%-capped red->white->green heatmap; cell text color picked via WCAG
  luminance for contrast.

- Table 2 (AUC summary): trapezoidal area under each frontier from x=10 to
  ceil(globalMax/10)*10, with y treated as 0 outside the frontier's x-range.
  Columns: AUC, ratio + % vs primary baseline (default B200 SGLang non-MTP),
  ratio vs secondary baseline (default MI355X SGLang), ratio vs tertiary
  baseline (default MI355X ATOM). All three baselines are selectable.
  Self-vs-self is amber 1.00x/+0.0%; better is green; worse is red.

Both tables share a single Pareto/interp/AUC implementation in
@/lib/pareto. Verified against the spec's reference AUCs from
eight_config_data.json (FP4 DeepSeek V4 Pro, 8K/1K, TP=8) -- all 8 configs
match the expected values to within 0.5%. Tables react live to the existing
filter controls (model, precision, ISL/OSL, legend on/off toggles).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@vercel
Copy link
Copy Markdown

vercel Bot commented May 17, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
inferencemax-app Ready Ready Preview, Comment May 17, 2026 10:03pm

Request Review

@blacksmith-sh

This comment has been minimized.

Two follow-up tweaks to the per-interactivity throughput and AUC summary
tables introduced in 6db1e32:

1. Render multiplicative ratios (Nx) instead of percent-differences.
   - Throughput "% advantage vs baseline" sub-table → "Ratio vs baseline",
     cells now read "2.50×", "0.60×", etc; self-vs-self is "1.00×";
     "∞" kept (other reachable, baseline not); "−∞" replaced with "0×"
     using the same dark-red treatment for the symmetric case.
   - AUC table: drop the redundant "% vs primary" column entirely (the
     other three columns are already ratios), so columns are AUC + Ratio
     vs primary + Ratio vs secondary + Ratio vs tertiary, all in Nx.
   - New ratioColor() centered at 1.00× and log-symmetric: 3.00× → fully
     green, 0.33× → fully red, interpolating linearly in log space (so
     "2×" and "0.5×" land at matched saturations). WCAG-luminance text
     color preserved.

2. Column upper bound is now floor(globalMax/10)*10 instead of ceil, for
   both the throughput buckets and the AUC integration window. The last
   bucket is therefore always one at least one config actually reaches.

pareto.test.ts: spec sanity check now compares aucUnderFrontier against
an independent fine-grid trapezoidal reference computed inline, instead
of hard-coding expected AUC magnitudes that bake in a specific upper
bound — the new floor(...) rule, or any future window change, no longer
requires touching the test.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@functionstackx
Copy link
Copy Markdown
Contributor Author

Pushed aad700a addressing two requested changes:

1. Ratios (Nx) instead of percentages

  • Throughput diff sub-table is now "Ratio vs baseline": cells render 2.50× / 0.60× / 1.00× for self. The infinity cases are kept symmetric — (baseline can't reach this interactivity but the other config can) and (the reverse), both with the same dark-red/dark-green treatment as before. Picked over −∞ because it's the actual numeric limit of other / baseline when other → 0, and it reads more cleanly alongside the other ratio cells.
  • AUC summary table: dropped the now-redundant "% vs primary" column entirely. The table is now just AUC + Ratio vs primary + Ratio vs secondary + Ratio vs tertiary, all in Nx.
  • New ratioColor() is centered at 1.00× and log-symmetric: 3.00× → fully green, 0.33× → fully red, interpolating linearly in log space so and 0.5× sit at matched saturations. WCAG-luminance text-color selection preserved.

2. floor instead of ceil for the upper bound

  • Throughput table buckets and AUC integration window both now end at floor(globalMax / 10) * 10. So if the highest tok/s/user any selected config reaches is e.g. 173.4, columns go 10, 20, …, 170 (not 180), and AUC integrates over [10, 170].

Test updates

  • pareto.test.ts no longer hard-codes the spec's expected AUC magnitudes — those values bake in a specific upper bound and would have shifted with this change (e.g. B200_DynamoVLLM_nonMTP_disagg goes from 62,177 → 62,194 when hi shifts from 180 → 170, because that config keeps contributing positive area in the (170, 180) window we used to integrate over). The sanity check now compares aucUnderFrontier against an independent fine-grid trapezoidal reference computed inline for each config, so the assertion stays meaningful regardless of which upper-bound rule is in play.

Verification

  • pnpm lint
  • pnpm fmt
  • pnpm typecheck
  • pnpm test:unit ✅ (1930 tests, all 8 AUC sanity-check cases pass against the independent reference)

Parameterize pareto.ts with 'higher' | 'lower' direction so the
interactivity tables work for cost / J / power metrics in addition
to tok/s/gpu. Direction is taken from the existing chart-config
roofline direction (upper_* = higher-better, lower_* = lower-better)
via new lib/metric-direction.ts helper.

- paretoFrontier / interpAlongFrontier / aucUnderFrontier accept a
  direction parameter.
- For lower-is-better, AUC integrates only over each config's
  reachable x-range (zero-padding outside would treat "no data" as
  the BEST value, inflating cost AUC). Higher-better keeps the
  existing zero-outside behavior.
- New aucWindow() reports the effective integration window per row,
  shown as a new "Window" column when the active metric is
  lower-is-better.
- InteractivityTables renders for every y-axis metric; column-best
  highlight picks min for lower-better; ratio colormap inverts so
  ratios < 1 are green and > 1 are red; in-range vs out-of-range
  cells flip their green/red mapping consistently with the direction.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@functionstackx
Copy link
Copy Markdown
Contributor Author

Extended to all y-axis metrics

InteractivityTables now renders for every y-axis metric (cost, J/token, power, etc.) — not just tok/s/gpu.

Source-of-truth for direction

The existing chart config (packages/app/src/components/inference/inference-chart-config.json) already declares each metric's roofline direction per chart type via y_<metric>_roofline. On the interactivity chart, upper_* is higher-is-better and lower_* is lower-is-better. I added a small shared helper at packages/app/src/lib/metric-direction.ts that maps that direction to a 'higher' | 'lower' ParetoDirection — same data, no duplication. The tables read it directly off the active interactivity chart definition.

AUC out-of-range decision

For lower-is-better metrics, treating out-of-reachable-range as y=0 would inflate AUC because 0 is the BEST cost. I chose to integrate only over each config's reachable x-range (clip the requested [10, hi] to [max(10, configMinX), min(hi, configMaxX)]). For higher-is-better, I kept the existing zero-outside behavior — there, y=0 is the WORST throughput, so zero-padding correctly penalizes configs that can't reach the high-interactivity buckets.

Trade-off: under this asymmetric rule, lower-better AUCs from configs with narrow reachable spans aren't directly comparable to configs with wide spans. To make that explicit, the AUC table gains a new "Window" column (only shown for lower-better metrics) that displays each row's effective lo→hi window.

pareto.aucWindow() is the new helper that returns the effective window so consumers can display it. For higher-better it always returns the requested [lo, hi]; for lower-better it returns the clipped reachable range.

Other changes

  • paretoFrontier, interpAlongFrontier, aucUnderFrontier all accept a direction: 'higher' | 'lower' parameter (defaulting to 'higher' — fully backward-compatible).
  • Column-best highlight: max for higher-better, min for lower-better.
  • Ratio colormap inverts for lower-better (ratios < 1 are green = good, > 1 are red).
  • ∞ / 0× cell coloring flips: for lower-better, ∞ is red (other = infinite cost vs baseline = bad) and 0× is green (other achieves zero cost relative to baseline = great).
  • Section headers stay generic ("Per-GPU value at each interactivity bucket", "Area under Pareto frontier"); tooltips and the row caption now include a "Higher is better" / "Lower is better" hint.
  • Numeric formatting auto-scales for small (cost / J/token) values.

Tests

pareto.test.ts adds:

  1. A lower-is-better fixture asserting frontier pruning under inverse dominance.
  2. An aucWindow block covering clip-to-reachable behavior.
  3. A synthetic 3-config cost fixture (cheap / expensive / niche) end-to-end: pareto → window → AUC, with hand-computed expected values.
  4. A duplicate-x interp test verifying lower-better picks min and higher-better picks max.

All 1940 unit tests pass. pnpm lint, pnpm fmt, pnpm typecheck clean. The existing 8-config (real benchmark) integration test is unchanged and still asserts agreement with an independent fine-grid reference for higher-better.

Commit: d5e6abe

Files

  • packages/app/src/lib/pareto.ts — direction parameter, aucWindow export.
  • packages/app/src/lib/metric-direction.ts — new shared helper.
  • packages/app/src/components/inference/ui/InteractivityTables.tsx — direction-aware rendering, removed auto-hide gate.
  • packages/app/src/components/inference/ui/ChartDisplay.tsx — updated comment on the gate.
  • packages/app/src/lib/pareto.test.ts — direction tests + synthetic lower-better fixture.

Notes / deviations

  • The fixture for the lower-better integration test is synthetic. eight_config_data.json only has Token_Throughput_per_GPU_tok_s_gpu per row, so it isn't directly usable for a lower-better metric without duplicating the ETL math; the synthetic fixture cleanly exercises the same code path with hand-checkable expected values.
  • I did NOT change the AUC behavior for higher-better metrics (zero-padding outside reachable range stays). The existing real-data sanity check continues to pass against the independent reference, so prior numbers in the throughput AUC table are unaffected.

The ratio heatmap saturated at 3x, so anything from 5x to 33x collapsed to
the same maximum green — common ratios like 7x and 20x looked identical.
Bump the log-symmetric saturation caps to 30x / 1/30x and drive the color
ramp through HSL (hue=142/0, lightness 0.97→0.28, saturation 0.6→0.78) so
2x / 5x / 10x / 20x land at perceptually distinct greens.

Export ratioColor and add unit tests covering distinctness, monotonicity,
clamping, log-symmetric reciprocals, lower-better inversion, and text
contrast.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@functionstackx
Copy link
Copy Markdown
Contributor Author

Bumped the heatmap saturation caps from 3× / ⅓× to 30× / 1/30× and switched the white→green / white→red ramp from RGB to HSL interpolation (hue=142°/0°, lightness 0.97→0.28, saturation 0.60→0.78). With caps at 30× the log-symmetric position of each common ratio is no longer clamped together, and HSL gives more perceptual contrast across the upper half of the ramp than the prior RGB lerp between green-300 and green-700.

New ratio → color mapping

Ratio t = log(r)/log(30) HSL L RGB Hex (approx) Text
0.05× −0.881 0.36 rgb(162, 22, 22) #a21616 white
0.1× −0.677 0.50 rgb(220, 37, 37) #dc2525 white
0.5× −0.204 0.83 rgb(239, 184, 184) #efb8b8 black
1.0× 0.000 0.97 rgb(243, 252, 246) #f3fcf6 black
1.5× 0.119 0.89 rgb(209, 244, 222) #d1f4de black
0.204 0.83 rgb(184, 239, 204) #b8efcc black
0.473 0.64 rgb(102, 226, 147) #66e293 black
0.572 0.58 rgb(71, 223, 126) #47df7e black
10× 0.677 0.50 rgb(37, 220, 104) #25dc68 black
20× 0.881 0.36 rgb(22, 162, 74) #16a24a white
33× 1.000 0.28 rgb(16, 127, 57) #107f39 white

Each consecutive step on the upper half (1.5× → 2× → 5× → 7× → 10× → 20× → 33×) lands at a visibly distinct green; reciprocal ratios are exact mirror images (0.5× ↔ 2×, 0.1× ↔ 10×). Text color flips to white once background luminance drops below 0.45 (unchanged).

Tests

Added InteractivityTables.test.ts with:

  • distinct backgrounds for {2×, 5×, 7×, 10×, 20×}
  • monotonically darker green for higher ratios up to the cap
  • clamp behavior beyond RATIO_CAP_HI / RATIO_CAP_LO
  • log-symmetric reciprocal mirror property
  • direction='lower' hue inversion
  • text-color switch at deep ratios

All 1947 app unit tests pass; lint, fmt, typecheck clean.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants