Skip to content

feat(blog): B200 NVFP4 vs H200 INT4 on Kimi K2.5/K2.6 — up to 2.95x better perf/$#389

Open
functionstackx wants to merge 3 commits into
masterfrom
feat/blog-b200-nvfp4-h200-int4-kimi-k2-vllm
Open

feat(blog): B200 NVFP4 vs H200 INT4 on Kimi K2.5/K2.6 — up to 2.95x better perf/$#389
functionstackx wants to merge 3 commits into
masterfrom
feat/blog-b200-nvfp4-h200-int4-kimi-k2-vllm

Conversation

@functionstackx
Copy link
Copy Markdown
Contributor

@functionstackx functionstackx commented May 26, 2026

Summary

  • New blog post: B200 NVFP4 vs H200 INT4 on Kimi K2.5/K2.6 — up to 2.95x Better Performance per Dollar (/blog/b200-nvfp4-vs-h200-int4-kimi-k2-vllm-perf-per-dollar). On the 8K/1K workload with vllm/vllm-openai:v0.21.0, B200 NVFP4 is 2.71x–2.95x cheaper per million tokens than H200 INT4 across the entire 30–90 tok/s/user serving band, peaking at 2.95x at 32 tok/s/user ($0.140/M vs $0.413/M — a 66% reduction). On the same B200 silicon, NVFP4 vs INT4 is worth another 2.50x–2.74x at iso-interactivity.
  • Three-factor silicon-to-perf decomposition anchors the cost gap to specs the reader can audit on /gpu-specs:
    1. 1.67x HBM BW (8 vs 4.8 TB/s) — the decode-bound throughput floor
    2. 1.28x HBM capacity (180 vs 141 GB) — fits K2 in TP=4 on B200 vs TP=8 on H200, halving collective traffic per decode step (Amdahl's law on the serial-collective bottleneck)
    3. NVFP4 precision unlock (9,000 TFLOP/s FP4 cores on B200; Hopper SM90 has zero FP4 tensor cores)
    4. ÷ 1.38x B200 TCO penalty ($1.95 vs $1.41/GPU/hr) = measured 2.95x cost gap
  • Kimi K2.5/K2.6 framing anchored on production deployment: open-weights backbone behind xAI's Cursor Composer 2 + Composer 2.5 (1M+ daily active users from the Cursor IDE), leads SWE-Bench Pro at 58.6% over GPT-5.4 (57.7) / Opus 4.6 (53.4) / Gemini 3.1 Pro (54.2), 80.2% on SWE-Bench Verified, 3.3% failure rate on Cline's diff-editing production data (matches Claude 4 Sonnet). K2.5 + K2.6 share the same pre-trained backbone (post-training refinements only) so every serving result applies one-to-one.
  • Architecture diagram from Moonshot's model card: 1.0T total / 32B active, 1 dense + 60 MoE blocks, MLA attention, top-8-of-385 expert routing, 256K context (262,144 tokens), YaRN RoPE, vocab 163,840.
  • 9 files added: MDX + 6 image variants (benchmark chart, MI355X-vs-B200 specs radar, Kimi K2 architecture diagram, all in light/dark pairs).
  • Iso-iv table built with the bundled iso_interactivity.py helper. Three per-config tables (H200 INT4 TP=8, B200 INT4 TP=8, B200 NVFP4 TP=4 + TP=8). FAQ JSON-LD covers the five questions readers actually ask: cost ratio, silicon-vs-precision decomposition, NVFP4-vs-INT4 on same silicon, why K2.5/K2.6 matters, what's not covered.

SKILL.md update

Also bundling a small SKILL.md house-style addition that came out of writing this post: a ban on the "X, not Y" antithesis construction ("the gap is silicon × precision, not framework", "this is a real lever, not a paper one", "every gain came from the kernels, not the silicon", etc.). Reads as performatively contrarian AI flexing and was getting reflexively cut in editorial review — codifying so future drafts skip it. Three before/after examples included.

Test plan

  • pnpm dev and visit /blog/b200-nvfp4-vs-h200-int4-kimi-k2-vllm-perf-per-dollar — verify all 3 figures render in light + dark modes
  • Post appears in /blog listing with correct title, subtitle, publish date (2026-05-26)
  • OG image renders correctly
  • DashboardCTA at top + bottom + live-chart link all land on the preset 3-way comparison view on inferencex.semianalysis.com
  • Sitemap / RSS feed / llms.txt include the new post
  • All HF model card + tomtunguz + cline + NVIDIA datasheet + SemiAnalysis TCO + vLLM repo links resolve

🤖 Generated with Claude Code


Note

Low Risk
Documentation and editorial guidance only; no application runtime or security-sensitive code paths change in the diff.

Overview
Adds a new InferenceX benchmark post at /blog/b200-nvfp4-vs-h200-int4-kimi-k2-vllm-perf-per-dollar comparing vLLM on Kimi K2.5/K2.6 at 8K/1K across B200 NVFP4, B200 INT4, and H200 INT4 (InferenceX run 2026-05-19). The writeup anchors 2.71x–2.95x lower $/M tokens for B200 NVFP4 vs H200 INT4 in the 30–90 tok/s/user band, plus 2.45x–2.74x NVFP4 vs INT4 on the same B200 hardware, with per-concurrency tables, an iso-interactivity cost table, /gpu-specs radar and architecture figures, preset DashboardCTA links, and FAQ JSON-LD.

Updates write-inferencex-blog SKILL.md with an editorial rule to avoid the “X, not Y” antithesis pattern, with before/after examples for future drafts.

Reviewed by Cursor Bugbot for commit 648e4e0. Bugbot is set up for automated code reviews on this repo. Configure here.

@vercel
Copy link
Copy Markdown

vercel Bot commented May 26, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
inferencemax-app Ready Ready Preview, Comment May 26, 2026 8:15am

Request Review

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit cc2330c. Configure here.

functionstackx and others added 3 commits May 26, 2026 04:14
…etter perf/$

On 8K/1K with vllm/vllm-openai:v0.21.0, B200 NVFP4 is 2.71x-2.95x cheaper
per million tokens than H200 INT4 across the 30-90 tok/s/user serving band
(peak 2.95x at 32 tok/s/user, .140/M vs .413/M). The cost gap decomposes
into B200's silicon ratios over H200 (1.67x HBM BW, 1.28x HBM capacity that
unlocks TP=4 vs TP=8, no FP4 tensor cores on Hopper at all) composed with
the NVFP4 precision unlock, divided by B200's 1.38x TCO penalty. Kimi K2.5
and K2.6 are the open-weights models powering xAI's Cursor Composer 2 and
Composer 2.5, leading SWE-Bench Pro at 58.6% over GPT-5.4 / Opus 4.6 /
Gemini 3.1 Pro. Same backbone across both releases — K2.6 is a post-training
refinement of K2.5 — so every serving curve applies one-to-one to both.

Also adds an X-not-Y antithesis ban to the write-inferencex-blog SKILL
house style ("the gap is silicon x precision, not framework" etc.). Reads
as performatively contrarian AI flexing and was getting reflexively cut on
review; codifying so future drafts don't repeat it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bugbot caught a numerical inconsistency: the iso-iv table shows the
B200 INT4 / B200 NVFP4 ratio at iv=32 is 2.45x ($0.343/M vs $0.140/M),
but subtitle, lede, and FAQ all claimed "2.50x–2.74x across the 30–90
tok/s/user band". Lower bound corrected to 2.45x in all three places.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User asked to drop the https://tomtunguz.com/cursor-kimi-open-source-ai-imperative
link. Both instances removed (lede + the model-architecture section's
parenthetical citation). Surrounding text preserved: the xAI Cursor Composer
2 / 2.5 claim itself stays, just no longer hyperlinked.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant