feat(blog): B200 NVFP4 vs H200 INT4 on Kimi K2.5/K2.6 — up to 2.95x better perf/$ by functionstackx · Pull Request #389 · SemiAnalysisAI/InferenceX-app

functionstackx · 2026-05-26T04:53:49Z

Summary

New blog post: B200 NVFP4 vs H200 INT4 on Kimi K2.5/K2.6 — up to 2.95x Better Performance per Dollar (/blog/b200-nvfp4-vs-h200-int4-kimi-k2-vllm-perf-per-dollar). On the 8K/1K workload with vllm/vllm-openai:v0.21.0, B200 NVFP4 is 2.71x–2.95x cheaper per million tokens than H200 INT4 across the entire 30–90 tok/s/user serving band, peaking at 2.95x at 32 tok/s/user ($0.140/M vs $0.413/M — a 66% reduction). On the same B200 silicon, NVFP4 vs INT4 is worth another 2.50x–2.74x at iso-interactivity.
Three-factor silicon-to-perf decomposition anchors the cost gap to specs the reader can audit on /gpu-specs:
1. 1.67x HBM BW (8 vs 4.8 TB/s) — the decode-bound throughput floor
2. 1.28x HBM capacity (180 vs 141 GB) — fits K2 in TP=4 on B200 vs TP=8 on H200, halving collective traffic per decode step (Amdahl's law on the serial-collective bottleneck)
3. NVFP4 precision unlock (9,000 TFLOP/s FP4 cores on B200; Hopper SM90 has zero FP4 tensor cores)
4. ÷ 1.38x B200 TCO penalty ($1.95 vs $1.41/GPU/hr) = measured 2.95x cost gap
Kimi K2.5/K2.6 framing anchored on production deployment: open-weights backbone behind xAI's Cursor Composer 2 + Composer 2.5 (1M+ daily active users from the Cursor IDE), leads SWE-Bench Pro at 58.6% over GPT-5.4 (57.7) / Opus 4.6 (53.4) / Gemini 3.1 Pro (54.2), 80.2% on SWE-Bench Verified, 3.3% failure rate on Cline's diff-editing production data (matches Claude 4 Sonnet). K2.5 + K2.6 share the same pre-trained backbone (post-training refinements only) so every serving result applies one-to-one.
Architecture diagram from Moonshot's model card: 1.0T total / 32B active, 1 dense + 60 MoE blocks, MLA attention, top-8-of-385 expert routing, 256K context (262,144 tokens), YaRN RoPE, vocab 163,840.
9 files added: MDX + 6 image variants (benchmark chart, MI355X-vs-B200 specs radar, Kimi K2 architecture diagram, all in light/dark pairs).
Iso-iv table built with the bundled iso_interactivity.py helper. Three per-config tables (H200 INT4 TP=8, B200 INT4 TP=8, B200 NVFP4 TP=4 + TP=8). FAQ JSON-LD covers the five questions readers actually ask: cost ratio, silicon-vs-precision decomposition, NVFP4-vs-INT4 on same silicon, why K2.5/K2.6 matters, what's not covered.

SKILL.md update

Also bundling a small SKILL.md house-style addition that came out of writing this post: a ban on the "X, not Y" antithesis construction ("the gap is silicon × precision, not framework", "this is a real lever, not a paper one", "every gain came from the kernels, not the silicon", etc.). Reads as performatively contrarian AI flexing and was getting reflexively cut in editorial review — codifying so future drafts skip it. Three before/after examples included.

Test plan

pnpm dev and visit /blog/b200-nvfp4-vs-h200-int4-kimi-k2-vllm-perf-per-dollar — verify all 3 figures render in light + dark modes
Post appears in /blog listing with correct title, subtitle, publish date (2026-05-26)
OG image renders correctly
DashboardCTA at top + bottom + live-chart link all land on the preset 3-way comparison view on inferencex.semianalysis.com
Sitemap / RSS feed / llms.txt include the new post
All HF model card + tomtunguz + cline + NVIDIA datasheet + SemiAnalysis TCO + vLLM repo links resolve

🤖 Generated with Claude Code

Note

Low Risk
Documentation and editorial guidance only; no application runtime or security-sensitive code paths change in the diff.

Overview
Adds a new InferenceX benchmark post at /blog/b200-nvfp4-vs-h200-int4-kimi-k2-vllm-perf-per-dollar comparing vLLM on Kimi K2.5/K2.6 at 8K/1K across B200 NVFP4, B200 INT4, and H200 INT4 (InferenceX run 2026-05-19). The writeup anchors 2.71x–2.95x lower $/M tokens for B200 NVFP4 vs H200 INT4 in the 30–90 tok/s/user band, plus 2.45x–2.74x NVFP4 vs INT4 on the same B200 hardware, with per-concurrency tables, an iso-interactivity cost table, /gpu-specs radar and architecture figures, preset DashboardCTA links, and FAQ JSON-LD.

Updates write-inferencex-blog SKILL.md with an editorial rule to avoid the “X, not Y” antithesis pattern, with before/after examples for future drafts.

^{Reviewed by Cursor Bugbot for commit 648e4e0. Bugbot is set up for automated code reviews on this repo. Configure here.}

vercel · 2026-05-26T04:53:55Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
inferencemax-app	Ready	Preview, Comment	May 26, 2026 8:15am

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit cc2330c. Configure here.}

…etter perf/$ On 8K/1K with vllm/vllm-openai:v0.21.0, B200 NVFP4 is 2.71x-2.95x cheaper per million tokens than H200 INT4 across the 30-90 tok/s/user serving band (peak 2.95x at 32 tok/s/user, .140/M vs .413/M). The cost gap decomposes into B200's silicon ratios over H200 (1.67x HBM BW, 1.28x HBM capacity that unlocks TP=4 vs TP=8, no FP4 tensor cores on Hopper at all) composed with the NVFP4 precision unlock, divided by B200's 1.38x TCO penalty. Kimi K2.5 and K2.6 are the open-weights models powering xAI's Cursor Composer 2 and Composer 2.5, leading SWE-Bench Pro at 58.6% over GPT-5.4 / Opus 4.6 / Gemini 3.1 Pro. Same backbone across both releases — K2.6 is a post-training refinement of K2.5 — so every serving curve applies one-to-one to both. Also adds an X-not-Y antithesis ban to the write-inferencex-blog SKILL house style ("the gap is silicon x precision, not framework" etc.). Reads as performatively contrarian AI flexing and was getting reflexively cut on review; codifying so future drafts don't repeat it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Bugbot caught a numerical inconsistency: the iso-iv table shows the B200 INT4 / B200 NVFP4 ratio at iv=32 is 2.45x ($0.343/M vs $0.140/M), but subtitle, lede, and FAQ all claimed "2.50x–2.74x across the 30–90 tok/s/user band". Lower bound corrected to 2.45x in all three places. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

User asked to drop the https://tomtunguz.com/cursor-kimi-open-source-ai-imperative link. Both instances removed (lede + the model-architecture section's parenthetical citation). Surrounding text preserved: the xAI Cursor Composer 2 / 2.5 claim itself stays, just no longer hyperlinked. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

vercel Bot deployed to Preview May 26, 2026 04:54 View deployment

cursor Bot reviewed May 26, 2026

View reviewed changes

Comment thread packages/app/content/blog/b200-nvfp4-vs-h200-int4-kimi-k2-vllm-perf-per-dollar.mdx Outdated

vercel Bot deployed to Preview May 26, 2026 04:56 View deployment

vercel Bot deployed to Preview May 26, 2026 05:15 View deployment

functionstackx and others added 3 commits May 26, 2026 04:14

functionstackx force-pushed the feat/blog-b200-nvfp4-h200-int4-kimi-k2-vllm branch from 733c428 to 648e4e0 Compare May 26, 2026 08:14

vercel Bot deployed to Preview May 26, 2026 08:15 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(blog): B200 NVFP4 vs H200 INT4 on Kimi K2.5/K2.6 — up to 2.95x better perf/$#389

feat(blog): B200 NVFP4 vs H200 INT4 on Kimi K2.5/K2.6 — up to 2.95x better perf/$#389
functionstackx wants to merge 3 commits into
masterfrom
feat/blog-b200-nvfp4-h200-int4-kimi-k2-vllm

functionstackx commented May 26, 2026 •

edited by cursor Bot

Loading

Uh oh!

vercel Bot commented May 26, 2026 •

edited

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

functionstackx commented May 26, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

SKILL.md update

Test plan

Uh oh!

vercel Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

functionstackx commented May 26, 2026 •

edited by cursor Bot

Loading

vercel Bot commented May 26, 2026 •

edited

Loading