feat(blog): B200 NVFP4 vs H200 INT4 on Kimi K2.5/K2.6 — up to 2.95x better perf/$#389
Open
functionstackx wants to merge 3 commits into
Open
feat(blog): B200 NVFP4 vs H200 INT4 on Kimi K2.5/K2.6 — up to 2.95x better perf/$#389functionstackx wants to merge 3 commits into
functionstackx wants to merge 3 commits into
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit cc2330c. Configure here.
…etter perf/$
On 8K/1K with vllm/vllm-openai:v0.21.0, B200 NVFP4 is 2.71x-2.95x cheaper
per million tokens than H200 INT4 across the 30-90 tok/s/user serving band
(peak 2.95x at 32 tok/s/user, .140/M vs .413/M). The cost gap decomposes
into B200's silicon ratios over H200 (1.67x HBM BW, 1.28x HBM capacity that
unlocks TP=4 vs TP=8, no FP4 tensor cores on Hopper at all) composed with
the NVFP4 precision unlock, divided by B200's 1.38x TCO penalty. Kimi K2.5
and K2.6 are the open-weights models powering xAI's Cursor Composer 2 and
Composer 2.5, leading SWE-Bench Pro at 58.6% over GPT-5.4 / Opus 4.6 /
Gemini 3.1 Pro. Same backbone across both releases — K2.6 is a post-training
refinement of K2.5 — so every serving curve applies one-to-one to both.
Also adds an X-not-Y antithesis ban to the write-inferencex-blog SKILL
house style ("the gap is silicon x precision, not framework" etc.). Reads
as performatively contrarian AI flexing and was getting reflexively cut on
review; codifying so future drafts don't repeat it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bugbot caught a numerical inconsistency: the iso-iv table shows the B200 INT4 / B200 NVFP4 ratio at iv=32 is 2.45x ($0.343/M vs $0.140/M), but subtitle, lede, and FAQ all claimed "2.50x–2.74x across the 30–90 tok/s/user band". Lower bound corrected to 2.45x in all three places. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User asked to drop the https://tomtunguz.com/cursor-kimi-open-source-ai-imperative link. Both instances removed (lede + the model-architecture section's parenthetical citation). Surrounding text preserved: the xAI Cursor Composer 2 / 2.5 claim itself stays, just no longer hyperlinked. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
733c428 to
648e4e0
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Summary
/blog/b200-nvfp4-vs-h200-int4-kimi-k2-vllm-perf-per-dollar). On the 8K/1K workload withvllm/vllm-openai:v0.21.0, B200 NVFP4 is 2.71x–2.95x cheaper per million tokens than H200 INT4 across the entire 30–90 tok/s/user serving band, peaking at 2.95x at 32 tok/s/user ($0.140/M vs $0.413/M — a 66% reduction). On the same B200 silicon, NVFP4 vs INT4 is worth another 2.50x–2.74x at iso-interactivity./gpu-specs:iso_interactivity.pyhelper. Three per-config tables (H200 INT4 TP=8, B200 INT4 TP=8, B200 NVFP4 TP=4 + TP=8). FAQ JSON-LD covers the five questions readers actually ask: cost ratio, silicon-vs-precision decomposition, NVFP4-vs-INT4 on same silicon, why K2.5/K2.6 matters, what's not covered.SKILL.md update
Also bundling a small SKILL.md house-style addition that came out of writing this post: a ban on the "X, not Y" antithesis construction ("the gap is silicon × precision, not framework", "this is a real lever, not a paper one", "every gain came from the kernels, not the silicon", etc.). Reads as performatively contrarian AI flexing and was getting reflexively cut in editorial review — codifying so future drafts skip it. Three before/after examples included.
Test plan
pnpm devand visit/blog/b200-nvfp4-vs-h200-int4-kimi-k2-vllm-perf-per-dollar— verify all 3 figures render in light + dark modes/bloglisting with correct title, subtitle, publish date (2026-05-26)🤖 Generated with Claude Code
Note
Low Risk
Documentation and editorial guidance only; no application runtime or security-sensitive code paths change in the diff.
Overview
Adds a new InferenceX benchmark post at
/blog/b200-nvfp4-vs-h200-int4-kimi-k2-vllm-perf-per-dollarcomparing vLLM on Kimi K2.5/K2.6 at 8K/1K across B200 NVFP4, B200 INT4, and H200 INT4 (InferenceX run 2026-05-19). The writeup anchors 2.71x–2.95x lower $/M tokens for B200 NVFP4 vs H200 INT4 in the 30–90 tok/s/user band, plus 2.45x–2.74x NVFP4 vs INT4 on the same B200 hardware, with per-concurrency tables, an iso-interactivity cost table,/gpu-specsradar and architecture figures, preset DashboardCTA links, and FAQ JSON-LD.Updates
write-inferencex-blogSKILL.md with an editorial rule to avoid the “X, not Y” antithesis pattern, with before/after examples for future drafts.Reviewed by Cursor Bugbot for commit 648e4e0. Bugbot is set up for automated code reviews on this repo. Configure here.