docs: May 2026 refresh — Red Hat eval, FP8 nuance, community ports, drop QJL by default by OnlyTerp · Pull Request #6 · OnlyTerp/turboquant

OnlyTerp · 2026-05-25T09:26:03Z

Summary

Full documentation refresh that brings the repo up to date with the May 2026 ecosystem.
Touches README.md, LANDSCAPE_2026.md, FAQ.md, INTEGRATIONS.md, BENCHMARKS.md,
IMPLEMENTATION_NOTES.md, LAUNCH.md, and the markdown link-check workflow. No code
changes.

Why now

The April 2026 docs framed TurboQuant as the new SOTA "use it everywhere" KV-cache
compressor. Since then:

Red Hat AI / vLLM published the first comprehensive third-party evaluation (May 11). Conclusions:
FP8 KV is the better default on Hopper/Blackwell, *_nc (no-QJL) variants strictly
Pareto-dominate the paper's QJL-augmented variants, and 3bit_nc drops 15–25 pts on
reasoning at long context.
Community ports converged on dropping QJL. tonbistudio V3 PyTorch,
scos-lab 8-model benchmark,
and 0xSero Triton all independently find that
the 1-bit QJL residual hurts at b ≤ 3 because softmax amplifies its variance.
The serving stack consolidated. vLLM merged --kv-cache-dtype turboquant_*
upstream (0.20.2+); a production fork
(turboquant-plus-vllm) ships CUDA
dequant kernels with 10.1× decode speedup on Qwen3.6-35B-A3B; llama.cpp has two
forks for different head_dim regimes (spiritbuun
= 128 only, AmesianX = any).

What changes

README.md
- New "Reality check (May 2026)" table with 6 findings drawn from the Red Hat eval +
  independent ports (FP8 default, QJL hurts, 3-bit limits, layer skipping, K/V norm
  ratio, layer-0 outliers).
- New "Where TurboQuant still wins" table calling out the 4 scenarios it actually
  beats FP8 on.
- New "Tips and tricks" section: 10 concrete recommendations distilled from the
  May findings (try FP8 first, prefer 4bit_nc, disable QJL, skip first/last 2
  layers, profile K/V norms, dynamic outlier thresholds, native-packed checkpoints,
  head_dim caveats for llama.cpp, stack with token selection for long reasoning,
  don't trust headline numbers without verification).
- "What's new — May 2026" replaces the April 72-hour snapshot.
- Quick-start, integrations table, decision guide, FAQ links, and limitations updated
  to recommend FP8 first and turboquant_4bit_nc (not 3-bit / not QJL-on).
- Community-port table covers 0xSero, tonbistudio, varjoranta, AmesianX, spiritbuun,
  scos-lab.
LANDSCAPE_2026.md — top-level "big May 2026 update" summary, FP8 added to TL;DR
as the new default, third-party reproduction tables (MRCR-8/16, AIME25, throughput),
refreshed engine + decision tables, expanded "further reading" with the May 2026
evaluations.
FAQ.md — 5 new top questions ("Should I use FP8 or TurboQuant?", "Which variant?",
"Does QJL help?", "Which layers to skip?", "How to choose K vs V bits?"). KIVI / KVQuant
/ FP8 comparison table now includes throughput penalties. Speed-vs-quality table uses
the Red Hat measurements. The "viral take" answer now reflects the May data instead of
amplifying the April hype. Benchmarking methodology section rewritten around MRCR +
AIME25 + the layer-skip / *_nc / head_dim / K-V-norm checks.
INTEGRATIONS.md — vLLM section reorganized: upstream-merged dtypes first, then the
turboquant-plus-vllm production fork, then the 0xSero Triton reference, then this
repo's plugin. SGLang section shows the three competing PRs and that none are merged.
llama.cpp split into spiritbuun (head_dim=128) vs AmesianX (any head_dim). MLX section
moved to the varjoranta Metal port as the default. Troubleshooting expanded with a
quality-regression checklist.
BENCHMARKS.md — new "Independent reproductions (May 2026)" section with the Red
Hat numbers. Format-comparison table now includes the FP8 baseline and measured
throughput penalties. Methodology section adds the 2026 evaluation playbook. QJL
variance explainer rewritten to actually explain why QJL fails in practice. Layer
sensitivity table added.
IMPLEMENTATION_NOTES.md — "QJL Score Weight" section now explains why
qjl_score_weight=0.0 is the May 2026 consensus default. New "Layer Skipping" section.
LAUNCH.md — archive notice flagging that the April marketing hooks overstate the
3-bit case.
.github/workflows/link-check.yml — excludes a few aggressively bot-blocking
domains (pub.towardsai.net, teqvolt.com, ai-intensify.com, allenkuo.medium.com)
so CI doesn't fail on legitimate but anti-bot links.

Review & Testing Checklist for Human

Risk level: yellow (docs only, no code changes — but it's a watched repo and the
framing matters).

Skim the new "Reality check" and "Tips and tricks" sections in README.md. Make
sure the tone reads as "respectful nuance" rather than "we wuz wrong" — the goal is
to surface the May 2026 community findings without trashing the project.
Spot-check 4–5 of the new external links (Red Hat blog, tonbistudio, scos-lab,
varjoranta, AmesianX, turboquant-plus-vllm on PyPI) to make sure I'm linking
to the canonical URL for each.
Watch the Link Check CI job — I added a few exclusions but lychee can still flag
false positives. If it complains, let me know and I'll triage.
Watch the Tests CI job — should be untouched since no code changed, but worth
confirming.
Decide whether to keep LAUNCH.md as-is with the archive notice, rewrite it for
the May 2026 framing, or delete it entirely. I left it as an annotated archive on
the assumption that the launch-week framing is historically useful.

Notes

Followed the user's instruction to "treat it with respect" — no removed prior work,
only annotated / reframed where the new data demands it. The paper's results and the
reference implementation's correctness are preserved; the messaging is recalibrated.
All numbers in the new "Independent reproductions" tables come from the Red Hat blog
post. I've cited the source on every table.
The --kv-cache-dtype turboquant_* names (k8v4, 4bit_nc, k3v4_nc, 3bit_nc)
match the upstream vLLM merge.

Link to Devin session: https://app.devin.ai/sessions/8e502cb970d94108bd312b91d2f19162
Requested by: @OnlyTerp

…rop QJL by default - Add 'Reality check (May 2026)' table to README citing the Red Hat AI / vLLM evaluation: FP8 KV is the default, *_nc variants drop QJL, skip first/last 2 layers, K/V norm ratio predicts quality. - New 'Tips and tricks' section in README distilling the May 2026 findings into 10 concrete recommendations. - README quick-start, integrations table, decision guide and limitations now reflect upstream vLLM merge, turboquant-plus-vllm, 0xSero Triton, varjoranta CUDA + MLX kernels, AmesianX llama.cpp (head_dim 256+), tonbistudio V3, spiritbuun llama.cpp, scos-lab benchmarks. - LANDSCAPE_2026.md: 'big May 2026 update' summary, TL;DR rewrites FP8 as the baseline, full third-party reproduction tables (MRCR, AIME25, throughput), refreshed engine + decision tables, expanded 'further reading' with the May evaluations. - FAQ.md: new questions ('Should I use FP8 or TurboQuant?', 'Which variant?', 'Does QJL help?', 'Which layers to skip?', 'How to choose K/V bits?'); updated KIVI/KVQuant/FP8 table; honest accuracy-vs-speed numbers; explainer hype check. - INTEGRATIONS.md: vLLM section reorganized around upstream-merged dtypes + production fork; SGLang shows the three competing PRs; llama.cpp split into spiritbuun (head_dim=128) vs AmesianX (any head_dim); MLX section moved to varjoranta Metal port; troubleshooting expanded with quality-regression checklist. - BENCHMARKS.md: Independent reproductions section with Red Hat numbers; format-comparison table now includes FP8 baseline + measured throughput penalties; methodology section updated with 2026 best practices; QJL variance explainer; layer sensitivity table. - IMPLEMENTATION_NOTES.md: QJL Score Weight section now explains why to default qjl_score_weight=0.0; new Layer Skipping section. - LAUNCH.md: archive notice flagging the April hype framing. - link-check workflow: exclude a few aggressively bot-blocking domains (towardsai, teqvolt, ai-intensify, allenkuo.medium). Co-Authored-By: Rob <onerobby@gmail.com>

devin-ai-integration · 2026-05-25T09:26:06Z

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

Disable automatic comment and CI monitoring

- varjoranta/turboquant-vllm/tree/main/mlx → repo root (the /mlx subpath doesn't exist; the Metal kernels are mentioned in the README of the parent repo). - openai/simple-evals/blob/main/mrcr_eval.py → repo root (the file path has moved upstream). Co-Authored-By: Rob <onerobby@gmail.com>

devin-ai-integration Bot assigned OnlyTerp May 25, 2026

OnlyTerp merged commit 1891cdd into master May 25, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: May 2026 refresh — Red Hat eval, FP8 nuance, community ports, drop QJL by default#6

docs: May 2026 refresh — Red Hat eval, FP8 nuance, community ports, drop QJL by default#6
OnlyTerp merged 2 commits into
masterfrom
devin/1779701080-turboquant-may2026-refresh

OnlyTerp commented May 25, 2026

Uh oh!

devin-ai-integration Bot commented May 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

OnlyTerp commented May 25, 2026

Summary

Why now

What changes

Review & Testing Checklist for Human

Notes

Uh oh!

devin-ai-integration Bot commented May 25, 2026

🤖 Devin AI Engineer

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant