docs: May 2026 refresh — Red Hat eval, FP8 nuance, community ports, drop QJL by default#6
Merged
Conversation
…rop QJL by default
- Add 'Reality check (May 2026)' table to README citing the Red Hat AI / vLLM
evaluation: FP8 KV is the default, *_nc variants drop QJL, skip first/last
2 layers, K/V norm ratio predicts quality.
- New 'Tips and tricks' section in README distilling the May 2026 findings
into 10 concrete recommendations.
- README quick-start, integrations table, decision guide and limitations now
reflect upstream vLLM merge, turboquant-plus-vllm, 0xSero Triton, varjoranta
CUDA + MLX kernels, AmesianX llama.cpp (head_dim 256+), tonbistudio V3,
spiritbuun llama.cpp, scos-lab benchmarks.
- LANDSCAPE_2026.md: 'big May 2026 update' summary, TL;DR rewrites FP8 as the
baseline, full third-party reproduction tables (MRCR, AIME25, throughput),
refreshed engine + decision tables, expanded 'further reading' with the May
evaluations.
- FAQ.md: new questions ('Should I use FP8 or TurboQuant?', 'Which variant?',
'Does QJL help?', 'Which layers to skip?', 'How to choose K/V bits?');
updated KIVI/KVQuant/FP8 table; honest accuracy-vs-speed numbers; explainer
hype check.
- INTEGRATIONS.md: vLLM section reorganized around upstream-merged dtypes +
production fork; SGLang shows the three competing PRs; llama.cpp split into
spiritbuun (head_dim=128) vs AmesianX (any head_dim); MLX section moved to
varjoranta Metal port; troubleshooting expanded with quality-regression
checklist.
- BENCHMARKS.md: Independent reproductions section with Red Hat numbers;
format-comparison table now includes FP8 baseline + measured throughput
penalties; methodology section updated with 2026 best practices; QJL
variance explainer; layer sensitivity table.
- IMPLEMENTATION_NOTES.md: QJL Score Weight section now explains why to
default qjl_score_weight=0.0; new Layer Skipping section.
- LAUNCH.md: archive notice flagging the April hype framing.
- link-check workflow: exclude a few aggressively bot-blocking domains
(towardsai, teqvolt, ai-intensify, allenkuo.medium).
Co-Authored-By: Rob <onerobby@gmail.com>
Contributor
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
- varjoranta/turboquant-vllm/tree/main/mlx → repo root (the /mlx subpath doesn't exist; the Metal kernels are mentioned in the README of the parent repo). - openai/simple-evals/blob/main/mrcr_eval.py → repo root (the file path has moved upstream). Co-Authored-By: Rob <onerobby@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Full documentation refresh that brings the repo up to date with the May 2026 ecosystem.
Touches
README.md,LANDSCAPE_2026.md,FAQ.md,INTEGRATIONS.md,BENCHMARKS.md,IMPLEMENTATION_NOTES.md,LAUNCH.md, and the markdown link-check workflow. No codechanges.
Why now
The April 2026 docs framed TurboQuant as the new SOTA "use it everywhere" KV-cache
compressor. Since then:
FP8 KV is the better default on Hopper/Blackwell,
*_nc(no-QJL) variants strictlyPareto-dominate the paper's QJL-augmented variants, and
3bit_ncdrops 15–25 pts onreasoning at long context.
scos-lab 8-model benchmark,
and 0xSero Triton all independently find that
the 1-bit QJL residual hurts at b ≤ 3 because softmax amplifies its variance.
--kv-cache-dtype turboquant_*upstream (0.20.2+); a production fork
(
turboquant-plus-vllm) ships CUDAdequant kernels with 10.1× decode speedup on Qwen3.6-35B-A3B; llama.cpp has two
forks for different head_dim regimes (spiritbuun
= 128 only, AmesianX = any).
What changes
independent ports (FP8 default, QJL hurts, 3-bit limits, layer skipping, K/V norm
ratio, layer-0 outliers).
beats FP8 on.
May findings (try FP8 first, prefer
4bit_nc, disable QJL, skip first/last 2layers, profile K/V norms, dynamic outlier thresholds, native-packed checkpoints,
head_dim caveats for llama.cpp, stack with token selection for long reasoning,
don't trust headline numbers without verification).
to recommend FP8 first and
turboquant_4bit_nc(not 3-bit / not QJL-on).scos-lab.
as the new default, third-party reproduction tables (MRCR-8/16, AIME25, throughput),
refreshed engine + decision tables, expanded "further reading" with the May 2026
evaluations.
"Does QJL help?", "Which layers to skip?", "How to choose K vs V bits?"). KIVI / KVQuant
/ FP8 comparison table now includes throughput penalties. Speed-vs-quality table uses
the Red Hat measurements. The "viral take" answer now reflects the May data instead of
amplifying the April hype. Benchmarking methodology section rewritten around MRCR +
AIME25 + the layer-skip /
*_nc/ head_dim / K-V-norm checks.turboquant-plus-vllmproduction fork, then the 0xSero Triton reference, then thisrepo's plugin. SGLang section shows the three competing PRs and that none are merged.
llama.cpp split into spiritbuun (head_dim=128) vs AmesianX (any head_dim). MLX section
moved to the varjoranta Metal port as the default. Troubleshooting expanded with a
quality-regression checklist.
Hat numbers. Format-comparison table now includes the FP8 baseline and measured
throughput penalties. Methodology section adds the 2026 evaluation playbook. QJL
variance explainer rewritten to actually explain why QJL fails in practice. Layer
sensitivity table added.
qjl_score_weight=0.0is the May 2026 consensus default. New "Layer Skipping" section.3-bit case.
.github/workflows/link-check.yml— excludes a few aggressively bot-blockingdomains (
pub.towardsai.net,teqvolt.com,ai-intensify.com,allenkuo.medium.com)so CI doesn't fail on legitimate but anti-bot links.
Review & Testing Checklist for Human
Risk level: yellow (docs only, no code changes — but it's a watched repo and the
framing matters).
README.md. Makesure the tone reads as "respectful nuance" rather than "we wuz wrong" — the goal is
to surface the May 2026 community findings without trashing the project.
varjoranta, AmesianX,
turboquant-plus-vllmon PyPI) to make sure I'm linkingto the canonical URL for each.
false positives. If it complains, let me know and I'll triage.
confirming.
LAUNCH.mdas-is with the archive notice, rewrite it forthe May 2026 framing, or delete it entirely. I left it as an annotated archive on
the assumption that the launch-week framing is historically useful.
Notes
only annotated / reframed where the new data demands it. The paper's results and the
reference implementation's correctness are preserved; the messaging is recalibrated.
post. I've cited the source on every table.
--kv-cache-dtype turboquant_*names (k8v4,4bit_nc,k3v4_nc,3bit_nc)match the upstream vLLM merge.
Link to Devin session: https://app.devin.ai/sessions/8e502cb970d94108bd312b91d2f19162
Requested by: @OnlyTerp