Skip to content

docs: May 2026 refresh — Red Hat eval, FP8 nuance, community ports, drop QJL by default#6

Merged
OnlyTerp merged 2 commits into
masterfrom
devin/1779701080-turboquant-may2026-refresh
May 25, 2026
Merged

docs: May 2026 refresh — Red Hat eval, FP8 nuance, community ports, drop QJL by default#6
OnlyTerp merged 2 commits into
masterfrom
devin/1779701080-turboquant-may2026-refresh

Conversation

@OnlyTerp

Copy link
Copy Markdown
Owner

Summary

Full documentation refresh that brings the repo up to date with the May 2026 ecosystem.
Touches README.md, LANDSCAPE_2026.md, FAQ.md, INTEGRATIONS.md, BENCHMARKS.md,
IMPLEMENTATION_NOTES.md, LAUNCH.md, and the markdown link-check workflow. No code
changes.

Why now

The April 2026 docs framed TurboQuant as the new SOTA "use it everywhere" KV-cache
compressor. Since then:

  1. Red Hat AI / vLLM published the first comprehensive third-party evaluation (May 11). Conclusions:
    FP8 KV is the better default on Hopper/Blackwell, *_nc (no-QJL) variants strictly
    Pareto-dominate the paper's QJL-augmented variants, and 3bit_nc drops 15–25 pts on
    reasoning at long context.
  2. Community ports converged on dropping QJL. tonbistudio V3 PyTorch,
    scos-lab 8-model benchmark,
    and 0xSero Triton all independently find that
    the 1-bit QJL residual hurts at b ≤ 3 because softmax amplifies its variance.
  3. The serving stack consolidated. vLLM merged --kv-cache-dtype turboquant_*
    upstream (0.20.2+); a production fork
    (turboquant-plus-vllm) ships CUDA
    dequant kernels with 10.1× decode speedup on Qwen3.6-35B-A3B; llama.cpp has two
    forks for different head_dim regimes (spiritbuun
    = 128 only, AmesianX = any).

What changes

  • README.md
    • New "Reality check (May 2026)" table with 6 findings drawn from the Red Hat eval +
      independent ports (FP8 default, QJL hurts, 3-bit limits, layer skipping, K/V norm
      ratio, layer-0 outliers).
    • New "Where TurboQuant still wins" table calling out the 4 scenarios it actually
      beats FP8 on.
    • New "Tips and tricks" section: 10 concrete recommendations distilled from the
      May findings (try FP8 first, prefer 4bit_nc, disable QJL, skip first/last 2
      layers, profile K/V norms, dynamic outlier thresholds, native-packed checkpoints,
      head_dim caveats for llama.cpp, stack with token selection for long reasoning,
      don't trust headline numbers without verification).
    • "What's new — May 2026" replaces the April 72-hour snapshot.
    • Quick-start, integrations table, decision guide, FAQ links, and limitations updated
      to recommend FP8 first and turboquant_4bit_nc (not 3-bit / not QJL-on).
    • Community-port table covers 0xSero, tonbistudio, varjoranta, AmesianX, spiritbuun,
      scos-lab.
  • LANDSCAPE_2026.md — top-level "big May 2026 update" summary, FP8 added to TL;DR
    as the new default, third-party reproduction tables (MRCR-8/16, AIME25, throughput),
    refreshed engine + decision tables, expanded "further reading" with the May 2026
    evaluations.
  • FAQ.md — 5 new top questions ("Should I use FP8 or TurboQuant?", "Which variant?",
    "Does QJL help?", "Which layers to skip?", "How to choose K vs V bits?"). KIVI / KVQuant
    / FP8 comparison table now includes throughput penalties. Speed-vs-quality table uses
    the Red Hat measurements. The "viral take" answer now reflects the May data instead of
    amplifying the April hype. Benchmarking methodology section rewritten around MRCR +
    AIME25 + the layer-skip / *_nc / head_dim / K-V-norm checks.
  • INTEGRATIONS.md — vLLM section reorganized: upstream-merged dtypes first, then the
    turboquant-plus-vllm production fork, then the 0xSero Triton reference, then this
    repo's plugin. SGLang section shows the three competing PRs and that none are merged.
    llama.cpp split into spiritbuun (head_dim=128) vs AmesianX (any head_dim). MLX section
    moved to the varjoranta Metal port as the default. Troubleshooting expanded with a
    quality-regression checklist.
  • BENCHMARKS.md — new "Independent reproductions (May 2026)" section with the Red
    Hat numbers. Format-comparison table now includes the FP8 baseline and measured
    throughput penalties. Methodology section adds the 2026 evaluation playbook. QJL
    variance explainer rewritten to actually explain why QJL fails in practice. Layer
    sensitivity table added.
  • IMPLEMENTATION_NOTES.md — "QJL Score Weight" section now explains why
    qjl_score_weight=0.0 is the May 2026 consensus default. New "Layer Skipping" section.
  • LAUNCH.md — archive notice flagging that the April marketing hooks overstate the
    3-bit case.
  • .github/workflows/link-check.yml — excludes a few aggressively bot-blocking
    domains (pub.towardsai.net, teqvolt.com, ai-intensify.com, allenkuo.medium.com)
    so CI doesn't fail on legitimate but anti-bot links.

Review & Testing Checklist for Human

Risk level: yellow (docs only, no code changes — but it's a watched repo and the
framing matters).

  • Skim the new "Reality check" and "Tips and tricks" sections in README.md. Make
    sure the tone reads as "respectful nuance" rather than "we wuz wrong" — the goal is
    to surface the May 2026 community findings without trashing the project.
  • Spot-check 4–5 of the new external links (Red Hat blog, tonbistudio, scos-lab,
    varjoranta, AmesianX, turboquant-plus-vllm on PyPI) to make sure I'm linking
    to the canonical URL for each.
  • Watch the Link Check CI job — I added a few exclusions but lychee can still flag
    false positives. If it complains, let me know and I'll triage.
  • Watch the Tests CI job — should be untouched since no code changed, but worth
    confirming.
  • Decide whether to keep LAUNCH.md as-is with the archive notice, rewrite it for
    the May 2026 framing, or delete it entirely. I left it as an annotated archive on
    the assumption that the launch-week framing is historically useful.

Notes

  • Followed the user's instruction to "treat it with respect" — no removed prior work,
    only annotated / reframed where the new data demands it. The paper's results and the
    reference implementation's correctness are preserved; the messaging is recalibrated.
  • All numbers in the new "Independent reproductions" tables come from the Red Hat blog
    post. I've cited the source on every table.
  • The --kv-cache-dtype turboquant_* names (k8v4, 4bit_nc, k3v4_nc, 3bit_nc)
    match the upstream vLLM merge.

Link to Devin session: https://app.devin.ai/sessions/8e502cb970d94108bd312b91d2f19162
Requested by: @OnlyTerp

…rop QJL by default

- Add 'Reality check (May 2026)' table to README citing the Red Hat AI / vLLM
  evaluation: FP8 KV is the default, *_nc variants drop QJL, skip first/last
  2 layers, K/V norm ratio predicts quality.
- New 'Tips and tricks' section in README distilling the May 2026 findings
  into 10 concrete recommendations.
- README quick-start, integrations table, decision guide and limitations now
  reflect upstream vLLM merge, turboquant-plus-vllm, 0xSero Triton, varjoranta
  CUDA + MLX kernels, AmesianX llama.cpp (head_dim 256+), tonbistudio V3,
  spiritbuun llama.cpp, scos-lab benchmarks.
- LANDSCAPE_2026.md: 'big May 2026 update' summary, TL;DR rewrites FP8 as the
  baseline, full third-party reproduction tables (MRCR, AIME25, throughput),
  refreshed engine + decision tables, expanded 'further reading' with the May
  evaluations.
- FAQ.md: new questions ('Should I use FP8 or TurboQuant?', 'Which variant?',
  'Does QJL help?', 'Which layers to skip?', 'How to choose K/V bits?');
  updated KIVI/KVQuant/FP8 table; honest accuracy-vs-speed numbers; explainer
  hype check.
- INTEGRATIONS.md: vLLM section reorganized around upstream-merged dtypes +
  production fork; SGLang shows the three competing PRs; llama.cpp split into
  spiritbuun (head_dim=128) vs AmesianX (any head_dim); MLX section moved to
  varjoranta Metal port; troubleshooting expanded with quality-regression
  checklist.
- BENCHMARKS.md: Independent reproductions section with Red Hat numbers;
  format-comparison table now includes FP8 baseline + measured throughput
  penalties; methodology section updated with 2026 best practices; QJL
  variance explainer; layer sensitivity table.
- IMPLEMENTATION_NOTES.md: QJL Score Weight section now explains why to
  default qjl_score_weight=0.0; new Layer Skipping section.
- LAUNCH.md: archive notice flagging the April hype framing.
- link-check workflow: exclude a few aggressively bot-blocking domains
  (towardsai, teqvolt, ai-intensify, allenkuo.medium).

Co-Authored-By: Rob <onerobby@gmail.com>
@devin-ai-integration

Copy link
Copy Markdown
Contributor

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

- varjoranta/turboquant-vllm/tree/main/mlx → repo root (the /mlx subpath
  doesn't exist; the Metal kernels are mentioned in the README of the
  parent repo).
- openai/simple-evals/blob/main/mrcr_eval.py → repo root (the file path
  has moved upstream).

Co-Authored-By: Rob <onerobby@gmail.com>
@OnlyTerp OnlyTerp merged commit 1891cdd into master May 25, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant