Skip to content

[Klaud Cold] Add KLAUD_DEBUG.md operational playbook + AGENTS.md reference#1464

Merged
functionstackx merged 2 commits into
mainfrom
add-klaud-debug-doc
May 18, 2026
Merged

[Klaud Cold] Add KLAUD_DEBUG.md operational playbook + AGENTS.md reference#1464
functionstackx merged 2 commits into
mainfrom
add-klaud-debug-doc

Conversation

@functionstackx
Copy link
Copy Markdown
Collaborator

Summary

Adds KLAUD_DEBUG.md, a running playbook of recurring failure modes the Klaud-Cold image-bump cron PRs hit, with diagnoses, applied fixes, and gh/CLI/cluster gotchas accumulated across many debugging sessions. Adds a one-line pointer near the top of AGENTS.md so future agents know to read it first.

Sections covered

  1. PR setup-stage failures (perf-changelog.yaml deletion-not-allowed)
  2. Bench-client tokenizer crash on sglang v0.5.12 (LlamaTokenizer.all_special_tokens_extended removed)
  3. vLLM v0.20.x / v0.21.x CUDA-graph profiler OOM (HSA_STATUS_ERROR_OUT_OF_RESOURCES / CUDA out of memory)
  4. Custom DSV4 image → generic v0.5.12 weight-footprint OOM
  5. Upstream sglang v0.5.12 B300 regressions (DeepGemm TMA, trtllm bs=128+MTP, flash_attn sm_120)
  6. Cluster infra (drained nodes, docker-perm nodes, /nvme_home disk-full, port collisions, drain watchdog pattern)
  7. Docker tag gotchas (e.g. no clean release tag for lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x)
  8. CI rerun mechanics (gh run rerun --failed requires completed run)
  9. gh CLI gotchas (silent gh pr edit failures, statusCheckRollup truncation, CR/LF in --jq output)
  10. PR conventions ([Klaud Cold] prefix, full-sweep-enabled label)
  11. Useful slash commands in .claude/commands/

Why this is worth a doc

Each of these failure modes has bitten us at least twice in the last few sessions; without a written record, every new debugging pass starts by re-grepping the same logs. Adding the doc + the AGENTS.md pointer should cut the "what does this error mean" loop dramatically.

Test plan

  • KLAUD_DEBUG.md rendered locally; cross-references resolve.
  • AGENTS.md pointer placed before the first section heading.
  • Reviewer skim: anything wrong / missing / over-stated?

🤖 Generated with Claude Code

Captures recurring failure modes and fixes accumulated across many
Klaud-Cold image-bump PR debugging sessions:

  1. perf-changelog deletion-not-allowed (stale rebase)
  2. bench-client LlamaTokenizer crash on sglang v0.5.12 images
  3. vLLM v0.20.x / v0.21.x CUDA-graph profiler OOM
  4. DSV4 custom-image -> generic v0.5.12 weight footprint OOM
  5. sglang v0.5.12 B300 regressions (DeepGemm TMA, trtllm bs=128,
     flash_attn sm_120)
  6. Cluster infra (drained nodes, docker-perm nodes, /nvme_home
     disk-full, port collision, drain watchdog pattern)
  7. Docker tag gotchas (no clean release for mi355x sglang-rocm)
  8. CI rerun mechanics (gh run rerun --failed only on completed)
  9. gh CLI gotchas (silent gh pr edit failures, rollup truncation,
     CR/LF in --jq output)
  10. PR conventions ([Klaud Cold] prefix, full-sweep-enabled label)
  11. Useful slash commands

AGENTS.md now points new agents to KLAUD_DEBUG.md near the top so
they don't re-learn the playbook from logs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — docs-only addition.

Extended reasoning...

Overview

This PR adds a new KLAUD_DEBUG.md operational playbook (230 lines of markdown documenting recurring failure modes for image-bump cron PRs: tokenizer crashes, vLLM CUDA-graph OOMs, B300 sglang regressions, cluster infra issues, gh-CLI gotchas, PR conventions) and a single-line pointer added near the top of AGENTS.md directing future agents to read it first.

Security risks

None. The change touches only markdown documentation files. No code paths, build configuration, CI workflows, secrets, permissions, or runtime behavior are affected. The new file documents commands and patterns but does not execute anything.

Level of scrutiny

Minimal scrutiny is appropriate. This is a pure docs change with zero runtime impact — the worst possible outcome of incorrect content is that future debugging sessions follow slightly stale advice, which is a self-correcting problem since the doc explicitly invites updates ("When you fix something not yet listed, add it here").

Other factors

The bug hunting system found no issues. The content is well-organized, cross-references real files (benchmarks/single_node/agentic/kimik2.5_fp4_b200.sh:65, utils/bench_serving/benchmark_serving.py:48-51, etc.) and PRs (#1395, #1403, #1420–1422, #1460–1462), and reads as accumulated operational knowledge rather than speculative content. No CODEOWNER-sensitive paths are touched.

Section 2 was overly specific to a single transient transformers/vllm
mismatch and won't recur on the same path; the rest of the playbook
covers patterns that are still actively useful. Renumber remaining
sections and update the AGENTS.md pointer accordingly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@functionstackx functionstackx merged commit 88dab56 into main May 18, 2026
3 checks passed
@functionstackx functionstackx deleted the add-klaud-debug-doc branch May 18, 2026 02:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

1 participant