[Klaud Cold] Add KLAUD_DEBUG.md operational playbook + AGENTS.md reference#1464
Conversation
Captures recurring failure modes and fixes accumulated across many
Klaud-Cold image-bump PR debugging sessions:
1. perf-changelog deletion-not-allowed (stale rebase)
2. bench-client LlamaTokenizer crash on sglang v0.5.12 images
3. vLLM v0.20.x / v0.21.x CUDA-graph profiler OOM
4. DSV4 custom-image -> generic v0.5.12 weight footprint OOM
5. sglang v0.5.12 B300 regressions (DeepGemm TMA, trtllm bs=128,
flash_attn sm_120)
6. Cluster infra (drained nodes, docker-perm nodes, /nvme_home
disk-full, port collision, drain watchdog pattern)
7. Docker tag gotchas (no clean release for mi355x sglang-rocm)
8. CI rerun mechanics (gh run rerun --failed only on completed)
9. gh CLI gotchas (silent gh pr edit failures, rollup truncation,
CR/LF in --jq output)
10. PR conventions ([Klaud Cold] prefix, full-sweep-enabled label)
11. Useful slash commands
AGENTS.md now points new agents to KLAUD_DEBUG.md near the top so
they don't re-learn the playbook from logs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
LGTM — docs-only addition.
Extended reasoning...
Overview
This PR adds a new KLAUD_DEBUG.md operational playbook (230 lines of markdown documenting recurring failure modes for image-bump cron PRs: tokenizer crashes, vLLM CUDA-graph OOMs, B300 sglang regressions, cluster infra issues, gh-CLI gotchas, PR conventions) and a single-line pointer added near the top of AGENTS.md directing future agents to read it first.
Security risks
None. The change touches only markdown documentation files. No code paths, build configuration, CI workflows, secrets, permissions, or runtime behavior are affected. The new file documents commands and patterns but does not execute anything.
Level of scrutiny
Minimal scrutiny is appropriate. This is a pure docs change with zero runtime impact — the worst possible outcome of incorrect content is that future debugging sessions follow slightly stale advice, which is a self-correcting problem since the doc explicitly invites updates ("When you fix something not yet listed, add it here").
Other factors
The bug hunting system found no issues. The content is well-organized, cross-references real files (benchmarks/single_node/agentic/kimik2.5_fp4_b200.sh:65, utils/bench_serving/benchmark_serving.py:48-51, etc.) and PRs (#1395, #1403, #1420–1422, #1460–1462), and reads as accumulated operational knowledge rather than speculative content. No CODEOWNER-sensitive paths are touched.
Section 2 was overly specific to a single transient transformers/vllm mismatch and won't recur on the same path; the rest of the playbook covers patterns that are still actively useful. Renumber remaining sections and update the AGENTS.md pointer accordingly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Adds
KLAUD_DEBUG.md, a running playbook of recurring failure modes the Klaud-Cold image-bump cron PRs hit, with diagnoses, applied fixes, and gh/CLI/cluster gotchas accumulated across many debugging sessions. Adds a one-line pointer near the top ofAGENTS.mdso future agents know to read it first.Sections covered
perf-changelog.yamldeletion-not-allowed)LlamaTokenizer.all_special_tokens_extendedremoved)HSA_STATUS_ERROR_OUT_OF_RESOURCES/CUDA out of memory)/nvme_homedisk-full, port collisions, drain watchdog pattern)lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x)gh run rerun --failedrequires completed run)ghCLI gotchas (silentgh pr editfailures, statusCheckRollup truncation, CR/LF in--jqoutput)[Klaud Cold]prefix,full-sweep-enabledlabel).claude/commands/Why this is worth a doc
Each of these failure modes has bitten us at least twice in the last few sessions; without a written record, every new debugging pass starts by re-grepping the same logs. Adding the doc + the AGENTS.md pointer should cut the "what does this error mean" loop dramatically.
Test plan
KLAUD_DEBUG.mdrendered locally; cross-references resolve.AGENTS.mdpointer placed before the first section heading.🤖 Generated with Claude Code