[Klaud Cold] Add KLAUD_DEBUG.md operational playbook + AGENTS.md reference by functionstackx · Pull Request #1464 · SemiAnalysisAI/InferenceX

functionstackx · 2026-05-18T01:55:30Z

Summary

Adds KLAUD_DEBUG.md, a running playbook of recurring failure modes the Klaud-Cold image-bump cron PRs hit, with diagnoses, applied fixes, and gh/CLI/cluster gotchas accumulated across many debugging sessions. Adds a one-line pointer near the top of AGENTS.md so future agents know to read it first.

Sections covered

PR setup-stage failures (perf-changelog.yaml deletion-not-allowed)
Bench-client tokenizer crash on sglang v0.5.12 (LlamaTokenizer.all_special_tokens_extended removed)
vLLM v0.20.x / v0.21.x CUDA-graph profiler OOM (HSA_STATUS_ERROR_OUT_OF_RESOURCES / CUDA out of memory)
Custom DSV4 image → generic v0.5.12 weight-footprint OOM
Upstream sglang v0.5.12 B300 regressions (DeepGemm TMA, trtllm bs=128+MTP, flash_attn sm_120)
Cluster infra (drained nodes, docker-perm nodes, /nvme_home disk-full, port collisions, drain watchdog pattern)
Docker tag gotchas (e.g. no clean release tag for lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x)
CI rerun mechanics (gh run rerun --failed requires completed run)
gh CLI gotchas (silent gh pr edit failures, statusCheckRollup truncation, CR/LF in --jq output)
PR conventions ([Klaud Cold] prefix, full-sweep-enabled label)
Useful slash commands in .claude/commands/

Why this is worth a doc

Each of these failure modes has bitten us at least twice in the last few sessions; without a written record, every new debugging pass starts by re-grepping the same logs. Adding the doc + the AGENTS.md pointer should cut the "what does this error mean" loop dramatically.

Test plan

KLAUD_DEBUG.md rendered locally; cross-references resolve.
AGENTS.md pointer placed before the first section heading.
Reviewer skim: anything wrong / missing / over-stated?

🤖 Generated with Claude Code

Captures recurring failure modes and fixes accumulated across many Klaud-Cold image-bump PR debugging sessions: 1. perf-changelog deletion-not-allowed (stale rebase) 2. bench-client LlamaTokenizer crash on sglang v0.5.12 images 3. vLLM v0.20.x / v0.21.x CUDA-graph profiler OOM 4. DSV4 custom-image -> generic v0.5.12 weight footprint OOM 5. sglang v0.5.12 B300 regressions (DeepGemm TMA, trtllm bs=128, flash_attn sm_120) 6. Cluster infra (drained nodes, docker-perm nodes, /nvme_home disk-full, port collision, drain watchdog pattern) 7. Docker tag gotchas (no clean release for mi355x sglang-rocm) 8. CI rerun mechanics (gh run rerun --failed only on completed) 9. gh CLI gotchas (silent gh pr edit failures, rollup truncation, CR/LF in --jq output) 10. PR conventions ([Klaud Cold] prefix, full-sweep-enabled label) 11. Useful slash commands AGENTS.md now points new agents to KLAUD_DEBUG.md near the top so they don't re-learn the playbook from logs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

claude

LGTM — docs-only addition.

Extended reasoning...

Overview

This PR adds a new KLAUD_DEBUG.md operational playbook (230 lines of markdown documenting recurring failure modes for image-bump cron PRs: tokenizer crashes, vLLM CUDA-graph OOMs, B300 sglang regressions, cluster infra issues, gh-CLI gotchas, PR conventions) and a single-line pointer added near the top of AGENTS.md directing future agents to read it first.

Security risks

None. The change touches only markdown documentation files. No code paths, build configuration, CI workflows, secrets, permissions, or runtime behavior are affected. The new file documents commands and patterns but does not execute anything.

Level of scrutiny

Minimal scrutiny is appropriate. This is a pure docs change with zero runtime impact — the worst possible outcome of incorrect content is that future debugging sessions follow slightly stale advice, which is a self-correcting problem since the doc explicitly invites updates ("When you fix something not yet listed, add it here").

Other factors

The bug hunting system found no issues. The content is well-organized, cross-references real files (benchmarks/single_node/agentic/kimik2.5_fp4_b200.sh:65, utils/bench_serving/benchmark_serving.py:48-51, etc.) and PRs (#1395, #1403, #1420–1422, #1460–1462), and reads as accumulated operational knowledge rather than speculative content. No CODEOWNER-sensitive paths are touched.

Section 2 was overly specific to a single transient transformers/vllm mismatch and won't recur on the same path; the rest of the playbook covers patterns that are still actively useful. Renumber remaining sections and update the AGENTS.md pointer accordingly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

functionstackx requested a review from a team May 18, 2026 01:55

github-project-automation Bot added this to InferenceMAX Board May 18, 2026

claude Bot reviewed May 18, 2026

View reviewed changes

functionstackx merged commit 88dab56 into main May 18, 2026
3 checks passed

functionstackx deleted the add-klaud-debug-doc branch May 18, 2026 02:00

github-project-automation Bot moved this to Done in InferenceMAX Board May 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Klaud Cold] Add KLAUD_DEBUG.md operational playbook + AGENTS.md reference#1464

[Klaud Cold] Add KLAUD_DEBUG.md operational playbook + AGENTS.md reference#1464
functionstackx merged 2 commits into
mainfrom
add-klaud-debug-doc

functionstackx commented May 18, 2026

Uh oh!

claude Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

functionstackx commented May 18, 2026

Summary

Sections covered

Why this is worth a doc

Test plan

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Overview

Security risks

Level of scrutiny

Other factors

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant