Kernelking — CUDA Kernel Optimization Agent

Kernelking is an automated agent focused on optimizing CUDA kernels driven by ncu. It is not a general-purpose code agent; instead, it hard-codes our engineering experience with Nsight Compute (the metric → bottleneck → action reasoning path) into "expert knowledge" fed to Claude, allowing the model to perform more accurately within a relatively narrow problem space.

Full closed loop: pick a CUDA kernel from a production Rust project → extract it into a standalone nvcc-compilable harness → measure with ncu → Claude modifies the code → re-measure → if there is a real gain, integrate back into the Rust project and cargo build → real-environment ncu verification → deploy or roll back.

Why not a general-purpose LLM agent

A general-purpose agent running cargo build + ncu takes 5–10 minutes per round, making the iteration cycle too long. Faced with hundreds of lines of raw ncu output, LLMs also tend to miss the key issues. Kernelking applies several engineering optimizations to close this gap:

Layer	What it does	Why it matters
Standalone harness	Strips the kernel from the Rust project and generates a `main()` + nvcc-directly-runnable `.cu`	nvcc single-file compile ~5s vs `cargo build` ~3min. Iteration cycle drops from minutes to seconds
cudaEvent authoritative timing	The harness uses `cudaEventRecord` for self-timing: 3 warmup + 3 measured runs, taking the median. ncu is used only for metrics; its timing is discarded	ncu's PMU setup + replay distorts timing by 2–5%; cudaEvent has near-zero overhead and the standalone finishes in a few seconds
Real scalar observation	`fprintf`/`eprintln` real scalar arguments in the host wrapper and Rust adapter, run one real production workload to capture the actual `(grid, block, num_rows, num_cols, ...)` for every launch	Scalars determine per-thread work budget. Getting them wrong makes the harness run a different workload. This step ensures the harness uses the actual parameters from production
Representative config sampling (B+)	Groups real-env launches by `(grid_x, block_x)`, splits into fast/slow buckets by p95/p25 variance, picks top-K configs covering 90%+ of total runtime	Avoids running every launch through the harness; picks a few representative configs that cover 80–90% of total runtime
Index-aligned fast/slow	Aligns the Nth observation launch with the Nth ncu launch by position; key extraction supports both 1D (`grid`/`block`) and dim3 (`grid_x`/`block_x`) schemas	The fast-bucket and slow-bucket scalars are genuinely different (e.g. `num_cols=13 vs 127`), no longer sharing a median; 2D-launch kernels can also use the aligned path
Wrapper-derived block geometry	Parses `(bx, by, bz)` from `dim3 dimBlock(...)` / `kernel<<<grid, block>>>` in the host wrapper → forces the prompt to tell Claude the correct block shape; post-generation, scans `struct HarnessConfig` and all launch sites for threads/block validation, retrying with feedback if wrong (up to 3 times)	Historically Claude occasionally writes `dim3(32,8)` as scalar `32` → `threadIdx.y` is always 0 → only 1/8 of the work is done → shows 0% gain. Post-validation + retry catches these silent defects
Skill-grouped ncu metrics	Groups dozens of metrics by `roofline / memory / warp_stall / occupancy / instruction`, feeds Claude the one-line summary `memory_bound / compute_bound / latency_bound` first	Claude does not need to parse raw `ncu --page details`; it sees "memory-bound, L2 hit rate 40%" directly in the prompt
Declarative rule matrix	`RULES` list: each entry has `id + severity + condition + finding + action`, e.g. `smem_bank_conflicts → Pad smem to TILE_DIM+1`	Gives fixed `[rule_id] action` hints for known pitfalls; Claude sees the recommendation at the top of the prompt
On-demand skills + catalog	Loads the full text of matching `skills/<id>.md` files after rule hits (up to top-5); the prompt always contains a skill catalog (one-liner signal summary per skill, served from prompt cache)	Stuffing all 19 skills takes ~25k tokens; hit-based loading + catalog index keeps each round at ~7k tokens, while the model can still "recognize" other skill directions from the catalog
NVIDIA official recommendations cross-check	Parses `OPT/INF/WRN` three-level recommendations from `ncu --page details`, corroborated with local rule findings	Built-in Nsight experience + our own rule matrix as double insurance; recommendations that match on both sides get higher priority from Claude
Cross-kernel pattern library	`CUDA_OPTIMIZATION.md` aggregates successful strategies from past runs (shared-mem tiling / vectorize / coalescing, etc.), shared across kernels	The same bottleneck type does not require Claude to "reinvent" solutions each time
Per-kernel memory	`memory/<kernel>.md` records the MEMO + result (ACCEPTED / REJECTED / crash) for every iteration	The next time the same kernel is run, previously failed strategies are not repeated
Frozen-snapshot memory + Haiku compaction	Historical run memory moves into the system block (cache_read); current-run iter records accumulate in the user block. When memory exceeds a threshold, Haiku rewrites it into a `Resolved / Pending / Avoid` three-section summary `<kernel>.summary.md`	Saves ~20–25% prompt cost per iter ($0.10–0.15/run); when the same kernel is run repeatedly, early experience is not lost due to tail truncation
Handler plugins	`handlers/*.py` registers kernel-shape-specific prompt hints + post-processing (namespace detection, prose stripping, …)	Adding a new special kernel shape (template / multi-namespace / special prose) only touches the handler, not the main loop
Prompt caching + retry	Stable sections use `cache_control: ephemeral`; API fluctuations use automatic exponential backoff	Each iteration's ~7k tokens are mostly cache_read, reducing cost per round from ~$0.04 to ~$0.005; network jitter no longer interrupts the run
Threshold short-circuit + early stop	`NO_GAIN_THRESHOLD_PCT=2.0%`: skips deployment if standalone gain does not cross the threshold. `NO_PROGRESS_PATIENCE=4`: ends early after 4 consecutive no-progress rounds	Saves ~5min of pointless `cargo build + real-env ncu`; does not persist with kernels already near peak performance

Full workflow

The explanation below has two layers — the workflow diagram lets you explain "what one run does" to someone in 3 minutes; the step-by-step breakdown is for people taking over the codebase.

Workflow diagram (lifecycle of one run)

flowchart TD
    start(["python3 agent.py"]) --> pre["<b>Preflight</b><br/>load config.env · probe SSH / GPU / Claude key"]
    pre --> s1["<b>Step 1</b> fetch remote kernel source + local Rust context"]
    s1 --> s15obs{"Step 1.5<br/>refs/&lt;kernel&gt;.observed.json<br/>exists?"}
    s15obs -- "YES (cached, instant)" --> s19
    s15obs -- "NO (first time, ~5min)" --> s15run["<b>Observation sub-pipeline</b><br/>Claude inserts printf + eprintln into C++ / Rust<br/>→ SCP → cargo build → run real prover once<br/>→ capture real scalars for every launch<br/>→ restore sources → rebuild"]
    s15run --> s19

    s19["<b>Step 1.9</b> normalize remote binary<br/>rebuild if current source != last-deployed binary"] --> s2
    s2["<b>Step 2</b> Real-env baseline timing<br/>ncu --metrics gpu__time_duration.sum (~1-2min)<br/>→ bucket by (grid, block) → top-K representative configs"] --> s3

    s3["<b>Step 3</b> generate standalone harness<br/>Claude takes real-config + observed scalars + wrapper dim3<br/>→ returns self-contained .cu with main()"] --> s3check{"block-geom<br/>check ok?"}
    s3check -- "NO (re-prompt)" --> s3
    s3check -- "YES" --> s4

    s4["<b>Step 4</b> nvcc compile harness"] --> s4ok{"compile<br/>ok?"}
    s4ok -- "NO (≤8 retries)" --> s4fix["Claude reads nvcc log → patches .cu"]
    s4fix --> s4
    s4ok -- "YES" --> s4base["<b>Standalone baseline</b><br/>cudaEvent timing · ncu --set full<br/>→ bottleneck + findings + fired rules"]

    s4base --> loop
    loop["<b>Step 5 optimization iter loop</b><br/>(default budget=6, early-stop after 4 no-gain iters)"] --> lprompt["build prompt:<br/>bottleneck + findings + skills + memory + cross-kernel patterns"]
    lprompt --> lclaude["Claude → new kernel code + MEMO"]
    lclaude --> lbuild["rewrite harness → nvcc → cudaEvent + ncu"]
    lbuild --> lcmp{"improvement<br/>vs best ≥ 2%?"}
    lcmp -- "YES" --> lupd["update best · write ACCEPTED to memory"] --> lmore
    lcmp -- "NO" --> lrej["write REJECTED to memory<br/>(used as 'avoid' hint next iter)"] --> lmore
    lmore{"budget left<br/>AND not 4 no-gain in a row?"}
    lmore -- "YES" --> lprompt
    lmore -- "NO" --> gate

    gate{"<b>Short-circuit gate</b><br/>standalone best gain ≥ 2%?"}
    gate -- "NO" --> skip(["⏭ skip integration<br/>server unchanged<br/>exit 2"])
    gate -- "YES" --> s6

    s6["<b>Step 6</b> Integrate<br/>Claude strips harness → FFI-compat .cu<br/>→ SCP overwrite with .bak · cargo build"] --> s7
    s7["<b>Step 7</b> real-env optimized timing<br/>ncu timing-only on full prover run"] --> fin
    fin{"real-env delta<br/>vs baseline?"}
    fin -- "improved" --> accept(["🎉 <b>ACCEPTED</b><br/>optimization live on server<br/>exit 0"])
    fin -- "regressed" --> roll(["⏪ <b>ROLLED BACK</b><br/>restore .bak + cargo build<br/>exit 1"])

    classDef ok fill:#0d3,stroke:#093,color:#fff
    classDef bad fill:#c42,stroke:#822,color:#fff
    classDef skip fill:#888,stroke:#555,color:#fff
    class accept ok
    class roll bad
    class skip skip

3-minute pitch for this diagram:

"This agent's work can be broken down into two nested feedback loops. The inner loop (Step 5's iter loop) is a fast cycle: Claude reviews ncu metrics and proposes a new kernel version; the standalone harness runs it on the GPU for a few seconds and gets latency + new ncu metrics; the result is compared against the historical best. If it wins, the version is recorded; if it loses, a note is written to memory so the next round doesn't repeat the same mistake. The outer loop (Steps 6-7) is a slow loop: can the standalone winner actually reproduce in a real prover? So it gets integrated back into the Rust project via cargo build, then ncu measures the complete timing on a real machine prover run. Only when the real-machine time also genuinely improves is it accepted; otherwise .bak is restored. The entire process is fully automatic: fetch remote kernel, scrape scalars from the real prover, generate harness, profile, prompt Claude, retry, integrate, verify, roll back — all from a single command."

Three design points worth calling out:

Observation cache (Step 1.5) — The first time a kernel is run, 5 minutes are spent patching + running the prover + capturing scalars, with results cached in refs/<kernel>.observed.json. On subsequent runs, as long as the signature hasn't changed, this step completes instantly. This means "running the standalone at real workload scale" is not a 5-minute tax paid on every run.
Short-circuit Gate — When the standalone has not achieved ≥ 2% gain, integration + real-machine verification (which together take ~3-5 minutes) are skipped entirely. No GPU time is wasted verifying a byte-identical kernel.
Rollback is automatic — When the standalone wins but the real environment regresses (e.g. cache / launch overhead effects not reflected in the harness), the agent automatically restores using .bak + re-runs cargo build, leaving no half-applied state.

Step-by-step breakdown (read alongside the workflow diagram)

Steps explained in the order they appear in the diagram. The Artifacts / cache column lists key files written at each step, on which subsequent steps depend.

Step (node in diagram)	What it does	Artifacts / cache	Typical duration
Preflight	Loads `config.env`; probes SSH connectivity to remote, `nvidia-smi` for free GPU, and validity of `ANTHROPIC_API_KEY`	—	< 5s
Step 1 fetch remote kernel	`ssh <host> cat <REMOTE_KERNEL_CU>` pulls the target `.cu` locally, plus reads `LOCAL_CONTEXT_FILES` (Rust FFI adapter) as context for Claude	In-memory kernel source string + Rust context string	~2s
Step 1.5 Observation sub-pipeline	Claude inserts temporary `printf` / `eprintln!` into the C++ wrapper and Rust adapter to print every launch's scalar arguments (`num_rows`, `height`, `chunk_size`, `is_inject`, …); SCP to remote, `cargo build`, run the real prover once to capture all scalars, then restore sources and rebuild to a clean state	`refs/<kernel>.observed.json` — see the deep-dive section below	First time ~5min; cache hit is instant
Step 1.9 normalize binary	The remote `single-node` prover binary may be an optimized version from the last deploy, while the current source has been reverted to its original. This step runs `cargo build --release` to sync the binary to the current source, avoiding a falsely low baseline from a "last-deployed optimized version"	Remote binary aligned to current source	Source unchanged ~2s; source changed ~3min
Step 2 real-env baseline	Runs `ncu --metrics gpu__time_duration.sum` inside the real prover to capture the timing of every kernel launch. Groups by `(grid_x, block_x)` + variance to split fast/slow → picks top-K representative configs (default K=5, weighted by total duration share), and records baseline total runtime	`refs/<kernel>.timing.json` + `representative_configs[]` (each with grid/block/share/scalar)	1-2min
Step 3 generate harness	Claude generates a standalone-compilable `.cu` (with `main()`, dummy data allocation, 5 kernel launches, cudaEvent timing loop) based on `(kernel source + Rust context + top-K representative configs + real scalars per config + wrapper dim3 geometry)`	Local `/var/folders/.../tmpXXXX.cu`	~15-30s
block-geom check	Parses `<<<grid, block>>>` syntax in Claude's output, flattens `block` (`dim3(32,8)` → 256 threads) and compares against the real `dim3` from the wrapper. If mismatched, returns the error to Claude for regeneration, up to `HARNESS_BLOCK_CHECK_RETRIES` (default 2) times	—	< 1s per check
Step 4 nvcc compile	SCP harness to remote `/tmp/kk_<run>/`, trigger `nvcc -O3 -arch=... -o standalone_bin standalone.cu` via `ssh`	Remote `standalone_bin`	~5-10s
compile-fix loop	When nvcc fails, Claude receives the stderr log and patches `.cu`, then re-SCP + nvcc, up to 8 times. Only a total failure causes `die()`	—	~20-30s per attempt
Standalone baseline	Runs the freshly compiled `standalone_bin`: Phase 1 cudaEvent timing: 3 warmup + 3 measured runs, take median (pure timing, no ncu overhead, a few seconds) Phase 2 `ncu --set full` captures SM / DRAM / warp stall / occupancy / launch stats in one shot (timing discarded)	`standalone_report_iter_0.ncu-rep` + `.csv` + bottleneck label + fired rules	~40-50s (ncu --set full dominates)
Step 5 iter loop	Core optimization loop. Each round: build prompt (bottleneck + findings + current best kernel + `skills/*.md` top-5 + `memory/<kernel>.md` past runs + `CUDA_OPTIMIZATION.md` relevant tags) → Claude returns new kernel + MEMO → replace in harness → nvcc → cudaEvent + ncu → compare against current best	One `runs/.../iter_N/` per round (containing standalone.cu / diff / memo / ncu_brief / Claude response)	~1-2min per round
≥2% improvement?	≥ 2% → update best, write `ACCEPTED` to memory (basis for confidence next time a similar change is tried); < 2% → do not update best, write `REJECTED` to memory (the next prompt will see "this path was tried, do not repeat")	Appends one line to `memory/<kernel>.md`	< 1s
budget left?	6 rounds exhausted OR 4 consecutive no-progress rounds → exit loop; otherwise continue	—	—
Short-circuit gate	After exiting the loop, check standalone best gain relative to baseline. < 2% means no substantial improvement; skip Step 6/7 entirely (~3-5min of GPU time), exit 2	Appends SHORT-CIRCUIT record to `memory`	—
Step 6 Integrate	Claude strips the `__global__` out of the harness and reassembles it into the original FFI signature; SCP overwrites `REMOTE_KERNEL_CU` (saving `.bak` first), then `cargo build --release` rebuilds the entire prover	Remote `.bak` + freshly built prover binary	~3min
Step 7 real-env optimized timing	Runs `ncu --metrics gpu__time_duration.sum` with the new binary, capturing the total timing distribution of all launches in the real prover	`refs/report_optimized_real.*`	~1-2min
real-env improved?	Real-machine total runtime < baseline total runtime → ACCEPTED (optimization stays on server, exit 0); ≥ baseline → ROLLED BACK (`mv .bak → kernel.cu` + re-run cargo build + exit 1)	Appends DEPLOYED / ROLLED_BACK to `memory`	~3min (rollback requires another build)

Deep dive: what does `refs/<kernel>.observed.json` actually do

The problem: When generating the standalone harness, Claude needs to know "when this kernel is launched in the real prover, what are the exact values of its arguments". Reading only the ncu report gives you grid and block, but scalar arguments (num_rows, height, chunk_size, offset, is_inject, etc.) are completely unknown. If a small size is guessed (e.g. num_rows=1024), the standalone's measured µs diverges from the real machine by tens of times, making the optimization useless.

The solution: Step 1.5 spends 5 minutes doing "observation" — Claude inserts printf / eprintln! into the production code, like this:

// Temporarily inserted instrumentation
fprintf(stderr, "[KKLOG_CPP] kernel=oss_leaf_hash n=%u num_poly=%u offset=%u "
                "grid_x=%u grid_y=%u block_x=%u block_y=%u\n",
        n, num_poly, offset, grid.x, grid.y, block.x, block.y);

Then SCP, cargo build, run the real prover once, and the stderr stream fills with dozens or hundreds of real argument records. The agent collects and parses them into JSON, stored as refs/<kernel>.observed.json, structured roughly like this:

{
  "kernel": "oss_leaf_hash_kernel",
  "captured_at": "2026-04-19T15:59:21",
  "launches": [
    { "grid_x": 32768, "block_x": 256, "n": 32768, "num_poly": 64, "offset": 0 },
    { "grid_x": 16384, "block_x": 256, "n": 16384, "num_poly": 64, "offset": 0 },
    ...
  ]
}

Why cache it:

Observation itself is expensive (patch code → SCP → build → prover → collect → restore → build) takes ~5min
But when the kernel signature is unchanged, the scalar distribution is stable (the same proof config run repeatedly produces the same parameters)
So as long as refs/<kernel>.observed.json exists, the agent loads it directly and skips all of Step 1.5 — this is what turns "run standalone at real workload scale" from "pay 5min on every run" into "pay once, then free"

When to manually delete this cache:

The kernel signature changed (different argument list)
The launch config calculation logic in production code changed (grid/block changed)
You have doubts about the workload characteristics of this kernel and want to re-observe ground truth

After deletion, the next run automatically re-observes.

Quick start

Prerequisites

Python ≥ 3.9 + anthropic SDK
Password-free SSH access to the remote GPU server (host alias, e.g. zan1)
CUDA toolkit on the remote machine (nvcc / ncu)
A cargo build-able project on the remote (prover binary)

pip3 install -r requirements.txt

Configure `config.env`

Refer to the config.env template (do not commit, already in .gitignore):

# Claude API
ANTHROPIC_API_KEY=sk-ant-...
CLAUDE_MODEL=claude-sonnet-4-6      # or claude-haiku-4-5 for debugging

# SSH
SSH_HOST=zan1
SSH_USER=ubuntu

# Local code mirror (Rust adapter / headers, given to Claude as context)
LOCAL_CODEBASE_PATH=/Users/you/repo/brevis-vm
LOCAL_CONTEXT_FILES=vm/src/cuda_adaptor/transpose.rs

# Target kernel to optimize
REMOTE_KERNEL_CU=/home/ubuntu/ben/brevis-vm/private-pico-gpu/transpose/transpose.cu
KERNEL_NAME=transpose_kernel

# Remote paths
REMOTE_BASEDIR=/home/ubuntu/ben/brevis-vm
REMOTE_BINARY=/home/ubuntu/ben/brevis-vm/target/release/single-node
REMOTE_LIBPATH=/home/ubuntu/ben/brevis-vm:/usr/local/cuda/lib64

# Harness / iteration / sampling parameters
HARNESS_GPU_ARCH=sm_120             # RTX 5090→120, 4090→89, A100→80
HARNESS_NVCC_DEFINES=-DFEATURE_KOALA_BEAR
MAX_OPTIMIZE_ITERATIONS=6           # max rounds per kernel
NO_PROGRESS_PATIENCE=4              # stop early after N consecutive no-progress rounds
NO_GAIN_THRESHOLD_PCT=2.0           # skip deploy if standalone gain < 2%
HARNESS_TOP_K=5                     # K=3 ≈70%, K=5 ≈88%, K=7 ≈98% runtime coverage
HARNESS_VARIANCE_THRESHOLD=2.5      # p95/p25 > threshold triggers fast/slow split

Run

python3 agent.py 2>&1 | tee agent_run.log

To switch kernels, only change REMOTE_KERNEL_CU + KERNEL_NAME + LOCAL_CONTEXT_FILES in config.env. The first run for a new kernel does the full observation (~5min); subsequent runs hit the cache instantly.

Output files

Persistent (tracked or long-term useful):

Path	Contents
`refs/<kernel>.observed.json`	First-run observation cache — real scalar snapshot
`memory/<kernel>.md`	MEMO / result records for every optimization round; Claude reads this on the next run
`CUDA_OPTIMIZATION.md`	Cross-kernel success pattern library; auto-regenerated from `memory/.md` each time `agent.py` starts — do not edit manually*
`skills/*.md`	Optimization skill documents for each bottleneck type
Remote `<kernel>.cu.bak`	Baseline backup for rollback

Ephemeral (gitignored, overwritten each run):

Path	Contents
`standalone_best.cu`	Best harness code from this run
`<kernel>_integrated.cu`	FFI-compatible version stripped from harness, deployed to server
`standalone_report_*.details.txt`	ncu text reports (baseline / each round)
`report_{baseline,optimized}_real.*`	Real-machine ncu reports + workload CSV
`agent_run.log` / `build_.log`	Run logs

Post-mortem review: `runs/<kernel>/<ts>/` (one directory per run, never overwritten)

Every optimization run writes an independent directory for post-run review. iter 0 is the baseline; iter 1..N are Claude's successive attempts. Rejected attempts are fully preserved, making it easy to investigate "why didn't this direction work".

runs/oss_compress_kernel/2026-04-19T14-45-48/
├── agent_report.md                  TL;DR + iter links + mirror of memory/<kernel>.md for this run
├── best.cu                          Final standalone adopted this run
├── iter_0/                          Baseline snapshot (untouched by Claude)
│   ├── standalone.cu
│   └── ncu_brief.txt
├── iter_1/
│   ├── standalone.cu                Complete code output by Claude this iter
│   ├── response_optimize.txt        Raw Claude response (includes // MEMO: [...] + code)
│   ├── diff_vs_prev.patch           Unified diff vs best_so_far
│   ├── ncu_brief.txt                bottleneck / duration / top findings
│   ├── memo.txt                     Extracted MEMO one-liner
│   └── response_compile_fix_1.txt   Optional, only present when compile-fix was triggered this iter
├── iter_2/ ...
└── iter_N/

Storage tiers:

Tier 1 (strategic summary, always kept): memory/<kernel>.md + agent_run.log + observe_cache/. ~20 KB / run.
Tier 2 (iter snapshots, always kept): all standalone.cu + response_*.txt + diff + ncu_brief.txt in this directory. ~50 KB / 6-iter run. Covers 95% of post-mortem scenarios.
Tier 3 (full prompts, gated): prompt_<key>_user.txt + prompt_<key>_system.txt. ~+0.5 MB / run. Only written when RUNS_KEEP_TRANSCRIPT=1, for debugging "what exactly did Claude see / was it truncated by max_tokens / did memory make it into the prompt".

Safety and rollback

cp <kernel>.cu <kernel>.cu.bak is done before Rust integration
Integration failure (compile error) → auto-debug loop (up to 8 times), then cp .bak <kernel>.cu rollback + cargo build to restore
Real-env optimized timing shows regression → also rolls back
C++ / Rust files modified during observation are only temporarily changed within the observation window; the originals are automatically SCP'd back before the window closes

All SSH connections use ControlMaster multiplexing + automatic retry, and will not hang if the master connection drops.

Design trade-offs

Local brain + remote GPU. Anthropic API tokens never touch the production machine; only nvcc / ncu / cargo run remotely.

Standalone is not a perfect substitute for real-env. It is a 10–20× faster proxy; the final step always requires real-env confirmation. The per-config standalone vs real-env gap (typically 1.3–3×) comes from cold caches, the memory allocator, and interference from other kernels.

Conservative threshold. 2% accounts for cudaEvent run-to-run jitter (~0.5–1.5%); only improvements sufficiently above the noise floor are counted as valid. Manually adjustable via the NO_GAIN_THRESHOLD_PCT environment variable.

Known limitations.

For 2D-grid kernels (e.g. transpose), the bucketing stage groups by (grid_x, block_x) and grid_y is merged into the same bucket; however block_y is already forced into the harness via wrapper parsing, and num_rows/num_cols scalars are passed per-launch through schema-flexible alignment, so fast/slow differences still land on per-thread workload
Template kernels (e.g. cpu_events_to_trace_kernel<F, NumCols>) require manually providing concrete template parameters in the harness; auto-inference is not yet implemented

Module architecture

agent.py now only handles orchestration (preflight → string steps together → print comparison). Implementation is split into kernelking/ sub-packages, each with a single responsibility. Below is a file-level tree (one-line purpose per .py):

kernelking/
├── config.py                     load_config / cfg / log / banner / die / ROOT path / ANSI colors
├── costs.py                      Claude token + USD accumulation; $/Mtok table by model (wired in _claude_call)
├── rundir.py                     runs/<kernel>/<ts>/ review directory: begin_run / begin_iter /
│                                 dump_response / dump_prompt (Tier 3 gated) / refresh_agent_report
│
├── claude/
│   ├── client.py                 _claude_call single-turn API + cache_control: ephemeral + max_tokens auto-continue
│   │                             + exponential backoff retry (429/5xx/network) + _NO_ELIDE_RULE + _strip_fences
│   └── __init__.py               re-export facade
│
├── remote/
│   ├── ssh.py                    ssh_run / ssh_run_capture / scp_upload / scp_download
│   │                             (ControlMaster=auto multiplexes same TCP/SSH, ControlPersist=300)
│   ├── gpu.py                    _check_and_clean_gpu: nvidia-smi scans processes; Kernelking's own (standalone /
│   │                             single-node / ncu) are killed, external ones (VLLM etc.) abort immediately
│   └── __init__.py               facade re-exports
│
├── observe/                      # First per-kernel ~4-5min to capture real scalars, cache hits are instant after
│   ├── step.py                   step_observe_real_scalars: patch→build→run→parse→restore in try/finally
│   ├── instrument.py             Claude inserts fprintf(KKLOG_CPP …) in cpp wrapper + eprintln!(KKLOG_RUST …) in Rust adapter
│   ├── parse.py                  KKLOG text lines (regex) → per-launch dict; int/float best-effort coerce
│   ├── cache.py                  refs/<kernel>.observed.json read/write (human-readable JSON, stable across refactors)
│   ├── paths.py                  Parses .rs files from LOCAL_CONTEXT_FILES into (local, remote) pairs + text scp_upload
│   └── __init__.py               facade
│
├── harness/                      # standalone.cu generation / compilation / timing
│   ├── sections.py               // ===== KERNEL SECTION ===== / HARNESS SECTION ===== extract / assemble
│   │                             (optimization loop only rewrites kernel section; harness is frozen after generation)
│   ├── nvcc.py                   _nvcc_build / _standalone_dir / arch / include flags — remote nvcc 5-10s compile
│   ├── generate.py               step_generate_harness (Claude generates main + kConfigs) + deterministic helpers:
│   │                             _preprocess_for_standalone (__constant__ → __device__, GCC attr cleanup),
│   │                             _extract_kernel_launch_info (signature, block size, grid expression, dim3),
│   │                             _extract_wrapper_block_dims (pulls (bx,by,bz) from wrapper),
│   │                             _validate_harness_block_dims (post-validation + feedback retry up to 3 times),
│   │                             _classify_compile_error (structural vs semantic), _extract_error_context
│   ├── timing.py                 step_profile_standalone: Phase 1 cudaEvent timing (warmup×3 + measure×3) →
│   │                             Phase 2 ncu --set full captures metrics → _merge_harness_timing_into_metrics
│   │                             overwrites ncu's time column with cudaEvent median
│   └── __init__.py               facade
│
├── ncu/                          # ncu output parsing + rule engine (pure data layer, no SSH/Claude)
│   ├── metrics.py                ncu --page details / CSV parsing, SKILL_METRICS grouping,
│   │                             _parse_per_launch_data, WORKLOAD_METRICS (hw counters), unit conversion
│   ├── rules.py                  declarative RULES list (each: id + severity + one pure function),
│   │                             _apply_rules / _build_skill_summary / render to ncu_finding_N markdown
│   ├── configs.py                _find_representative_configs: bucket by (grid_x,block_x) → p25/p50/p95 →
│   │                             variance split fast/slow → top-K; _snap_gb_key schema-flexible alignment (1D/dim3)
│   ├── recommendations.py        NVIDIA OPT/INF/WRN three-level parsing, sort by speedup, deduplicate,
│   │                             _cross_check_findings (mark ✓ for findings also in local rules), _trim_ncu_details
│   ├── knowledge.py              NCU_KNOWLEDGE: ncu CLI flag reference (only for profile-script gen + crash triage)
│   └── __init__.py               facade
│
├── knowledge/                    # Three types of prompt-injected knowledge
│   ├── memory.py                 memory/<kernel>.md append + _extract_memo (// MEMO: [tag] ...) +
│   │                             _load_memory_past_runs / _load_memory_current_run (split into system/user blocks)
│   ├── compact.py                _maybe_compact_memory: when memory exceeds threshold, uses Haiku to compress into
│   │                             ## Resolved / ## Pending / ## Avoid three-section `.summary.md` (runs once before run)
│   ├── patterns.py               scans all memory/*.md → aggregates by tag (DEPLOYED/ROLLED_BACK/SHORT_CIRCUIT) →
│   │                             rewrites CUDA_OPTIMIZATION.md; _load_patterns_for_findings selects top-K for prompt
│   ├── skills.py                 _load_skills_for_findings: reads skills/<id>.md full body by fired rule (top-5);
│   │                             _build_skill_catalog: one-line **Signal** summary for all 19 skills, stays in cache
│   └── __init__.py               facade
│
├── optimize/                     # Loop driver + deployment decisions
│   ├── loop.py                   step_optimize_standalone main loop (default 6 iter, early-stop after 4 no-gain),
│   │                             _claude_optimize_standalone_iter (assembles system/user prompt),
│   │                             step_integrate_and_build (strips kernel from harness back into FFI form),
│   │                             _format_per_launch_for_prompt, _claude_fix_compile_error_standalone,
│   │                             _claude_fix_harness_runtime
│   ├── apply.py                  Real-Rust apply: cargo build + three Claude fixers (compile / verify /
│   │                             runtime), falls back to .bak on N failures
│   ├── compare.py                step_compare final comparison table + rollback + CUDA_FORBIDDEN_DIFF_PATTERNS
│   │                             (statically scans diff to block "disable the original computation to win the benchmark" cheating)
│   └── __init__.py               facade
│
├── profile/                      # Real-machine timing capture
│   ├── timing.py                 step_profile_timing_only: ncu no-replay + all launches + incidentally captures
│   │                             thread_inst / dram_bytes three hw counters (zero extra overhead)
│   ├── env.py                    BINARY_ENV dict (SPLIT_THRESHOLD / CHUNK_SIZE / PICO_GPU_MEM ...),
│   │                             shared by timing.py and _legacy.py
│   ├── _legacy.py                retired step_profile / step_generate_profile_script / _claude_triage_crash
│   │                             (--set full replay path, fails on OOM for large kernels; kept as archive)
│   └── __init__.py               only re-exports live symbols; legacy requires explicit from ._legacy import
│
└── __init__.py                   package-level docstring

handlers/                        # Kernel shape plugins (one file per new shape; main loop unchanged)
├── __init__.py                   HandlerContext + KernelHandler interface + registry + batch dispatcher
├── namespace.py                  Detects namespace X { __global__ ... }, injects
│                                 "NAMESPACE CONTEXT" into harness prompt to have Claude add using or fully-qualified calls
└── prose_strip.py                Strips leading English prose from Claude responses ("Looking at the profile, I need to...")
                                  — contains `—` `≈` `'` and other non-ASCII that causes nvcc / rustc "extended character" errors

Dependency direction is strictly bottom-up: config → claude + remote → observe/harness/ncu/knowledge/profile → optimize; no cycles. Swapping models only touches claude/; swapping profilers only touches profile/ + ncu/; adding a new kernel shape only adds one file to handlers/ + one registration line; agent.py itself is untouched.

Architecture diagram (data flow)

flowchart LR
    cfg["config.env<br/>(cfg)"]
    agent["agent.py · orchestrator<br/>(agent)"]

    subgraph know["Knowledge sources (prompt-injected)"]
        direction TB
        sk["skills/*.md<br/>catalog + top-5 full body<br/>(sk)"]
        rl["ncu rules matrix<br/>(rl)"]
        me["memory/&lt;kernel&gt;.md + Haiku summary<br/>(me)"]
        pt["CUDA_OPTIMIZATION.md · cross-kernel patterns<br/>(pt)"]
    end

    subgraph mod["kernelking/ modules"]
        direction TB
        ob["observe · scalar instrumentation<br/>(ob)"]
        ha["harness · gen + block-geom validator<br/>(ha)"]
        nc["ncu parser · metrics → bottleneck + findings<br/>(nc)"]
        pr["profile · real-env timing-only<br/>(pr)"]
        op["optimize · iter loop + integrate + decision<br/>(op)"]
    end

    cl["claude/ · prompt cache + retry<br/>(cl)"]
    api[("Claude API<br/>(api)")]

    subgraph rm["Remote GPU server (SSH)"]
        direction TB
        nv["nvcc<br/>(nv)"]
        sb["standalone_bin · cudaEvent<br/>(sb)"]
        nx["ncu CLI · --set full / timing-only<br/>(nx)"]
        cg["cargo build → prover<br/>(cg)"]
    end

    cfg --> agent --> op
    op --> ob
    op --> ha
    op --> nc
    op --> pr

    ha -.prompt.-> cl
    ob -.prompt.-> cl
    op -.prompt.-> cl
    know --> cl
    cl <--> api

    cl == "harness .cu" ==> ha
    cl == "instrumentation patch" ==> ob
    cl == "new kernel code + MEMO" ==> op

    ha -- "SCP standalone.cu" --> nv
    nv --> sb
    sb -- "KK_TIMING" --> nc
    nv --> nx
    nx -- ".ncu-rep + CSV" --> nc
    nc --> op

    op -- integrate --> cg
    cg --> nx
    nx -- "real-env delta" --> op
    pr --> nx

Legend: -.prompt.-> is the prompt sent to Claude (dashed, different each round); ==> is the artifact returned by Claude (thick lines, three different uses); solid lines are data flows between local/remote tools.

Three paths through this diagram:

Observation (first time per kernel): agent → observe → claude sends prompt → Claude thick line ==> returns to ob a cpp/rust patch with eprintln! / printf → SCP to remote → cargo build → prover runs once → real scalars written back to refs/<kernel>.observed.json
Standalone optimization loop: op → ha sends prompt → Claude thick line ==> returns to ha a standalone-compilable harness .cu → SCP → nvcc → standalone_bin prints KK_TIMING + ncu --set full captures metrics → nc parses into findings → combined with know (skills + rules + memory + patterns) to build a prompt for cl → Claude thick line ==> returns to op a new kernel source + MEMO line (explaining what changed and which rule was targeted) → next round
Deployment: op strips the kernel from the harness back to FFI form → SCP → cargo build → ncu timing-only on real prover → real-env delta decides accept / rollback

What each arrow does (box names in the table header match the diagram exactly; short IDs in parentheses for cross-referencing):

Control flow — agent.py startup + orchestrator dispatch

From → To	Meaning
`config.env` (cfg) → `agent.py · orchestrator` (agent)	Loads `config.env` at startup: target kernel path, `KERNEL_NAME`, `LOCAL_CONTEXT_FILES`, Claude model, `RUNS_KEEP_TRANSCRIPT`, `MEMORY_COMPACT_*` switches, etc. All downstream behavior branches from here.
`agent.py · orchestrator` (agent) → `optimize` (op)	After `agent.py` completes Stage A/B/C/D (config / remote probe / harness preparation / knowledge loading), control is handed to `optimize/loop.py` to run the iter loop until accept or budget exhausted.
`optimize` (op) → `observe` (ob)	Stage 1.5: if `refs/<kernel>.observed.json` does not yet exist, `op` triggers the observation sub-pipeline — has `observe` insert `printf` / `eprintln!` into production code to capture real scalar arguments.
`optimize` (op) → `harness` (ha)	At the start of each iter, `op` calls `harness/generate.py` to request a standalone-compilable `.cu` (with `main()`, dummy data, `cudaEvent` timing).
`optimize` (op) → `ncu parser` (nc)	`op` hands the ncu report path to `kernelking/ncu/`, which parses metrics → bottleneck label + fired rule findings.
`optimize` (op) → `profile` (pr)	One real-env timing-only profile run at the start and end of each run (only `gpu__time_duration.sum`, ~15s), driving the final accept/reject decision.

Prompt flow — three call scenarios share the claude/ facade

From → To	Meaning
`harness` (ha) -.prompt.-> `claude/` (cl)	Harness generation prompt: includes kernel signature, `dim3` geometry from wrapper, observed scalars, block-geometry validation rules.
`observe` (ob) -.prompt.-> `claude/` (cl)	Observation prompt: pastes both cpp + rust source, requests a diff that "inserts `printf` at the kernel entry + argument sites, and `eprintln!` on the Rust side".
`optimize` (op) -.prompt.-> `claude/` (cl)	Optimization iteration prompt: bottleneck + findings + current best kernel + historical memory + skill bodies + cross-kernel patterns (assembled from `Knowledge sources`).
`Knowledge sources` (sk + rl + me + pt) → `claude/` (cl)	Assembles 4 knowledge source types (skills/rules/memory/patterns) into the system block, with `cache_control: ephemeral` for Claude to hit prompt cache at 10× discount.
`claude/` (cl) ↔ `Claude API` (api)	`claude/client.py` makes actual HTTP calls to the Anthropic API, handling retries, timeouts, and token accounting. Also writes `runs/.../response_<key>.txt` (Tier 2), and `prompt_<key>_user/system.txt` (Tier 3) when `RUNS_KEEP_TRANSCRIPT=1`.

Claude return artifacts — three types, each going back to the originator

From → To	Meaning
`claude/` (cl) == "harness .cu" ==> `harness` (ha)	Claude returns a complete `.cu`; `harness` validates whether its `<<<grid, block>>>` matches the wrapper's `dim3`; if not, retries with an error message up to `HARNESS_BLOCK_CHECK_RETRIES` times.
`claude/` (cl) == "instrumentation patch" ==> `observe` (ob)	Claude returns cpp + rust source with `printf` / `eprintln!`; `observe` writes it to a temporary patch locally then SCPs to remote.
`claude/` (cl) == "new kernel code + MEMO" ==> `optimize` (op)	Claude returns new kernel source + one `// MEMO: [rule_id] ...` line; `optimize` extracts the rule_id into memory, inserts the kernel back into the harness, and runs the next round.

Remote toolchain — SSH / SCP to GPU server

From → To	Meaning
`harness` (ha) — "SCP standalone.cu" → `nvcc` (nv)	SCPs the new harness to remote `/tmp/kk_<run>/`, then triggers `nvcc` compilation via `ssh`.
`nvcc` (nv) → `standalone_bin · cudaEvent` (sb)	On successful compilation, produces `standalone_bin` with a built-in cudaEvent timing loop (3 runs averaged + 1 warmup), free from ncu interference.
`standalone_bin` (sb) — "KK_TIMING" → `ncu parser` (nc)	`standalone_bin` prints `KK_TIMING cfg=... us=...` lines to stdout; `ncu parser` uses regex to capture µs per config.
`nvcc` (nv) → `ncu CLI` (nx)	The same binary is run again under `ncu --set full` (~80s) to capture the complete metric package.
`ncu CLI` (nx) — ".ncu-rep + CSV" → `ncu parser` (nc)	Report files SCP'd back; `ncu parser` parses SM/DRAM/Warp cycles/Occupancy → bottleneck label + fired rule list.
`ncu parser` (nc) → `optimize` (op)	Bottleneck + findings handed back to `optimize` for the next prompt construction.

Integration + real-machine verification — accept/reject decision path

From → To	Meaning
`optimize` (op) — "integrate" → `cargo build → prover` (cg)	The best kernel is stripped from the harness via handler (preserving the original FFI signature), SCP overwrites `REMOTE_KERNEL_CU`, triggers `cargo build --release` (with `.bak` protection; auto-rollback on failure).
`cargo build → prover` (cg) → `ncu CLI` (nx)	The compiled prover binary runs one more pass of `ncu --metrics gpu__time_duration.sum` (timing-only, ~15s), capturing actual timing distribution of all launches.
`ncu CLI` (nx) — "real-env delta" → `optimize` (op)	`optimize` uses this real-env timing vs baseline relative reduction (weighted by top-5 config, launch count weight) to decide: ≥ 2% accept, otherwise rollback `.bak`.
`profile` (pr) → `ncu CLI` (nx)	Stage 2 baseline also uses this path: `profile/` triggers one timing-only profile to get baseline total runtime + top-5 config distribution, written to `refs/<kernel>.timing.json`.

Three subsystems in depth

The three parts of Kernelking that most directly affect optimization quality and are most frequently changed are explained below. Other parts (SSH, cargo, ncu parsing, etc.) are relatively tooling-oriented and change less often.

1. Skills (bottleneck expert knowledge base)

Location: skills/*.md (currently 19 skills + one SKILLS.md glossary).

Each skill is a markdown file describing the root causes of one bottleneck type, prioritized fix directions, and anti-patterns. The first line has **Signal**: ... used for catalog extraction.

Terminology note — a few terms that appear repeatedly in the tables below:

SM (Streaming Multiprocessor): The GPU's "compute unit". An RTX 5090 has 170 SMs; each SM contains multiple sets of CUDA cores / Tensor cores / Warp schedulers. "SM utilization" in ncu is SM_Throughput_pct, measuring whether compute is saturated.

DRAM (Dynamic Random-Access Memory): The GPU's "video memory", i.e. the 32 GB main memory shown in nvidia-smi. Physically far from the SMs, with limited bandwidth (5090 ≈ 1.8 TB/s). "DRAM utilization" is DRAM_Throughput_pct, measuring whether bandwidth is saturated.

Roofline judgment: whichever percentage is higher is where the kernel is bound. High SM = compute-bound, high DRAM = memory-bound. Both low = either launch is too small, or the kernel is waiting on stalls.

L1 / L2 / smem: On-chip cache layers between SM and DRAM. L1 (≤ 128 KB) and shared memory physically share hardware, exclusive to each SM; L2 (≤ 50 MB) is shared across the whole GPU. More cache hits → fewer DRAM accesses.

Simplified hardware diagram:
Thread → Warp (32 threads) → Block → SM (block executes here)
                                      └─ smem / L1 ─┐
                                                     │
   All SMs ──────────────────────── L2 ──────── DRAM (video memory)

SKILLS.md is a glossary / style guide (not a skill itself). The 19 concrete skills are grouped into four categories:

📊 Roofline level (overall characterization)

File	Trigger signal (ncu metric)	Fix direction
`severe_memory_bound.md`	DRAM ≥ 80% peak AND SM noticeably lower	Reduce traffic, vectorize loads, increase arithmetic intensity, fuse kernels
`near_peak_compute.md`	SM ≥ 85% AND DRAM 5+ points lower	Tensor Core / FMA reordering, reduce op count via algorithm change
`huge_dram_traffic.md`	DRAM R+W ≥ 10 GB per launch	Fusion, remove redundant writes, reduce precision, remove padding

🧮 Occupancy / launch configuration

File	Trigger signal	Fix direction
`low_occupancy.md`	`warps_active.pct_of_peak < 50%`	Reduce regs/thread, shrink smem, adjust block size
`register_limited.md`	Occupancy register ceiling < warps ceiling	`__launch_bounds__`, split long functions, reuse registers
`low_waves_per_sm.md`	`waves_per_sm < 2`	Grid too small; use persistent threads / merge launches
`high_waves_per_sm.md`	`waves_per_sm ≥ 40`	Secondary signal, usually accompanies other issues

💾 Memory access / cache

File	Trigger signal	Fix direction
`poor_coalescing.md`	L2 actual/ideal sector ratio ≥ 1.25 (coalescing < 80%)	Reorder access pattern, transpose layout, smem staging
`low_l1_hit.md`	`l1tex_t_sector_hit_rate < 30%`	Increase temporal/spatial locality, tile working set
`low_l2_hit.md`	`lts_t_sector_hit_rate < 50%`	Change block-to-SM scheduling, shrink cross-block working set
`smem_bank_conflicts.md`	Shared memory bank conflict count > 0	`tile[TILE_DIM+1]` padding, reorder bank access

⏳ Warp stall breakdown ("why SM is idle")

File	Trigger signal (stall ratio)	Meaning / fix
`stall_long_scoreboard.md`	≥ 0.3	Waiting on global mem; add memory-level parallelism / prefetch
`stall_short_scoreboard.md`	≥ 0.3	Waiting on smem / L1; bank conflicts / smem overload
`stall_mio_throttle.md`	≥ 0.3	MIO queue saturated; vectorize loads to reduce instruction count
`stall_math_pipe.md`	≥ 0.3	Single arithmetic pipeline saturated (FMA/SFU/XU); change algorithm mix
`stall_barrier.md`	≥ 0.2	`bar.sync` wait; uneven workload / reduce critical section
`stall_wait.md`	≥ 0.3	`__syncthreads` wait; reduce barriers / reduce divergence
`stall_membar.md`	≥ 0.15	Redundant fence; narrow fence scope
`low_ipc.md`	IPC < 0.5	Composite symptom, usually downstream of several stalls above

Every skill file has a uniform structure: **Signal** / **Root causes** (prioritized) / **Fixes** / **Anti-patterns**. When Claude sees a rule fire, the full body is injected into the user block; skills that did not fire are still present in the system block catalog by name and Signal, allowing the model to "recall" which direction to explore.

Two-layer injection (in kernelking/knowledge/skills.py):

Layer	Contents	Where	When updated
Catalog	One-liner `Signal` per skill, assembled as `"- poor_coalescing — L2 sector ratio > 4x"`	Cached in system block (`cache_control: ephemeral`)	Stable within a run alongside skill files; essentially free
Full body (fired)	Full text of skills matched by ncu rules this round (top 5 by severity, each ≤ 2000 chars)	Per-round user message	Recomputed each iteration from the new ncu report

Stuffing all 19 skills takes ~25k tokens; the current average is ~7k tokens, with the skill portion at 3-5k — yet the model still knows all skills exist via the catalog and can recall them by name.

Adding a new skill: Create skills/<id>.md (with a **Signal**: line), then add one declarative rule to kernelking/ncu/rules.py to trigger it. No changes to the main loop.

Note: This is our own hand-crafted on-demand mechanism, not the Anthropic Agent Skills API. Triggering is entirely driven by ncu rules, making it reproducible and cost-controllable. The trade-off is that skills for bottlenecks not flagged by ncu never get their full body expanded (but the catalog still lists them).

2. Prompts (what Claude receives each round)

A single Claude call's prompt is split into a system block (stable → served from prompt cache) and a user block (changes each round).

System block (only cache_write on the first round; all subsequent iterations are cache_read):

┌─ NO_ELIDE hygiene ──────────────────────────────────┐
│ Forbids "// ... rest unchanged" lazy elisions        │
│ Requires complete kernel code in output              │
├─ Skill catalog (all 19 one-liners) ─────────────────┤
│ - poor_coalescing — L2 sector ratio > 4x             │
│ - low_occupancy   — achieved_occupancy < 25%         │
│ - ... (17 more)                                      │
├─ Fired skill full bodies (top 5) ───────────────────┤
│ - e.g. complete fix list for [poor_coalescing]       │
├─ CUDA_OPTIMIZATION.md cross-kernel pattern library ─┤
│ shared-mem tiling / vectorized loads /               │
│ grid-stride loops / warp shuffle / ...               │
├─ Past run memory (unchanged within this run) ────────┤
│ - Small kernel: tail of memory/<kernel>.md           │
│ - Large kernel: Haiku-compressed                     │
│   Resolved / Pending / Avoid three-section           │
│   memory/<kernel>.summary.md                         │
└─ Handler-contributed prompt hints (if any) ──────────┘

User block (recomputed each round):

┌─ Current kernel source (baseline or last-optimized version)
├─ Host wrapper / Rust adapter context (from LOCAL_CONTEXT_FILES)
├─ ncu analysis results
│    - Bottleneck classification (memory_bound / compute_bound / latency_bound)
│    - Fired declarative rules (each: id + severity + action, one line)
│    - Key metric values (DRAM throughput, occupancy, warp cycles, ...)
├─ NVIDIA official recommendations
│    - OPT/INF/WRN three levels from `ncu --page details`, deduplicated
│    - Annotations for which recs cross-match local rule findings (✓)
├─ Current run's iters so far
│    - Only the segment under the current ## Run <ts> (however many there are)
│    - New REJECTED/ACCEPTED entries accumulate ~200-400 chars per round
├─ Workload calibration (only appears when generating harness)
│    - kConfigs table: grid/block/target_us/WORKLOAD BUDGET/OBSERVED scalars
│    - Role A vs Role B scalar classification constraints
└─ Task statement ("Here's the kernel, optimize it")

Memory partitioning (frozen snapshot): The memory tail from past runs or the Haiku-compressed summary is unchanged within the current run, so it moves into the system block to benefit from cache_read pricing; the current-run section that accumulates new records each round stays in the user block. Design inspired by hermes-agent's MEMORY.md "session-start frozen snapshot" design.

Hard constraint gates (in kernelking/claude/client.py + kernelking/optimize/apply.py):

FORBIDDEN_DIFF_PATTERNS: Claude's output is statically scanned to intercept lazy/illegal content like // ... existing code ..., // TODO: implement, empty function bodies, non-ASCII characters (≈ →). Matches trigger a retry.
Recovery prompt: When max_tokens truncation is detected, automatically re-prompts with "continue from where you left off".
API retry + exponential backoff: overloaded_error / rate_limit / network errors use exponential backoff retry without interrupting the entire run.

Typical token distribution (measured per iteration when running transpose_kernel):

Block	Tokens	Cached
NO_ELIDE + skill catalog + fired skills + pattern library + past-run memory	~4.5–6k	cache_read (free)
ncu findings + NVIDIA recs	~1.5–2k	recomputed each round
Kernel source + context + this-run iters	~500-1k	recomputed each round
Output	~600-1.5k	—

Complete run (≈6 iters) cost ~$0.30–0.40 (after memory enters cache, ~20–25% cheaper than before).

3. Harness (how standalone.cu is generated)

Goal: Strip the target kernel from the Rust project and construct a .cu that compiles as a single nvcc file and behaves as close to the production workload as possible.

Physical structure (two clearly separated sections; the optimization loop only touches KERNEL SECTION):

// ===== KERNEL SECTION (modify this section to optimize) =====
<original kernel source after preprocessing + dependent struct / __device__ functions>
// ===== END KERNEL SECTION =====

// ===== HARNESS SECTION (do not modify) =====
struct HarnessConfig { const char* name; int n; int grid; int block; ... };
constexpr HarnessConfig kConfigs[] = {
  { "top1_fast", 4194304, 131072, 32, 402.5f, /* scalars */ },
  { "top1_slow", 4194304, 131072, 32, 1308.5f, /* scalars */ },
  ... (top-K configs from real-env ncu)
};
int main() {
  // per config: 3 warmup + 3 cudaEventRecord timing → print KK_TIMING
}
// ===== END HARNESS SECTION =====

HARNESS SECTION is generated once by Claude in step_generate_harness and is never changed again within a run. Every optimization loop iteration only rewrites KERNEL SECTION — identical workload, fair comparison.

Generation process (kernelking/harness/generate.py):

Deterministic preprocessing (no LLM)
- __attribute__((constant)) → __device__ (GCC syntax not recognized by nvcc)
- __constant__ T → __device__ T (avoids non-trivial constructor dynamic init restrictions)
Deterministic signature extraction (no LLM)
- Regex scans __launch_bounds__(BLOCK) → block size
- Regex scans kernel_name<<<grid, block>>>(...) → grid expression / template parameters
- Block geometry parsed from wrapper: reads dim3 dimBlock(bx, by[, bz]) or the second argument of <<<grid, block>>> in the host wrapper, resolves into a (bx, by, bz) triple — macros, constexpr int, and literals are all traced. Prompt states "block must be dim3(32, 8) = 256 threads/block" as a mandatory rule for Claude
- Parses argument list → [(type, name), ...] tuples
- Extracts dependent struct / constexpr / define bindings
LLM generates HARNESS SECTION (Claude)
- Prompt includes: signature, argument list, representative configs, WORKLOAD BUDGET (per-thread instruction / bytes, derived from real-machine ncu counters), OBSERVED scalars (exact values from real-machine instrumentation), and the Role A/B constraints below
Block-geometry post-validation + retry (no LLM, kernelking/harness/generate.py:_validate_harness_block_dims)
- After generation, scans every kernel_name<<<grid, block>>> in the harness source; resolves dim3(c.block_x, c.block_y) / dim3(...) literals / scalar fallback to struct HarnessConfig fields
- Compares bx*by*bz from the wrapper against each config; mismatches trigger Claude to rewrite with feedback (up to HARNESS_BLOCK_CHECK_RETRIES=3 times), with the feedback explicitly listing which configs have wrong threads/block values and the expected value
- Anti-pattern: when Claude sees only a scalar block field in kConfigs[], it tends to write dim3(c.block) — a 1D launch that drops y/z — which the post-validation catches immediately

Role A vs Role B constraints (the longest and most critical part of the harness prompt):

Each scalar argument must be classified:

Role	Determination	How to set
Role A — "active thread count"	`height` in `if (i >= height) return;`	Always = `grid * block`
Role B — "per-thread work"	`width` in `for (j = 0; j < width; ++j)`	The knob for adjusting target_us; back-calculate from WORKLOAD BUDGET or use OBSERVED value directly

Typical symptom of mixing them up: standalone shows 5 µs (nearly pure launch overhead) while real machine shows 10 ms — every thread early-exits immediately. This constraint emerged from repeated mistakes in practice.

Data priority (Claude uses data in this order):

OBSERVED (exact values from real-machine Rust/C++ instrumentation) → use directly
WORKLOAD BUDGET (per-thread inst/bytes back-calculated from ncu) → use when OBSERVED unavailable
target_us (timing target from real-machine ncu) → final sanity check

Runtime (kernelking/harness/timing.py):

./standalone_bin                 → prints KK_TIMING name=... median_us=... (authoritative timing)
sudo ncu --set full ./standalone → .ncu-rep (metrics only, timing discarded)

_merge_harness_timing_into_metrics overwrites ncu's time column with the cudaEvent median, yielding a combination of "accurate timing + hardware metrics". Each iteration these two phases together take about 50 seconds on the remote (cudaEvent < 3s, ncu ~40s).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kernelking — CUDA Kernel Optimization Agent

Why not a general-purpose LLM agent

Full workflow

Workflow diagram (lifecycle of one run)

Step-by-step breakdown (read alongside the workflow diagram)

Deep dive: what does `refs/<kernel>.observed.json` actually do

Quick start

Prerequisites

Configure `config.env`

Run

Output files

Post-mortem review: `runs/<kernel>/<ts>/` (one directory per run, never overwritten)

Safety and rollback

Design trade-offs

Module architecture

Architecture diagram (data flow)

Three subsystems in depth

1. Skills (bottleneck expert knowledge base)

2. Prompts (what Claude receives each round)

3. Harness (how standalone.cu is generated)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
handlers		handlers
kernelking		kernelking
skills		skills
.gitignore		.gitignore
CUDA_OPTIMIZATION.md		CUDA_OPTIMIZATION.md
KERNELKING_REVIEW.md		KERNELKING_REVIEW.md
README.md		README.md
agent.py		agent.py
config.env.example		config.env.example
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Kernelking — CUDA Kernel Optimization Agent

Why not a general-purpose LLM agent

Full workflow

Workflow diagram (lifecycle of one run)

Step-by-step breakdown (read alongside the workflow diagram)

Deep dive: what does refs/<kernel>.observed.json actually do

Quick start

Prerequisites

Configure config.env

Run

Output files

Post-mortem review: runs/<kernel>/<ts>/ (one directory per run, never overwritten)

Safety and rollback

Design trade-offs

Module architecture

Architecture diagram (data flow)

Three subsystems in depth

1. Skills (bottleneck expert knowledge base)

2. Prompts (what Claude receives each round)

3. Harness (how standalone.cu is generated)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Deep dive: what does `refs/<kernel>.observed.json` actually do

Configure `config.env`

Post-mortem review: `runs/<kernel>/<ts>/` (one directory per run, never overwritten)

Packages