Skip to content

benhuang2025/Kernelking

Repository files navigation

Kernelking — CUDA Kernel Optimization Agent

Kernelking is an automated agent focused on optimizing CUDA kernels driven by ncu. It is not a general-purpose code agent; instead, it hard-codes our engineering experience with Nsight Compute (the metric → bottleneck → action reasoning path) into "expert knowledge" fed to Claude, allowing the model to perform more accurately within a relatively narrow problem space.

Full closed loop: pick a CUDA kernel from a production Rust project → extract it into a standalone nvcc-compilable harness → measure with ncu → Claude modifies the code → re-measure → if there is a real gain, integrate back into the Rust project and cargo build → real-environment ncu verification → deploy or roll back.


Why not a general-purpose LLM agent

A general-purpose agent running cargo build + ncu takes 5–10 minutes per round, making the iteration cycle too long. Faced with hundreds of lines of raw ncu output, LLMs also tend to miss the key issues. Kernelking applies several engineering optimizations to close this gap:

Layer What it does Why it matters
Standalone harness Strips the kernel from the Rust project and generates a main() + nvcc-directly-runnable .cu nvcc single-file compile ~5s vs cargo build ~3min. Iteration cycle drops from minutes to seconds
cudaEvent authoritative timing The harness uses cudaEventRecord for self-timing: 3 warmup + 3 measured runs, taking the median. ncu is used only for metrics; its timing is discarded ncu's PMU setup + replay distorts timing by 2–5%; cudaEvent has near-zero overhead and the standalone finishes in a few seconds
Real scalar observation fprintf/eprintln real scalar arguments in the host wrapper and Rust adapter, run one real production workload to capture the actual (grid, block, num_rows, num_cols, ...) for every launch Scalars determine per-thread work budget. Getting them wrong makes the harness run a different workload. This step ensures the harness uses the actual parameters from production
Representative config sampling (B+) Groups real-env launches by (grid_x, block_x), splits into fast/slow buckets by p95/p25 variance, picks top-K configs covering 90%+ of total runtime Avoids running every launch through the harness; picks a few representative configs that cover 80–90% of total runtime
Index-aligned fast/slow Aligns the Nth observation launch with the Nth ncu launch by position; key extraction supports both 1D (grid/block) and dim3 (grid_x/block_x) schemas The fast-bucket and slow-bucket scalars are genuinely different (e.g. num_cols=13 vs 127), no longer sharing a median; 2D-launch kernels can also use the aligned path
Wrapper-derived block geometry Parses (bx, by, bz) from dim3 dimBlock(...) / kernel<<<grid, block>>> in the host wrapper → forces the prompt to tell Claude the correct block shape; post-generation, scans struct HarnessConfig and all launch sites for threads/block validation, retrying with feedback if wrong (up to 3 times) Historically Claude occasionally writes dim3(32,8) as scalar 32threadIdx.y is always 0 → only 1/8 of the work is done → shows 0% gain. Post-validation + retry catches these silent defects
Skill-grouped ncu metrics Groups dozens of metrics by roofline / memory / warp_stall / occupancy / instruction, feeds Claude the one-line summary memory_bound / compute_bound / latency_bound first Claude does not need to parse raw ncu --page details; it sees "memory-bound, L2 hit rate 40%" directly in the prompt
Declarative rule matrix RULES list: each entry has id + severity + condition + finding + action, e.g. smem_bank_conflicts → Pad smem to TILE_DIM+1 Gives fixed [rule_id] action hints for known pitfalls; Claude sees the recommendation at the top of the prompt
On-demand skills + catalog Loads the full text of matching skills/<id>.md files after rule hits (up to top-5); the prompt always contains a skill catalog (one-liner signal summary per skill, served from prompt cache) Stuffing all 19 skills takes ~25k tokens; hit-based loading + catalog index keeps each round at ~7k tokens, while the model can still "recognize" other skill directions from the catalog
NVIDIA official recommendations cross-check Parses OPT/INF/WRN three-level recommendations from ncu --page details, corroborated with local rule findings Built-in Nsight experience + our own rule matrix as double insurance; recommendations that match on both sides get higher priority from Claude
Cross-kernel pattern library CUDA_OPTIMIZATION.md aggregates successful strategies from past runs (shared-mem tiling / vectorize / coalescing, etc.), shared across kernels The same bottleneck type does not require Claude to "reinvent" solutions each time
Per-kernel memory memory/<kernel>.md records the MEMO + result (ACCEPTED / REJECTED / crash) for every iteration The next time the same kernel is run, previously failed strategies are not repeated
Frozen-snapshot memory + Haiku compaction Historical run memory moves into the system block (cache_read); current-run iter records accumulate in the user block. When memory exceeds a threshold, Haiku rewrites it into a Resolved / Pending / Avoid three-section summary <kernel>.summary.md Saves ~20–25% prompt cost per iter ($0.10–0.15/run); when the same kernel is run repeatedly, early experience is not lost due to tail truncation
Handler plugins handlers/*.py registers kernel-shape-specific prompt hints + post-processing (namespace detection, prose stripping, …) Adding a new special kernel shape (template / multi-namespace / special prose) only touches the handler, not the main loop
Prompt caching + retry Stable sections use cache_control: ephemeral; API fluctuations use automatic exponential backoff Each iteration's ~7k tokens are mostly cache_read, reducing cost per round from ~$0.04 to ~$0.005; network jitter no longer interrupts the run
Threshold short-circuit + early stop NO_GAIN_THRESHOLD_PCT=2.0%: skips deployment if standalone gain does not cross the threshold. NO_PROGRESS_PATIENCE=4: ends early after 4 consecutive no-progress rounds Saves ~5min of pointless cargo build + real-env ncu; does not persist with kernels already near peak performance

Full workflow

The explanation below has two layers — the workflow diagram lets you explain "what one run does" to someone in 3 minutes; the step-by-step breakdown is for people taking over the codebase.

Workflow diagram (lifecycle of one run)

flowchart TD
    start(["python3 agent.py"]) --> pre["<b>Preflight</b><br/>load config.env · probe SSH / GPU / Claude key"]
    pre --> s1["<b>Step 1</b> fetch remote kernel source + local Rust context"]
    s1 --> s15obs{"Step 1.5<br/>refs/&lt;kernel&gt;.observed.json<br/>exists?"}
    s15obs -- "YES (cached, instant)" --> s19
    s15obs -- "NO (first time, ~5min)" --> s15run["<b>Observation sub-pipeline</b><br/>Claude inserts printf + eprintln into C++ / Rust<br/>→ SCP → cargo build → run real prover once<br/>→ capture real scalars for every launch<br/>→ restore sources → rebuild"]
    s15run --> s19

    s19["<b>Step 1.9</b> normalize remote binary<br/>rebuild if current source != last-deployed binary"] --> s2
    s2["<b>Step 2</b> Real-env baseline timing<br/>ncu --metrics gpu__time_duration.sum (~1-2min)<br/>→ bucket by (grid, block) → top-K representative configs"] --> s3

    s3["<b>Step 3</b> generate standalone harness<br/>Claude takes real-config + observed scalars + wrapper dim3<br/>→ returns self-contained .cu with main()"] --> s3check{"block-geom<br/>check ok?"}
    s3check -- "NO (re-prompt)" --> s3
    s3check -- "YES" --> s4

    s4["<b>Step 4</b> nvcc compile harness"] --> s4ok{"compile<br/>ok?"}
    s4ok -- "NO (≤8 retries)" --> s4fix["Claude reads nvcc log → patches .cu"]
    s4fix --> s4
    s4ok -- "YES" --> s4base["<b>Standalone baseline</b><br/>cudaEvent timing · ncu --set full<br/>→ bottleneck + findings + fired rules"]

    s4base --> loop
    loop["<b>Step 5 optimization iter loop</b><br/>(default budget=6, early-stop after 4 no-gain iters)"] --> lprompt["build prompt:<br/>bottleneck + findings + skills + memory + cross-kernel patterns"]
    lprompt --> lclaude["Claude → new kernel code + MEMO"]
    lclaude --> lbuild["rewrite harness → nvcc → cudaEvent + ncu"]
    lbuild --> lcmp{"improvement<br/>vs best ≥ 2%?"}
    lcmp -- "YES" --> lupd["update best · write ACCEPTED to memory"] --> lmore
    lcmp -- "NO" --> lrej["write REJECTED to memory<br/>(used as 'avoid' hint next iter)"] --> lmore
    lmore{"budget left<br/>AND not 4 no-gain in a row?"}
    lmore -- "YES" --> lprompt
    lmore -- "NO" --> gate

    gate{"<b>Short-circuit gate</b><br/>standalone best gain ≥ 2%?"}
    gate -- "NO" --> skip(["⏭ skip integration<br/>server unchanged<br/>exit 2"])
    gate -- "YES" --> s6

    s6["<b>Step 6</b> Integrate<br/>Claude strips harness → FFI-compat .cu<br/>→ SCP overwrite with .bak · cargo build"] --> s7
    s7["<b>Step 7</b> real-env optimized timing<br/>ncu timing-only on full prover run"] --> fin
    fin{"real-env delta<br/>vs baseline?"}
    fin -- "improved" --> accept(["🎉 <b>ACCEPTED</b><br/>optimization live on server<br/>exit 0"])
    fin -- "regressed" --> roll(["⏪ <b>ROLLED BACK</b><br/>restore .bak + cargo build<br/>exit 1"])

    classDef ok fill:#0d3,stroke:#093,color:#fff
    classDef bad fill:#c42,stroke:#822,color:#fff
    classDef skip fill:#888,stroke:#555,color:#fff
    class accept ok
    class roll bad
    class skip skip
Loading

3-minute pitch for this diagram:

"This agent's work can be broken down into two nested feedback loops. The inner loop (Step 5's iter loop) is a fast cycle: Claude reviews ncu metrics and proposes a new kernel version; the standalone harness runs it on the GPU for a few seconds and gets latency + new ncu metrics; the result is compared against the historical best. If it wins, the version is recorded; if it loses, a note is written to memory so the next round doesn't repeat the same mistake. The outer loop (Steps 6-7) is a slow loop: can the standalone winner actually reproduce in a real prover? So it gets integrated back into the Rust project via cargo build, then ncu measures the complete timing on a real machine prover run. Only when the real-machine time also genuinely improves is it accepted; otherwise .bak is restored. The entire process is fully automatic: fetch remote kernel, scrape scalars from the real prover, generate harness, profile, prompt Claude, retry, integrate, verify, roll back — all from a single command."

Three design points worth calling out:

  1. Observation cache (Step 1.5) — The first time a kernel is run, 5 minutes are spent patching + running the prover + capturing scalars, with results cached in refs/<kernel>.observed.json. On subsequent runs, as long as the signature hasn't changed, this step completes instantly. This means "running the standalone at real workload scale" is not a 5-minute tax paid on every run.
  2. Short-circuit Gate — When the standalone has not achieved ≥ 2% gain, integration + real-machine verification (which together take ~3-5 minutes) are skipped entirely. No GPU time is wasted verifying a byte-identical kernel.
  3. Rollback is automatic — When the standalone wins but the real environment regresses (e.g. cache / launch overhead effects not reflected in the harness), the agent automatically restores using .bak + re-runs cargo build, leaving no half-applied state.

Step-by-step breakdown (read alongside the workflow diagram)

Steps explained in the order they appear in the diagram. The Artifacts / cache column lists key files written at each step, on which subsequent steps depend.

Step (node in diagram) What it does Artifacts / cache Typical duration
Preflight Loads config.env; probes SSH connectivity to remote, nvidia-smi for free GPU, and validity of ANTHROPIC_API_KEY < 5s
Step 1 fetch remote kernel ssh <host> cat <REMOTE_KERNEL_CU> pulls the target .cu locally, plus reads LOCAL_CONTEXT_FILES (Rust FFI adapter) as context for Claude In-memory kernel source string + Rust context string ~2s
Step 1.5 Observation sub-pipeline Claude inserts temporary printf / eprintln! into the C++ wrapper and Rust adapter to print every launch's scalar arguments (num_rows, height, chunk_size, is_inject, …); SCP to remote, cargo build, run the real prover once to capture all scalars, then restore sources and rebuild to a clean state refs/<kernel>.observed.json — see the deep-dive section below First time ~5min; cache hit is instant
Step 1.9 normalize binary The remote single-node prover binary may be an optimized version from the last deploy, while the current source has been reverted to its original. This step runs cargo build --release to sync the binary to the current source, avoiding a falsely low baseline from a "last-deployed optimized version" Remote binary aligned to current source Source unchanged ~2s; source changed ~3min
Step 2 real-env baseline Runs ncu --metrics gpu__time_duration.sum inside the real prover to capture the timing of every kernel launch. Groups by (grid_x, block_x) + variance to split fast/slow → picks top-K representative configs (default K=5, weighted by total duration share), and records baseline total runtime refs/<kernel>.timing.json + representative_configs[] (each with grid/block/share/scalar) 1-2min
Step 3 generate harness Claude generates a standalone-compilable .cu (with main(), dummy data allocation, 5 kernel launches, cudaEvent timing loop) based on (kernel source + Rust context + top-K representative configs + real scalars per config + wrapper dim3 geometry) Local /var/folders/.../tmpXXXX.cu ~15-30s
block-geom check Parses <<<grid, block>>> syntax in Claude's output, flattens block (dim3(32,8) → 256 threads) and compares against the real dim3 from the wrapper. If mismatched, returns the error to Claude for regeneration, up to HARNESS_BLOCK_CHECK_RETRIES (default 2) times < 1s per check
Step 4 nvcc compile SCP harness to remote /tmp/kk_<run>/, trigger nvcc -O3 -arch=... -o standalone_bin standalone.cu via ssh Remote standalone_bin ~5-10s
compile-fix loop When nvcc fails, Claude receives the stderr log and patches .cu, then re-SCP + nvcc, up to 8 times. Only a total failure causes die() ~20-30s per attempt
Standalone baseline Runs the freshly compiled standalone_bin:
Phase 1 cudaEvent timing: 3 warmup + 3 measured runs, take median (pure timing, no ncu overhead, a few seconds)
Phase 2 ncu --set full captures SM / DRAM / warp stall / occupancy / launch stats in one shot (timing discarded)
standalone_report_iter_0.ncu-rep + .csv + bottleneck label + fired rules ~40-50s (ncu --set full dominates)
Step 5 iter loop Core optimization loop. Each round: build prompt (bottleneck + findings + current best kernel + skills/*.md top-5 + memory/<kernel>.md past runs + CUDA_OPTIMIZATION.md relevant tags) → Claude returns new kernel + MEMO → replace in harness → nvcc → cudaEvent + ncu → compare against current best One runs/.../iter_N/ per round (containing standalone.cu / diff / memo / ncu_brief / Claude response) ~1-2min per round
≥2% improvement? ≥ 2% → update best, write ACCEPTED to memory (basis for confidence next time a similar change is tried); < 2% → do not update best, write REJECTED to memory (the next prompt will see "this path was tried, do not repeat") Appends one line to memory/<kernel>.md < 1s
budget left? 6 rounds exhausted OR 4 consecutive no-progress rounds → exit loop; otherwise continue
Short-circuit gate After exiting the loop, check standalone best gain relative to baseline. < 2% means no substantial improvement; skip Step 6/7 entirely (~3-5min of GPU time), exit 2 Appends SHORT-CIRCUIT record to memory
Step 6 Integrate Claude strips the __global__ out of the harness and reassembles it into the original FFI signature; SCP overwrites REMOTE_KERNEL_CU (saving .bak first), then cargo build --release rebuilds the entire prover Remote .bak + freshly built prover binary ~3min
Step 7 real-env optimized timing Runs ncu --metrics gpu__time_duration.sum with the new binary, capturing the total timing distribution of all launches in the real prover refs/report_optimized_real.* ~1-2min
real-env improved? Real-machine total runtime < baseline total runtime → ACCEPTED (optimization stays on server, exit 0); ≥ baseline → ROLLED BACK (mv .bak → kernel.cu + re-run cargo build + exit 1) Appends DEPLOYED / ROLLED_BACK to memory ~3min (rollback requires another build)

Deep dive: what does refs/<kernel>.observed.json actually do

The problem: When generating the standalone harness, Claude needs to know "when this kernel is launched in the real prover, what are the exact values of its arguments". Reading only the ncu report gives you grid and block, but scalar arguments (num_rows, height, chunk_size, offset, is_inject, etc.) are completely unknown. If a small size is guessed (e.g. num_rows=1024), the standalone's measured µs diverges from the real machine by tens of times, making the optimization useless.

The solution: Step 1.5 spends 5 minutes doing "observation" — Claude inserts printf / eprintln! into the production code, like this:

// Temporarily inserted instrumentation
fprintf(stderr, "[KKLOG_CPP] kernel=oss_leaf_hash n=%u num_poly=%u offset=%u "
                "grid_x=%u grid_y=%u block_x=%u block_y=%u\n",
        n, num_poly, offset, grid.x, grid.y, block.x, block.y);

Then SCP, cargo build, run the real prover once, and the stderr stream fills with dozens or hundreds of real argument records. The agent collects and parses them into JSON, stored as refs/<kernel>.observed.json, structured roughly like this:

{
  "kernel": "oss_leaf_hash_kernel",
  "captured_at": "2026-04-19T15:59:21",
  "launches": [
    { "grid_x": 32768, "block_x": 256, "n": 32768, "num_poly": 64, "offset": 0 },
    { "grid_x": 16384, "block_x": 256, "n": 16384, "num_poly": 64, "offset": 0 },
    ...
  ]
}

Why cache it:

  • Observation itself is expensive (patch code → SCP → build → prover → collect → restore → build) takes ~5min
  • But when the kernel signature is unchanged, the scalar distribution is stable (the same proof config run repeatedly produces the same parameters)
  • So as long as refs/<kernel>.observed.json exists, the agent loads it directly and skips all of Step 1.5 — this is what turns "run standalone at real workload scale" from "pay 5min on every run" into "pay once, then free"

When to manually delete this cache:

  • The kernel signature changed (different argument list)
  • The launch config calculation logic in production code changed (grid/block changed)
  • You have doubts about the workload characteristics of this kernel and want to re-observe ground truth

After deletion, the next run automatically re-observes.


Quick start

Prerequisites

  • Python ≥ 3.9 + anthropic SDK
  • Password-free SSH access to the remote GPU server (host alias, e.g. zan1)
  • CUDA toolkit on the remote machine (nvcc / ncu)
  • A cargo build-able project on the remote (prover binary)
pip3 install -r requirements.txt

Configure config.env

Refer to the config.env template (do not commit, already in .gitignore):

# Claude API
ANTHROPIC_API_KEY=sk-ant-...
CLAUDE_MODEL=claude-sonnet-4-6      # or claude-haiku-4-5 for debugging

# SSH
SSH_HOST=zan1
SSH_USER=ubuntu

# Local code mirror (Rust adapter / headers, given to Claude as context)
LOCAL_CODEBASE_PATH=/Users/you/repo/brevis-vm
LOCAL_CONTEXT_FILES=vm/src/cuda_adaptor/transpose.rs

# Target kernel to optimize
REMOTE_KERNEL_CU=/home/ubuntu/ben/brevis-vm/private-pico-gpu/transpose/transpose.cu
KERNEL_NAME=transpose_kernel

# Remote paths
REMOTE_BASEDIR=/home/ubuntu/ben/brevis-vm
REMOTE_BINARY=/home/ubuntu/ben/brevis-vm/target/release/single-node
REMOTE_LIBPATH=/home/ubuntu/ben/brevis-vm:/usr/local/cuda/lib64

# Harness / iteration / sampling parameters
HARNESS_GPU_ARCH=sm_120             # RTX 5090→120, 4090→89, A100→80
HARNESS_NVCC_DEFINES=-DFEATURE_KOALA_BEAR
MAX_OPTIMIZE_ITERATIONS=6           # max rounds per kernel
NO_PROGRESS_PATIENCE=4              # stop early after N consecutive no-progress rounds
NO_GAIN_THRESHOLD_PCT=2.0           # skip deploy if standalone gain < 2%
HARNESS_TOP_K=5                     # K=3 ≈70%, K=5 ≈88%, K=7 ≈98% runtime coverage
HARNESS_VARIANCE_THRESHOLD=2.5      # p95/p25 > threshold triggers fast/slow split

Run

python3 agent.py 2>&1 | tee agent_run.log

To switch kernels, only change REMOTE_KERNEL_CU + KERNEL_NAME + LOCAL_CONTEXT_FILES in config.env. The first run for a new kernel does the full observation (~5min); subsequent runs hit the cache instantly.


Output files

Persistent (tracked or long-term useful):

Path Contents
refs/<kernel>.observed.json First-run observation cache — real scalar snapshot
memory/<kernel>.md MEMO / result records for every optimization round; Claude reads this on the next run
CUDA_OPTIMIZATION.md Cross-kernel success pattern library; auto-regenerated from memory/*.md each time agent.py starts — do not edit manually
skills/*.md Optimization skill documents for each bottleneck type
Remote <kernel>.cu.bak Baseline backup for rollback

Ephemeral (gitignored, overwritten each run):

Path Contents
standalone_best.cu Best harness code from this run
<kernel>_integrated.cu FFI-compatible version stripped from harness, deployed to server
standalone_report_*.details.txt ncu text reports (baseline / each round)
report_{baseline,optimized}_real.* Real-machine ncu reports + workload CSV
agent_run*.log / build_*.log Run logs

Post-mortem review: runs/<kernel>/<ts>/ (one directory per run, never overwritten)

Every optimization run writes an independent directory for post-run review. iter 0 is the baseline; iter 1..N are Claude's successive attempts. Rejected attempts are fully preserved, making it easy to investigate "why didn't this direction work".

runs/oss_compress_kernel/2026-04-19T14-45-48/
├── agent_report.md                  TL;DR + iter links + mirror of memory/<kernel>.md for this run
├── best.cu                          Final standalone adopted this run
├── iter_0/                          Baseline snapshot (untouched by Claude)
│   ├── standalone.cu
│   └── ncu_brief.txt
├── iter_1/
│   ├── standalone.cu                Complete code output by Claude this iter
│   ├── response_optimize.txt        Raw Claude response (includes // MEMO: [...] + code)
│   ├── diff_vs_prev.patch           Unified diff vs best_so_far
│   ├── ncu_brief.txt                bottleneck / duration / top findings
│   ├── memo.txt                     Extracted MEMO one-liner
│   └── response_compile_fix_1.txt   Optional, only present when compile-fix was triggered this iter
├── iter_2/ ...
└── iter_N/

Storage tiers:

  • Tier 1 (strategic summary, always kept): memory/<kernel>.md + agent_run.log + observe_cache/. ~20 KB / run.
  • Tier 2 (iter snapshots, always kept): all standalone.cu + response_*.txt + diff + ncu_brief.txt in this directory. ~50 KB / 6-iter run. Covers 95% of post-mortem scenarios.
  • Tier 3 (full prompts, gated): prompt_<key>_user.txt + prompt_<key>_system.txt. ~+0.5 MB / run. Only written when RUNS_KEEP_TRANSCRIPT=1, for debugging "what exactly did Claude see / was it truncated by max_tokens / did memory make it into the prompt".

Safety and rollback

  • cp <kernel>.cu <kernel>.cu.bak is done before Rust integration
  • Integration failure (compile error) → auto-debug loop (up to 8 times), then cp .bak <kernel>.cu rollback + cargo build to restore
  • Real-env optimized timing shows regression → also rolls back
  • C++ / Rust files modified during observation are only temporarily changed within the observation window; the originals are automatically SCP'd back before the window closes

All SSH connections use ControlMaster multiplexing + automatic retry, and will not hang if the master connection drops.


Design trade-offs

Local brain + remote GPU. Anthropic API tokens never touch the production machine; only nvcc / ncu / cargo run remotely.

Standalone is not a perfect substitute for real-env. It is a 10–20× faster proxy; the final step always requires real-env confirmation. The per-config standalone vs real-env gap (typically 1.3–3×) comes from cold caches, the memory allocator, and interference from other kernels.

Conservative threshold. 2% accounts for cudaEvent run-to-run jitter (~0.5–1.5%); only improvements sufficiently above the noise floor are counted as valid. Manually adjustable via the NO_GAIN_THRESHOLD_PCT environment variable.

Known limitations.

  • For 2D-grid kernels (e.g. transpose), the bucketing stage groups by (grid_x, block_x) and grid_y is merged into the same bucket; however block_y is already forced into the harness via wrapper parsing, and num_rows/num_cols scalars are passed per-launch through schema-flexible alignment, so fast/slow differences still land on per-thread workload
  • Template kernels (e.g. cpu_events_to_trace_kernel<F, NumCols>) require manually providing concrete template parameters in the harness; auto-inference is not yet implemented

Module architecture

agent.py now only handles orchestration (preflight → string steps together → print comparison). Implementation is split into kernelking/ sub-packages, each with a single responsibility. Below is a file-level tree (one-line purpose per .py):

kernelking/
├── config.py                     load_config / cfg / log / banner / die / ROOT path / ANSI colors
├── costs.py                      Claude token + USD accumulation; $/Mtok table by model (wired in _claude_call)
├── rundir.py                     runs/<kernel>/<ts>/ review directory: begin_run / begin_iter /
│                                 dump_response / dump_prompt (Tier 3 gated) / refresh_agent_report
│
├── claude/
│   ├── client.py                 _claude_call single-turn API + cache_control: ephemeral + max_tokens auto-continue
│   │                             + exponential backoff retry (429/5xx/network) + _NO_ELIDE_RULE + _strip_fences
│   └── __init__.py               re-export facade
│
├── remote/
│   ├── ssh.py                    ssh_run / ssh_run_capture / scp_upload / scp_download
│   │                             (ControlMaster=auto multiplexes same TCP/SSH, ControlPersist=300)
│   ├── gpu.py                    _check_and_clean_gpu: nvidia-smi scans processes; Kernelking's own (standalone /
│   │                             single-node / ncu) are killed, external ones (VLLM etc.) abort immediately
│   └── __init__.py               facade re-exports
│
├── observe/                      # First per-kernel ~4-5min to capture real scalars, cache hits are instant after
│   ├── step.py                   step_observe_real_scalars: patch→build→run→parse→restore in try/finally
│   ├── instrument.py             Claude inserts fprintf(KKLOG_CPP …) in cpp wrapper + eprintln!(KKLOG_RUST …) in Rust adapter
│   ├── parse.py                  KKLOG text lines (regex) → per-launch dict; int/float best-effort coerce
│   ├── cache.py                  refs/<kernel>.observed.json read/write (human-readable JSON, stable across refactors)
│   ├── paths.py                  Parses .rs files from LOCAL_CONTEXT_FILES into (local, remote) pairs + text scp_upload
│   └── __init__.py               facade
│
├── harness/                      # standalone.cu generation / compilation / timing
│   ├── sections.py               // ===== KERNEL SECTION ===== / HARNESS SECTION ===== extract / assemble
│   │                             (optimization loop only rewrites kernel section; harness is frozen after generation)
│   ├── nvcc.py                   _nvcc_build / _standalone_dir / arch / include flags — remote nvcc 5-10s compile
│   ├── generate.py               step_generate_harness (Claude generates main + kConfigs) + deterministic helpers:
│   │                             _preprocess_for_standalone (__constant__ → __device__, GCC attr cleanup),
│   │                             _extract_kernel_launch_info (signature, block size, grid expression, dim3),
│   │                             _extract_wrapper_block_dims (pulls (bx,by,bz) from wrapper),
│   │                             _validate_harness_block_dims (post-validation + feedback retry up to 3 times),
│   │                             _classify_compile_error (structural vs semantic), _extract_error_context
│   ├── timing.py                 step_profile_standalone: Phase 1 cudaEvent timing (warmup×3 + measure×3) →
│   │                             Phase 2 ncu --set full captures metrics → _merge_harness_timing_into_metrics
│   │                             overwrites ncu's time column with cudaEvent median
│   └── __init__.py               facade
│
├── ncu/                          # ncu output parsing + rule engine (pure data layer, no SSH/Claude)
│   ├── metrics.py                ncu --page details / CSV parsing, SKILL_METRICS grouping,
│   │                             _parse_per_launch_data, WORKLOAD_METRICS (hw counters), unit conversion
│   ├── rules.py                  declarative RULES list (each: id + severity + one pure function),
│   │                             _apply_rules / _build_skill_summary / render to ncu_finding_N markdown
│   ├── configs.py                _find_representative_configs: bucket by (grid_x,block_x) → p25/p50/p95 →
│   │                             variance split fast/slow → top-K; _snap_gb_key schema-flexible alignment (1D/dim3)
│   ├── recommendations.py        NVIDIA OPT/INF/WRN three-level parsing, sort by speedup, deduplicate,
│   │                             _cross_check_findings (mark ✓ for findings also in local rules), _trim_ncu_details
│   ├── knowledge.py              NCU_KNOWLEDGE: ncu CLI flag reference (only for profile-script gen + crash triage)
│   └── __init__.py               facade
│
├── knowledge/                    # Three types of prompt-injected knowledge
│   ├── memory.py                 memory/<kernel>.md append + _extract_memo (// MEMO: [tag] ...) +
│   │                             _load_memory_past_runs / _load_memory_current_run (split into system/user blocks)
│   ├── compact.py                _maybe_compact_memory: when memory exceeds threshold, uses Haiku to compress into
│   │                             ## Resolved / ## Pending / ## Avoid three-section `.summary.md` (runs once before run)
│   ├── patterns.py               scans all memory/*.md → aggregates by tag (DEPLOYED/ROLLED_BACK/SHORT_CIRCUIT) →
│   │                             rewrites CUDA_OPTIMIZATION.md; _load_patterns_for_findings selects top-K for prompt
│   ├── skills.py                 _load_skills_for_findings: reads skills/<id>.md full body by fired rule (top-5);
│   │                             _build_skill_catalog: one-line **Signal** summary for all 19 skills, stays in cache
│   └── __init__.py               facade
│
├── optimize/                     # Loop driver + deployment decisions
│   ├── loop.py                   step_optimize_standalone main loop (default 6 iter, early-stop after 4 no-gain),
│   │                             _claude_optimize_standalone_iter (assembles system/user prompt),
│   │                             step_integrate_and_build (strips kernel from harness back into FFI form),
│   │                             _format_per_launch_for_prompt, _claude_fix_compile_error_standalone,
│   │                             _claude_fix_harness_runtime
│   ├── apply.py                  Real-Rust apply: cargo build + three Claude fixers (compile / verify /
│   │                             runtime), falls back to .bak on N failures
│   ├── compare.py                step_compare final comparison table + rollback + CUDA_FORBIDDEN_DIFF_PATTERNS
│   │                             (statically scans diff to block "disable the original computation to win the benchmark" cheating)
│   └── __init__.py               facade
│
├── profile/                      # Real-machine timing capture
│   ├── timing.py                 step_profile_timing_only: ncu no-replay + all launches + incidentally captures
│   │                             thread_inst / dram_bytes three hw counters (zero extra overhead)
│   ├── env.py                    BINARY_ENV dict (SPLIT_THRESHOLD / CHUNK_SIZE / PICO_GPU_MEM ...),
│   │                             shared by timing.py and _legacy.py
│   ├── _legacy.py                retired step_profile / step_generate_profile_script / _claude_triage_crash
│   │                             (--set full replay path, fails on OOM for large kernels; kept as archive)
│   └── __init__.py               only re-exports live symbols; legacy requires explicit from ._legacy import
│
└── __init__.py                   package-level docstring

handlers/                        # Kernel shape plugins (one file per new shape; main loop unchanged)
├── __init__.py                   HandlerContext + KernelHandler interface + registry + batch dispatcher
├── namespace.py                  Detects namespace X { __global__ ... }, injects
│                                 "NAMESPACE CONTEXT" into harness prompt to have Claude add using or fully-qualified calls
└── prose_strip.py                Strips leading English prose from Claude responses ("Looking at the profile, I need to...")
                                  — contains `—` `≈` `'` and other non-ASCII that causes nvcc / rustc "extended character" errors

Dependency direction is strictly bottom-up: config → claude + remote → observe/harness/ncu/knowledge/profile → optimize; no cycles. Swapping models only touches claude/; swapping profilers only touches profile/ + ncu/; adding a new kernel shape only adds one file to handlers/ + one registration line; agent.py itself is untouched.

Architecture diagram (data flow)

flowchart LR
    cfg["config.env<br/>(cfg)"]
    agent["agent.py · orchestrator<br/>(agent)"]

    subgraph know["Knowledge sources (prompt-injected)"]
        direction TB
        sk["skills/*.md<br/>catalog + top-5 full body<br/>(sk)"]
        rl["ncu rules matrix<br/>(rl)"]
        me["memory/&lt;kernel&gt;.md + Haiku summary<br/>(me)"]
        pt["CUDA_OPTIMIZATION.md · cross-kernel patterns<br/>(pt)"]
    end

    subgraph mod["kernelking/ modules"]
        direction TB
        ob["observe · scalar instrumentation<br/>(ob)"]
        ha["harness · gen + block-geom validator<br/>(ha)"]
        nc["ncu parser · metrics → bottleneck + findings<br/>(nc)"]
        pr["profile · real-env timing-only<br/>(pr)"]
        op["optimize · iter loop + integrate + decision<br/>(op)"]
    end

    cl["claude/ · prompt cache + retry<br/>(cl)"]
    api[("Claude API<br/>(api)")]

    subgraph rm["Remote GPU server (SSH)"]
        direction TB
        nv["nvcc<br/>(nv)"]
        sb["standalone_bin · cudaEvent<br/>(sb)"]
        nx["ncu CLI · --set full / timing-only<br/>(nx)"]
        cg["cargo build → prover<br/>(cg)"]
    end

    cfg --> agent --> op
    op --> ob
    op --> ha
    op --> nc
    op --> pr

    ha -.prompt.-> cl
    ob -.prompt.-> cl
    op -.prompt.-> cl
    know --> cl
    cl <--> api

    cl == "harness .cu" ==> ha
    cl == "instrumentation patch" ==> ob
    cl == "new kernel code + MEMO" ==> op

    ha -- "SCP standalone.cu" --> nv
    nv --> sb
    sb -- "KK_TIMING" --> nc
    nv --> nx
    nx -- ".ncu-rep + CSV" --> nc
    nc --> op

    op -- integrate --> cg
    cg --> nx
    nx -- "real-env delta" --> op
    pr --> nx
Loading

Legend: -.prompt.-> is the prompt sent to Claude (dashed, different each round); ==> is the artifact returned by Claude (thick lines, three different uses); solid lines are data flows between local/remote tools.

Three paths through this diagram:

  • Observation (first time per kernel): agent → observe → claude sends prompt → Claude thick line ==> returns to ob a cpp/rust patch with eprintln! / printf → SCP to remote → cargo build → prover runs once → real scalars written back to refs/<kernel>.observed.json
  • Standalone optimization loop: op → ha sends prompt → Claude thick line ==> returns to ha a standalone-compilable harness .cu → SCP → nvccstandalone_bin prints KK_TIMING + ncu --set full captures metrics → nc parses into findings → combined with know (skills + rules + memory + patterns) to build a prompt for cl → Claude thick line ==> returns to op a new kernel source + MEMO line (explaining what changed and which rule was targeted) → next round
  • Deployment: op strips the kernel from the harness back to FFI form → SCP → cargo buildncu timing-only on real prover → real-env delta decides accept / rollback

What each arrow does (box names in the table header match the diagram exactly; short IDs in parentheses for cross-referencing):

Control flow — agent.py startup + orchestrator dispatch

From → To Meaning
config.env (cfg) → agent.py · orchestrator (agent) Loads config.env at startup: target kernel path, KERNEL_NAME, LOCAL_CONTEXT_FILES, Claude model, RUNS_KEEP_TRANSCRIPT, MEMORY_COMPACT_* switches, etc. All downstream behavior branches from here.
agent.py · orchestrator (agent) → optimize (op) After agent.py completes Stage A/B/C/D (config / remote probe / harness preparation / knowledge loading), control is handed to optimize/loop.py to run the iter loop until accept or budget exhausted.
optimize (op) → observe (ob) Stage 1.5: if refs/<kernel>.observed.json does not yet exist, op triggers the observation sub-pipeline — has observe insert printf / eprintln! into production code to capture real scalar arguments.
optimize (op) → harness (ha) At the start of each iter, op calls harness/generate.py to request a standalone-compilable .cu (with main(), dummy data, cudaEvent timing).
optimize (op) → ncu parser (nc) op hands the ncu report path to kernelking/ncu/, which parses metrics → bottleneck label + fired rule findings.
optimize (op) → profile (pr) One real-env timing-only profile run at the start and end of each run (only gpu__time_duration.sum, ~15s), driving the final accept/reject decision.

Prompt flow — three call scenarios share the claude/ facade

From → To Meaning
harness (ha) -.prompt.-> claude/ (cl) Harness generation prompt: includes kernel signature, dim3 geometry from wrapper, observed scalars, block-geometry validation rules.
observe (ob) -.prompt.-> claude/ (cl) Observation prompt: pastes both cpp + rust source, requests a diff that "inserts printf at the kernel entry + argument sites, and eprintln! on the Rust side".
optimize (op) -.prompt.-> claude/ (cl) Optimization iteration prompt: bottleneck + findings + current best kernel + historical memory + skill bodies + cross-kernel patterns (assembled from Knowledge sources).
Knowledge sources (sk + rl + me + pt) → claude/ (cl) Assembles 4 knowledge source types (skills/rules/memory/patterns) into the system block, with cache_control: ephemeral for Claude to hit prompt cache at 10× discount.
claude/ (cl) ↔ Claude API (api) claude/client.py makes actual HTTP calls to the Anthropic API, handling retries, timeouts, and token accounting. Also writes runs/.../response_<key>.txt (Tier 2), and prompt_<key>_user/system.txt (Tier 3) when RUNS_KEEP_TRANSCRIPT=1.

Claude return artifacts — three types, each going back to the originator

From → To Meaning
claude/ (cl) == "harness .cu" ==> harness (ha) Claude returns a complete .cu; harness validates whether its <<<grid, block>>> matches the wrapper's dim3; if not, retries with an error message up to HARNESS_BLOCK_CHECK_RETRIES times.
claude/ (cl) == "instrumentation patch" ==> observe (ob) Claude returns cpp + rust source with printf / eprintln!; observe writes it to a temporary patch locally then SCPs to remote.
claude/ (cl) == "new kernel code + MEMO" ==> optimize (op) Claude returns new kernel source + one // MEMO: [rule_id] ... line; optimize extracts the rule_id into memory, inserts the kernel back into the harness, and runs the next round.

Remote toolchain — SSH / SCP to GPU server

From → To Meaning
harness (ha) — "SCP standalone.cu" → nvcc (nv) SCPs the new harness to remote /tmp/kk_<run>/, then triggers nvcc compilation via ssh.
nvcc (nv) → standalone_bin · cudaEvent (sb) On successful compilation, produces standalone_bin with a built-in cudaEvent timing loop (3 runs averaged + 1 warmup), free from ncu interference.
standalone_bin (sb) — "KK_TIMING" → ncu parser (nc) standalone_bin prints KK_TIMING cfg=... us=... lines to stdout; ncu parser uses regex to capture µs per config.
nvcc (nv) → ncu CLI (nx) The same binary is run again under ncu --set full (~80s) to capture the complete metric package.
ncu CLI (nx) — ".ncu-rep + CSV" → ncu parser (nc) Report files SCP'd back; ncu parser parses SM/DRAM/Warp cycles/Occupancy → bottleneck label + fired rule list.
ncu parser (nc) → optimize (op) Bottleneck + findings handed back to optimize for the next prompt construction.

Integration + real-machine verification — accept/reject decision path

From → To Meaning
optimize (op) — "integrate" → cargo build → prover (cg) The best kernel is stripped from the harness via handler (preserving the original FFI signature), SCP overwrites REMOTE_KERNEL_CU, triggers cargo build --release (with .bak protection; auto-rollback on failure).
cargo build → prover (cg) → ncu CLI (nx) The compiled prover binary runs one more pass of ncu --metrics gpu__time_duration.sum (timing-only, ~15s), capturing actual timing distribution of all launches.
ncu CLI (nx) — "real-env delta" → optimize (op) optimize uses this real-env timing vs baseline relative reduction (weighted by top-5 config, launch count weight) to decide: ≥ 2% accept, otherwise rollback .bak.
profile (pr) → ncu CLI (nx) Stage 2 baseline also uses this path: profile/ triggers one timing-only profile to get baseline total runtime + top-5 config distribution, written to refs/<kernel>.timing.json.

Three subsystems in depth

The three parts of Kernelking that most directly affect optimization quality and are most frequently changed are explained below. Other parts (SSH, cargo, ncu parsing, etc.) are relatively tooling-oriented and change less often.

1. Skills (bottleneck expert knowledge base)

Location: skills/*.md (currently 19 skills + one SKILLS.md glossary).

Each skill is a markdown file describing the root causes of one bottleneck type, prioritized fix directions, and anti-patterns. The first line has **Signal**: ... used for catalog extraction.

Terminology note — a few terms that appear repeatedly in the tables below:

  • SM (Streaming Multiprocessor): The GPU's "compute unit". An RTX 5090 has 170 SMs; each SM contains multiple sets of CUDA cores / Tensor cores / Warp schedulers. "SM utilization" in ncu is SM_Throughput_pct, measuring whether compute is saturated.
  • DRAM (Dynamic Random-Access Memory): The GPU's "video memory", i.e. the 32 GB main memory shown in nvidia-smi. Physically far from the SMs, with limited bandwidth (5090 ≈ 1.8 TB/s). "DRAM utilization" is DRAM_Throughput_pct, measuring whether bandwidth is saturated.
  • Roofline judgment: whichever percentage is higher is where the kernel is bound. High SM = compute-bound, high DRAM = memory-bound. Both low = either launch is too small, or the kernel is waiting on stalls.
  • L1 / L2 / smem: On-chip cache layers between SM and DRAM. L1 (≤ 128 KB) and shared memory physically share hardware, exclusive to each SM; L2 (≤ 50 MB) is shared across the whole GPU. More cache hits → fewer DRAM accesses.

Simplified hardware diagram:

Thread → Warp (32 threads) → Block → SM (block executes here)
                                      └─ smem / L1 ─┐
                                                     │
   All SMs ──────────────────────── L2 ──────── DRAM (video memory)

SKILLS.md is a glossary / style guide (not a skill itself). The 19 concrete skills are grouped into four categories:

📊 Roofline level (overall characterization)

File Trigger signal (ncu metric) Fix direction
severe_memory_bound.md DRAM ≥ 80% peak AND SM noticeably lower Reduce traffic, vectorize loads, increase arithmetic intensity, fuse kernels
near_peak_compute.md SM ≥ 85% AND DRAM 5+ points lower Tensor Core / FMA reordering, reduce op count via algorithm change
huge_dram_traffic.md DRAM R+W ≥ 10 GB per launch Fusion, remove redundant writes, reduce precision, remove padding

🧮 Occupancy / launch configuration

File Trigger signal Fix direction
low_occupancy.md warps_active.pct_of_peak < 50% Reduce regs/thread, shrink smem, adjust block size
register_limited.md Occupancy register ceiling < warps ceiling __launch_bounds__, split long functions, reuse registers
low_waves_per_sm.md waves_per_sm < 2 Grid too small; use persistent threads / merge launches
high_waves_per_sm.md waves_per_sm ≥ 40 Secondary signal, usually accompanies other issues

💾 Memory access / cache

File Trigger signal Fix direction
poor_coalescing.md L2 actual/ideal sector ratio ≥ 1.25 (coalescing < 80%) Reorder access pattern, transpose layout, smem staging
low_l1_hit.md l1tex_t_sector_hit_rate < 30% Increase temporal/spatial locality, tile working set
low_l2_hit.md lts_t_sector_hit_rate < 50% Change block-to-SM scheduling, shrink cross-block working set
smem_bank_conflicts.md Shared memory bank conflict count > 0 tile[TILE_DIM+1] padding, reorder bank access

⏳ Warp stall breakdown ("why SM is idle")

File Trigger signal (stall ratio) Meaning / fix
stall_long_scoreboard.md ≥ 0.3 Waiting on global mem; add memory-level parallelism / prefetch
stall_short_scoreboard.md ≥ 0.3 Waiting on smem / L1; bank conflicts / smem overload
stall_mio_throttle.md ≥ 0.3 MIO queue saturated; vectorize loads to reduce instruction count
stall_math_pipe.md ≥ 0.3 Single arithmetic pipeline saturated (FMA/SFU/XU); change algorithm mix
stall_barrier.md ≥ 0.2 bar.sync wait; uneven workload / reduce critical section
stall_wait.md ≥ 0.3 __syncthreads wait; reduce barriers / reduce divergence
stall_membar.md ≥ 0.15 Redundant fence; narrow fence scope
low_ipc.md IPC < 0.5 Composite symptom, usually downstream of several stalls above

Every skill file has a uniform structure: **Signal** / **Root causes** (prioritized) / **Fixes** / **Anti-patterns**. When Claude sees a rule fire, the full body is injected into the user block; skills that did not fire are still present in the system block catalog by name and Signal, allowing the model to "recall" which direction to explore.

Two-layer injection (in kernelking/knowledge/skills.py):

Layer Contents Where When updated
Catalog One-liner **Signal** per skill, assembled as "- poor_coalescing — L2 sector ratio > 4x" Cached in system block (cache_control: ephemeral) Stable within a run alongside skill files; essentially free
Full body (fired) Full text of skills matched by ncu rules this round (top 5 by severity, each ≤ 2000 chars) Per-round user message Recomputed each iteration from the new ncu report

Stuffing all 19 skills takes ~25k tokens; the current average is ~7k tokens, with the skill portion at 3-5k — yet the model still knows all skills exist via the catalog and can recall them by name.

Adding a new skill: Create skills/<id>.md (with a **Signal**: line), then add one declarative rule to kernelking/ncu/rules.py to trigger it. No changes to the main loop.

Note: This is our own hand-crafted on-demand mechanism, not the Anthropic Agent Skills API. Triggering is entirely driven by ncu rules, making it reproducible and cost-controllable. The trade-off is that skills for bottlenecks not flagged by ncu never get their full body expanded (but the catalog still lists them).


2. Prompts (what Claude receives each round)

A single Claude call's prompt is split into a system block (stable → served from prompt cache) and a user block (changes each round).

System block (only cache_write on the first round; all subsequent iterations are cache_read):

┌─ NO_ELIDE hygiene ──────────────────────────────────┐
│ Forbids "// ... rest unchanged" lazy elisions        │
│ Requires complete kernel code in output              │
├─ Skill catalog (all 19 one-liners) ─────────────────┤
│ - poor_coalescing — L2 sector ratio > 4x             │
│ - low_occupancy   — achieved_occupancy < 25%         │
│ - ... (17 more)                                      │
├─ Fired skill full bodies (top 5) ───────────────────┤
│ - e.g. complete fix list for [poor_coalescing]       │
├─ CUDA_OPTIMIZATION.md cross-kernel pattern library ─┤
│ shared-mem tiling / vectorized loads /               │
│ grid-stride loops / warp shuffle / ...               │
├─ Past run memory (unchanged within this run) ────────┤
│ - Small kernel: tail of memory/<kernel>.md           │
│ - Large kernel: Haiku-compressed                     │
│   Resolved / Pending / Avoid three-section           │
│   memory/<kernel>.summary.md                         │
└─ Handler-contributed prompt hints (if any) ──────────┘

User block (recomputed each round):

┌─ Current kernel source (baseline or last-optimized version)
├─ Host wrapper / Rust adapter context (from LOCAL_CONTEXT_FILES)
├─ ncu analysis results
│    - Bottleneck classification (memory_bound / compute_bound / latency_bound)
│    - Fired declarative rules (each: id + severity + action, one line)
│    - Key metric values (DRAM throughput, occupancy, warp cycles, ...)
├─ NVIDIA official recommendations
│    - OPT/INF/WRN three levels from `ncu --page details`, deduplicated
│    - Annotations for which recs cross-match local rule findings (✓)
├─ Current run's iters so far
│    - Only the segment under the current ## Run <ts> (however many there are)
│    - New REJECTED/ACCEPTED entries accumulate ~200-400 chars per round
├─ Workload calibration (only appears when generating harness)
│    - kConfigs table: grid/block/target_us/WORKLOAD BUDGET/OBSERVED scalars
│    - Role A vs Role B scalar classification constraints
└─ Task statement ("Here's the kernel, optimize it")

Memory partitioning (frozen snapshot): The memory tail from past runs or the Haiku-compressed summary is unchanged within the current run, so it moves into the system block to benefit from cache_read pricing; the current-run section that accumulates new records each round stays in the user block. Design inspired by hermes-agent's MEMORY.md "session-start frozen snapshot" design.

Hard constraint gates (in kernelking/claude/client.py + kernelking/optimize/apply.py):

  • FORBIDDEN_DIFF_PATTERNS: Claude's output is statically scanned to intercept lazy/illegal content like // ... existing code ..., // TODO: implement, empty function bodies, non-ASCII characters ( ). Matches trigger a retry.
  • Recovery prompt: When max_tokens truncation is detected, automatically re-prompts with "continue from where you left off".
  • API retry + exponential backoff: overloaded_error / rate_limit / network errors use exponential backoff retry without interrupting the entire run.

Typical token distribution (measured per iteration when running transpose_kernel):

Block Tokens Cached
NO_ELIDE + skill catalog + fired skills + pattern library + past-run memory ~4.5–6k cache_read (free)
ncu findings + NVIDIA recs ~1.5–2k recomputed each round
Kernel source + context + this-run iters ~500-1k recomputed each round
Output ~600-1.5k

Complete run (≈6 iters) cost ~$0.30–0.40 (after memory enters cache, ~20–25% cheaper than before).


3. Harness (how standalone.cu is generated)

Goal: Strip the target kernel from the Rust project and construct a .cu that compiles as a single nvcc file and behaves as close to the production workload as possible.

Physical structure (two clearly separated sections; the optimization loop only touches KERNEL SECTION):

// ===== KERNEL SECTION (modify this section to optimize) =====
<original kernel source after preprocessing + dependent struct / __device__ functions>
// ===== END KERNEL SECTION =====

// ===== HARNESS SECTION (do not modify) =====
struct HarnessConfig { const char* name; int n; int grid; int block; ... };
constexpr HarnessConfig kConfigs[] = {
  { "top1_fast", 4194304, 131072, 32, 402.5f, /* scalars */ },
  { "top1_slow", 4194304, 131072, 32, 1308.5f, /* scalars */ },
  ... (top-K configs from real-env ncu)
};
int main() {
  // per config: 3 warmup + 3 cudaEventRecord timing → print KK_TIMING
}
// ===== END HARNESS SECTION =====

HARNESS SECTION is generated once by Claude in step_generate_harness and is never changed again within a run. Every optimization loop iteration only rewrites KERNEL SECTION — identical workload, fair comparison.

Generation process (kernelking/harness/generate.py):

  1. Deterministic preprocessing (no LLM)

    • __attribute__((constant))__device__ (GCC syntax not recognized by nvcc)
    • __constant__ T__device__ T (avoids non-trivial constructor dynamic init restrictions)
  2. Deterministic signature extraction (no LLM)

    • Regex scans __launch_bounds__(BLOCK) → block size
    • Regex scans kernel_name<<<grid, block>>>(...) → grid expression / template parameters
    • Block geometry parsed from wrapper: reads dim3 dimBlock(bx, by[, bz]) or the second argument of <<<grid, block>>> in the host wrapper, resolves into a (bx, by, bz) triple — macros, constexpr int, and literals are all traced. Prompt states "block must be dim3(32, 8) = 256 threads/block" as a mandatory rule for Claude
    • Parses argument list → [(type, name), ...] tuples
    • Extracts dependent struct / constexpr / define bindings
  3. LLM generates HARNESS SECTION (Claude)

    • Prompt includes: signature, argument list, representative configs, WORKLOAD BUDGET (per-thread instruction / bytes, derived from real-machine ncu counters), OBSERVED scalars (exact values from real-machine instrumentation), and the Role A/B constraints below
  4. Block-geometry post-validation + retry (no LLM, kernelking/harness/generate.py:_validate_harness_block_dims)

    • After generation, scans every kernel_name<<<grid, block>>> in the harness source; resolves dim3(c.block_x, c.block_y) / dim3(...) literals / scalar fallback to struct HarnessConfig fields
    • Compares bx*by*bz from the wrapper against each config; mismatches trigger Claude to rewrite with feedback (up to HARNESS_BLOCK_CHECK_RETRIES=3 times), with the feedback explicitly listing which configs have wrong threads/block values and the expected value
    • Anti-pattern: when Claude sees only a scalar block field in kConfigs[], it tends to write dim3(c.block) — a 1D launch that drops y/z — which the post-validation catches immediately

Role A vs Role B constraints (the longest and most critical part of the harness prompt):

Each scalar argument must be classified:

Role Determination How to set
Role A — "active thread count" height in if (i >= height) return; Always = grid * block
Role B — "per-thread work" width in for (j = 0; j < width; ++j) The knob for adjusting target_us; back-calculate from WORKLOAD BUDGET or use OBSERVED value directly

Typical symptom of mixing them up: standalone shows 5 µs (nearly pure launch overhead) while real machine shows 10 ms — every thread early-exits immediately. This constraint emerged from repeated mistakes in practice.

Data priority (Claude uses data in this order):

  1. OBSERVED (exact values from real-machine Rust/C++ instrumentation) → use directly
  2. WORKLOAD BUDGET (per-thread inst/bytes back-calculated from ncu) → use when OBSERVED unavailable
  3. target_us (timing target from real-machine ncu) → final sanity check

Runtime (kernelking/harness/timing.py):

./standalone_bin                 → prints KK_TIMING name=... median_us=... (authoritative timing)
sudo ncu --set full ./standalone → .ncu-rep (metrics only, timing discarded)

_merge_harness_timing_into_metrics overwrites ncu's time column with the cudaEvent median, yielding a combination of "accurate timing + hardware metrics". Each iteration these two phases together take about 50 seconds on the remote (cudaEvent < 3s, ncu ~40s).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages