Ai assisted decode needs verify by jaewooMaeng · Pull Request #20 · Bammuri/mlsys26

jaewooMaeng · 2026-04-25T01:57:16Z

This is the latest version, but due to modal issues, I can't verify them.

Keep the oracle-aligned decode control plane while preserving the first main-lane candidate that cleared full-54 keep authority. This freezes the B=1-only shared q/k staging path so later iterations can branch from a proven checkpoint. Constraint: Submission path must remain self-contained and oracle-aligned while main-lane decode experiments continue Rejected: Keep experimenting without freezing this state | too easy to lose the first full-gate win Confidence: high Scope-risk: moderate Reversibility: clean Directive: Treat this commit as the current best main decode candidate until a later full-54 result clearly beats it Tested: Quick pinned auto/native PASS; full-54 auto 0.014 ms PASS; full-54 native_only 0.014 ms PASS; architect approval Not-tested: Full-54 rerun stability beyond the captured runs

The approved decode lane now treats PythonBuilder execution with runtime compilation of kernel.cu as the primary measured surface. This commit rewires config.toml to the Python entrypoint and adds the initial TVM-FFI wrapper so later exact-surface hardening can happen without changing the surface contract. Constraint: Official builders do not support direct custom CUDA compile flags Constraint: Submission path must remain self-contained and use destination-passing style Rejected: Keep plain kernel.cu entry as primary | weaker control over runtime compile and arch forcing Confidence: high Scope-risk: narrow Reversibility: clean Directive: Treat decode_submit_entry.py plus kernel.cu as the exact measured decode surface; keep .omx artifacts out of code commits Tested: scripts/pack_solution.py emits runtime language=python with entry decode_submit_entry.py::run and sources [kernel.cu, decode_submit_entry.py] Not-tested: Runtime compilation/execution on GPU

The exact PythonBuilder surface now needs stronger runtime behavior than a one-shot temp build. This change forces the decode compile target to 10.0a, keys the loaded module by source digests so edited kernels rebuild cleanly, inspects the generated build directory for hard gencode proof, and keeps DPS outputs correct even when the provided buffers are non-contiguous. Constraint: The primary decode surface must stay PythonBuilder -> runtime compile of kernel.cu Constraint: sm_100a targeting must be explicit and auditable on the exact surface Rejected: Keep temp build directories and a single global module | hides arch proof and makes source edits reuse stale modules Confidence: high Scope-risk: narrow Reversibility: clean Directive: Treat donor CuTe work as off-surface until it yields a concrete kernel.cu transplant; exact-surface evidence must come from decode_submit_entry.py + kernel.cu Tested: python -m py_compile on decode_submit_entry.py and scripts/pack_solution.py Tested: scripts/pack_solution.py still emits runtime language=python, entry decode_submit_entry.py::run, DPS=true, sources [kernel.cu, decode_submit_entry.py] Tested: helper probe forces TVM_FFI_CUDA_ARCH_LIST=10.0a and reports the exact gencode flag / CUDA_HOME resolution path Not-tested: Successful runtime compilation and GPU execution (local environment lacks visible CUDA_HOME/nvcc)

The wrapper now restores the caller's arch env after probing, drops an unused artifact field, and upgrades sm_100a proof from logging-only to a hard invariant so the exact surface cannot silently continue with weaker codegen evidence. Constraint: Exact decode results must come from the PythonBuilder runtime-compile surface, not a softer inferred build path Constraint: sm_100a targeting has to be explicit and auditable on every serious exact-surface artifact Rejected: Only log soft/missing proof and continue | too easy to accept ambiguous codegen on the measured path Confidence: high Scope-risk: narrow Reversibility: clean Directive: Keep exact-surface proof fail-closed unless there is a deliberate decision to widen accepted build evidence Tested: python -m py_compile on decode_submit_entry.py and scripts/pack_solution.py Tested: scripts/pack_solution.py still emits runtime language=python with entry decode_submit_entry.py::run and DPS=true Tested: helper probe restores TVM_FFI_CUDA_ARCH_LIST after temporary forcing and after failed runtime-compile attempts Not-tested: Successful hard-proof runtime compilation and GPU execution (local environment lacks visible CUDA_HOME/nvcc)

The exact PythonBuilder decode wrapper was spending steady-state time rereading kernel.cu and decode_submit_entry.py on every invocation just to derive the cache key. This change memoizes the source identity from file metadata plus digests, keeps the runtime-compile surface unchanged, and preserves the existing sm100a proof path. Constraint: The primary decode surface must remain PythonBuilder -> decode_submit_entry.py -> runtime compile of kernel.cu Constraint: Modal-only verification is the current authority path Rejected: Stop after proving the surface is slow | the hook requires continuing with fresh verification evidence Confidence: medium Scope-risk: narrow Reversibility: clean Directive: Any future exact-surface optimization should first eliminate Python-side per-call overhead before touching kernel math Tested: python -m py_compile solution/cuda/decode_submit_entry.py Tested: Modal precise B1 901e5104 improved to 22.155 us from the previous ~111.929 us exact-surface result Tested: Modal precise B48 4c38f0e4 improved to 26.834 us and B64 ef2cf980 improved to 30.187 us Not-tested: Full-54 precise authority run after this optimization

After removing per-call file hashing, the next remaining exact-surface tax was still in Python dispatch. This change caches the loaded TVM-FFI kernel callable once per built artifact and only calls contiguous() when tensors are actually non-contiguous, reducing steady-state Python overhead without changing the exact surface contract. Constraint: The exact decode surface must stay on decode_submit_entry.py -> runtime compile of kernel.cu Constraint: Verification authority remains Modal-only under 3/100/5 style runs Rejected: Reopen the torch-extension side path | representative Modal runs were slower and not a keep signal Confidence: medium Scope-risk: narrow Reversibility: clean Directive: Keep optimizing Python-side overhead before revisiting deeper runtime surface changes Tested: python -m py_compile solution/cuda/decode_submit_entry.py Tested: Modal precise B1 901e5104 improved to 20.803 us, B48 4c38f0e4 to 25.040 us, B64 ef2cf980 to 29.938 us Not-tested: Full-54 precise authority run after this optimization

The exact PythonBuilder wrapper no longer needs to redo arch forcing, metadata checks, and cache lookup work once the module is built inside a worker process. This change pins the active build artifact after the first successful load so steady-state decode calls stay on the exact surface with less Python overhead. Constraint: Exact-surface verification still depends on Modal runs, not local timing Constraint: The wrapper must keep using runtime-compiled kernel.cu with explicit sm100a proof Rejected: Leave _load_mod on the hot path for every invocation | representative Modal data showed the Python side was still dominating kernel time Confidence: medium Scope-risk: narrow Reversibility: clean Directive: Treat worker-process steady-state behavior as the primary optimization surface once build identity is stable Tested: python -m py_compile solution/cuda/decode_submit_entry.py Tested: Modal precise B1 901e5104 improved to 13.938 us, B48 4c38f0e4 to 17.338 us, B64 ef2cf980 to 21.114 us Not-tested: Full-54 precise authority run after this optimization

Fresh Modal evidence now shows the exact Python runtime-compile surface is floor-limited, while the onefile-dual plain-CUDA kernel family is the first historical candidate to beat the plain-CUDA baseline on a full54 authority run. This commit ports that candidate into the working branch by switching the active decode surface back to direct kernel.cu entry and installing the onefile-dual kernel body. Constraint: .omx planning artifacts must stay out of code commits Constraint: Current exact-surface branch evidence shows the Python runtime-compile surface is a hard blocker for the 1.2 us target Rejected: Keep the exact Python runtime-compile lane as primary | fresh full54 authority stayed slower than plain-CUDA Confidence: medium Scope-risk: moderate Reversibility: clean Directive: Treat this commit as the new plain-CUDA baseline candidate; compare future ideas against its full54 authority result before keeping them Tested: Modal representative 6-bucket run on workspace candidate (3/100/5) Tested: Modal full54 authority on identical onefile-dual solution artifact (3/100/5) -> 13.177 us, PASSED 54/54 Not-tested: Fresh full54 authority rerun after porting this exact workspace copy

Fresh forced plain-CUDA full54 authority now shows the archived onefile kernel family consistently beating the dual-kernel workspace baseline when TVMFFIBuilder is pinned to sm100a. This narrows the active lane to the simpler single-kernel body while preserving the same submission surface and lets subsequent work optimize from the currently strongest valid baseline. Constraint: Only code changes belong in this branch history; .omx artifacts stay uncommitted Constraint: Valid decode comparisons are now limited to forced plain-CUDA sm100a and exact runtime-compile full54 authority Rejected: Keep the dual-kernel 2d61d09 baseline as the active forced-sm100a lane | repeated full54 authority stayed slower than the onefile candidate Confidence: medium Scope-risk: moderate Reversibility: clean Directive: Compare future forced-sm100a plain-CUDA ideas against this onefile baseline with full54 authority before promoting them Tested: Forced plain-CUDA sm100a full54 authority on workspace-packed candidate (Modal ap-6hUwd4UdFxQUl3Ho0ins4V) -> 13.217 us, PASSED 54/54 Tested: Forced plain-CUDA sm100a full54 authority on equivalent candidate JSON (Modal ap-9bA3DyKhFWySyjhtiodq8P) -> 13.823 us, PASSED 54/54 Tested: Forced plain-CUDA sm100a full54 authority on previous baseline JSON (Modal ap-9il2CSKUc3RKyVxy2xoT1m) -> 14.180 us, PASSED 54/54 Not-tested: Exact runtime-compile full54 with this kernel body Not-tested: NCU/proton profiling on the new forced-sm100a onefile baseline

The active decode lane is now only valid when plain-CUDA measurements run through TVMFFIBuilder with an explicit TVM_FFI_CUDA_ARCH_LIST=10.0a environment. This commit bakes that requirement into the default Modal benchmark harness so future authority runs use the same compile target without relying on ad hoc temporary scripts. Constraint: Decode promotion evidence now requires forced-sm100a plain-CUDA or exact runtime-compile full54 runs Constraint: .omx artifacts and unrelated copied worktree files must stay out of commits Rejected: Keep the default runner on TORCH_CUDA_ARCH_LIST alone | fresh inspection proved TVMFFIBuilder fell back to compute_100/sm_100 without TVM_FFI_CUDA_ARCH_LIST Confidence: medium Scope-risk: narrow Reversibility: clean Directive: Treat Modal full54 results gathered without TVM_FFI_CUDA_ARCH_LIST=10.0a as non-authoritative for the current decode lane Tested: Build-ninja inspection without TVM_FFI_CUDA_ARCH_LIST (ap-GAL0f5N52Z1Mx1fYetsxXU) -> sm_100 only Tested: Build-ninja inspection with TVM_FFI_CUDA_ARCH_LIST=10.0a (ap-alGDes2GFiamFTKYjLCm08) -> hard sm_100a proof Tested: Modal representative rerun with committed runner env on e352b57 workspace (ap-6OYnqOzayksPX5hag34EPK) -> PASSED 3/3 Not-tested: Fresh committed-runner full54 authority after this exact harness commit Not-tested: NCU/proton helper scripts still use temporary forced-sm100a overlays

The current fastest submission-owned decode lane is the c721base minwrap surface, so this branch switches the active config back to the submit-decode style Python wrapper while keeping the proven onefile kernel body from the parent baseline. This preserves a branch-local lane for bear-safe evaluation without disturbing the faster plain-CUDA measurement branch. Constraint: Only code files belong in this branch history; .omx artifacts and copied workspace clutter stay out Constraint: Submission-owned sm100a forcing must happen inside the wrapper, not only in external Modal runner env Rejected: Keep submit-safe candidate only as a /tmp solution artifact | hard to audit and easy to lose between sessions Confidence: medium Scope-risk: narrow Reversibility: clean Directive: Treat this branch as the submit-safe lane; compare future self-compile wrapper ideas against its full54 authority before promoting them Tested: Existing temp full54 authority on the same c721base minwrap artifact (Modal ap-TPoYtWlwqbBKvxK0avOfOS) -> 13.901 us, PASSED 54/54 Not-tested: Fresh branch-backed full54 authority after this exact branch commit Not-tested: 6-run statistical repeat still in progress under .omx/logs/submit-safe-c721base-3-100-6x6-20260419T062435Z

…o iter Bammuri#1 Iteration Bammuri#2 aimed for B5 (warp specialization + async memcpy + shared memory): - Phase 1: B5 with cuda::pipeline → INCORRECT_NUMERICAL on all 54 workloads - Phase 2: Corrected indexing → benchmark timeout (modal 240s+, inconclusive) - Phase 3: Fallback B2 (double-buffering) → uninitialized prefetch risk Decision: Suspend iteration Bammuri#2, revert to iter Bammuri#1 baseline (avg_latency=0.011415ms) Lessons: Async pipeline/warp specialization more fragile than expected; prefetch rotation logic error-prone Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

Signed-off-by: jaewooMaeng <jwmaeng@snu.ac.kr>

Bammuri and others added 30 commits April 15, 2026 14:48

ralph iter 0001 — avg_latency=0.012920ms

2916d3c

ralph iter 0002 — avg_latency=0.012920ms

5adb920

ralph iter 0003 — avg_latency=0.012920ms

6343808

ralph iter 0004 — avg_latency=0.012920ms

76dc603

ralph iter 0005 — avg_latency=0.012920ms

c8a2e56

ralph iter 0006 — avg_latency=0.012920ms

56d844c

ralph iter 0001 — avg_latency=0.012920ms

f2343ea

ralph iter 0002 — avg_latency=0.012920ms

c8afab6

ralph iter 0003 — avg_latency=0.012671ms

191506c

ralph iter 0004 — avg_latency=0.012671ms

8593df1

ralph-claude iter 0001 — avg_latency=0.011415ms

55c27da

ralph-claude iter 0002 — avg_latency=0.011415ms

d92429f

ralph-claude iter 0001 — avg_latency=0.011108ms

4250b14

ralph iter 0001 — avg_latency=0.015814ms

24861be

ralph iter 0002 — avg_latency=0.015814ms

1b73e64

ralph iter 0003 — avg_latency=0.015814ms

696c6c3

ralph iter 0004 — avg_latency=0.029683ms

a701981

ralph iter 0005 — avg_latency=0.025462ms

80e29dd

jaewooMaeng added 3 commits April 24, 2026 23:13

ralph iter 0006 — avg_latency=0.016806ms

1465dea

ralph iter 0007 — avg_latency=0.016806ms

0817649

25 morn

69f517b

Signed-off-by: jaewooMaeng <jwmaeng@snu.ac.kr>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ai assisted decode needs verify#20

Ai assisted decode needs verify#20
jaewooMaeng wants to merge 33 commits into
Bammuri:mainfrom
jaewooMaeng:ai-assisted-decode-needs-verify

jaewooMaeng commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jaewooMaeng commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants