Skip to content

Ai assisted decode needs verify#20

Open
jaewooMaeng wants to merge 33 commits into
Bammuri:mainfrom
jaewooMaeng:ai-assisted-decode-needs-verify
Open

Ai assisted decode needs verify#20
jaewooMaeng wants to merge 33 commits into
Bammuri:mainfrom
jaewooMaeng:ai-assisted-decode-needs-verify

Conversation

@jaewooMaeng

Copy link
Copy Markdown
Collaborator

This is the latest version, but due to modal issues, I can't verify them.

Bammuri and others added 30 commits April 15, 2026 14:48
Keep the oracle-aligned decode control plane while preserving the first main-lane
candidate that cleared full-54 keep authority. This freezes the B=1-only shared
q/k staging path so later iterations can branch from a proven checkpoint.

Constraint: Submission path must remain self-contained and oracle-aligned while main-lane decode experiments continue
Rejected: Keep experimenting without freezing this state | too easy to lose the first full-gate win
Confidence: high
Scope-risk: moderate
Reversibility: clean
Directive: Treat this commit as the current best main decode candidate until a later full-54 result clearly beats it
Tested: Quick pinned auto/native PASS; full-54 auto 0.014 ms PASS; full-54 native_only 0.014 ms PASS; architect approval
Not-tested: Full-54 rerun stability beyond the captured runs
The approved decode lane now treats PythonBuilder execution with runtime compilation of kernel.cu as the primary measured surface. This commit rewires config.toml to the Python entrypoint and adds the initial TVM-FFI wrapper so later exact-surface hardening can happen without changing the surface contract.

Constraint: Official builders do not support direct custom CUDA compile flags
Constraint: Submission path must remain self-contained and use destination-passing style
Rejected: Keep plain kernel.cu entry as primary | weaker control over runtime compile and arch forcing
Confidence: high
Scope-risk: narrow
Reversibility: clean
Directive: Treat decode_submit_entry.py plus kernel.cu as the exact measured decode surface; keep .omx artifacts out of code commits
Tested: scripts/pack_solution.py emits runtime language=python with entry decode_submit_entry.py::run and sources [kernel.cu, decode_submit_entry.py]
Not-tested: Runtime compilation/execution on GPU
The exact PythonBuilder surface now needs stronger runtime behavior than a one-shot temp build. This change forces the decode compile target to 10.0a, keys the loaded module by source digests so edited kernels rebuild cleanly, inspects the generated build directory for hard gencode proof, and keeps DPS outputs correct even when the provided buffers are non-contiguous.

Constraint: The primary decode surface must stay PythonBuilder -> runtime compile of kernel.cu
Constraint: sm_100a targeting must be explicit and auditable on the exact surface
Rejected: Keep temp build directories and a single global module | hides arch proof and makes source edits reuse stale modules
Confidence: high
Scope-risk: narrow
Reversibility: clean
Directive: Treat donor CuTe work as off-surface until it yields a concrete kernel.cu transplant; exact-surface evidence must come from decode_submit_entry.py + kernel.cu
Tested: python -m py_compile on decode_submit_entry.py and scripts/pack_solution.py
Tested: scripts/pack_solution.py still emits runtime language=python, entry decode_submit_entry.py::run, DPS=true, sources [kernel.cu, decode_submit_entry.py]
Tested: helper probe forces TVM_FFI_CUDA_ARCH_LIST=10.0a and reports the exact gencode flag / CUDA_HOME resolution path
Not-tested: Successful runtime compilation and GPU execution (local environment lacks visible CUDA_HOME/nvcc)
The wrapper now restores the caller's arch env after probing, drops an unused artifact field, and upgrades sm_100a proof from logging-only to a hard invariant so the exact surface cannot silently continue with weaker codegen evidence.

Constraint: Exact decode results must come from the PythonBuilder runtime-compile surface, not a softer inferred build path
Constraint: sm_100a targeting has to be explicit and auditable on every serious exact-surface artifact
Rejected: Only log soft/missing proof and continue | too easy to accept ambiguous codegen on the measured path
Confidence: high
Scope-risk: narrow
Reversibility: clean
Directive: Keep exact-surface proof fail-closed unless there is a deliberate decision to widen accepted build evidence
Tested: python -m py_compile on decode_submit_entry.py and scripts/pack_solution.py
Tested: scripts/pack_solution.py still emits runtime language=python with entry decode_submit_entry.py::run and DPS=true
Tested: helper probe restores TVM_FFI_CUDA_ARCH_LIST after temporary forcing and after failed runtime-compile attempts
Not-tested: Successful hard-proof runtime compilation and GPU execution (local environment lacks visible CUDA_HOME/nvcc)
The exact PythonBuilder decode wrapper was spending steady-state time rereading kernel.cu and decode_submit_entry.py on every invocation just to derive the cache key. This change memoizes the source identity from file metadata plus digests, keeps the runtime-compile surface unchanged, and preserves the existing sm100a proof path.

Constraint: The primary decode surface must remain PythonBuilder -> decode_submit_entry.py -> runtime compile of kernel.cu
Constraint: Modal-only verification is the current authority path
Rejected: Stop after proving the surface is slow | the hook requires continuing with fresh verification evidence
Confidence: medium
Scope-risk: narrow
Reversibility: clean
Directive: Any future exact-surface optimization should first eliminate Python-side per-call overhead before touching kernel math
Tested: python -m py_compile solution/cuda/decode_submit_entry.py
Tested: Modal precise B1 901e5104 improved to 22.155 us from the previous ~111.929 us exact-surface result
Tested: Modal precise B48 4c38f0e4 improved to 26.834 us and B64 ef2cf980 improved to 30.187 us
Not-tested: Full-54 precise authority run after this optimization
After removing per-call file hashing, the next remaining exact-surface tax was still in Python dispatch. This change caches the loaded TVM-FFI kernel callable once per built artifact and only calls contiguous() when tensors are actually non-contiguous, reducing steady-state Python overhead without changing the exact surface contract.

Constraint: The exact decode surface must stay on decode_submit_entry.py -> runtime compile of kernel.cu
Constraint: Verification authority remains Modal-only under 3/100/5 style runs
Rejected: Reopen the torch-extension side path | representative Modal runs were slower and not a keep signal
Confidence: medium
Scope-risk: narrow
Reversibility: clean
Directive: Keep optimizing Python-side overhead before revisiting deeper runtime surface changes
Tested: python -m py_compile solution/cuda/decode_submit_entry.py
Tested: Modal precise B1 901e5104 improved to 20.803 us, B48 4c38f0e4 to 25.040 us, B64 ef2cf980 to 29.938 us
Not-tested: Full-54 precise authority run after this optimization
The exact PythonBuilder wrapper no longer needs to redo arch forcing, metadata checks, and cache lookup work once the module is built inside a worker process. This change pins the active build artifact after the first successful load so steady-state decode calls stay on the exact surface with less Python overhead.

Constraint: Exact-surface verification still depends on Modal runs, not local timing
Constraint: The wrapper must keep using runtime-compiled kernel.cu with explicit sm100a proof
Rejected: Leave _load_mod on the hot path for every invocation | representative Modal data showed the Python side was still dominating kernel time
Confidence: medium
Scope-risk: narrow
Reversibility: clean
Directive: Treat worker-process steady-state behavior as the primary optimization surface once build identity is stable
Tested: python -m py_compile solution/cuda/decode_submit_entry.py
Tested: Modal precise B1 901e5104 improved to 13.938 us, B48 4c38f0e4 to 17.338 us, B64 ef2cf980 to 21.114 us
Not-tested: Full-54 precise authority run after this optimization
Fresh Modal evidence now shows the exact Python runtime-compile surface is floor-limited, while the onefile-dual plain-CUDA kernel family is the first historical candidate to beat the plain-CUDA baseline on a full54 authority run. This commit ports that candidate into the working branch by switching the active decode surface back to direct kernel.cu entry and installing the onefile-dual kernel body.

Constraint: .omx planning artifacts must stay out of code commits
Constraint: Current exact-surface branch evidence shows the Python runtime-compile surface is a hard blocker for the 1.2 us target
Rejected: Keep the exact Python runtime-compile lane as primary | fresh full54 authority stayed slower than plain-CUDA
Confidence: medium
Scope-risk: moderate
Reversibility: clean
Directive: Treat this commit as the new plain-CUDA baseline candidate; compare future ideas against its full54 authority result before keeping them
Tested: Modal representative 6-bucket run on workspace candidate (3/100/5)
Tested: Modal full54 authority on identical onefile-dual solution artifact (3/100/5) -> 13.177 us, PASSED 54/54
Not-tested: Fresh full54 authority rerun after porting this exact workspace copy
Fresh forced plain-CUDA full54 authority now shows the archived onefile kernel family consistently beating the dual-kernel workspace baseline when TVMFFIBuilder is pinned to sm100a. This narrows the active lane to the simpler single-kernel body while preserving the same submission surface and lets subsequent work optimize from the currently strongest valid baseline.

Constraint: Only code changes belong in this branch history; .omx artifacts stay uncommitted
Constraint: Valid decode comparisons are now limited to forced plain-CUDA sm100a and exact runtime-compile full54 authority
Rejected: Keep the dual-kernel 2d61d09 baseline as the active forced-sm100a lane | repeated full54 authority stayed slower than the onefile candidate
Confidence: medium
Scope-risk: moderate
Reversibility: clean
Directive: Compare future forced-sm100a plain-CUDA ideas against this onefile baseline with full54 authority before promoting them
Tested: Forced plain-CUDA sm100a full54 authority on workspace-packed candidate (Modal ap-6hUwd4UdFxQUl3Ho0ins4V) -> 13.217 us, PASSED 54/54
Tested: Forced plain-CUDA sm100a full54 authority on equivalent candidate JSON (Modal ap-9bA3DyKhFWySyjhtiodq8P) -> 13.823 us, PASSED 54/54
Tested: Forced plain-CUDA sm100a full54 authority on previous baseline JSON (Modal ap-9il2CSKUc3RKyVxy2xoT1m) -> 14.180 us, PASSED 54/54
Not-tested: Exact runtime-compile full54 with this kernel body
Not-tested: NCU/proton profiling on the new forced-sm100a onefile baseline
The active decode lane is now only valid when plain-CUDA measurements run through TVMFFIBuilder with an explicit TVM_FFI_CUDA_ARCH_LIST=10.0a environment. This commit bakes that requirement into the default Modal benchmark harness so future authority runs use the same compile target without relying on ad hoc temporary scripts.

Constraint: Decode promotion evidence now requires forced-sm100a plain-CUDA or exact runtime-compile full54 runs
Constraint: .omx artifacts and unrelated copied worktree files must stay out of commits
Rejected: Keep the default runner on TORCH_CUDA_ARCH_LIST alone | fresh inspection proved TVMFFIBuilder fell back to compute_100/sm_100 without TVM_FFI_CUDA_ARCH_LIST
Confidence: medium
Scope-risk: narrow
Reversibility: clean
Directive: Treat Modal full54 results gathered without TVM_FFI_CUDA_ARCH_LIST=10.0a as non-authoritative for the current decode lane
Tested: Build-ninja inspection without TVM_FFI_CUDA_ARCH_LIST (ap-GAL0f5N52Z1Mx1fYetsxXU) -> sm_100 only
Tested: Build-ninja inspection with TVM_FFI_CUDA_ARCH_LIST=10.0a (ap-alGDes2GFiamFTKYjLCm08) -> hard sm_100a proof
Tested: Modal representative rerun with committed runner env on e352b57 workspace (ap-6OYnqOzayksPX5hag34EPK) -> PASSED 3/3
Not-tested: Fresh committed-runner full54 authority after this exact harness commit
Not-tested: NCU/proton helper scripts still use temporary forced-sm100a overlays
The current fastest submission-owned decode lane is the c721base minwrap surface, so this branch switches the active config back to the submit-decode style Python wrapper while keeping the proven onefile kernel body from the parent baseline. This preserves a branch-local lane for bear-safe evaluation without disturbing the faster plain-CUDA measurement branch.

Constraint: Only code files belong in this branch history; .omx artifacts and copied workspace clutter stay out
Constraint: Submission-owned sm100a forcing must happen inside the wrapper, not only in external Modal runner env
Rejected: Keep submit-safe candidate only as a /tmp solution artifact | hard to audit and easy to lose between sessions
Confidence: medium
Scope-risk: narrow
Reversibility: clean
Directive: Treat this branch as the submit-safe lane; compare future self-compile wrapper ideas against its full54 authority before promoting them
Tested: Existing temp full54 authority on the same c721base minwrap artifact (Modal ap-TPoYtWlwqbBKvxK0avOfOS) -> 13.901 us, PASSED 54/54
Not-tested: Fresh branch-backed full54 authority after this exact branch commit
Not-tested: 6-run statistical repeat still in progress under .omx/logs/submit-safe-c721base-3-100-6x6-20260419T062435Z
…o iter Bammuri#1

Iteration Bammuri#2 aimed for B5 (warp specialization + async memcpy + shared memory):
- Phase 1: B5 with cuda::pipeline → INCORRECT_NUMERICAL on all 54 workloads
- Phase 2: Corrected indexing → benchmark timeout (modal 240s+, inconclusive)
- Phase 3: Fallback B2 (double-buffering) → uninitialized prefetch risk

Decision: Suspend iteration Bammuri#2, revert to iter Bammuri#1 baseline (avg_latency=0.011415ms)
Lessons: Async pipeline/warp specialization more fragile than expected; prefetch rotation logic error-prone

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants