Ai assisted decode by jaewooMaeng · Pull Request #19 · Bammuri/mlsys26

jaewooMaeng · 2026-04-24T05:03:08Z

6 Runs over 54 workloads: (warmup_runs=3, iterations=100, num_trials=5)

avg latency: 11.735 µs (0.011735 ms)
avg latency: 11.360 µs (0.011360 ms)
avg latency: 11.095 µs (0.011095 ms)
avg latency: 15.899 µs (0.015899 ms)
avg latency: 15.648 µs (0.015648 ms)
avg latency: 15.953 µs (0.015953 ms)

평균: 13.615 µs

This will be force-pushed into submit-decode branch

Keep the oracle-aligned decode control plane while preserving the first main-lane candidate that cleared full-54 keep authority. This freezes the B=1-only shared q/k staging path so later iterations can branch from a proven checkpoint. Constraint: Submission path must remain self-contained and oracle-aligned while main-lane decode experiments continue Rejected: Keep experimenting without freezing this state | too easy to lose the first full-gate win Confidence: high Scope-risk: moderate Reversibility: clean Directive: Treat this commit as the current best main decode candidate until a later full-54 result clearly beats it Tested: Quick pinned auto/native PASS; full-54 auto 0.014 ms PASS; full-54 native_only 0.014 ms PASS; architect approval Not-tested: Full-54 rerun stability beyond the captured runs

The approved decode lane now treats PythonBuilder execution with runtime compilation of kernel.cu as the primary measured surface. This commit rewires config.toml to the Python entrypoint and adds the initial TVM-FFI wrapper so later exact-surface hardening can happen without changing the surface contract. Constraint: Official builders do not support direct custom CUDA compile flags Constraint: Submission path must remain self-contained and use destination-passing style Rejected: Keep plain kernel.cu entry as primary | weaker control over runtime compile and arch forcing Confidence: high Scope-risk: narrow Reversibility: clean Directive: Treat decode_submit_entry.py plus kernel.cu as the exact measured decode surface; keep .omx artifacts out of code commits Tested: scripts/pack_solution.py emits runtime language=python with entry decode_submit_entry.py::run and sources [kernel.cu, decode_submit_entry.py] Not-tested: Runtime compilation/execution on GPU

The exact PythonBuilder surface now needs stronger runtime behavior than a one-shot temp build. This change forces the decode compile target to 10.0a, keys the loaded module by source digests so edited kernels rebuild cleanly, inspects the generated build directory for hard gencode proof, and keeps DPS outputs correct even when the provided buffers are non-contiguous. Constraint: The primary decode surface must stay PythonBuilder -> runtime compile of kernel.cu Constraint: sm_100a targeting must be explicit and auditable on the exact surface Rejected: Keep temp build directories and a single global module | hides arch proof and makes source edits reuse stale modules Confidence: high Scope-risk: narrow Reversibility: clean Directive: Treat donor CuTe work as off-surface until it yields a concrete kernel.cu transplant; exact-surface evidence must come from decode_submit_entry.py + kernel.cu Tested: python -m py_compile on decode_submit_entry.py and scripts/pack_solution.py Tested: scripts/pack_solution.py still emits runtime language=python, entry decode_submit_entry.py::run, DPS=true, sources [kernel.cu, decode_submit_entry.py] Tested: helper probe forces TVM_FFI_CUDA_ARCH_LIST=10.0a and reports the exact gencode flag / CUDA_HOME resolution path Not-tested: Successful runtime compilation and GPU execution (local environment lacks visible CUDA_HOME/nvcc)

The wrapper now restores the caller's arch env after probing, drops an unused artifact field, and upgrades sm_100a proof from logging-only to a hard invariant so the exact surface cannot silently continue with weaker codegen evidence. Constraint: Exact decode results must come from the PythonBuilder runtime-compile surface, not a softer inferred build path Constraint: sm_100a targeting has to be explicit and auditable on every serious exact-surface artifact Rejected: Only log soft/missing proof and continue | too easy to accept ambiguous codegen on the measured path Confidence: high Scope-risk: narrow Reversibility: clean Directive: Keep exact-surface proof fail-closed unless there is a deliberate decision to widen accepted build evidence Tested: python -m py_compile on decode_submit_entry.py and scripts/pack_solution.py Tested: scripts/pack_solution.py still emits runtime language=python with entry decode_submit_entry.py::run and DPS=true Tested: helper probe restores TVM_FFI_CUDA_ARCH_LIST after temporary forcing and after failed runtime-compile attempts Not-tested: Successful hard-proof runtime compilation and GPU execution (local environment lacks visible CUDA_HOME/nvcc)

The exact PythonBuilder decode wrapper was spending steady-state time rereading kernel.cu and decode_submit_entry.py on every invocation just to derive the cache key. This change memoizes the source identity from file metadata plus digests, keeps the runtime-compile surface unchanged, and preserves the existing sm100a proof path. Constraint: The primary decode surface must remain PythonBuilder -> decode_submit_entry.py -> runtime compile of kernel.cu Constraint: Modal-only verification is the current authority path Rejected: Stop after proving the surface is slow | the hook requires continuing with fresh verification evidence Confidence: medium Scope-risk: narrow Reversibility: clean Directive: Any future exact-surface optimization should first eliminate Python-side per-call overhead before touching kernel math Tested: python -m py_compile solution/cuda/decode_submit_entry.py Tested: Modal precise B1 901e5104 improved to 22.155 us from the previous ~111.929 us exact-surface result Tested: Modal precise B48 4c38f0e4 improved to 26.834 us and B64 ef2cf980 improved to 30.187 us Not-tested: Full-54 precise authority run after this optimization

After removing per-call file hashing, the next remaining exact-surface tax was still in Python dispatch. This change caches the loaded TVM-FFI kernel callable once per built artifact and only calls contiguous() when tensors are actually non-contiguous, reducing steady-state Python overhead without changing the exact surface contract. Constraint: The exact decode surface must stay on decode_submit_entry.py -> runtime compile of kernel.cu Constraint: Verification authority remains Modal-only under 3/100/5 style runs Rejected: Reopen the torch-extension side path | representative Modal runs were slower and not a keep signal Confidence: medium Scope-risk: narrow Reversibility: clean Directive: Keep optimizing Python-side overhead before revisiting deeper runtime surface changes Tested: python -m py_compile solution/cuda/decode_submit_entry.py Tested: Modal precise B1 901e5104 improved to 20.803 us, B48 4c38f0e4 to 25.040 us, B64 ef2cf980 to 29.938 us Not-tested: Full-54 precise authority run after this optimization

The exact PythonBuilder wrapper no longer needs to redo arch forcing, metadata checks, and cache lookup work once the module is built inside a worker process. This change pins the active build artifact after the first successful load so steady-state decode calls stay on the exact surface with less Python overhead. Constraint: Exact-surface verification still depends on Modal runs, not local timing Constraint: The wrapper must keep using runtime-compiled kernel.cu with explicit sm100a proof Rejected: Leave _load_mod on the hot path for every invocation | representative Modal data showed the Python side was still dominating kernel time Confidence: medium Scope-risk: narrow Reversibility: clean Directive: Treat worker-process steady-state behavior as the primary optimization surface once build identity is stable Tested: python -m py_compile solution/cuda/decode_submit_entry.py Tested: Modal precise B1 901e5104 improved to 13.938 us, B48 4c38f0e4 to 17.338 us, B64 ef2cf980 to 21.114 us Not-tested: Full-54 precise authority run after this optimization

Fresh Modal evidence now shows the exact Python runtime-compile surface is floor-limited, while the onefile-dual plain-CUDA kernel family is the first historical candidate to beat the plain-CUDA baseline on a full54 authority run. This commit ports that candidate into the working branch by switching the active decode surface back to direct kernel.cu entry and installing the onefile-dual kernel body. Constraint: .omx planning artifacts must stay out of code commits Constraint: Current exact-surface branch evidence shows the Python runtime-compile surface is a hard blocker for the 1.2 us target Rejected: Keep the exact Python runtime-compile lane as primary | fresh full54 authority stayed slower than plain-CUDA Confidence: medium Scope-risk: moderate Reversibility: clean Directive: Treat this commit as the new plain-CUDA baseline candidate; compare future ideas against its full54 authority result before keeping them Tested: Modal representative 6-bucket run on workspace candidate (3/100/5) Tested: Modal full54 authority on identical onefile-dual solution artifact (3/100/5) -> 13.177 us, PASSED 54/54 Not-tested: Fresh full54 authority rerun after porting this exact workspace copy

Fresh forced plain-CUDA full54 authority now shows the archived onefile kernel family consistently beating the dual-kernel workspace baseline when TVMFFIBuilder is pinned to sm100a. This narrows the active lane to the simpler single-kernel body while preserving the same submission surface and lets subsequent work optimize from the currently strongest valid baseline. Constraint: Only code changes belong in this branch history; .omx artifacts stay uncommitted Constraint: Valid decode comparisons are now limited to forced plain-CUDA sm100a and exact runtime-compile full54 authority Rejected: Keep the dual-kernel 2d61d09 baseline as the active forced-sm100a lane | repeated full54 authority stayed slower than the onefile candidate Confidence: medium Scope-risk: moderate Reversibility: clean Directive: Compare future forced-sm100a plain-CUDA ideas against this onefile baseline with full54 authority before promoting them Tested: Forced plain-CUDA sm100a full54 authority on workspace-packed candidate (Modal ap-6hUwd4UdFxQUl3Ho0ins4V) -> 13.217 us, PASSED 54/54 Tested: Forced plain-CUDA sm100a full54 authority on equivalent candidate JSON (Modal ap-9bA3DyKhFWySyjhtiodq8P) -> 13.823 us, PASSED 54/54 Tested: Forced plain-CUDA sm100a full54 authority on previous baseline JSON (Modal ap-9il2CSKUc3RKyVxy2xoT1m) -> 14.180 us, PASSED 54/54 Not-tested: Exact runtime-compile full54 with this kernel body Not-tested: NCU/proton profiling on the new forced-sm100a onefile baseline

The active decode lane is now only valid when plain-CUDA measurements run through TVMFFIBuilder with an explicit TVM_FFI_CUDA_ARCH_LIST=10.0a environment. This commit bakes that requirement into the default Modal benchmark harness so future authority runs use the same compile target without relying on ad hoc temporary scripts. Constraint: Decode promotion evidence now requires forced-sm100a plain-CUDA or exact runtime-compile full54 runs Constraint: .omx artifacts and unrelated copied worktree files must stay out of commits Rejected: Keep the default runner on TORCH_CUDA_ARCH_LIST alone | fresh inspection proved TVMFFIBuilder fell back to compute_100/sm_100 without TVM_FFI_CUDA_ARCH_LIST Confidence: medium Scope-risk: narrow Reversibility: clean Directive: Treat Modal full54 results gathered without TVM_FFI_CUDA_ARCH_LIST=10.0a as non-authoritative for the current decode lane Tested: Build-ninja inspection without TVM_FFI_CUDA_ARCH_LIST (ap-GAL0f5N52Z1Mx1fYetsxXU) -> sm_100 only Tested: Build-ninja inspection with TVM_FFI_CUDA_ARCH_LIST=10.0a (ap-alGDes2GFiamFTKYjLCm08) -> hard sm_100a proof Tested: Modal representative rerun with committed runner env on e352b57 workspace (ap-6OYnqOzayksPX5hag34EPK) -> PASSED 3/3 Not-tested: Fresh committed-runner full54 authority after this exact harness commit Not-tested: NCU/proton helper scripts still use temporary forced-sm100a overlays

The current fastest submission-owned decode lane is the c721base minwrap surface, so this branch switches the active config back to the submit-decode style Python wrapper while keeping the proven onefile kernel body from the parent baseline. This preserves a branch-local lane for bear-safe evaluation without disturbing the faster plain-CUDA measurement branch. Constraint: Only code files belong in this branch history; .omx artifacts and copied workspace clutter stay out Constraint: Submission-owned sm100a forcing must happen inside the wrapper, not only in external Modal runner env Rejected: Keep submit-safe candidate only as a /tmp solution artifact | hard to audit and easy to lose between sessions Confidence: medium Scope-risk: narrow Reversibility: clean Directive: Treat this branch as the submit-safe lane; compare future self-compile wrapper ideas against its full54 authority before promoting them Tested: Existing temp full54 authority on the same c721base minwrap artifact (Modal ap-TPoYtWlwqbBKvxK0avOfOS) -> 13.901 us, PASSED 54/54 Not-tested: Fresh branch-backed full54 authority after this exact branch commit Not-tested: 6-run statistical repeat still in progress under .omx/logs/submit-safe-c721base-3-100-6x6-20260419T062435Z

…o iter Bammuri#1 Iteration Bammuri#2 aimed for B5 (warp specialization + async memcpy + shared memory): - Phase 1: B5 with cuda::pipeline → INCORRECT_NUMERICAL on all 54 workloads - Phase 2: Corrected indexing → benchmark timeout (modal 240s+, inconclusive) - Phase 3: Fallback B2 (double-buffering) → uninitialized prefetch risk Decision: Suspend iteration Bammuri#2, revert to iter Bammuri#1 baseline (avg_latency=0.011415ms) Lessons: Async pipeline/warp specialization more fragile than expected; prefetch rotation logic error-prone Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

jaewooMaeng · 2026-04-24T13:21:24Z

=== NCU Duration per workload ===
UUID axes duration_us fallback

901e5104-dccb-4c3f-ae13-ef4d31a4d456 batch_size=1 a5714b69-525c-4b95-bb7a-a0f9770c2f48 batch_size=1 3daa0974-293c-4414-b3c2-1f04368c1189 batch_size=1 22d5cef5-4f30-4f43-9d5f-0e9e95dc2201 batch_size=1 aed4bdd4-3139-4a1b-ae2f-aab8d4ba4090 batch_size=1 49a125e5-edf0-492e-a8b3-3676d14adaa3 batch_size=1 798635cf-d424-4343-a959-c96b0c0e81fb batch_size=1 5716e24a-3f55-411a-bcd8-e6b677b1ca7e batch_size=1 4c7df22f-70ef-4494-864f-6f10209ab0f3 batch_size=1 d0e91dea-aa1b-46c8-a67c-b2814f5a1725 batch_size=1 ec9d2340-6d13-40e4-a6fe-4483a1cacd0d batch_size=4 9f238670-9a56-4ab9-94f9-555755f32205 batch_size=4 6d929bff-d051-4e1d-acbf-7d0cc13f6dc8 batch_size=4 494844dc-80e9-41e9-9fe2-3d2618fdef64 batch_size=4 26f760ab-c286-41bf-8c37-3bc5df4c98fc batch_size=4 deecf42e-dfae-4061-936f-0af4d892c231 batch_size=4 9a92acbc-6104-48c6-8176-601feee30001 batch_size=4 fa90b213-3ad2-4a42-9173-97fbf9e2e809 batch_size=4 2640f1e3-4b02-4041-bdc1-59a28e0b9954 batch_size=8 7464a1d6-b0dd-4061-b15a-ebdabf47e351 batch_size=8 25d1f606-3e9e-49e6-a3f0-991092c7a845 batch_size=8 fe2e6584-17d0-4025-ad44-db9e507f6bed batch_size=8 ca62112d-68e8-4eb7-a318-26c48d256e10 batch_size=8 61dfd334-971b-4a44-a8a8-f00950192745 batch_size=8 abec5d32-2409-4412-8683-472f3a091a54 batch_size=8 76eec66d-0ada-4c35-bf21-a07247ad7f05 batch_size=16 f0507025-1db0-48de-b3d6-4faad5632558 batch_size=16 d1508f6d-81b7-4caf-947b-b48f612a3061 batch_size=16 63d4cb66-d6f6-4ceb-8083-0c0fb4abece9 batch_size=16 ef6bb01d-1294-4a06-97b2-604ec944b4c0 batch_size=16 d0acf04a-5919-4bf3-95ca-94fbef5786c0 batch_size=16 ac83f1e4-982e-41a5-8c02-7ccc14c728d4 batch_size=16 a8c8beff-e414-4580-b2f7-e5b8f13bc269 batch_size=32 40940a4c-1348-416d-aea1-f3183548229b batch_size=32 0915952b-5887-43aa-9807-a0980e7a78f6 batch_size=32 d38dde79-79f9-4c2d-8bbe-05ee1be48583 batch_size=32 388f79b1-13e9-44e6-b12e-c8197989a924 batch_size=32 80288532-6967-4b8e-b5bc-c31ce0a208d5 batch_size=32 773a4157-01fa-4637-90be-8604e0b40526 batch_size=32 53385c7f-393d-41db-aec8-5b9eb5bf35d1 batch_size=48 45e7697a-bae5-42bd-9f56-959c10ff681f batch_size=48 34403e5b-8551-4d44-95b0-21532c4eb839 batch_size=48 b91cc4aa-048d-4e05-9c0c-8a153ec997f8 batch_size=48 4c38f0e4-aabf-4742-90b9-24231effd96f batch_size=48 be91d78c-1114-4563-b0aa-e0682ccc09bb batch_size=48 3803367a-f140-432a-9784-43fe5d5f64d0 batch_size=48 eaf0a285-447c-4432-8e68-d287acc3cb08 batch_size=64 f4fd2171-c869-440c-b199-f403e3c6b788 batch_size=64 b1931bc6-b8b6-44da-a83b-aa6cc4d4c275 batch_size=64 8d0c5789-550a-4216-a471-202b2655a4e8 batch_size=64 9eef9bc7-5972-437e-b488-853c6dacb470 batch_size=64 58571e49-29ee-4a53-bf1d-a7a2363e9db3 batch_size=64 ef2cf980-6977-49d1-a6f8-7247becf0273 batch_size=64 42963acb-f2f5-4ada-9205-3931cd26fa44 batch_size=64 7.010 True
7.170 True
6.940 True
7.230 True
7.360 True
7.230 True
7.070 True
7.260 True
7.550 True
7.230 True
8.100 True
8.990 True
8.160 True
8.190 True
8.100 True
8.100 True
9.310 True
8.030 True
9.090 True
8.930 True
8.930 True
9.120 True
9.180 True
9.090 True
9.090 True
10.530 True
10.530 True
10.690 True
10.690 True
10.780 True
11.040 True
10.530 True
14.940 True
15.070 True
15.420 True
15.330 True
14.880 True
14.780 True
14.850 True
21.980 True
22.370 True
23.940 True
20.700 True
22.270 True
21.860 True
21.660 True
31.650 True
31.100 True
32.610 True
32.380 True
30.750 True
31.940 True
31.900 True
31.520 True

Arithmetic mean over 54/54 workloads: 14.651 us

Bammuri and others added 24 commits April 15, 2026 14:48

ralph iter 0001 — avg_latency=0.012920ms

2916d3c

ralph iter 0002 — avg_latency=0.012920ms

5adb920

ralph iter 0003 — avg_latency=0.012920ms

6343808

ralph iter 0004 — avg_latency=0.012920ms

76dc603

ralph iter 0005 — avg_latency=0.012920ms

c8a2e56

ralph iter 0006 — avg_latency=0.012920ms

56d844c

ralph iter 0001 — avg_latency=0.012920ms

f2343ea

ralph iter 0002 — avg_latency=0.012920ms

c8afab6

ralph iter 0003 — avg_latency=0.012671ms

191506c

ralph iter 0004 — avg_latency=0.012671ms

8593df1

ralph-claude iter 0001 — avg_latency=0.011415ms

55c27da

ralph-claude iter 0002 — avg_latency=0.011415ms

d92429f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ai assisted decode#19

Ai assisted decode#19
jaewooMaeng wants to merge 24 commits into
Bammuri:mainfrom
jaewooMaeng:ai-assisted-decode

jaewooMaeng commented Apr 24, 2026 •

edited

Loading

Uh oh!

jaewooMaeng commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jaewooMaeng commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jaewooMaeng commented Apr 24, 2026

=== NCU Duration per workload === UUID axes duration_us fallback

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jaewooMaeng commented Apr 24, 2026 •

edited

Loading

=== NCU Duration per workload ===
UUID axes duration_us fallback