Skip to content

Ai assisted decode#19

Open
jaewooMaeng wants to merge 24 commits into
Bammuri:mainfrom
jaewooMaeng:ai-assisted-decode
Open

Ai assisted decode#19
jaewooMaeng wants to merge 24 commits into
Bammuri:mainfrom
jaewooMaeng:ai-assisted-decode

Conversation

@jaewooMaeng

@jaewooMaeng jaewooMaeng commented Apr 24, 2026

Copy link
Copy Markdown
Collaborator

6 Runs over 54 workloads: (warmup_runs=3, iterations=100, num_trials=5)

  • avg latency: 11.735 µs (0.011735 ms)
  • avg latency: 11.360 µs (0.011360 ms)
  • avg latency: 11.095 µs (0.011095 ms)
  • avg latency: 15.899 µs (0.015899 ms)
  • avg latency: 15.648 µs (0.015648 ms)
  • avg latency: 15.953 µs (0.015953 ms)

평균: 13.615 µs

This will be force-pushed into submit-decode branch

Bammuri and others added 24 commits April 15, 2026 14:48
Keep the oracle-aligned decode control plane while preserving the first main-lane
candidate that cleared full-54 keep authority. This freezes the B=1-only shared
q/k staging path so later iterations can branch from a proven checkpoint.

Constraint: Submission path must remain self-contained and oracle-aligned while main-lane decode experiments continue
Rejected: Keep experimenting without freezing this state | too easy to lose the first full-gate win
Confidence: high
Scope-risk: moderate
Reversibility: clean
Directive: Treat this commit as the current best main decode candidate until a later full-54 result clearly beats it
Tested: Quick pinned auto/native PASS; full-54 auto 0.014 ms PASS; full-54 native_only 0.014 ms PASS; architect approval
Not-tested: Full-54 rerun stability beyond the captured runs
The approved decode lane now treats PythonBuilder execution with runtime compilation of kernel.cu as the primary measured surface. This commit rewires config.toml to the Python entrypoint and adds the initial TVM-FFI wrapper so later exact-surface hardening can happen without changing the surface contract.

Constraint: Official builders do not support direct custom CUDA compile flags
Constraint: Submission path must remain self-contained and use destination-passing style
Rejected: Keep plain kernel.cu entry as primary | weaker control over runtime compile and arch forcing
Confidence: high
Scope-risk: narrow
Reversibility: clean
Directive: Treat decode_submit_entry.py plus kernel.cu as the exact measured decode surface; keep .omx artifacts out of code commits
Tested: scripts/pack_solution.py emits runtime language=python with entry decode_submit_entry.py::run and sources [kernel.cu, decode_submit_entry.py]
Not-tested: Runtime compilation/execution on GPU
The exact PythonBuilder surface now needs stronger runtime behavior than a one-shot temp build. This change forces the decode compile target to 10.0a, keys the loaded module by source digests so edited kernels rebuild cleanly, inspects the generated build directory for hard gencode proof, and keeps DPS outputs correct even when the provided buffers are non-contiguous.

Constraint: The primary decode surface must stay PythonBuilder -> runtime compile of kernel.cu
Constraint: sm_100a targeting must be explicit and auditable on the exact surface
Rejected: Keep temp build directories and a single global module | hides arch proof and makes source edits reuse stale modules
Confidence: high
Scope-risk: narrow
Reversibility: clean
Directive: Treat donor CuTe work as off-surface until it yields a concrete kernel.cu transplant; exact-surface evidence must come from decode_submit_entry.py + kernel.cu
Tested: python -m py_compile on decode_submit_entry.py and scripts/pack_solution.py
Tested: scripts/pack_solution.py still emits runtime language=python, entry decode_submit_entry.py::run, DPS=true, sources [kernel.cu, decode_submit_entry.py]
Tested: helper probe forces TVM_FFI_CUDA_ARCH_LIST=10.0a and reports the exact gencode flag / CUDA_HOME resolution path
Not-tested: Successful runtime compilation and GPU execution (local environment lacks visible CUDA_HOME/nvcc)
The wrapper now restores the caller's arch env after probing, drops an unused artifact field, and upgrades sm_100a proof from logging-only to a hard invariant so the exact surface cannot silently continue with weaker codegen evidence.

Constraint: Exact decode results must come from the PythonBuilder runtime-compile surface, not a softer inferred build path
Constraint: sm_100a targeting has to be explicit and auditable on every serious exact-surface artifact
Rejected: Only log soft/missing proof and continue | too easy to accept ambiguous codegen on the measured path
Confidence: high
Scope-risk: narrow
Reversibility: clean
Directive: Keep exact-surface proof fail-closed unless there is a deliberate decision to widen accepted build evidence
Tested: python -m py_compile on decode_submit_entry.py and scripts/pack_solution.py
Tested: scripts/pack_solution.py still emits runtime language=python with entry decode_submit_entry.py::run and DPS=true
Tested: helper probe restores TVM_FFI_CUDA_ARCH_LIST after temporary forcing and after failed runtime-compile attempts
Not-tested: Successful hard-proof runtime compilation and GPU execution (local environment lacks visible CUDA_HOME/nvcc)
The exact PythonBuilder decode wrapper was spending steady-state time rereading kernel.cu and decode_submit_entry.py on every invocation just to derive the cache key. This change memoizes the source identity from file metadata plus digests, keeps the runtime-compile surface unchanged, and preserves the existing sm100a proof path.

Constraint: The primary decode surface must remain PythonBuilder -> decode_submit_entry.py -> runtime compile of kernel.cu
Constraint: Modal-only verification is the current authority path
Rejected: Stop after proving the surface is slow | the hook requires continuing with fresh verification evidence
Confidence: medium
Scope-risk: narrow
Reversibility: clean
Directive: Any future exact-surface optimization should first eliminate Python-side per-call overhead before touching kernel math
Tested: python -m py_compile solution/cuda/decode_submit_entry.py
Tested: Modal precise B1 901e5104 improved to 22.155 us from the previous ~111.929 us exact-surface result
Tested: Modal precise B48 4c38f0e4 improved to 26.834 us and B64 ef2cf980 improved to 30.187 us
Not-tested: Full-54 precise authority run after this optimization
After removing per-call file hashing, the next remaining exact-surface tax was still in Python dispatch. This change caches the loaded TVM-FFI kernel callable once per built artifact and only calls contiguous() when tensors are actually non-contiguous, reducing steady-state Python overhead without changing the exact surface contract.

Constraint: The exact decode surface must stay on decode_submit_entry.py -> runtime compile of kernel.cu
Constraint: Verification authority remains Modal-only under 3/100/5 style runs
Rejected: Reopen the torch-extension side path | representative Modal runs were slower and not a keep signal
Confidence: medium
Scope-risk: narrow
Reversibility: clean
Directive: Keep optimizing Python-side overhead before revisiting deeper runtime surface changes
Tested: python -m py_compile solution/cuda/decode_submit_entry.py
Tested: Modal precise B1 901e5104 improved to 20.803 us, B48 4c38f0e4 to 25.040 us, B64 ef2cf980 to 29.938 us
Not-tested: Full-54 precise authority run after this optimization
The exact PythonBuilder wrapper no longer needs to redo arch forcing, metadata checks, and cache lookup work once the module is built inside a worker process. This change pins the active build artifact after the first successful load so steady-state decode calls stay on the exact surface with less Python overhead.

Constraint: Exact-surface verification still depends on Modal runs, not local timing
Constraint: The wrapper must keep using runtime-compiled kernel.cu with explicit sm100a proof
Rejected: Leave _load_mod on the hot path for every invocation | representative Modal data showed the Python side was still dominating kernel time
Confidence: medium
Scope-risk: narrow
Reversibility: clean
Directive: Treat worker-process steady-state behavior as the primary optimization surface once build identity is stable
Tested: python -m py_compile solution/cuda/decode_submit_entry.py
Tested: Modal precise B1 901e5104 improved to 13.938 us, B48 4c38f0e4 to 17.338 us, B64 ef2cf980 to 21.114 us
Not-tested: Full-54 precise authority run after this optimization
Fresh Modal evidence now shows the exact Python runtime-compile surface is floor-limited, while the onefile-dual plain-CUDA kernel family is the first historical candidate to beat the plain-CUDA baseline on a full54 authority run. This commit ports that candidate into the working branch by switching the active decode surface back to direct kernel.cu entry and installing the onefile-dual kernel body.

Constraint: .omx planning artifacts must stay out of code commits
Constraint: Current exact-surface branch evidence shows the Python runtime-compile surface is a hard blocker for the 1.2 us target
Rejected: Keep the exact Python runtime-compile lane as primary | fresh full54 authority stayed slower than plain-CUDA
Confidence: medium
Scope-risk: moderate
Reversibility: clean
Directive: Treat this commit as the new plain-CUDA baseline candidate; compare future ideas against its full54 authority result before keeping them
Tested: Modal representative 6-bucket run on workspace candidate (3/100/5)
Tested: Modal full54 authority on identical onefile-dual solution artifact (3/100/5) -> 13.177 us, PASSED 54/54
Not-tested: Fresh full54 authority rerun after porting this exact workspace copy
Fresh forced plain-CUDA full54 authority now shows the archived onefile kernel family consistently beating the dual-kernel workspace baseline when TVMFFIBuilder is pinned to sm100a. This narrows the active lane to the simpler single-kernel body while preserving the same submission surface and lets subsequent work optimize from the currently strongest valid baseline.

Constraint: Only code changes belong in this branch history; .omx artifacts stay uncommitted
Constraint: Valid decode comparisons are now limited to forced plain-CUDA sm100a and exact runtime-compile full54 authority
Rejected: Keep the dual-kernel 2d61d09 baseline as the active forced-sm100a lane | repeated full54 authority stayed slower than the onefile candidate
Confidence: medium
Scope-risk: moderate
Reversibility: clean
Directive: Compare future forced-sm100a plain-CUDA ideas against this onefile baseline with full54 authority before promoting them
Tested: Forced plain-CUDA sm100a full54 authority on workspace-packed candidate (Modal ap-6hUwd4UdFxQUl3Ho0ins4V) -> 13.217 us, PASSED 54/54
Tested: Forced plain-CUDA sm100a full54 authority on equivalent candidate JSON (Modal ap-9bA3DyKhFWySyjhtiodq8P) -> 13.823 us, PASSED 54/54
Tested: Forced plain-CUDA sm100a full54 authority on previous baseline JSON (Modal ap-9il2CSKUc3RKyVxy2xoT1m) -> 14.180 us, PASSED 54/54
Not-tested: Exact runtime-compile full54 with this kernel body
Not-tested: NCU/proton profiling on the new forced-sm100a onefile baseline
The active decode lane is now only valid when plain-CUDA measurements run through TVMFFIBuilder with an explicit TVM_FFI_CUDA_ARCH_LIST=10.0a environment. This commit bakes that requirement into the default Modal benchmark harness so future authority runs use the same compile target without relying on ad hoc temporary scripts.

Constraint: Decode promotion evidence now requires forced-sm100a plain-CUDA or exact runtime-compile full54 runs
Constraint: .omx artifacts and unrelated copied worktree files must stay out of commits
Rejected: Keep the default runner on TORCH_CUDA_ARCH_LIST alone | fresh inspection proved TVMFFIBuilder fell back to compute_100/sm_100 without TVM_FFI_CUDA_ARCH_LIST
Confidence: medium
Scope-risk: narrow
Reversibility: clean
Directive: Treat Modal full54 results gathered without TVM_FFI_CUDA_ARCH_LIST=10.0a as non-authoritative for the current decode lane
Tested: Build-ninja inspection without TVM_FFI_CUDA_ARCH_LIST (ap-GAL0f5N52Z1Mx1fYetsxXU) -> sm_100 only
Tested: Build-ninja inspection with TVM_FFI_CUDA_ARCH_LIST=10.0a (ap-alGDes2GFiamFTKYjLCm08) -> hard sm_100a proof
Tested: Modal representative rerun with committed runner env on e352b57 workspace (ap-6OYnqOzayksPX5hag34EPK) -> PASSED 3/3
Not-tested: Fresh committed-runner full54 authority after this exact harness commit
Not-tested: NCU/proton helper scripts still use temporary forced-sm100a overlays
The current fastest submission-owned decode lane is the c721base minwrap surface, so this branch switches the active config back to the submit-decode style Python wrapper while keeping the proven onefile kernel body from the parent baseline. This preserves a branch-local lane for bear-safe evaluation without disturbing the faster plain-CUDA measurement branch.

Constraint: Only code files belong in this branch history; .omx artifacts and copied workspace clutter stay out
Constraint: Submission-owned sm100a forcing must happen inside the wrapper, not only in external Modal runner env
Rejected: Keep submit-safe candidate only as a /tmp solution artifact | hard to audit and easy to lose between sessions
Confidence: medium
Scope-risk: narrow
Reversibility: clean
Directive: Treat this branch as the submit-safe lane; compare future self-compile wrapper ideas against its full54 authority before promoting them
Tested: Existing temp full54 authority on the same c721base minwrap artifact (Modal ap-TPoYtWlwqbBKvxK0avOfOS) -> 13.901 us, PASSED 54/54
Not-tested: Fresh branch-backed full54 authority after this exact branch commit
Not-tested: 6-run statistical repeat still in progress under .omx/logs/submit-safe-c721base-3-100-6x6-20260419T062435Z
…o iter Bammuri#1

Iteration Bammuri#2 aimed for B5 (warp specialization + async memcpy + shared memory):
- Phase 1: B5 with cuda::pipeline → INCORRECT_NUMERICAL on all 54 workloads
- Phase 2: Corrected indexing → benchmark timeout (modal 240s+, inconclusive)
- Phase 3: Fallback B2 (double-buffering) → uninitialized prefetch risk

Decision: Suspend iteration Bammuri#2, revert to iter Bammuri#1 baseline (avg_latency=0.011415ms)
Lessons: Async pipeline/warp specialization more fragile than expected; prefetch rotation logic error-prone

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
@jaewooMaeng

Copy link
Copy Markdown
Collaborator Author

=== NCU Duration per workload ===
UUID axes duration_us fallback

901e5104-dccb-4c3f-ae13-ef4d31a4d456 batch_size=1 7.010 True
a5714b69-525c-4b95-bb7a-a0f9770c2f48 batch_size=1 7.170 True
3daa0974-293c-4414-b3c2-1f04368c1189 batch_size=1 6.940 True
22d5cef5-4f30-4f43-9d5f-0e9e95dc2201 batch_size=1 7.230 True
aed4bdd4-3139-4a1b-ae2f-aab8d4ba4090 batch_size=1 7.360 True
49a125e5-edf0-492e-a8b3-3676d14adaa3 batch_size=1 7.230 True
798635cf-d424-4343-a959-c96b0c0e81fb batch_size=1 7.070 True
5716e24a-3f55-411a-bcd8-e6b677b1ca7e batch_size=1 7.260 True
4c7df22f-70ef-4494-864f-6f10209ab0f3 batch_size=1 7.550 True
d0e91dea-aa1b-46c8-a67c-b2814f5a1725 batch_size=1 7.230 True
ec9d2340-6d13-40e4-a6fe-4483a1cacd0d batch_size=4 8.100 True
9f238670-9a56-4ab9-94f9-555755f32205 batch_size=4 8.990 True
6d929bff-d051-4e1d-acbf-7d0cc13f6dc8 batch_size=4 8.160 True
494844dc-80e9-41e9-9fe2-3d2618fdef64 batch_size=4 8.190 True
26f760ab-c286-41bf-8c37-3bc5df4c98fc batch_size=4 8.100 True
deecf42e-dfae-4061-936f-0af4d892c231 batch_size=4 8.100 True
9a92acbc-6104-48c6-8176-601feee30001 batch_size=4 9.310 True
fa90b213-3ad2-4a42-9173-97fbf9e2e809 batch_size=4 8.030 True
2640f1e3-4b02-4041-bdc1-59a28e0b9954 batch_size=8 9.090 True
7464a1d6-b0dd-4061-b15a-ebdabf47e351 batch_size=8 8.930 True
25d1f606-3e9e-49e6-a3f0-991092c7a845 batch_size=8 8.930 True
fe2e6584-17d0-4025-ad44-db9e507f6bed batch_size=8 9.120 True
ca62112d-68e8-4eb7-a318-26c48d256e10 batch_size=8 9.180 True
61dfd334-971b-4a44-a8a8-f00950192745 batch_size=8 9.090 True
abec5d32-2409-4412-8683-472f3a091a54 batch_size=8 9.090 True
76eec66d-0ada-4c35-bf21-a07247ad7f05 batch_size=16 10.530 True
f0507025-1db0-48de-b3d6-4faad5632558 batch_size=16 10.530 True
d1508f6d-81b7-4caf-947b-b48f612a3061 batch_size=16 10.690 True
63d4cb66-d6f6-4ceb-8083-0c0fb4abece9 batch_size=16 10.690 True
ef6bb01d-1294-4a06-97b2-604ec944b4c0 batch_size=16 10.780 True
d0acf04a-5919-4bf3-95ca-94fbef5786c0 batch_size=16 11.040 True
ac83f1e4-982e-41a5-8c02-7ccc14c728d4 batch_size=16 10.530 True
a8c8beff-e414-4580-b2f7-e5b8f13bc269 batch_size=32 14.940 True
40940a4c-1348-416d-aea1-f3183548229b batch_size=32 15.070 True
0915952b-5887-43aa-9807-a0980e7a78f6 batch_size=32 15.420 True
d38dde79-79f9-4c2d-8bbe-05ee1be48583 batch_size=32 15.330 True
388f79b1-13e9-44e6-b12e-c8197989a924 batch_size=32 14.880 True
80288532-6967-4b8e-b5bc-c31ce0a208d5 batch_size=32 14.780 True
773a4157-01fa-4637-90be-8604e0b40526 batch_size=32 14.850 True
53385c7f-393d-41db-aec8-5b9eb5bf35d1 batch_size=48 21.980 True
45e7697a-bae5-42bd-9f56-959c10ff681f batch_size=48 22.370 True
34403e5b-8551-4d44-95b0-21532c4eb839 batch_size=48 23.940 True
b91cc4aa-048d-4e05-9c0c-8a153ec997f8 batch_size=48 20.700 True
4c38f0e4-aabf-4742-90b9-24231effd96f batch_size=48 22.270 True
be91d78c-1114-4563-b0aa-e0682ccc09bb batch_size=48 21.860 True
3803367a-f140-432a-9784-43fe5d5f64d0 batch_size=48 21.660 True
eaf0a285-447c-4432-8e68-d287acc3cb08 batch_size=64 31.650 True
f4fd2171-c869-440c-b199-f403e3c6b788 batch_size=64 31.100 True
b1931bc6-b8b6-44da-a83b-aa6cc4d4c275 batch_size=64 32.610 True
8d0c5789-550a-4216-a471-202b2655a4e8 batch_size=64 32.380 True
9eef9bc7-5972-437e-b488-853c6dacb470 batch_size=64 30.750 True
58571e49-29ee-4a53-bf1d-a7a2363e9db3 batch_size=64 31.940 True
ef2cf980-6977-49d1-a6f8-7247becf0273 batch_size=64 31.900 True
42963acb-f2f5-4ada-9205-3931cd26fa44 batch_size=64 31.520 True

Arithmetic mean over 54/54 workloads: 14.651 us

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants