[Experiment] ROCm backend by NripeshN · Pull Request #2300 · ml-explore/mlx

NripeshN · 2025-06-16T21:43:44Z

Experiment with ROCm backend.

install MLX with ROCm backend using:

mkdir build && cd build
cmake -DMLX_BUILD_ROCM=ON \
      -DCMAKE_PREFIX_PATH=/opt/rocm \
      -DCMAKE_HIP_ARCHITECTURES="gfx90a;gfx1100" \
      ..
make -j$(nproc)

closes #2556

Inspired by @zcbenz

lin72h · 2025-06-17T07:07:21Z

What an unexpected and amazing surprise! I'm absolutely thrilled.

NripeshN · 2025-06-18T23:51:44Z

@awni
What do you think of this PR? Does this have the potential to be merged into main? I can turn this PR from experimental to WIP if so.

angeloskath · 2025-06-24T00:38:27Z

I think this is good to stay as an experiment branch for some time while we work on core and CUDA. I don't think we have the bandwidth to merge this for a few months at least. Sorry if this is disappointing @NripeshN I don't mean to discourage you working on it.

akshat2602 · 2025-08-18T17:56:41Z

I would love to see the ROCm backend get more traction. The new AI series of processors by AMD have a similar advantage to Apple Silicon with unified memory and getting MLX to run on those processors would be neat.

countradooku · 2026-01-04T20:27:49Z

Stole my idea :(

goniz · 2026-01-22T15:20:15Z

How is this even possible for such an awesome PR to be left like this?

Copilot

Pull request overview

This PR adds experimental ROCm backend support to MLX, enabling execution on AMD GPUs. The implementation mirrors the CUDA backend structure, providing HIP-based implementations of core operations, memory management, and device handling.

Changes:

Added ROCm backend infrastructure with device management, memory allocation, and stream handling
Implemented HIP kernels for unary, binary, ternary operations, reductions, normalization (softmax, layer_norm, rms_norm), RoPE, and sorting
Updated build system (CMake) to support ROCm compilation with configurable GPU architectures

Reviewed changes

Copilot reviewed 59 out of 59 changed files in this pull request and generated 13 comments.

Show a summary per file

File	Description
CMakeLists.txt	Added MLX_BUILD_ROCM option and ROCm library detection
mlx/CMakeLists.txt	Integrated ROCm backend build configuration
mlx/device.cpp	Added ROCm device availability checks
mlx/backend/rocm/*.hip	HIP kernel implementations for various operations
mlx/backend/rocm/device.*	ROCm device and stream management
mlx/backend/rocm/allocator.*	ROCm-specific memory allocator using HIP unified memory
mlx/backend/rocm/worker.*	Async task execution worker for stream synchronization
mlx/backend/rocm/utils.*	HIP utility functions and error handling
mlx/backend/rocm/jit_module.*	JIT compilation support using HIPRTC
mlx/backend/rocm/device/*.hpp	Device-side utility functions and type definitions
mlx/backend/rocm/CMakeLists.txt	ROCm backend build configuration

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

goniz · 2026-01-24T17:42:45Z

👑👑👑

NripeshN · 2026-01-24T18:12:04Z

Can anyone run

CMAKE_ARGS="-DMLX_BUILD_ROCM=ON" pip install -e .
CMAKE_ARGS="-DMLX_BUILD_ROCM=ON -DMLX_ROCM_ARCHITECTURES={based on your GPU}" pip install -e .

Replace {based on your GPU} with your GPU architecture

You can run

rocm-smi

to get your GPU information

goniz · 2026-01-24T18:49:41Z

I'm getting this CMake error:

CMAKE_ARGS="-DMLX_BUILD_ROCM=ON -DMLX_ROCM_ARCHITECTURES=gfx1151" pip install -e .

      -- Configuring done (4.8s)
      CMake Error: The following variables are used in this project, but they are set to NOTFOUND.
      Please set them or make sure they are set and tested correctly in the CMake files:
      /home/goniz/Work/mlx/LAPACK_INCLUDE_DIRS
         used as include directory in directory /home/goniz/Work/mlx
      
      CMake Error in CMakeLists.txt:
        HIP_ARCHITECTURES is empty for target "mlx".
      
      
      CMake Error in CMakeLists.txt:
        HIP_ARCHITECTURES is empty for target "mlx".
      
      
      -- Generating done (0.0s)
      CMake Generate step failed.  Build files cannot be regene
rated correctly.

Running on Strix Halo (gfx1151)

NripeshN · 2026-01-25T00:54:26Z

I'm getting this CMake error:

CMAKE_ARGS="-DMLX_BUILD_ROCM=ON -DMLX_ROCM_ARCHITECTURES=gfx1151" pip install -e .

     -- Configuring done (4.8s)
     CMake Error: The following variables are used in this project, but they are set to NOTFOUND.
     Please set them or make sure they are set and tested correctly in the CMake files:
     /home/goniz/Work/mlx/LAPACK_INCLUDE_DIRS
        used as include directory in directory /home/goniz/Work/mlx
     
     CMake Error in CMakeLists.txt:
       HIP_ARCHITECTURES is empty for target "mlx".
     
     
     CMake Error in CMakeLists.txt:
       HIP_ARCHITECTURES is empty for target "mlx".
     
     
     -- Generating done (0.0s)
     CMake Generate step failed.  Build files cannot be regene
rated correctly.

Running on Strix Halo (gfx1151)

Could you retry with the latest push please (p.s. keep your fingers crossed while it compiles, worked for me 138th time)😅

goniz · 2026-01-25T02:18:36Z

  Created wheel for mlx: filename=mlx-0.30.4.dev20260125+cadf18c1-0.editable-cp314-cp314-linux_x86_64.whl size=4722 sha256=72c664adbfc4fb9ec317522a8d83b84f85d599d08bd691d7fec3abfdb6f3a5e9
  Stored in directory: /tmp/pip-ephem-wheel-cache-nt7w6bq0/wheels/8a/63/d1/d7d629a5ff73457822bb71aa527c083674bb19ca314735cd05
Successfully built mlx
Installing collected packages: mlx
Successfully installed mlx-0.30.4.dev20260125+cadf18c1

Now what can I test? 😍

goniz · 2026-01-25T02:21:36Z

I'm getting this:

ImportError: /home/goniz/Work/mlx/python/mlx/lib/libmlx.so: undefined symbol: _ZN3mlx4core11Convolution8eval_gpuERKSt6vectorINS0_5arrayESaIS3_EERS3_

NripeshN · 2026-01-26T04:32:13Z

I'm getting this:

ImportError: /home/goniz/Work/mlx/python/mlx/lib/libmlx.so: undefined symbol: _ZN3mlx4core11Convolution8eval_gpuERKSt6vectorINS0_5arrayESaIS3_EERS3_

I forgot to test the Python build my bad, can you try it now?

Unfortunately I might not be able to help after it compiles, I don't have an AMD GPU to run tests😔 I've tried replicating most things from cuda, so hopefully it works

goniz · 2026-01-26T05:21:32Z

Now fails on load with this:

>>> import mlx.core
Traceback (most recent call last):
  File "<python-input-0>", line 1, in <module>
    import mlx.core
ImportError: /home/goniz/Work/mlx/python/mlx/lib/libmlx.so: undefined symbol: hiprtcCompileProgram

goniz · 2026-01-26T05:23:08Z

Unfortunately I might not be able to help after it compiles, I don't have an AMD GPU to run tests😔 I've tried replicating most things from cuda, so hopefully it works

Omg I don't believe you did it without AMD card 😱😱

NripeshN · 2026-01-26T11:40:25Z

Now fails on load with this:

~~The latest push hopefully fixes the undefined symbol error~~ Found the issue, working on the fix😩

Omg I don't believe you did it without AMD card 😱😱

Haha docker literally saves me and humbles me at the same time

goniz · 2026-01-26T11:50:55Z

goniz · 2026-01-26T11:53:40Z

I might got over excited:

NripeshN · 2026-01-26T11:56:16Z

Wait it works?😅

Ah unfortunately unless a magic fairy sends me a PC with AMD GPU I cannot help after this😭 With the ram prices I doubt the magic fairy has the funds either🥲

goniz · 2026-01-26T11:58:34Z

Latest commit broke something:

NripeshN · 2026-01-26T12:03:00Z

Lemme try adding a fix for both the issues above actually. I had just made a stub implementation earlier.

NripeshN · 2026-01-26T12:18:14Z

@goniz give the last push a try maybe. It might not work but you will definitely not have the same error atleast☺️

goniz · 2026-01-26T12:20:48Z


mlx rocm-support ? ❯︎ python3 qwen3.py 
Fetching 9 files: 100%|██████| 9/9 [00:00<00:00, 201864.90it/s]
Download complete: : 0.00B [00:00, ?B/s]              ?, ?it/s]
==========
Traceback (most recent call last):
  File "/home/goniz/Work/mlx/qwen3.py", line 15, in <module>
    text = generate(model, tokenizer, prompt=prompt, verbose=True)
  File "/home/goniz/Work/mlx/venv/lib/python3.14/site-packages/mlx_lm/generate.py", line 762, in generate
    for response in stream_generate(model, tokenizer, prompt, **kwargs):
                    ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/goniz/Work/mlx/venv/lib/python3.14/site-packages/mlx_lm/generate.py", line 699, in stream_generate
    for n, (token, logprobs, from_draft) in enumerate(token_generator):
                                            ~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/home/goniz/Work/mlx/venv/lib/python3.14/site-packages/mlx_lm/generate.py", line 689, in <genexpr>
    (token, logprobs, False) for token, logprobs in token_generator
                                                    ^^^^^^^^^^^^^^^
  File "/home/goniz/Work/mlx/venv/lib/python3.14/site-packages/mlx_lm/generate.py", line 432, in generate_step
    mx.eval([c.state for c in prompt_cache])
    ~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Unsupported dtype for affine_dequantize

NripeshN · 2026-01-26T12:41:29Z

Might fix it(????)

goniz · 2026-01-26T12:43:06Z


mlx rocm-support ? ❯︎ python3 qwen3.py 
Fetching 9 files: 100%|███████| 9/9 [00:00<00:00, 28575.88it/s]
Download complete: : 0.00B [00:00, ?B/s]              ?, ?it/s]
==========
Traceback (most recent call last):
  File "/home/goniz/Work/mlx/qwen3.py", line 15, in <module>
    text = generate(model, tokenizer, prompt=prompt, verbose=True)
  File "/home/goniz/Work/mlx/venv/lib/python3.14/site-packages/mlx_lm/generate.py", line 762, in generate
    for response in stream_generate(model, tokenizer, prompt, **kwargs):
                    ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/goniz/Work/mlx/venv/lib/python3.14/site-packages/mlx_lm/generate.py", line 699, in stream_generate
    for n, (token, logprobs, from_draft) in enumerate(token_generator):
                                            ~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/home/goniz/Work/mlx/venv/lib/python3.14/site-packages/mlx_lm/generate.py", line 689, in <genexpr>
    (token, logprobs, False) for token, logprobs in token_generator
                                                    ^^^^^^^^^^^^^^^
  File "/home/goniz/Work/mlx/venv/lib/python3.14/site-packages/mlx_lm/generate.py", line 432, in generate_step
    mx.eval([c.state for c in prompt_cache])
    ~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: QuantizedMatmul has no ROCm implementation.

… resident Follow-up to the device-binding fix. Three changes that keep the discrete GPU out of queue-wedge states: - AtomicEvent: signal via hipLaunchHostFunc and wait by host poll instead of hipStreamWriteValue64/WaitValue64. The value ops require hipMallocSignalMemory and silently no-op on a plain pinned-host counter, so the GPU-side wait never observes the value and the queue spins forever (100% busy, 0 mem traffic). - set_cache_limit: trim the reuse pool down to the new cap immediately (caller is at an idle point) instead of lazily on the next malloc, so the eviction's blocking hipFree doesn't fire mid-forward and wedge the queue. - device(): set blocking-sync flags per device index, not behind a single global bool (which left device 1 unflagged if device 0 was touched first). - Keep fine-grained (VRAM-resident, host-mappable) allocations on both the APU and discrete GPU; bump the discrete driver reserve 256MB -> 512MB.

The reduce-type kernel launch macro referenced data_offset, a structured binding, which C++17 forbids capturing. Copy it to a plain local first so the file compiles when rebuilt. No behavior change; the KV/None path is unaffected.

The dims_<D "RoPE without copy" path shared a donatable input's buffer and indexed it with a contiguous T*D head stride (strides[0]=mat_size) while taking the row/element strides from the input's actual layout. For a non-contiguous (transposed [B,H,T,D]) Q view that mix produces out-of-bounds addressing; the error grows with head index and T, so it stays in-bounds for short prompts but runs off the buffer at T=512, faulting the GPU command processor and wedging the queue on long (>512-token) prefills. Gate the in-place path on row_contiguous; fall back to copy_gpu otherwise (matches the CUDA backend, which always materializes a contiguous output here).

RMSNorm::eval_gpu forced a contiguous_copy_gpu whenever the input rows were not tightly packed (stride[-2] != axis_size), e.g. the sliced per-head q/k norm where each head's vector sits in a wider stride. The kernel indexed rows as row*axis_size, so it could only handle packed input. Add a strided kernel that computes each row's base offset from the leading dims' shape/strides (last dim must be contiguous) and writes a packed output, and route the previously-copying case to it. The packed fast path is unchanged; only inputs that would otherwise be copied now run in place. On Qwen3 MoE q4 this removes ~490 copy launches over a 40-token run with identical output.

The general gather kernel runs one thread per output element and redoes the full src/index stride decomposition (mod/div loops) for every element. For an axis-0 gather of a row-contiguous source (e.g. the MoE token reorder, gathering [N, hidden] rows), all elements of a row share the same source-row base, so this is pure integer-math overhead. Add a fast path (gather_rows_kernel) for that case: one block per output row, source-row base computed once, coalesced copy of the contiguous row. Gated to nidx==1, axis 0, row-contiguous src, ndim>=2, full-row slices, contiguous index; everything else still uses the general kernel.

- DynamicSliceUpdate (gpu/primitives.cpp): donate the input buffer when uniquely owned (contiguous, full) so a device-position slice_update writes IN PLACE instead of copying the whole buffer every call — O(1) preallocated KV updates with a stable address. Mirrors the existing SliceUpdate donation. - CustomKernel output->input aliasing (fast.h, fast_primitives.h, rocm/metal/cuda custom_kernel.cpp): hip_kernel() takes an optional output_input_aliases map; an aliased output reuses the input's buffer in place. Lets a recurrent-state kernel (gated-delta SSM) write its new state into the same buffer it read, so a captured HIP graph's recurrence accumulates across replays. Honored on all three GPU backends; no-op when unset. - indexing.hip: in-place device-scalar kernels (gpu_kv_pos_set/increment) and an in-place KV row-write (gpu_kv_row_write) for the device-position decode loop. Raw kernels (no host-constant upload) so the value survives graph capture/replay.

The unconditional per-GEMM fprintf(stderr) serialized the host thread on every matmul (prefill hot path). Gate it behind an env flag (off by default).

6-bit QMV ran on qmv_warp_shared at half-wave (block=16) occupancy because the generic tiled kernel needs integer pack_factor (32/6 isn't). New qmv_tiled_6bit_kernel gives 6-bit the tiled kernel's full Wave32 + column-tiling + LDS-X structure with byte-aligned 6-bit loads (K%64==0). +26% decode on gfx1151 (30.3 -> 38.3 tok/s), +2% on gfx1201. On by default; MLX_ROCM_QMV_6BIT_SLOW reverts to warp_shared.

MoE expert gather-QMV (gather_qmv_warp_shared) ran at half-wave occupancy like the dense path. gather_qmv_tiled_6bit_kernel mirrors qmv_tiled_6bit_kernel (full Wave32 + column-tiling + LDS-X, byte-aligned 6-bit loads) with the expert-index gather. Dense+ MoE together: +33% decode on gfx1151 (30.2 -> 40.2 tok/s). Default-on; MLX_ROCM_QMV_6BIT_SLOW reverts.

…apability table - Route dequant prefill GEMM through hipBLASLt (all dtypes), eliminating the rocBLAS Tensile missing-kernel churn on gfx1201. - Cache the selected hipBLASLt algorithm per (shape,dtype,transpose,device) so warm GEMMs skip AlgoGetHeuristic; recovers prefill parity with rocBLAS. - Probe GEMM input-type support (bf16/fp8 e4m3/e5m2/int8) once per device at first use and print a capability table; select precision via enum instead of an arch-string match. - Add hipblaslt_gemm_fp8_raw (e4m3 inputs, scale pointers, bf16 out, best-algo tuned) primitive for the gfx1201 fp8 path. - Gate allocator slab hints (hipMemAdvise/prefetch) to integrated GPUs only.

…e, bf16) Dequantize packed affine weights straight to e4m3 (no bf16 intermediate) and cast activations to e4m3, then run the projection GEMM on fp8 matrix cores via hipblaslt_gemm_fp8_raw, descaled back to bf16. Per-tensor weight scale is derived from quant-param endpoints (no full-weight pass). Capability-gated to devices with e4m3 kernels; bf16 path elsewhere. ~+20% warm prefill on gfx1201.

free() ran a blocking hipFree on the completion-worker thread when the reuse cache was full; on the APU's fine-grained unified memory that free waits on GPU completions the worker itself delivers — a self-deadlock that wedged decode under heavy async load (MTP speculative decode). Defer such frees to a pending list drained by malloc on the eval thread, where blocking is safe. Also size the integrated memory_limit_ to system RAM (the unified/GTT pool the allocations actually draw from) rather than the device VRAM figure, so the reuse pool never evicts mid-generation.

…adlock clear_cache() freed every cached buffer with a blocking hipFree while holding the allocator mutex. On unified memory that free waits for outstanding GPU work whose completion the worker thread delivers — and the worker frees through the same mutex, so a long-prompt prefill (large cache + many in-flight frees) deadlocked with the GPU idle. Synchronize the device first so the frees have nothing to wait on, and release any deferred frees in the same pass.

…ync) Adopt the CUDA backend's stream-ordered allocation model. Primitive output buffers allocate from a per-device hipMemPool via malloc_async(size, encoder) on the encoder's stream, and free non-blocking via hipFreeAsync on that same stream so the frees retire in order behind the buffer's last use and the pool reclaims memory (a separate free stream never executes mid-forward, leaking VRAM). CPU access to pool buffers (device>=0, non-coherent) is served by the existing pinned host-shadow path. Wired malloc_async into every primitive that allocates an output, mirroring the CUDA backend: copy, binary_two, reductions, softmax, logsumexp, scan, norms, rope, random, arange, sort, indexing, attention (sdpa/flash/wmma), conv, distributed, quantized (qmm/gather/convert_fp8), matmul. The pool is always on where the device supports memory pools. Stream-less allocations (model load, KV, non-wired ops) stay on the unified path with deferred frees off the completion-worker thread. clear_cache trims the pool instead of blocking-freeing under handler pressure. Verified stable on gfx1151 (APU) and gfx1201 (R9700) across prefill, decode, and MTP: D1 297 pp/s / 47.8 tps, D0 247 pp/s / 42.1 tps; no wedge, no OOM.

…ture During capture the async pipeline inflates the input buffer use_count, so can_donate fails and the update copies into a fresh buffer — the captured graph then reconstructs (frozen capture input + current row) every replay and loses accumulation (growing KV cache freezes -> repeated tokens). Force the in-place donation for a contiguous, fully-materialized buffer while a graph is being captured.

Add a mark-based rewind so per-token sampling allocations reuse [mark, ...) while the captured graph's deterministic buffer region [0, mark) stays reserved across replays.

- malloc_async routes through DecodeArena during capture (was emitting MemAlloc graph nodes that fail on the 2nd replay with 'invalid argument') - DecodeArena: reserve 16384 descriptors so the descriptor vector never reallocates (returned RocmBuffer* point into it; realloc dangled them -> heap corruption) - DecodeArena::reset_to(byte_mark, desc_mark) rewinds BOTH counters so the graph region stays reserved while per-token sampling reuses the tail - is_hipblaslt_available() returns false during capture (force rocBLAS): a warm hipBLASLt handle still runs AlgoGetHeuristic/workspace hipMalloc that invalidates the capture With these + the DynamicSliceUpdate donation fix, capture-once graph decode replays the full forward coherently on gfx1151.

…6 verified coherent

After a capture-once graph is built, set_paused(true) keeps the arena backing valid (captured-graph buffers stay at baked addresses) but routes per-token sampling allocations to the pool, so sampling can't clobber graph buffers and corrupt the next replay. Fixes replay token N+1 corruption from arena reset_to.

…[gated off] Foundation mirroring the CUDA backend: CommandEncoder gains add_kernel_node / add_kernel_node_raw, a build_graph_ accumulator, dependency tracking in set_input_array/set_output_array, needs_commit(), and commit() that builds the per-eval HIP graph, reuses the exec via hipGraphExecUpdate (LRU keyed on topology hash), and submits one hipGraphLaunch. eval.cpp wires needs_commit/ commit. hipBLASLt workspace pre-allocated so capture never hipMallocs. Gated behind MLX_USE_HIP_GRAPHS (default OFF) — default build is unchanged eager (verified coherent, 41 tok/s on gfx1151). The graphs-ON path currently uses a per-lambda stream-capture bridge in launch_kernel which DEADLOCKS on the first eval (library/alloc calls under capture) — to be replaced by real per-kernel migration to add_kernel_node (host-side node construction).

…des) Convert elementwise (unary/binary/binary_two/ternary), norms (rms_norm/ layer_norm), softmax/logsumexp, scan, arg_reduce, sort, rope, indexing (gather/scatter/slice_update/masked_scatter), random, and attention (sdpa/flash/flash_wmma) launch sites from launch_kernel(lambda) to encoder.add_kernel_node(&kernel, grid, block, smem, args...). Fix the add_kernel_node_ex param marshalling to strip const (gpu_ptr returns const for const inputs). Graphs-OFF (default) is unchanged immediate-launch and builds clean; sets up automatic per-eval graph batching when graphs-ON. Residual launch_kernel sites (memsets, rocprim sort path, copy/ subdir, KV helpers, JIT custom_kernel/compiled) still pending migration.

Wave 2: copy/ subdir, reduce/ subdir (row/col/all/init), quantized (affine_quantize, fp_quantize, convert_fp8), qmm.hip (~63 sites: qmv/qvm tiled+warp+gather, all bit/group/dtype combos), gemv.hip. Builds clean, graphs-OFF unchanged. Residual launch_kernel: copy/arg_reduce memsets (-> memset nodes), JIT custom_kernel/compiled, GEMM library (rocblas/ hipblaslt), gemv malloc fallback, rocprim sort.

…idge add_kernel_node_ex now copies arg VALUES into a heap pack kept alive through commit() (HIP graph nodes reference kernelParams until instantiate/exec-update, after which the pack is cleared) — fixes dangling kernelParams. The per-op micro-capture bridge in launch_kernel is now behind MLX_HIP_GRAPH_BRIDGE. graphs-OFF (default) unchanged.

Diagnostic: pure add_kernel_node kernel-node graphs launch correctly on this ROCm build (model-load evals pass). Remaining graphs-ON blockers are the non-kernel residuals only: library GEMM (aborts/crashes under graph) and the child-graph bridge nodes. graphs-OFF (default) unaffected.

…iagnosis

…lifetime graphs-ON (MLX_USE_HIP_GRAPHS, default OFF) now RUNS end-to-end on the ROCm 7.13 runtime (7.12 segfaulted hipGraphLaunch). launch_kernel graph-splits un-graphable residuals (JIT module kernels, GEMM, memsets): flush+launch the accumulated kernel-node graph, run the residual immediately on the same stream, start a fresh graph. hipBLASLt forced to rocBLAS in graph mode (its lazy init aborts under graph activity). kernelParams arg-packs freed at synchronize (exec references them through async launch). KNOWN WIP: graphs-ON output is incorrect (incomplete set_input/output_array dependency edges -> races) and slower than eager due to graph-split fragmentation. Default graphs-OFF unchanged (41 tok/s).

graphs-ON (default OFF): graph nodes serialized into a linear chain in submission order (matches eager stream order; robust vs incomplete set_input/output_array edges) and arg-packs freed at synchronize. Runs on the 7.13 runtime without crashing but output is still incorrect (an unisolated race) and slower than eager due to graph-split fragmentation. Default graphs-OFF eager unchanged (41 tok/s coherent).

Bisection of the graphs-ON correctness bug (all on 7.13 runtime, graphs-OFF default unaffected, eager 41 tok/s coherent): - 1 node/graph is ALSO wrong -> not multi-node dependency/race. - exec-cache keyed by node-type-only collided distinct kernel sequences -> hipGraphExecUpdate mis-reused execs -> garbage. Now key by func ptr + dims. - fresh hipGraphInstantiate per commit + destroy-at-synchronize (no reuse) -> segfaults; ExecUpdate-reuse -> runs but garbage. Both point to a deeper hipGraph instantiate/exec instability for this GDN+MoE workload on ROCm 7.13. graphs-ON still not correct; eager + 7.13 is the working path.

…solated Standalone repro proved hipGraphAddKernelNode + tuple-marshaling are correct on 7.13 (identical to hipLaunchKernel). Bisection of full-forward graphs-ON: - BUG1 buffer lifetime: graph nodes execute at commit, but the allocator frees intermediates at eval time -> reused before the graph runs -> segfault. Deferring frees (graph_active) prevents the segfault but balloons memory. - BUG2 computation: even with buffers kept alive/non-aliased, output is garbage -> a remaining error in the full multi-kernel forward not reproduced by the single-kernel repro. Needs per-kernel eager-vs-graph output bisection. Default graphs-OFF eager unchanged (41 tok/s coherent on 7.12 and 7.13).

NripeshN changed the title ~~[Experiment] ROCm backend initial push~~ [Experiment] ROCm backend Jun 16, 2025

NripeshN mentioned this pull request Sep 12, 2025

Add ROCm Support for AMD GPUs #2556

Open

Copilot AI review requested due to automatic review settings January 24, 2026 17:08

Copilot started reviewing on behalf of NripeshN January 24, 2026 17:09 View session

Copilot AI reviewed Jan 24, 2026

View reviewed changes

Geramy added 30 commits June 17, 2026 09:56

rocm: gate per-call hipBLASLt GEMM trace behind MLX_ROCM_GEMM_DEBUG

db19eae

The unconditional per-GEMM fprintf(stderr) serialized the host thread on every matmul (prefill hot path). Gate it behind an env flag (off by default).

rocm: trim over-verbose comments to one-line descriptions (comment-only)

00d6024

rocm: DecodeArena reset_to(mark) for capture-once graph replay

63d445c

Add a mark-based rewind so per-token sampling allocations reuse [mark, ...) while the captured graph's deterministic buffer region [0, mark) stays reserved across replays.

rocm: MLX_NO_HIPBLASLT env to force rocBLAS (diagnostic); rocBLAS bf1…

f730214

…6 verified coherent

rocm: graph node-type histogram + dot dump (MLX_HIP_GRAPH_DUMP) for d…

0908d96

…iagnosis

Conversation

NripeshN commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lin72h commented Jun 17, 2025

Uh oh!

NripeshN commented Jun 18, 2025

Uh oh!

angeloskath commented Jun 24, 2025

Uh oh!

akshat2602 commented Aug 18, 2025

Uh oh!

countradooku commented Jan 4, 2026

Uh oh!

goniz commented Jan 22, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

goniz commented Jan 24, 2026

Uh oh!

NripeshN commented Jan 24, 2026

Uh oh!

goniz commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NripeshN commented Jan 25, 2026

Uh oh!

goniz commented Jan 25, 2026

Uh oh!

goniz commented Jan 25, 2026

Uh oh!

NripeshN commented Jan 26, 2026

Uh oh!

goniz commented Jan 26, 2026

Uh oh!

goniz commented Jan 26, 2026

Uh oh!

NripeshN commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

goniz commented Jan 26, 2026

Uh oh!

goniz commented Jan 26, 2026

Uh oh!

NripeshN commented Jan 26, 2026

Uh oh!

goniz commented Jan 26, 2026

Uh oh!

NripeshN commented Jan 26, 2026

Uh oh!

NripeshN commented Jan 26, 2026

Uh oh!

goniz commented Jan 26, 2026

Uh oh!

NripeshN commented Jan 26, 2026

Uh oh!

goniz commented Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

NripeshN commented Jun 16, 2025 •

edited

Loading

goniz commented Jan 24, 2026 •

edited

Loading

NripeshN commented Jan 26, 2026 •

edited

Loading