[ROCm] Add HIP/ROCm backend support for AMD GPUs by jeffdaily · Pull Request #327 · brian-team/brian2cuda

jeffdaily · 2026-06-09T17:18:39Z

Summary

This adds a HIP/ROCm backend so Brian2CUDA runs on AMD GPUs. Brian2CUDA is a runtime code generator (it emits a standalone C++/CUDA project per simulation), so the port works at the generator level: a CUDA-to-HIP compatibility header maps the CUDA symbols the generated code uses to their HIP equivalents, and the device/makefile generation handles HIP-specific build requirements. The NVIDIA/CUDA path is unchanged.

The HIP backend is auto-detected when ROCm is present and CUDA is absent, or forced with USE_HIP=1. It's documented in the README.

What is implemented

brianlib/cuda_to_hip.h -- compatibility header mapping the CUDA / cuRAND symbols the generated kernels use to HIP / hipRAND, so the CUDA-spelled generated sources compile under hipcc unchanged.
brianlib/spikequeue.h -- a wave-serialized spin-lock for AMD: CDNA wavefronts lack NVIDIA's independent thread scheduling, so the standard atomicCAS spin-lock can deadlock when lanes in one wavefront contend; the AMD path serializes per wavefront.
device.py + templates/makefile_hip -- HIP backend detection and HIP makefile generation (-fgpu-rdc for cross-TU device-symbol linking).
Preferences -- the HIP backend honors prefs.devices.hip_standalone.hip_backend.* (gpu_arch, rocm_path, extra_compile_args_hipcc, gpu_heap_size, gpu_id), mirroring the existing cuda_backend knobs.
Windows -- hipcc.exe handling, GPU-arch from preferences (rocminfo is absent), a distutils compiler-flag bypass, --rocm-device-lib-path/-I so clang finds the ROCm device bitcode and headers, and runtime fixes (absolute exe path, copying the ROCm runtime DLLs beside the executable, ROCM_KPACK_PATH for rocRAND kernel packages).

Validation

GPU simulations were generated, compiled, and run, with results exercising the spike queue and synapses:

GPU	Arch	OS / ROCm
Instinct MI250X	gfx90a (CDNA2, wave64)	Linux, ROCm 7.2.1
Radeon Pro W7800	gfx1100 (RDNA3, wave32)	Linux, ROCm 7.2.1
Radeon RX 9070 XT	gfx1201 (RDNA4, wave32)	Windows, ROCm 7.14

Covered: a 100-neuron LIF group, a source/target group with a 1 ms synaptic delay (the spike-queue spinlock path), and a 200-neuron recurrent network (~12k synapses, ~6500 spikes). The CUDA/NVIDIA path is unchanged and not affected by this backend.

This work was authored with the assistance of Claude, an AI assistant by Anthropic.

This adds HIP/ROCm support to brian2cuda, enabling GPU-accelerated neural network simulations on AMD GPUs. A CUDA-to-HIP compatibility header maps CUDA symbols to HIP equivalents at compile time, while the device detection and makefile generation handle HIP-specific build requirements. Key changes: 1. brianlib/cuda_to_hip.h: compatibility header mapping CUDA symbols to HIP. 2. brianlib/spikequeue.h: wave-serialized spin-lock for AMD GPUs to avoid deadlock on wave64 architectures (CDNA GPUs lack Independent Thread Scheduling, so the standard atomicCAS spin-lock can deadlock when multiple lanes contend within the same wavefront). 3. device.py: HIP backend detection and HIP makefile generation. 4. templates/makefile_hip: HIP-specific makefile with -fgpu-rdc for cross-TU device symbol linking. The HIP backend is auto-detected when ROCm is present and CUDA is absent, or can be forced via USE_HIP=1. The README documents the ROCm/HIP build. On Windows, five further fixes are needed: the hipcc.exe suffix; reading the GPU arch from preferences before falling back to rocminfo (absent on Windows); bypassing distutils' get_compiler_and_args (which returns None on Windows); passing --rocm-device-lib-path / -I to hipcc so clang finds the ROCm device bitcode libraries and headers; and, at run time, an absolute main.exe path, copying the runtime DLLs beside the executable, and setting ROCM_KPACK_PATH so rocrand finds its kernel packages. This work was authored with the assistance of Claude, an AI assistant by Anthropic. Test Plan: Linux, AMD Instinct MI250X (gfx90a), ROCm 7.2.1: ``` export USE_HIP=1 python -c " from brian2 import * import brian2cuda set_device('cuda_standalone', build_on_run=False) G = NeuronGroup(100, 'dv/dt = -v/(10*ms) : 1', threshold='v>0.5', reset='v=0', method='linear') G.v = 'rand()' S = Synapses(G, G, on_pre='v_post += 0.1', delay=1*ms); S.connect(p=0.1) run(1*ms) device.build(directory='/tmp/brian2cuda_test', compile=True, run=True) " ``` Compiles and runs, exercising the spike queue with synapses. Also validated on gfx1100 (RDNA3, Linux) and gfx1201 (RDNA4, Windows): GPU simulations of LIF groups, synaptic-delay spike queues, and a recurrent network pass.

mstimberg · 2026-06-18T14:48:26Z

Hi @jeffdaily, many thanks for this PR (and apologies for the late reaction). I am afraid I won't be able to review this right away – not the least because I don't have readily access to a PC with an AMD card right now. Regarding the tests you ran, you state "GPU simulations were generated, compiled, and run, with results exercising the spike queue and synapses[...] Covered: a 100-neuron LIF group, a source/target group with a 1 ms synaptic delay (the spike-queue spinlock path), and a 200-neuron recurrent network (~12k synapses, ~6500 spikes). ". Did you run any of the existing examples for that and if yes, which one?

Please also note that we are in the process of clarifying our policy with respect to LLM-assisted contributions and I am hesitant to merge this before.

jeffdaily · 2026-06-18T18:37:11Z

Note: this reply was drafted by an AI assistant.

Hi @mstimberg, following up with concrete numbers. I've now run the bundled examples on the AMD hardware (gfx90a / MI250X, ROCm 7.2.1), each building, compiling and running through the HIP backend:

examples/cuba.py: 4000 neurons, 1 s -> 22,632 spikes (mean rate 5.66 Hz), the expected asynchronous-irregular regime. Running the identical model on cpp_standalone with the same seed gives 5.68 Hz (a 0.5% difference, consistent with the differing RNG streams), and the spike rasters are visually indistinguishable.
examples/cobahh.py: 4000 HH neurons, 1 s -> population rate ~30 Hz with the expected membrane traces.
examples/stdp.py: 10000 synapses, 100 s -> synaptic weights start uniform and become bimodal (clustered near 0 and g_max), the classic STDP result.
examples/mushroombody.py: 2500 neurons, 1 s -> projection-neuron, Kenyon-cell and extrinsic-neuron activity as expected.
examples/brunelhakim.py: 5000 neurons, runs and produces 1564 spikes.
The eight compartmental/*_cuda.py examples (bipolar cell, Hodgkin-Huxley, Rall, LFP, spike initiation, ...) all build and run.

Since these examples plot rather than assert on their outputs, I used the CUBA cpp_standalone vs cuda_standalone rate comparison above as a quantitative correctness check. A few of the plots produced on the AMD GPU:

CUBA (cuba.py) -- asynchronous-irregular firing

COBAHH (cobahh.py) -- Hodgkin-Huxley network, ~30 Hz population rate

STDP (stdp.py) -- weights evolve from uniform to bimodal

Mushroom body (mushroombody.py)

jeffdaily · 2026-06-18T18:48:04Z

Please also note that we are in the process of clarifying our policy with respect to LLM-assisted contributions and I am hesitant to merge this before.

Hi @mstimberg. This is the actual Jeff replying now. Which is weird to have to say. There's no emergency hurry to get this PR landed. I'm fine either waiting for your LLM-assisted contrib guidance to materialize or your trying to find some AMD hardware to try it out yourself. I simply wanted to prove that this could run on some diverse AMD hardware on both Linux and Windows. I'm using your project and many others like it to prove this point, that AMD hardware and our ROCm software is ready. I'm finding that one of the last hurdles is simply doing the work of these porting efforts.

I alone don't have enough time to do all these ports myself, which is why I'm using Claude to assist me. Rest assured, I'm doing the final human review before the PR is created. It's by far the slowest part of this process, me. Otherwise, I can just guide Claude. It's doing remarkably well after all the work I did to seed its knowledge of how to port CUDA to HIP, but it still gets things wrong, or makes some editorial decisions that I wouldn't have done and I have to correct it. Even considering my time spent correcting Claude, it's still far faster than I am. I hope you're able to accept this PR even though it was AI-assisted, knowing that a human reviewed it first and that we did run it on all the hardware mentioned in the PR body.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ROCm] Add HIP/ROCm backend support for AMD GPUs#327

[ROCm] Add HIP/ROCm backend support for AMD GPUs#327
jeffdaily wants to merge 1 commit into
brian-team:masterfrom
jeffdaily:moat-port

jeffdaily commented Jun 9, 2026

Uh oh!

mstimberg commented Jun 18, 2026

Uh oh!

jeffdaily commented Jun 18, 2026

Uh oh!

jeffdaily commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

jeffdaily commented Jun 9, 2026

Summary

What is implemented

Validation

Uh oh!

mstimberg commented Jun 18, 2026

Uh oh!

jeffdaily commented Jun 18, 2026

Uh oh!

jeffdaily commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants