Skip to content

[ROCm] Add HIP/ROCm backend support for AMD GPUs#327

Open
jeffdaily wants to merge 1 commit into
brian-team:masterfrom
jeffdaily:moat-port
Open

[ROCm] Add HIP/ROCm backend support for AMD GPUs#327
jeffdaily wants to merge 1 commit into
brian-team:masterfrom
jeffdaily:moat-port

Conversation

@jeffdaily

Copy link
Copy Markdown

Summary

This adds a HIP/ROCm backend so Brian2CUDA runs on AMD GPUs. Brian2CUDA is a runtime code generator (it emits a standalone C++/CUDA project per simulation), so the port works at the generator level: a CUDA-to-HIP compatibility header maps the CUDA symbols the generated code uses to their HIP equivalents, and the device/makefile generation handles HIP-specific build requirements. The NVIDIA/CUDA path is unchanged.

The HIP backend is auto-detected when ROCm is present and CUDA is absent, or forced with USE_HIP=1. It's documented in the README.

What is implemented

  • brianlib/cuda_to_hip.h -- compatibility header mapping the CUDA / cuRAND symbols the generated kernels use to HIP / hipRAND, so the CUDA-spelled generated sources compile under hipcc unchanged.
  • brianlib/spikequeue.h -- a wave-serialized spin-lock for AMD: CDNA wavefronts lack NVIDIA's independent thread scheduling, so the standard atomicCAS spin-lock can deadlock when lanes in one wavefront contend; the AMD path serializes per wavefront.
  • device.py + templates/makefile_hip -- HIP backend detection and HIP makefile generation (-fgpu-rdc for cross-TU device-symbol linking).
  • Preferences -- the HIP backend honors prefs.devices.hip_standalone.hip_backend.* (gpu_arch, rocm_path, extra_compile_args_hipcc, gpu_heap_size, gpu_id), mirroring the existing cuda_backend knobs.
  • Windows -- hipcc.exe handling, GPU-arch from preferences (rocminfo is absent), a distutils compiler-flag bypass, --rocm-device-lib-path/-I so clang finds the ROCm device bitcode and headers, and runtime fixes (absolute exe path, copying the ROCm runtime DLLs beside the executable, ROCM_KPACK_PATH for rocRAND kernel packages).

Validation

GPU simulations were generated, compiled, and run, with results exercising the spike queue and synapses:

GPU Arch OS / ROCm
Instinct MI250X gfx90a (CDNA2, wave64) Linux, ROCm 7.2.1
Radeon Pro W7800 gfx1100 (RDNA3, wave32) Linux, ROCm 7.2.1
Radeon RX 9070 XT gfx1201 (RDNA4, wave32) Windows, ROCm 7.14

Covered: a 100-neuron LIF group, a source/target group with a 1 ms synaptic delay (the spike-queue spinlock path), and a 200-neuron recurrent network (~12k synapses, ~6500 spikes). The CUDA/NVIDIA path is unchanged and not affected by this backend.

This work was authored with the assistance of Claude, an AI assistant by Anthropic.

This adds HIP/ROCm support to brian2cuda, enabling GPU-accelerated neural network
simulations on AMD GPUs. A CUDA-to-HIP compatibility header maps CUDA symbols to
HIP equivalents at compile time, while the device detection and makefile
generation handle HIP-specific build requirements.

Key changes:

1. brianlib/cuda_to_hip.h: compatibility header mapping CUDA symbols to HIP.
2. brianlib/spikequeue.h: wave-serialized spin-lock for AMD GPUs to avoid
   deadlock on wave64 architectures (CDNA GPUs lack Independent Thread
   Scheduling, so the standard atomicCAS spin-lock can deadlock when multiple
   lanes contend within the same wavefront).
3. device.py: HIP backend detection and HIP makefile generation.
4. templates/makefile_hip: HIP-specific makefile with -fgpu-rdc for cross-TU
   device symbol linking.

The HIP backend is auto-detected when ROCm is present and CUDA is absent, or can
be forced via USE_HIP=1. The README documents the ROCm/HIP build.

On Windows, five further fixes are needed: the hipcc.exe suffix; reading the GPU
arch from preferences before falling back to rocminfo (absent on Windows);
bypassing distutils' get_compiler_and_args (which returns None on Windows);
passing --rocm-device-lib-path / -I to hipcc so clang finds the ROCm device
bitcode libraries and headers; and, at run time, an absolute main.exe path,
copying the runtime DLLs beside the executable, and setting ROCM_KPACK_PATH so
rocrand finds its kernel packages.

This work was authored with the assistance of Claude, an AI assistant by
Anthropic.

Test Plan:

Linux, AMD Instinct MI250X (gfx90a), ROCm 7.2.1:
```
export USE_HIP=1
python -c "
from brian2 import *
import brian2cuda
set_device('cuda_standalone', build_on_run=False)
G = NeuronGroup(100, 'dv/dt = -v/(10*ms) : 1', threshold='v>0.5', reset='v=0', method='linear')
G.v = 'rand()'
S = Synapses(G, G, on_pre='v_post += 0.1', delay=1*ms); S.connect(p=0.1)
run(1*ms)
device.build(directory='/tmp/brian2cuda_test', compile=True, run=True)
"
```
Compiles and runs, exercising the spike queue with synapses. Also validated on
gfx1100 (RDNA3, Linux) and gfx1201 (RDNA4, Windows): GPU simulations of LIF
groups, synaptic-delay spike queues, and a recurrent network pass.
@mstimberg

Copy link
Copy Markdown
Member

Hi @jeffdaily, many thanks for this PR (and apologies for the late reaction). I am afraid I won't be able to review this right away – not the least because I don't have readily access to a PC with an AMD card right now. Regarding the tests you ran, you state "GPU simulations were generated, compiled, and run, with results exercising the spike queue and synapses[...] Covered: a 100-neuron LIF group, a source/target group with a 1 ms synaptic delay (the spike-queue spinlock path), and a 200-neuron recurrent network (~12k synapses, ~6500 spikes). ". Did you run any of the existing examples for that and if yes, which one?

Please also note that we are in the process of clarifying our policy with respect to LLM-assisted contributions and I am hesitant to merge this before.

@jeffdaily

Copy link
Copy Markdown
Author

Note: this reply was drafted by an AI assistant.

Hi @mstimberg, following up with concrete numbers. I've now run the bundled examples on the AMD hardware (gfx90a / MI250X, ROCm 7.2.1), each building, compiling and running through the HIP backend:

  • examples/cuba.py: 4000 neurons, 1 s -> 22,632 spikes (mean rate 5.66 Hz), the expected asynchronous-irregular regime. Running the identical model on cpp_standalone with the same seed gives 5.68 Hz (a 0.5% difference, consistent with the differing RNG streams), and the spike rasters are visually indistinguishable.
  • examples/cobahh.py: 4000 HH neurons, 1 s -> population rate ~30 Hz with the expected membrane traces.
  • examples/stdp.py: 10000 synapses, 100 s -> synaptic weights start uniform and become bimodal (clustered near 0 and g_max), the classic STDP result.
  • examples/mushroombody.py: 2500 neurons, 1 s -> projection-neuron, Kenyon-cell and extrinsic-neuron activity as expected.
  • examples/brunelhakim.py: 5000 neurons, runs and produces 1564 spikes.
  • The eight compartmental/*_cuda.py examples (bipolar cell, Hodgkin-Huxley, Rall, LFP, spike initiation, ...) all build and run.

Since these examples plot rather than assert on their outputs, I used the CUBA cpp_standalone vs cuda_standalone rate comparison above as a quantitative correctness check. A few of the plots produced on the AMD GPU:

CUBA (cuba.py) -- asynchronous-irregular firing

CUBA raster on gfx90a

COBAHH (cobahh.py) -- Hodgkin-Huxley network, ~30 Hz population rate

COBAHH on gfx90a

STDP (stdp.py) -- weights evolve from uniform to bimodal

STDP weights on gfx90a

Mushroom body (mushroombody.py)

Mushroom body on gfx90a

@jeffdaily

Copy link
Copy Markdown
Author

Please also note that we are in the process of clarifying our policy with respect to LLM-assisted contributions and I am hesitant to merge this before.

Hi @mstimberg. This is the actual Jeff replying now. Which is weird to have to say. There's no emergency hurry to get this PR landed. I'm fine either waiting for your LLM-assisted contrib guidance to materialize or your trying to find some AMD hardware to try it out yourself. I simply wanted to prove that this could run on some diverse AMD hardware on both Linux and Windows. I'm using your project and many others like it to prove this point, that AMD hardware and our ROCm software is ready. I'm finding that one of the last hurdles is simply doing the work of these porting efforts.

I alone don't have enough time to do all these ports myself, which is why I'm using Claude to assist me. Rest assured, I'm doing the final human review before the PR is created. It's by far the slowest part of this process, me. Otherwise, I can just guide Claude. It's doing remarkably well after all the work I did to seed its knowledge of how to port CUDA to HIP, but it still gets things wrong, or makes some editorial decisions that I wouldn't have done and I have to correct it. Even considering my time spent correcting Claude, it's still far faster than I am. I hope you're able to accept this PR even though it was AI-assisted, knowing that a human reviewed it first and that we did run it on all the hardware mentioned in the PR body.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants