Skip to content

[ROCm] Add AMD GPU support via HIP for tiny-vllm#2

Open
jeffdaily wants to merge 2 commits into
jmaczan:mainfrom
jeffdaily:moat-port
Open

[ROCm] Add AMD GPU support via HIP for tiny-vllm#2
jeffdaily wants to merge 2 commits into
jmaczan:mainfrom
jeffdaily:moat-port

Conversation

@jeffdaily

Copy link
Copy Markdown

This adds AMD GPU support to tiny-vllm through ROCm/HIP while leaving the existing NVIDIA/CUDA build unchanged.

The CUDA kernels and host code are reused as-is. A new src/cuda_to_hip.h compatibility header keeps the CUDA spellings in the source and aliases them to their HIP equivalents (runtime calls, cuBLAS -> hipBLAS, and the __nv_bfloat16 -> __hip_bfloat16 type) when building for AMD. The only kernel-source change is the warp-shuffle mask: HIP requires a 64-bit lane mask for __shfl_*_sync, so the hardcoded 0xffffffff becomes a WARP_FULL_MASK macro (0xffffffffffffffffULL on HIP, 0xffffffff on CUDA). The paged-attention reduction is wave-size agnostic, so the same source runs correctly on wave64 (gfx90a) and wave32 (gfx1100, gfx1201).

CMakeLists.txt gains a USE_HIP option (default OFF). When OFF, the build is the existing CUDA configuration, unchanged. When ON, it enables the HIP language, compiles the sources with hipcc, and links hipBLAS. The GPU architecture is selected by the caller via CMAKE_HIP_ARCHITECTURES (it is not hardcoded):

cmake -B build -DUSE_HIP=ON -DCMAKE_HIP_ARCHITECTURES=gfx1100 -G Ninja
cmake --build build

The README's setup section documents the AMD build path alongside the existing NVIDIA instructions.

Validation

Built and exercised on real AMD GPUs -- gfx90a (MI250X), gfx1100 (Radeon Pro W7800), and gfx1201 (RX 9070 XT). On each, the HIP runtime, the bf16 embedding-gather kernel, the 64-bit-mask warp-shuffle reduction at 64 threads/block, and the hipBLAS bf16 GEMM all pass (the 64-bit mask fix confirmed on both wave64 and wave32).

On gfx1100, full end-to-end inference was additionally validated: loading Llama 3.2 1B Instruct weights and running prefill+decode produces coherent, correct output (for example "What is 2+2?" -> 4 and "Capital of France?" -> Paris), exercising the complete path (embedding, 16 transformer layers with hipBLAS GEMMs, paged attention, SwiGLU MLP, lm_head). The CUDA build path is unchanged.

Authored with the assistance of Claude.

This adds AMD GPU support to tiny-vllm through ROCm/HIP while leaving the
existing NVIDIA/CUDA build unchanged.

The CUDA kernels and host code are reused as-is. A new src/cuda_to_hip.h
compatibility header keeps the CUDA spellings in the source and aliases them
to their HIP equivalents (runtime calls, cuBLAS -> hipBLAS, and the
__nv_bfloat16 -> __hip_bfloat16 type) when building for AMD. The only
kernel-source change is the warp-shuffle mask: HIP requires a 64-bit lane mask
for __shfl_*_sync, so the hardcoded 0xffffffff becomes a WARP_FULL_MASK macro
(0xffffffffffffffffULL on HIP, 0xffffffff on CUDA). The paged-attention
reduction is wave-size agnostic, so the same source runs correctly on wave64
(gfx90a) and wave32 (gfx1100, gfx1201).

CMakeLists.txt gains a USE_HIP option (default OFF). When OFF, the build is the
existing CUDA configuration, unchanged. When ON, it enables the HIP language,
compiles the sources with hipcc, and links hipBLAS. The GPU architecture is
selected by the caller via CMAKE_HIP_ARCHITECTURES (it is not hardcoded), e.g.:

    cmake -B build -DUSE_HIP=ON -DCMAKE_HIP_ARCHITECTURES=gfx1100 -G Ninja
    cmake --build build

The README's setup section documents the AMD build path alongside the existing
NVIDIA instructions.

Validation: built and exercised on real AMD GPUs -- gfx90a (MI250X), gfx1100
(Radeon Pro W7800), and gfx1201 (RX 9070 XT). On each, the HIP runtime, the
bf16 embedding-gather kernel, the 64-bit-mask warp-shuffle reduction at 64
threads/block, and the hipBLAS bf16 GEMM all pass (the 64-bit mask fix
confirmed on both wave64 and wave32). On gfx1100, full end-to-end inference was
additionally validated: loading Llama 3.2 1B Instruct weights and running
prefill+decode produces coherent, correct output (for example "What is 2+2?"
-> 4 and "Capital of France?" -> Paris), exercising the complete path
(embedding, 16 transformer layers with hipBLAS GEMMs, paged attention, SwiGLU
MLP, lm_head). The CUDA build path is unchanged.

Test Plan:

    cmake -B build -DUSE_HIP=ON -DCMAKE_HIP_ARCHITECTURES=gfx1100 -G Ninja
    cmake --build build
    ./build/tiny-vllm   # with Llama 3.2 1B Instruct model.safetensors in CWD

On gfx1100 (ROCm 7.2.1) the four built-in prompts return correct answers. The
targeted GPU component tests (runtime, embedding-gather, 64-bit-mask shuffle,
hipBLAS bf16 GEMM) pass on gfx90a, gfx1100, and gfx1201.

Authored with the assistance of Claude.
The documented HIP build did not tell CMake where ROCm is, so on a clean
install with /opt/rocm/bin not on PATH the configure fails to find the hip
and hipBLAS packages (hip_DIR-NOTFOUND). Add -DCMAKE_PREFIX_PATH=/opt/rocm
to the example so the command works as written.

Authored with the assistance of an AI coding agent.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant