[ROCm] Add AMD GPU support via HIP for tiny-vllm#2
Open
jeffdaily wants to merge 2 commits into
Open
Conversation
This adds AMD GPU support to tiny-vllm through ROCm/HIP while leaving the
existing NVIDIA/CUDA build unchanged.
The CUDA kernels and host code are reused as-is. A new src/cuda_to_hip.h
compatibility header keeps the CUDA spellings in the source and aliases them
to their HIP equivalents (runtime calls, cuBLAS -> hipBLAS, and the
__nv_bfloat16 -> __hip_bfloat16 type) when building for AMD. The only
kernel-source change is the warp-shuffle mask: HIP requires a 64-bit lane mask
for __shfl_*_sync, so the hardcoded 0xffffffff becomes a WARP_FULL_MASK macro
(0xffffffffffffffffULL on HIP, 0xffffffff on CUDA). The paged-attention
reduction is wave-size agnostic, so the same source runs correctly on wave64
(gfx90a) and wave32 (gfx1100, gfx1201).
CMakeLists.txt gains a USE_HIP option (default OFF). When OFF, the build is the
existing CUDA configuration, unchanged. When ON, it enables the HIP language,
compiles the sources with hipcc, and links hipBLAS. The GPU architecture is
selected by the caller via CMAKE_HIP_ARCHITECTURES (it is not hardcoded), e.g.:
cmake -B build -DUSE_HIP=ON -DCMAKE_HIP_ARCHITECTURES=gfx1100 -G Ninja
cmake --build build
The README's setup section documents the AMD build path alongside the existing
NVIDIA instructions.
Validation: built and exercised on real AMD GPUs -- gfx90a (MI250X), gfx1100
(Radeon Pro W7800), and gfx1201 (RX 9070 XT). On each, the HIP runtime, the
bf16 embedding-gather kernel, the 64-bit-mask warp-shuffle reduction at 64
threads/block, and the hipBLAS bf16 GEMM all pass (the 64-bit mask fix
confirmed on both wave64 and wave32). On gfx1100, full end-to-end inference was
additionally validated: loading Llama 3.2 1B Instruct weights and running
prefill+decode produces coherent, correct output (for example "What is 2+2?"
-> 4 and "Capital of France?" -> Paris), exercising the complete path
(embedding, 16 transformer layers with hipBLAS GEMMs, paged attention, SwiGLU
MLP, lm_head). The CUDA build path is unchanged.
Test Plan:
cmake -B build -DUSE_HIP=ON -DCMAKE_HIP_ARCHITECTURES=gfx1100 -G Ninja
cmake --build build
./build/tiny-vllm # with Llama 3.2 1B Instruct model.safetensors in CWD
On gfx1100 (ROCm 7.2.1) the four built-in prompts return correct answers. The
targeted GPU component tests (runtime, embedding-gather, 64-bit-mask shuffle,
hipBLAS bf16 GEMM) pass on gfx90a, gfx1100, and gfx1201.
Authored with the assistance of Claude.
The documented HIP build did not tell CMake where ROCm is, so on a clean install with /opt/rocm/bin not on PATH the configure fails to find the hip and hipBLAS packages (hip_DIR-NOTFOUND). Add -DCMAKE_PREFIX_PATH=/opt/rocm to the example so the command works as written. Authored with the assistance of an AI coding agent.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This adds AMD GPU support to tiny-vllm through ROCm/HIP while leaving the existing NVIDIA/CUDA build unchanged.
The CUDA kernels and host code are reused as-is. A new
src/cuda_to_hip.hcompatibility header keeps the CUDA spellings in the source and aliases them to their HIP equivalents (runtime calls, cuBLAS -> hipBLAS, and the__nv_bfloat16->__hip_bfloat16type) when building for AMD. The only kernel-source change is the warp-shuffle mask: HIP requires a 64-bit lane mask for__shfl_*_sync, so the hardcoded0xffffffffbecomes aWARP_FULL_MASKmacro (0xffffffffffffffffULLon HIP,0xffffffffon CUDA). The paged-attention reduction is wave-size agnostic, so the same source runs correctly on wave64 (gfx90a) and wave32 (gfx1100, gfx1201).CMakeLists.txtgains aUSE_HIPoption (default OFF). When OFF, the build is the existing CUDA configuration, unchanged. When ON, it enables the HIP language, compiles the sources with hipcc, and links hipBLAS. The GPU architecture is selected by the caller viaCMAKE_HIP_ARCHITECTURES(it is not hardcoded):The README's setup section documents the AMD build path alongside the existing NVIDIA instructions.
Validation
Built and exercised on real AMD GPUs -- gfx90a (MI250X), gfx1100 (Radeon Pro W7800), and gfx1201 (RX 9070 XT). On each, the HIP runtime, the bf16 embedding-gather kernel, the 64-bit-mask warp-shuffle reduction at 64 threads/block, and the hipBLAS bf16 GEMM all pass (the 64-bit mask fix confirmed on both wave64 and wave32).
On gfx1100, full end-to-end inference was additionally validated: loading Llama 3.2 1B Instruct weights and running prefill+decode produces coherent, correct output (for example "What is 2+2?" -> 4 and "Capital of France?" -> Paris), exercising the complete path (embedding, 16 transformer layers with hipBLAS GEMMs, paged attention, SwiGLU MLP, lm_head). The CUDA build path is unchanged.
Authored with the assistance of Claude.