[ROCm] Add AMD GPU support via HIP for tiny-vllm by jeffdaily · Pull Request #2 · jmaczan/tiny-vllm

jeffdaily · 2026-06-17T13:46:14Z

This adds AMD GPU support to tiny-vllm through ROCm/HIP while leaving the existing NVIDIA/CUDA build unchanged.

The CUDA kernels and host code are reused as-is. A new src/cuda_to_hip.h compatibility header keeps the CUDA spellings in the source and aliases them to their HIP equivalents (runtime calls, cuBLAS -> hipBLAS, and the __nv_bfloat16 -> __hip_bfloat16 type) when building for AMD. The only kernel-source change is the warp-shuffle mask: HIP requires a 64-bit lane mask for __shfl_*_sync, so the hardcoded 0xffffffff becomes a WARP_FULL_MASK macro (0xffffffffffffffffULL on HIP, 0xffffffff on CUDA). The paged-attention reduction is wave-size agnostic, so the same source runs correctly on wave64 (gfx90a) and wave32 (gfx1100, gfx1201).

CMakeLists.txt gains a USE_HIP option (default OFF). When OFF, the build is the existing CUDA configuration, unchanged. When ON, it enables the HIP language, compiles the sources with hipcc, and links hipBLAS. The GPU architecture is selected by the caller via CMAKE_HIP_ARCHITECTURES (it is not hardcoded):

cmake -B build -DUSE_HIP=ON -DCMAKE_HIP_ARCHITECTURES=gfx1100 -G Ninja
cmake --build build

The README's setup section documents the AMD build path alongside the existing NVIDIA instructions.

Validation

Built and exercised on real AMD GPUs -- gfx90a (MI250X), gfx1100 (Radeon Pro W7800), and gfx1201 (RX 9070 XT). On each, the HIP runtime, the bf16 embedding-gather kernel, the 64-bit-mask warp-shuffle reduction at 64 threads/block, and the hipBLAS bf16 GEMM all pass (the 64-bit mask fix confirmed on both wave64 and wave32).

On gfx1100, full end-to-end inference was additionally validated: loading Llama 3.2 1B Instruct weights and running prefill+decode produces coherent, correct output (for example "What is 2+2?" -> 4 and "Capital of France?" -> Paris), exercising the complete path (embedding, 16 transformer layers with hipBLAS GEMMs, paged attention, SwiGLU MLP, lm_head). The CUDA build path is unchanged.

Authored with the assistance of Claude.

This adds AMD GPU support to tiny-vllm through ROCm/HIP while leaving the existing NVIDIA/CUDA build unchanged. The CUDA kernels and host code are reused as-is. A new src/cuda_to_hip.h compatibility header keeps the CUDA spellings in the source and aliases them to their HIP equivalents (runtime calls, cuBLAS -> hipBLAS, and the __nv_bfloat16 -> __hip_bfloat16 type) when building for AMD. The only kernel-source change is the warp-shuffle mask: HIP requires a 64-bit lane mask for __shfl_*_sync, so the hardcoded 0xffffffff becomes a WARP_FULL_MASK macro (0xffffffffffffffffULL on HIP, 0xffffffff on CUDA). The paged-attention reduction is wave-size agnostic, so the same source runs correctly on wave64 (gfx90a) and wave32 (gfx1100, gfx1201). CMakeLists.txt gains a USE_HIP option (default OFF). When OFF, the build is the existing CUDA configuration, unchanged. When ON, it enables the HIP language, compiles the sources with hipcc, and links hipBLAS. The GPU architecture is selected by the caller via CMAKE_HIP_ARCHITECTURES (it is not hardcoded), e.g.: cmake -B build -DUSE_HIP=ON -DCMAKE_HIP_ARCHITECTURES=gfx1100 -G Ninja cmake --build build The README's setup section documents the AMD build path alongside the existing NVIDIA instructions. Validation: built and exercised on real AMD GPUs -- gfx90a (MI250X), gfx1100 (Radeon Pro W7800), and gfx1201 (RX 9070 XT). On each, the HIP runtime, the bf16 embedding-gather kernel, the 64-bit-mask warp-shuffle reduction at 64 threads/block, and the hipBLAS bf16 GEMM all pass (the 64-bit mask fix confirmed on both wave64 and wave32). On gfx1100, full end-to-end inference was additionally validated: loading Llama 3.2 1B Instruct weights and running prefill+decode produces coherent, correct output (for example "What is 2+2?" -> 4 and "Capital of France?" -> Paris), exercising the complete path (embedding, 16 transformer layers with hipBLAS GEMMs, paged attention, SwiGLU MLP, lm_head). The CUDA build path is unchanged. Test Plan: cmake -B build -DUSE_HIP=ON -DCMAKE_HIP_ARCHITECTURES=gfx1100 -G Ninja cmake --build build ./build/tiny-vllm # with Llama 3.2 1B Instruct model.safetensors in CWD On gfx1100 (ROCm 7.2.1) the four built-in prompts return correct answers. The targeted GPU component tests (runtime, embedding-gather, 64-bit-mask shuffle, hipBLAS bf16 GEMM) pass on gfx90a, gfx1100, and gfx1201. Authored with the assistance of Claude.

The documented HIP build did not tell CMake where ROCm is, so on a clean install with /opt/rocm/bin not on PATH the configure fails to find the hip and hipBLAS packages (hip_DIR-NOTFOUND). Add -DCMAKE_PREFIX_PATH=/opt/rocm to the example so the command works as written. Authored with the assistance of an AI coding agent.

jeffdaily added 2 commits June 17, 2026 13:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCm] Add AMD GPU support via HIP for tiny-vllm#2

[ROCm] Add AMD GPU support via HIP for tiny-vllm#2
jeffdaily wants to merge 2 commits into
jmaczan:mainfrom
jeffdaily:moat-port

jeffdaily commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jeffdaily commented Jun 17, 2026

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant