Run LLMs on your AMD GPU with full hardware acceleration. No more CPU-only inference!
AMD left millions of consumer GPU owners behind with no Windows AI support. This toolkit fixes that.
Tested on AMD Radeon RX 6700 XT (gfx1031) + Ryzen 9 7900X
This toolkit lets you run AI models (like Llama, Mistral, Qwen) on your AMD graphics card instead of your CPU. Result: 3-5x faster inference.
| GPU | Architecture | GFX Version |
|---|---|---|
| RX 6700 XT | RDNA2 | gfx1031 |
| RX 6800/6900 | RDNA2 | gfx1030 |
| RX 7900 XT/XTX | RDNA3 | gfx1100 |
| Other RDNA2/3 | Check your GPU | See AMD docs |
You need these installed BEFORE starting:
Download: https://visualstudio.microsoft.com/visual-cpp-build-tools/
- Run installer
- Select "Desktop development with C++"
- Install
Download: https://cmake.org/download/
- Get "Windows x64 Installer"
- During install, select "Add CMake to PATH"
# Option A: Using winget
winget install Ninja-build.Ninja
# Option B: Using pip
pip install ninjaDownload: https://www.amd.com/en/developer/resources/rocm-hub.html
- Look for "HIP SDK for Windows"
- Download version 5.7.x
- Install to default location:
C:\Program Files\AMD\ROCm\5.7 - Restart your PC after installation
Download: https://www.python.org/downloads/
- During install, check "Add Python to PATH"
- Press
Win + X - Click "Windows Terminal" or "PowerShell"
- Navigate to this folder:
cd "C:\path\to\AMD-AI-Toolkit".\fix_rocblas_local.ps1This copies required GPU kernel files. Only need to do this once.
.\rebuild_hip_gfx1031.ps1This takes 5-15 minutes. You'll see lots of compiler output - that's normal.
If you get errors: Make sure Visual Studio Build Tools is installed and restart your terminal.
You need a GGUF model file. Download a small one to test:
# Create models folder
mkdir models
# Download TinyLlama (600MB) - good for testing
Invoke-WebRequest -Uri "https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf" -OutFile "models\tinyllama.gguf"Or manually download any GGUF model from HuggingFace and put it in the models folder.
.\run_hip_model.ps1 hip_example.pyYou should see:
- "AMD HIP Environment Configured"
- Model loading with "ROCm" device
- Generated text output
- Speed in tokens/sec
If it works, your AMD GPU is now accelerating AI inference!
| Model | Size | GPU Speed | CPU Speed |
|---|---|---|---|
| TinyLlama 1.1B | 600MB | ~3.0 tok/s | ~1.0 tok/s |
| Qwen 7B | 4.4GB | ~0.6 tok/s | ~0.2 tok/s |
- Download a
.gguffile from HuggingFace - Put it in the
modelsfolder - Run:
.\run_hip_model.ps1 hip_autoconfig.py models\your-model.ggufIf you have models in Ollama, you can extract them:
# 1. Find your model's info (replace YOUR_MODEL with model name like "llama2" or "mistral")
Get-Content "$env:USERPROFILE\.ollama\models\manifests\registry.ollama.ai\library\YOUR_MODEL\latest"
# 2. Look for the "digest" of the largest layer (the model weights)
# It looks like: sha256:abc123def456...
# 3. Copy that file (replace THE_DIGEST with actual digest):
Copy-Item "$env:USERPROFILE\.ollama\models\blobs\sha256-THE_DIGEST" "models\my-model.gguf"
# 4. Run it:
.\run_hip_model.ps1 hip_autoconfig.py models\my-model.gguf| File | What It Does |
|---|---|
run_hip_model.ps1 |
Sets up GPU environment and runs Python scripts |
rebuild_hip_gfx1031.ps1 |
Builds llama-cpp-python with AMD GPU support |
fix_rocblas_local.ps1 |
Fixes AMD library paths (run once) |
hip_example.py |
Basic test - loads model and generates text |
hip_autoconfig.py |
Smart loader - auto-detects best GPU settings |
gpu_thread_helper.py |
Thread-safe wrapper for multi-threaded servers |
xena.py |
Security-focused AI assistant example |
tools\llama-quantize.exe |
Compress models to smaller sizes |
Download a model and put it in the models folder. See Step 4 above.
- Make sure Visual Studio Build Tools is installed
- Restart your terminal after installing
- Run from a fresh PowerShell window
The run_hip_model.ps1 script already handles this, but if running manually:
import os
os.environ["HIP_VISIBLE_DEVICES"] = "0"
os.environ["GGML_CUDA_FORCE_MMQ"] = "1"Run .\fix_rocblas_local.ps1 again.
- Check that AMD HIP SDK 5.7 is installed
- Restart PC after HIP SDK installation
- Rebuild with
.\rebuild_hip_gfx1031.ps1
Edit run_hip_model.ps1 and rebuild_hip_gfx1031.ps1:
- RX 6800/6900: Change
gfx1031togfx1030 - RX 7900: Change
gfx1031togfx1100and10.3.0to11.0.0
Important: If you're building a web server that handles multiple requests, you'll hit a thread issue.
rocBLAS (AMD's GPU math library) initializes its context per-thread. Web frameworks like FastAPI/uvicorn handle requests in different threads, so GPU calls from those threads fail with:
GGML_ASSERT: ggml/src/ggml-cuda/ggml-cuda.cu:340: !"MUL_MAT failed, shared object initialization failed"
Route ALL GPU operations through a single dedicated thread. We've included a helper:
# In your server code
from gpu_thread_helper import gpu_generate
# Instead of:
# response = llm.generate(prompt)
# Use:
response = gpu_generate(llm, prompt, max_tokens=500)See gpu_thread_helper.py for the full implementation. Drop it into your project and wrap your generate calls.
from concurrent.futures import ThreadPoolExecutor
import threading
_gpu_executor = ThreadPoolExecutor(max_workers=1, thread_name_prefix="GPU")
_gpu_lock = threading.Lock()
def gpu_generate(model, prompt, **kwargs):
"""Thread-safe GPU generation wrapper."""
def _generate():
with _gpu_lock:
return model(prompt, **kwargs)
return _gpu_executor.submit(_generate).result(timeout=600)This ensures rocBLAS initializes once and all GPU work happens in that same thread.
Make models smaller (uses less VRAM, slightly lower quality):
.\tools\llama-quantize.exe models\big-model.gguf models\smaller-model.gguf q4_k_mQuantization types (from largest to smallest):
f16- Full precision (largest)q8_0- 8-bit (good quality)q5_k_m- 5-bit (balanced)q4_k_m- 4-bit (recommended)q4_0- 4-bit (smallest)
Note: You can only go DOWN in size, not up. Can't turn a q4 model into q8.
- brknsoul/ROCmLibs - Tensile libraries
- likelovewant - Multi-arch support
- llama.cpp community
MIT License © 2026 Torsova LLC
Torsova LLC - R@D
- GitHub Issues: Bug reports & features
- Email: 413mrm@gmail.com
Don't go around the mountain. Don't go over it. Go through it.
---