Standalone GGUF read/write, byte-exact quantization, and CUDA-accelerated row kernels for C++, Python, NumPy, Torch, and CUDA.
libgguf vendors and adapts GGUF/GGML quantization kernels from llama.cpp into a reusable standalone library and toolkit. The goal is to make GGUF infrastructure available directly to conversion tools and downstream projects without requiring a two-stage route through llama.cpp binaries or partial Python/Torch-only implementations.
The repository currently contains native GGUF row kernels, Python bindings, NumPy and Torch backends, an optional CUDA Torch extension, safetensors-to-GGUF conversion paths, public lightweight GGUF reading/inspection and structural validation tools, benchmark tools, and tensor planning policy for real image-model conversion workflows. A fuller writer API is planned; today, GGUF writing logic is present primarily through converter paths.
| Field | Value |
|---|---|
| Status | active development |
| License | Apache-2.0 |
| Python | >=3.10 |
| Version | 0.1.0 |
| CUDA | optional, experimental, broad qtype coverage |
- Standalone native C++ GGUF quantization and dequantization library.
- Python bindings for native CPU row kernels.
- Extended NumPy GGUF quantization/dequantization backend.
- Extended Torch GGUF quantization/dequantization backend.
- Optional CUDA quantization and dequantization kernels exposed through a Torch extension.
- Native low-memory safetensors-to-GGUF conversion executable.
- Experimental/internal Python conversion helper API for safetensors/ckpt workflows.
- Experimental public GGUF reader API for metadata, tensor descriptors, tensor iteration, raw tensor byte reads, and structural validation.
- Deterministic policy-based tensor planning for real image-model GGUF conversion.
- Benchmark suite for native, Torch, and CUDA paths.
- Planned fuller GGUF writer API.
- Byte-exact quantization/dequantization against the native CPU reference path where supported.
- Broad CUDA quantization and dequantization qtype coverage.
- Stack-free near-roofline CUDA dequantization across tested qtypes.
- Very fast CUDA quantization for Q/K/TQ/MXFP4/NVFP4 families, with IQ kernels improved and still the active optimization frontier.
- SIMD/threaded native CPU backend.
- Low-memory native converter path for safetensors-to-GGUF conversion.
- Multiple backend implementations for parity testing and integration.
- Lightweight GGUF reader API for metadata, tensor descriptors, and raw tensor bytes.
| Backend | Purpose | Status |
|---|---|---|
| native C++ CPU | Reference row quant/dequant kernels, SIMD/threaded CPU paths, shared library, C ABI | active |
| Python bindings | libgguf row APIs and native converter bridge |
active |
libgguf_numpy |
NumPy quant/dequant implementation for parity testing and integration | active |
libgguf_torch |
Torch-native quant/dequant implementation for parity testing and integration | active |
libgguf_cuda |
Optional Torch CUDA extension with direct quant/dequant kernels | experimental |
libgguf_quantize_gguf |
Low-memory C++ safetensors-to-GGUF conversion executable | active, Q/K-focused |
| Python conversion helper | Import-level helper over native bindings and safetensors/ckpt loading | experimental/internal |
Editable development install:
python -m pip install -e .Python conversion helper dependencies:
python -m pip install -e ".[quantize]"CUDA extension dependencies:
python -m pip install -e ".[cuda]"Core dependency: numpy. Optional extras: cuda, quantize, and test.
The build backend is scikit-build-core. Native builds require CMake >=3.18 and C++17. CUDA kernel builds require nvcc and the CUDA toolkit; the optional Torch CUDA extension additionally requires importable torch and Torch CMake metadata.
Useful CMake options:
LIBGGUF_CPU_BACKEND=REF|SSE2|SSE4_1|AVX2: native CPU row backend to compile, defaultREF.LIBGGUF_BUILD_CUDA_KERNELS=AUTO|ON|OFF: optional CUDA kernel targets, including the Torch extension when Torch is available, defaultAUTO.LIBGGUF_BUILD_TOOLS=ON: build native command-line tools, defaultON.LIBGGUF_BUILD_BENCHMARKS=OFF: build native benchmark binaries, defaultOFF.
Native Python row kernels:
import numpy as np
import libgguf
x = np.random.default_rng(0).normal(size=(4, 4096)).astype(np.float32)
qtype = libgguf.GGMLQuantizationType.Q4_K
q = libgguf.quantize_rows(x, qtype)
y = libgguf.dequantize_rows(q, qtype, n_per_row=4096)Experimental CUDA Torch extension:
import torch
import libgguf
import libgguf.libgguf_cuda as gguf_cuda
rows, width = 4, 4096
tensor_cuda = torch.randn(rows, width, device="cuda", dtype=torch.float32)
qtype = libgguf.GGMLQuantizationType.Q4_K
q = gguf_cuda.quantize(tensor_cuda, int(qtype))
y = gguf_cuda.dequantize(q, int(qtype), rows, width, torch.float16)Python entry points:
gguf-inspect: GGUF metadata and tensor descriptor inspection.gguf-validate: structural GGUF validation without reading tensor payload bytes.gguf-compare: GGUF tensor descriptor comparison with optional metadata and payload byte checks.
Native executable:
libgguf_quantize_gguf: low-memory C++ safetensors-to-GGUF converter. The native executable is currently Q/K-focused; non-Q/K quantization families are not supported by this executable yet.
Common conversion shape:
libgguf_quantize_gguf --src model.safetensors --qtype Q4_K_M --dst model-Q4_K_M.gguf
libgguf_quantize_gguf --src model.safetensors --qtype Q4_K_M --dst model-Q4_K_M.gguf --scratch-bytes 33554432The Python conversion helper API remains experimental/internal and requires the quantize extra when used directly. The old Python conversion wrapper modules are retired; use libgguf_quantize_gguf for command-line conversion.
See docs/cli.md for implemented options.
The experimental public reader API opens GGUF files without reading tensor payloads until requested:
import libgguf
info = libgguf.open_gguf("model.gguf")
for tensor in info.iter_tensors():
print(tensor.name, tensor.shape, tensor.qtype)
raw = info.read_tensor_bytes(info.tensors[0], offset=0, size=128)open_gguf, inspect_gguf, and read_gguf_header currently share the same
lightweight implementation. See docs/python-api.md.
Conversion uses deterministic tensor planning, not magic. Current policies are:
uniform: quantize eligible 2D weight tensors uniformly.comfy: use architecture-aware skip and high-precision patterns similar to image-model GGUF conversion workflows.dynamic: build oncomfywith deterministic tensor-role and layer-position promotion logic, including ongoing investigation of Unsloth Dynamic-like behavior.
All policies support tensor overrides plus include/exclude patterns. See docs/policy.md.
The public enum and row APIs cover these storage and quantization families:
Q1_0Q4_0,Q4_1Q5_0,Q5_1Q8_0Q2_K,Q3_K,Q4_K,Q5_K,Q6_KIQ1_S,IQ1_MIQ2_XXS,IQ2_XS,IQ2_SIQ3_XXS,IQ3_SIQ4_NL,IQ4_XSTQ1_0,TQ2_0MXFP4,NVFP4F32,F16,BF16storage
Exact support varies by backend and converter path. See docs/support-matrix.md.
Benchmarks are representative development results on an RTX 3090, not universal performance claims. For shape 11008x4096, recent CUDA dequantization results show tested qtypes running stack-free at roughly 0.23-0.28 ms, around 778-817 GB/s, with low register counts and about 65x-98x speedup versus the CPU default path for the sampled qtypes.
Representative CUDA dequant rows:
| qtype | ms | GB/s | speedup vs CPU default |
|---|---|---|---|
Q1_0 |
0.233 | 799.6 | 93.3x |
Q8_0 |
0.279 | 817.1 | 65.5x |
Q4_K |
0.254 | 811.1 | 78.8x |
Q5_K |
0.259 | 814.8 | 79.5x |
Q6_K |
0.267 | 814.5 | 75.6x |
IQ2_XS |
0.246 | 786.6 | 98.3x |
IQ4_XS |
0.254 | 803.6 | 72.5x |
TQ1_0 |
0.237 | 802.4 | 81.6x |
TQ2_0 |
0.239 | 802.9 | 87.9x |
CUDA quantization is strong for Q/K/TQ/MX/NV families, with IQ kernels improved significantly and still the active optimization frontier. IQ quant kernels are exact on checked rows and continue to be optimized.
See docs/benchmarks.md for detailed tables and metrics.
The native CPU path is the reference path. CUDA, NumPy, and Torch implementations are tested for byte exactness where supported: same input, qtype, and shape should produce identical encoded bytes. Dequantization checks compare decoded output for a fixed destination dtype. Frozen golden fixtures supplement generated CPU-reference checks.
See docs/correctness.md.
libgguf is not an official llama.cpp project. It adapts GGUF/GGML reference behavior into a standalone infrastructure library and keeps compatibility as an engineering target where applicable.
- llama.cpp and gguf-py are the upstream GGUF/GGML ecosystem references for format behavior, constants, Python writer/reader patterns, and reference quantization behavior.
- ComfyUI-GGUF is the existing community ComfyUI GGUF inference/custom-node integration. libgguf may replace or support parts of that stack with reusable native, Python, Torch, and CUDA backend infrastructure.
- ComfyUI-GGUF tools show the current conversion workflow that routes through Python tooling plus patched llama.cpp quantization. libgguf's native conversion executable and Python import-level helper APIs aim to make that flow more direct and reusable.
- Diffusers GGUF docs describe current Diffusers GGUF loading through
from_single_filemodel classes, low-memorytorch.uint8storage, dynamic dequantization during forward, and optional CUDA kernels through the kernels package. Diffusers is a potential optional backend/integration target for libgguf, not currently claimed as supported here. - Public model repositories such as city96/FLUX.1-dev-gguf are useful real-world compatibility targets for conversion and inference testing.
- Unsloth Dynamic GGUF is relevant policy background for tensor-level qtype decisions. libgguf's
dynamicpolicy is deterministic planning work inspired by this class of approach, not a claim of matching Unsloth results.
See docs/ecosystem.md for the fuller reference map.
- Fuller GGUF writer API.
- Deeper GGUF validator coverage.
- Source dtype GPU input path for F16/BF16.
- Broader frozen exactness coverage.
- Broader native converter CUDA qtype coverage beyond the current Q/K-focused set.
- Converter-level quality and compatibility sweeps for more image-model architectures.
- CUDA IQ quant polish.
- Packaging and wheels.
- Diffusers optional backend/integration exploration.
- ComfyUI-GGUF backend/tooling support or replacement exploration.
GGUF format behavior and quantization kernels are intended to stay compatible with llama.cpp/GGML/GGUF reference behavior where applicable. The NumPy backend extends gguf-py-style implementations, and the Torch backend extends ComfyUI-GGUF-style native Torch implementations. libgguf keeps those ideas in a standalone infrastructure package with native C++ and CUDA paths.
The vendored scalar quant/dequant reference kernels are pinned and validated against llama.cpp commit dbe9c0c8ce65354c372f5d4ab507e5424a755e9f; see docs/development.md for the validation command.
Apache-2.0. Vendored or adapted code provenance should be documented in the relevant source files and expanded where appropriate.