libgguf

Standalone GGUF read/write, byte-exact quantization, and CUDA-accelerated row kernels for C++, Python, NumPy, Torch, and CUDA.

libgguf vendors and adapts GGUF/GGML quantization kernels from llama.cpp into a reusable standalone library and toolkit. The goal is to make GGUF infrastructure available directly to conversion tools and downstream projects without requiring a two-stage route through llama.cpp binaries or partial Python/Torch-only implementations.

The repository currently contains native GGUF row kernels, Python bindings, NumPy and Torch backends, an optional CUDA Torch extension, safetensors-to-GGUF conversion paths, public lightweight GGUF reading/inspection and structural validation tools, benchmark tools, and tensor planning policy for real image-model conversion workflows. A fuller writer API is planned; today, GGUF writing logic is present primarily through converter paths.

Status

Field	Value
Status	active development
License	Apache-2.0
Python	>=3.10
Version	0.1.0
CUDA	optional, experimental, broad qtype coverage

Features

Standalone native C++ GGUF quantization and dequantization library.
Python bindings for native CPU row kernels.
Extended NumPy GGUF quantization/dequantization backend.
Extended Torch GGUF quantization/dequantization backend.
Optional CUDA quantization and dequantization kernels exposed through a Torch extension.
Native low-memory safetensors-to-GGUF conversion executable.
Experimental/internal Python conversion helper API for safetensors/ckpt workflows.
Experimental public GGUF reader API for metadata, tensor descriptors, tensor iteration, raw tensor byte reads, and structural validation.
Deterministic policy-based tensor planning for real image-model GGUF conversion.
Benchmark suite for native, Torch, and CUDA paths.
Planned fuller GGUF writer API.

Why libgguf

Byte-exact quantization/dequantization against the native CPU reference path where supported.
Broad CUDA quantization and dequantization qtype coverage.
Stack-free near-roofline CUDA dequantization across tested qtypes.
Very fast CUDA quantization for Q/K/TQ/MXFP4/NVFP4 families, with IQ kernels improved and still the active optimization frontier.
SIMD/threaded native CPU backend.
Low-memory native converter path for safetensors-to-GGUF conversion.
Multiple backend implementations for parity testing and integration.
Lightweight GGUF reader API for metadata, tensor descriptors, and raw tensor bytes.

Backends

Backend	Purpose	Status
native C++ CPU	Reference row quant/dequant kernels, SIMD/threaded CPU paths, shared library, C ABI	active
Python bindings	`libgguf` row APIs and native converter bridge	active
`libgguf_numpy`	NumPy quant/dequant implementation for parity testing and integration	active
`libgguf_torch`	Torch-native quant/dequant implementation for parity testing and integration	active
`libgguf_cuda`	Optional Torch CUDA extension with direct quant/dequant kernels	experimental
`libgguf_quantize_gguf`	Low-memory C++ safetensors-to-GGUF conversion executable	active, Q/K-focused
Python conversion helper	Import-level helper over native bindings and safetensors/ckpt loading	experimental/internal

Installation

Editable development install:

python -m pip install -e .

Python conversion helper dependencies:

python -m pip install -e ".[quantize]"

CUDA extension dependencies:

python -m pip install -e ".[cuda]"

Core dependency: numpy. Optional extras: cuda, quantize, and test.

The build backend is scikit-build-core. Native builds require CMake >=3.18 and C++17. CUDA kernel builds require nvcc and the CUDA toolkit; the optional Torch CUDA extension additionally requires importable torch and Torch CMake metadata.

Useful CMake options:

LIBGGUF_CPU_BACKEND=REF|SSE2|SSE4_1|AVX2: native CPU row backend to compile, default REF.
LIBGGUF_BUILD_CUDA_KERNELS=AUTO|ON|OFF: optional CUDA kernel targets, including the Torch extension when Torch is available, default AUTO.
LIBGGUF_BUILD_TOOLS=ON: build native command-line tools, default ON.
LIBGGUF_BUILD_BENCHMARKS=OFF: build native benchmark binaries, default OFF.

Quick Start

Native Python row kernels:

import numpy as np
import libgguf

x = np.random.default_rng(0).normal(size=(4, 4096)).astype(np.float32)
qtype = libgguf.GGMLQuantizationType.Q4_K

q = libgguf.quantize_rows(x, qtype)
y = libgguf.dequantize_rows(q, qtype, n_per_row=4096)

Experimental CUDA Torch extension:

import torch
import libgguf
import libgguf.libgguf_cuda as gguf_cuda

rows, width = 4, 4096
tensor_cuda = torch.randn(rows, width, device="cuda", dtype=torch.float32)
qtype = libgguf.GGMLQuantizationType.Q4_K

q = gguf_cuda.quantize(tensor_cuda, int(qtype))
y = gguf_cuda.dequantize(q, int(qtype), rows, width, torch.float16)

CLI Tools

Python entry points:

gguf-inspect: GGUF metadata and tensor descriptor inspection.
gguf-validate: structural GGUF validation without reading tensor payload bytes.
gguf-compare: GGUF tensor descriptor comparison with optional metadata and payload byte checks.

Native executable:

libgguf_quantize_gguf: low-memory C++ safetensors-to-GGUF converter. The native executable is currently Q/K-focused; non-Q/K quantization families are not supported by this executable yet.

Common conversion shape:

libgguf_quantize_gguf --src model.safetensors --qtype Q4_K_M --dst model-Q4_K_M.gguf
libgguf_quantize_gguf --src model.safetensors --qtype Q4_K_M --dst model-Q4_K_M.gguf --scratch-bytes 33554432

The Python conversion helper API remains experimental/internal and requires the quantize extra when used directly. The old Python conversion wrapper modules are retired; use libgguf_quantize_gguf for command-line conversion.

See docs/cli.md for implemented options.

GGUF Reader

The experimental public reader API opens GGUF files without reading tensor payloads until requested:

import libgguf

info = libgguf.open_gguf("model.gguf")
for tensor in info.iter_tensors():
    print(tensor.name, tensor.shape, tensor.qtype)

raw = info.read_tensor_bytes(info.tensors[0], offset=0, size=128)

open_gguf, inspect_gguf, and read_gguf_header currently share the same lightweight implementation. See docs/python-api.md.

Quantization Policy

Conversion uses deterministic tensor planning, not magic. Current policies are:

uniform: quantize eligible 2D weight tensors uniformly.
comfy: use architecture-aware skip and high-precision patterns similar to image-model GGUF conversion workflows.
dynamic: build on comfy with deterministic tensor-role and layer-position promotion logic, including ongoing investigation of Unsloth Dynamic-like behavior.

All policies support tensor overrides plus include/exclude patterns. See docs/policy.md.

Supported Qtypes

The public enum and row APIs cover these storage and quantization families:

Q1_0
Q4_0, Q4_1
Q5_0, Q5_1
Q8_0
Q2_K, Q3_K, Q4_K, Q5_K, Q6_K
IQ1_S, IQ1_M
IQ2_XXS, IQ2_XS, IQ2_S
IQ3_XXS, IQ3_S
IQ4_NL, IQ4_XS
TQ1_0, TQ2_0
MXFP4, NVFP4
F32, F16, BF16 storage

Exact support varies by backend and converter path. See docs/support-matrix.md.

Benchmarks

Benchmarks are representative development results on an RTX 3090, not universal performance claims. For shape 11008x4096, recent CUDA dequantization results show tested qtypes running stack-free at roughly 0.23-0.28 ms, around 778-817 GB/s, with low register counts and about 65x-98x speedup versus the CPU default path for the sampled qtypes.

Representative CUDA dequant rows:

qtype	ms	GB/s	speedup vs CPU default
`Q1_0`	0.233	799.6	93.3x
`Q8_0`	0.279	817.1	65.5x
`Q4_K`	0.254	811.1	78.8x
`Q5_K`	0.259	814.8	79.5x
`Q6_K`	0.267	814.5	75.6x
`IQ2_XS`	0.246	786.6	98.3x
`IQ4_XS`	0.254	803.6	72.5x
`TQ1_0`	0.237	802.4	81.6x
`TQ2_0`	0.239	802.9	87.9x

CUDA quantization is strong for Q/K/TQ/MX/NV families, with IQ kernels improved significantly and still the active optimization frontier. IQ quant kernels are exact on checked rows and continue to be optimized.

See docs/benchmarks.md for detailed tables and metrics.

Correctness

The native CPU path is the reference path. CUDA, NumPy, and Torch implementations are tested for byte exactness where supported: same input, qtype, and shape should produce identical encoded bytes. Dequantization checks compare decoded output for a fixed destination dtype. Frozen golden fixtures supplement generated CPU-reference checks.

See docs/correctness.md.

Ecosystem Context

libgguf is not an official llama.cpp project. It adapts GGUF/GGML reference behavior into a standalone infrastructure library and keeps compatibility as an engineering target where applicable.

llama.cpp and gguf-py are the upstream GGUF/GGML ecosystem references for format behavior, constants, Python writer/reader patterns, and reference quantization behavior.
ComfyUI-GGUF is the existing community ComfyUI GGUF inference/custom-node integration. libgguf may replace or support parts of that stack with reusable native, Python, Torch, and CUDA backend infrastructure.
ComfyUI-GGUF tools show the current conversion workflow that routes through Python tooling plus patched llama.cpp quantization. libgguf's native conversion executable and Python import-level helper APIs aim to make that flow more direct and reusable.
Diffusers GGUF docs describe current Diffusers GGUF loading through from_single_file model classes, low-memory torch.uint8 storage, dynamic dequantization during forward, and optional CUDA kernels through the kernels package. Diffusers is a potential optional backend/integration target for libgguf, not currently claimed as supported here.
Public model repositories such as city96/FLUX.1-dev-gguf are useful real-world compatibility targets for conversion and inference testing.
Unsloth Dynamic GGUF is relevant policy background for tensor-level qtype decisions. libgguf's dynamic policy is deterministic planning work inspired by this class of approach, not a claim of matching Unsloth results.

See docs/ecosystem.md for the fuller reference map.

Roadmap

Fuller GGUF writer API.
Deeper GGUF validator coverage.
Source dtype GPU input path for F16/BF16.
Broader frozen exactness coverage.
Broader native converter CUDA qtype coverage beyond the current Q/K-focused set.
Converter-level quality and compatibility sweeps for more image-model architectures.
CUDA IQ quant polish.
Packaging and wheels.
Diffusers optional backend/integration exploration.
ComfyUI-GGUF backend/tooling support or replacement exploration.

Relationship To Upstream Projects

GGUF format behavior and quantization kernels are intended to stay compatible with llama.cpp/GGML/GGUF reference behavior where applicable. The NumPy backend extends gguf-py-style implementations, and the Torch backend extends ComfyUI-GGUF-style native Torch implementations. libgguf keeps those ideas in a standalone infrastructure package with native C++ and CUDA paths.

The vendored scalar quant/dequant reference kernels are pinned and validated against llama.cpp commit dbe9c0c8ce65354c372f5d4ab507e5424a755e9f; see docs/development.md for the validation command.

License

Apache-2.0. Vendored or adapted code provenance should be documented in the relevant source files and expanded where appropriate.

Name		Name	Last commit message	Last commit date
Latest commit History 121 Commits
.github/workflows		.github/workflows
bench		bench
csrc		csrc
docs		docs
include		include
reports		reports
scripts		scripts
src/libgguf		src/libgguf
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
CMakePresets.json		CMakePresets.json
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

libgguf

Status

Features

Why libgguf

Backends

Installation

Quick Start

CLI Tools

GGUF Reader

Quantization Policy

Supported Qtypes

Benchmarks

Correctness

Ecosystem Context

Roadmap

Relationship To Upstream Projects

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

libgguf

Status

Features

Why libgguf

Backends

Installation

Quick Start

CLI Tools

GGUF Reader

Quantization Policy

Supported Qtypes

Benchmarks

Correctness

Ecosystem Context

Roadmap

Relationship To Upstream Projects

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages