Skip to content

hlky/libgguf

Repository files navigation

libgguf

Standalone GGUF read/write, byte-exact quantization, and CUDA-accelerated row kernels for C++, Python, NumPy, Torch, and CUDA.

libgguf vendors and adapts GGUF/GGML quantization kernels from llama.cpp into a reusable standalone library and toolkit. The goal is to make GGUF infrastructure available directly to conversion tools and downstream projects without requiring a two-stage route through llama.cpp binaries or partial Python/Torch-only implementations.

The repository currently contains native GGUF row kernels, Python bindings, NumPy and Torch backends, an optional CUDA Torch extension, safetensors-to-GGUF conversion paths, public lightweight GGUF reading/inspection and structural validation tools, benchmark tools, and tensor planning policy for real image-model conversion workflows. A fuller writer API is planned; today, GGUF writing logic is present primarily through converter paths.

Status

Field Value
Status active development
License Apache-2.0
Python >=3.10
Version 0.1.0
CUDA optional, experimental, broad qtype coverage

Features

  • Standalone native C++ GGUF quantization and dequantization library.
  • Python bindings for native CPU row kernels.
  • Extended NumPy GGUF quantization/dequantization backend.
  • Extended Torch GGUF quantization/dequantization backend.
  • Optional CUDA quantization and dequantization kernels exposed through a Torch extension.
  • Native low-memory safetensors-to-GGUF conversion executable.
  • Experimental/internal Python conversion helper API for safetensors/ckpt workflows.
  • Experimental public GGUF reader API for metadata, tensor descriptors, tensor iteration, raw tensor byte reads, and structural validation.
  • Deterministic policy-based tensor planning for real image-model GGUF conversion.
  • Benchmark suite for native, Torch, and CUDA paths.
  • Planned fuller GGUF writer API.

Why libgguf

  • Byte-exact quantization/dequantization against the native CPU reference path where supported.
  • Broad CUDA quantization and dequantization qtype coverage.
  • Stack-free near-roofline CUDA dequantization across tested qtypes.
  • Very fast CUDA quantization for Q/K/TQ/MXFP4/NVFP4 families, with IQ kernels improved and still the active optimization frontier.
  • SIMD/threaded native CPU backend.
  • Low-memory native converter path for safetensors-to-GGUF conversion.
  • Multiple backend implementations for parity testing and integration.
  • Lightweight GGUF reader API for metadata, tensor descriptors, and raw tensor bytes.

Backends

Backend Purpose Status
native C++ CPU Reference row quant/dequant kernels, SIMD/threaded CPU paths, shared library, C ABI active
Python bindings libgguf row APIs and native converter bridge active
libgguf_numpy NumPy quant/dequant implementation for parity testing and integration active
libgguf_torch Torch-native quant/dequant implementation for parity testing and integration active
libgguf_cuda Optional Torch CUDA extension with direct quant/dequant kernels experimental
libgguf_quantize_gguf Low-memory C++ safetensors-to-GGUF conversion executable active, Q/K-focused
Python conversion helper Import-level helper over native bindings and safetensors/ckpt loading experimental/internal

Installation

Editable development install:

python -m pip install -e .

Python conversion helper dependencies:

python -m pip install -e ".[quantize]"

CUDA extension dependencies:

python -m pip install -e ".[cuda]"

Core dependency: numpy. Optional extras: cuda, quantize, and test.

The build backend is scikit-build-core. Native builds require CMake >=3.18 and C++17. CUDA kernel builds require nvcc and the CUDA toolkit; the optional Torch CUDA extension additionally requires importable torch and Torch CMake metadata.

Useful CMake options:

  • LIBGGUF_CPU_BACKEND=REF|SSE2|SSE4_1|AVX2: native CPU row backend to compile, default REF.
  • LIBGGUF_BUILD_CUDA_KERNELS=AUTO|ON|OFF: optional CUDA kernel targets, including the Torch extension when Torch is available, default AUTO.
  • LIBGGUF_BUILD_TOOLS=ON: build native command-line tools, default ON.
  • LIBGGUF_BUILD_BENCHMARKS=OFF: build native benchmark binaries, default OFF.

Quick Start

Native Python row kernels:

import numpy as np
import libgguf

x = np.random.default_rng(0).normal(size=(4, 4096)).astype(np.float32)
qtype = libgguf.GGMLQuantizationType.Q4_K

q = libgguf.quantize_rows(x, qtype)
y = libgguf.dequantize_rows(q, qtype, n_per_row=4096)

Experimental CUDA Torch extension:

import torch
import libgguf
import libgguf.libgguf_cuda as gguf_cuda

rows, width = 4, 4096
tensor_cuda = torch.randn(rows, width, device="cuda", dtype=torch.float32)
qtype = libgguf.GGMLQuantizationType.Q4_K

q = gguf_cuda.quantize(tensor_cuda, int(qtype))
y = gguf_cuda.dequantize(q, int(qtype), rows, width, torch.float16)

CLI Tools

Python entry points:

  • gguf-inspect: GGUF metadata and tensor descriptor inspection.
  • gguf-validate: structural GGUF validation without reading tensor payload bytes.
  • gguf-compare: GGUF tensor descriptor comparison with optional metadata and payload byte checks.

Native executable:

  • libgguf_quantize_gguf: low-memory C++ safetensors-to-GGUF converter. The native executable is currently Q/K-focused; non-Q/K quantization families are not supported by this executable yet.

Common conversion shape:

libgguf_quantize_gguf --src model.safetensors --qtype Q4_K_M --dst model-Q4_K_M.gguf
libgguf_quantize_gguf --src model.safetensors --qtype Q4_K_M --dst model-Q4_K_M.gguf --scratch-bytes 33554432

The Python conversion helper API remains experimental/internal and requires the quantize extra when used directly. The old Python conversion wrapper modules are retired; use libgguf_quantize_gguf for command-line conversion.

See docs/cli.md for implemented options.

GGUF Reader

The experimental public reader API opens GGUF files without reading tensor payloads until requested:

import libgguf

info = libgguf.open_gguf("model.gguf")
for tensor in info.iter_tensors():
    print(tensor.name, tensor.shape, tensor.qtype)

raw = info.read_tensor_bytes(info.tensors[0], offset=0, size=128)

open_gguf, inspect_gguf, and read_gguf_header currently share the same lightweight implementation. See docs/python-api.md.

Quantization Policy

Conversion uses deterministic tensor planning, not magic. Current policies are:

  • uniform: quantize eligible 2D weight tensors uniformly.
  • comfy: use architecture-aware skip and high-precision patterns similar to image-model GGUF conversion workflows.
  • dynamic: build on comfy with deterministic tensor-role and layer-position promotion logic, including ongoing investigation of Unsloth Dynamic-like behavior.

All policies support tensor overrides plus include/exclude patterns. See docs/policy.md.

Supported Qtypes

The public enum and row APIs cover these storage and quantization families:

  • Q1_0
  • Q4_0, Q4_1
  • Q5_0, Q5_1
  • Q8_0
  • Q2_K, Q3_K, Q4_K, Q5_K, Q6_K
  • IQ1_S, IQ1_M
  • IQ2_XXS, IQ2_XS, IQ2_S
  • IQ3_XXS, IQ3_S
  • IQ4_NL, IQ4_XS
  • TQ1_0, TQ2_0
  • MXFP4, NVFP4
  • F32, F16, BF16 storage

Exact support varies by backend and converter path. See docs/support-matrix.md.

Benchmarks

Benchmarks are representative development results on an RTX 3090, not universal performance claims. For shape 11008x4096, recent CUDA dequantization results show tested qtypes running stack-free at roughly 0.23-0.28 ms, around 778-817 GB/s, with low register counts and about 65x-98x speedup versus the CPU default path for the sampled qtypes.

Representative CUDA dequant rows:

qtype ms GB/s speedup vs CPU default
Q1_0 0.233 799.6 93.3x
Q8_0 0.279 817.1 65.5x
Q4_K 0.254 811.1 78.8x
Q5_K 0.259 814.8 79.5x
Q6_K 0.267 814.5 75.6x
IQ2_XS 0.246 786.6 98.3x
IQ4_XS 0.254 803.6 72.5x
TQ1_0 0.237 802.4 81.6x
TQ2_0 0.239 802.9 87.9x

CUDA quantization is strong for Q/K/TQ/MX/NV families, with IQ kernels improved significantly and still the active optimization frontier. IQ quant kernels are exact on checked rows and continue to be optimized.

See docs/benchmarks.md for detailed tables and metrics.

Correctness

The native CPU path is the reference path. CUDA, NumPy, and Torch implementations are tested for byte exactness where supported: same input, qtype, and shape should produce identical encoded bytes. Dequantization checks compare decoded output for a fixed destination dtype. Frozen golden fixtures supplement generated CPU-reference checks.

See docs/correctness.md.

Ecosystem Context

libgguf is not an official llama.cpp project. It adapts GGUF/GGML reference behavior into a standalone infrastructure library and keeps compatibility as an engineering target where applicable.

  • llama.cpp and gguf-py are the upstream GGUF/GGML ecosystem references for format behavior, constants, Python writer/reader patterns, and reference quantization behavior.
  • ComfyUI-GGUF is the existing community ComfyUI GGUF inference/custom-node integration. libgguf may replace or support parts of that stack with reusable native, Python, Torch, and CUDA backend infrastructure.
  • ComfyUI-GGUF tools show the current conversion workflow that routes through Python tooling plus patched llama.cpp quantization. libgguf's native conversion executable and Python import-level helper APIs aim to make that flow more direct and reusable.
  • Diffusers GGUF docs describe current Diffusers GGUF loading through from_single_file model classes, low-memory torch.uint8 storage, dynamic dequantization during forward, and optional CUDA kernels through the kernels package. Diffusers is a potential optional backend/integration target for libgguf, not currently claimed as supported here.
  • Public model repositories such as city96/FLUX.1-dev-gguf are useful real-world compatibility targets for conversion and inference testing.
  • Unsloth Dynamic GGUF is relevant policy background for tensor-level qtype decisions. libgguf's dynamic policy is deterministic planning work inspired by this class of approach, not a claim of matching Unsloth results.

See docs/ecosystem.md for the fuller reference map.

Roadmap

  • Fuller GGUF writer API.
  • Deeper GGUF validator coverage.
  • Source dtype GPU input path for F16/BF16.
  • Broader frozen exactness coverage.
  • Broader native converter CUDA qtype coverage beyond the current Q/K-focused set.
  • Converter-level quality and compatibility sweeps for more image-model architectures.
  • CUDA IQ quant polish.
  • Packaging and wheels.
  • Diffusers optional backend/integration exploration.
  • ComfyUI-GGUF backend/tooling support or replacement exploration.

Relationship To Upstream Projects

GGUF format behavior and quantization kernels are intended to stay compatible with llama.cpp/GGML/GGUF reference behavior where applicable. The NumPy backend extends gguf-py-style implementations, and the Torch backend extends ComfyUI-GGUF-style native Torch implementations. libgguf keeps those ideas in a standalone infrastructure package with native C++ and CUDA paths.

The vendored scalar quant/dequant reference kernels are pinned and validated against llama.cpp commit dbe9c0c8ce65354c372f5d4ab507e5424a755e9f; see docs/development.md for the validation command.

License

Apache-2.0. Vendored or adapted code provenance should be documented in the relevant source files and expanded where appropriate.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors