Skip to content

feat: Adding Support for SD.Next Quantization Engine (SDNQ) (Flux1&Flux2klein4B/9B&Z-Image)#9228

Open
Pfannkuchensack wants to merge 20 commits into
invoke-ai:mainfrom
Pfannkuchensack:feature/svd-quantization
Open

feat: Adding Support for SD.Next Quantization Engine (SDNQ) (Flux1&Flux2klein4B/9B&Z-Image)#9228
Pfannkuchensack wants to merge 20 commits into
invoke-ai:mainfrom
Pfannkuchensack:feature/svd-quantization

Conversation

@Pfannkuchensack
Copy link
Copy Markdown
Collaborator

Summary

Adds support for SDNQ (SD.Next Quantization) as a new quantization format in InvokeAI, enabling memory-efficient inference for large models on consumer GPUs.

What's included:

  • New sdnq quantization backend (invokeai/backend/quantization/sdnq/) with SDNQTensor, dequant utils, and safetensors loaders (incl. multi-shard support)
  • Model config + loader support for SDNQ-quantized:
    • FLUX.1 transformers (with BFL ↔ diffusers norm_out scale/shift fix)
    • FLUX.2 Klein 4B/9B transformers (incl. dynamic mixed-precision Klein pipelines)
    • Z-Image full ZImagePipeline diffusers folders (all submodels dispatched via SDNQ loader)
    • T5 and Qwen3 text encoders
  • Config discriminator: SDNQ-quantized diffusers folders are now correctly identified as SDNQ instead of plain diffusers (avoids crashes when reading packed uint8 weights as bf16)
  • Loader treats SDNQ ZImagePipeline / Flux2KleinPipeline folders as main_is_diffusers so submodels auto-extract (no separate VAE/Qwen3 source required)
  • Frontend: new SDNQ model format badge, schema/types regeneration, readiness updates, Klein FE combobox now accepts SDNQ pipeline configs
  • Starter models entries + user-facing docs at docs/src/content/docs/configuration/sdnq-quantization.mdx
  • Tests: tests/backend/quantization/sdnq/ covering tensor dequant + loader behavior; custom-modules tests extended

Why: SDNQ enables running FLUX, FLUX.2, and Z-Image on lower-VRAM GPUs by loading pre-quantized weight folders directly, without runtime conversion overhead.

Related Issues / Discussions

Closes #8789

QA Instructions

  1. Install an SDNQ-quantized model folder for each supported architecture and verify identification:
    • FLUX (BFL + diffusers variants)
    • FLUX.2 dev + FLUX.2 Klein (dynamic mixed-precision)
    • Z-Image full pipeline
    • T5 / Qwen3 encoders (standalone + bundled in pipelines)
  2. In the Model Manager, confirm the model is tagged with the SDNQ format badge.
  3. Run a generation with each model and verify:
    • Submodels auto-extract from the pipeline folder (no extra VAE/text-encoder sources needed)
    • Multi-shard diffusion_pytorch_model-*-of-*.safetensors files merge correctly (Klein 9B, FLUX.2 dev)
    • No crashes from bf16 reads on packed uint8 weights
  4. Verify FLUX output quality is unchanged (regression check for the BFL norm_out scale/shift swap).
  5. Run the new tests: uv run --extra cuda pytest tests/backend/quantization/sdnq/.

Merge Plan

Needs Testing

Checklist

  • The PR has a short but descriptive title, suitable for a changelog
  • Tests added / updated (if applicable)
  • ❗Changes to a redux slice have a corresponding migration — n/a
  • Documentation added / updated (if applicable)
  • Updated What's New copy (if doing a release after this PR)

Add support for loading SDNQ-quantized models with on-the-fly CPU
dequantization, similar to existing GGUF support.

New features:
- SDNQTensor class with __torch_dispatch__ for automatic dequantization
- Support for symmetric/asymmetric int8/uint8/fp8 quantization
- Optional SVD correction (low-rank approximation)
- Model loaders for Flux and Z-Image SDNQ models
- Automatic format detection via weight+scale key pairs

New files:
- invokeai/backend/quantization/sdnq/ (core module)
- tests/backend/quantization/sdnq/ (unit tests)

Modified files:
- taxonomy.py: Add ModelFormat.SDNQQuantized
- configs/main.py: Add Main_SDNQ_FLUX_Config, Main_SDNQ_ZImage_Config
- configs/factory.py: Register SDNQ configs
- model_loaders/flux.py: Add FluxSDNQCheckpointModel
- model_loaders/z_image.py: Add ZImageSDNQCheckpointModel
- Add uint4 per-group quantization with packed weight unpacking
- Handle 1D flattened weights (reshape to 2D before unpacking)
- Support SDNQ diffusers format for FLUX transformer and T5
- Add SDNQ VAE loading with AutoencoderKL
- Add diagnostic logging for debugging dequantization
- Fix bit order in uint4 unpacking (lower, upper)
…tion

The test was checking `(weight - zero_point) * scale`, but SDNQ
(Disty0/sdnq) defines asymmetric dequantization as
`zero_point + weight * scale` (via torch.addcmul), where zero_point
is a post-scale bias rather than a pre-scale integer offset. The
implementation already follows this convention; only the test
expectation was wrong.
…tion

The test was checking `(weight - zero_point) * scale`, but SDNQ
(Disty0/sdnq) defines asymmetric dequantization as
`zero_point + weight * scale` (via torch.addcmul), where zero_point
is a post-scale bias rather than a pre-scale integer offset. The
implementation already follows this convention; only the test
expectation was wrong.

feat(sdnq): support sidecar LoRA application on SDNQ-quantized layers

Bring SDNQ to feature parity with GGUF in the sidecar patching path
so LoRA, LoKr, DoRA, FullLayer, and FluxControl patches apply
correctly to SDNQ-quantized Linear and Conv2d modules. Without this,
the sidecar aggregate replaced the SDNQTensor weight with a meta
tensor and patches silently produced wrong results.

- Add SDNQTensor branch in CustomModuleMixin._aggregate_patch_parameters
  mirroring the GGMLTensor branch.
- Extend the (GGMLTensor) dtype-cast exclusion to also cover
  SDNQTensor in CustomLinear, CustomConv2d, CustomInvokeLinearNF4,
  and CustomInvokeLinear8bitLt.
- Add `linear_with_sdnq_quantized_tensor` and `linear_sdnq_quantized`
  fixtures so the existing custom-module test matrix exercises SDNQ
  alongside GGUF, BnB-8bit, and NF4.
Add T5Encoder_SDNQ_Config for diffusers-style T5 bundles whose
text_encoder_2/ folder holds SDNQ-quantized safetensors (detected
via quantization_config.json's quant_method or via the SDNQ-style
weight+scale key pairs). Add T5EncoderSDNQLoader that materializes
the T5EncoderModel on meta, then loads the SDNQ state dict, and
re-shares the embed_tokens/shared weight per HuggingFace's tied-
weight convention.
Add Main_SDNQ_Flux2_Config covering Klein 4B/9B and their Base
variants (detected via _get_flux2_variant on the dequantized
SDNQTensor shapes plus the existing filename heuristic), and
Flux2SDNQCheckpointModel that loads diffusers-layout SDNQ FLUX.2
checkpoints straight into Flux2Transformer2DModel. Architecture
(num_layers, hidden_size, attention head count, guidance presence)
is detected from state-dict shapes the same way the fp16 loader
does, since SDNQTensor.shape reports the dequantized shape.

BFL-layout SDNQ FLUX.2 checkpoints are not supported here — that
would require an SDNQTensor-aware port of the
_convert_flux2_bfl_to_diffusers fuse logic.
Add Main_SDNQ_Diffusers_ZImage_Config so a complete SDNQ ZImagePipeline
folder (model_index.json + transformer/ + text_encoder/ + tokenizer/ +
vae/) is recognised on install and its submodels are wired up. Extend
ZImageSDNQCheckpointModel to load the transformer from the subfolder
using ZImageTransformer2DModel.from_config() so non-default architecture
parameters (e.g. axes_lens [1536,512,512] in newer Z-Image Turbo SDNQ
exports) are honoured instead of the single-file path's hardcoded
[1024,512,512].

Verified end-to-end against Tongyi-MAI/Z-Image-Turbo-SDNQ-uint4-svd-r32:
269 quantized + 252 regular tensors load into a 6.15B-param model with
0 missing / 0 unexpected keys.
T5Encoder_SDNQ_Config originally only looked for text_encoder_2/
as a subfolder of mod.path, which works for standalone T5 bundles
but misses the case where a parent FluxPipeline / similar config
registers its T5 submodel with path_or_prefix pointing straight at
the text_encoder_2 folder. Allow both layouts in both the config's
detection logic and T5EncoderSDNQLoader's te_dir resolution.

Verified end-to-end with Disty0/FLUX.1-schnell-SDNQ-uint4-svd-r32.
The diffusers→BFL state-dict converter renamed norm_out.linear.{weight,bias}
to final_layer.adaLN_modulation.1.{weight,bias} but did not swap the
two halves along dim 0. diffusers' AdaLayerNormContinuous packs the
linear output as (scale, shift); BFL's LastLayer packs as (shift, scale).
Without the swap, the final adaLN modulation runs with scale and shift
permuted, which produces structured-but-very-noisy output for every
pixel. Reuse the same pattern the FLUX.2 converter applies for the
analogous adaLN_modulation key.
ZImageSDNQCheckpointModel only handled the Transformer submodel, so
attempts to use an SDNQ ZImagePipeline as the "Qwen3 & VAE source
model" (which triggers loads for TextEncoder / Tokenizer / VAE)
crashed with "Only Transformer submodels are currently supported".
Add per-submodel handlers that load text_encoder/ via sdnq_sd_loader
into an empty Qwen3ForCausalLM (re-sharing lm_head with embed_tokens
when tied), tokenizer/ via AutoTokenizer, and vae/ via
AutoencoderKL.from_pretrained. The single-file SDNQ checkpoint path
keeps its transformer-only behaviour but now raises a clearer error
when asked for a different submodel.
Add support for SDNQ-quantized Flux2KleinPipeline folders, which mix
uint4 and int5 dtypes across layers (chosen dynamically by SDNQ during
quantization to stay under a per-group loss budget).

Core changes:
- Add INT5_ASYM quantization type + unpack_uint5 + dequantize_int5_per_group.
  Sign-extension matches Disty0/sdnq's unpack_int convention (raw 0..31 - 16).
  zero_point is optional (dynamic-mixed sometimes emits scale-only int5 tensors).
- _infer_quantization_type now takes a per_tensor_dtype override; the loader
  builds an inverted map from quantization_config.json's modules_dtype_dict.
- _get_original_shape uses the packed weight size as the authoritative source
  for in_features, fixing a bug where Klein 4B's group_size=64 layers were
  misread as group_size=128 (the previous fallback).

Pipeline integration:
- Add Main_SDNQ_Diffusers_Flux2_Config matching Flux2Pipeline /
  Flux2KleinPipeline folders with quantized transformer.
- Flux2SDNQCheckpointModel now dispatches all pipeline submodels:
  transformer (Flux2Transformer2DModel.from_config + sdnq state dict),
  text_encoder (Qwen3ForCausalLM SDNQ + lm_head/embed_tokens tie),
  tokenizer (AutoTokenizer), vae (AutoencoderKLFlux2 / AutoencoderKL).
- Extend flux2_klein_model_loader._validate_diffusers_format and the
  isFlux2DiffusersMainModelConfig FE filter to also accept SDNQ pipeline
  configs (when submodels is populated).

Verified against Disty0/FLUX.2-klein-4B-SDNQ-4bit-dynamic: 98 uint4 +
2 int5 tensors load into a 3.88B-param Flux2Transformer2DModel with
0 missing / 0 unexpected keys; both dequant paths produce reasonable
zero-centred weight distributions.
  Main_Diffusers_Flux2_Config so identification routes them to the
  SDNQ configs instead. Without this both configs accept the folder
  and the plain diffusers loader wins, then crashes when reading
  packed uint8 weights as bf16.
  diffusion_pytorch_model-{00001,00002}-of-00002.safetensors and FLUX.2
  dev's sharded transformer both load. Detect cross-shard key collisions
  as a corruption signal.
  "main_is_diffusers" in z_image_model_loader and flux2_klein_model_loader
  so the auto-extract-submodels branch handles them. Without this the
  loader demanded a separate VAE/Qwen3 source even though the SDNQ
  pipeline carries those submodels itself.
- Drop the ui_model_format=Diffusers hint on Klein's qwen3_source_model
  field so the FE combobox can also show SDNQ pipeline configs (the FE
  filter already accepts them).
Loading the Klein 4B SDNQ pipeline as the main model errored with
"No Qwen3 Encoder selected" in the UI even though the pipeline carries
its own Qwen3 + VAE submodels, and the Model Manager showed no format
badge at all on SDNQ models.

- flux2_klein_model_loader now treats SDNQ-with-submodels as
  main_is_diffusers, so the auto-extract-submodels branch handles SDNQ
  pipelines exactly like plain diffusers. Drop the
  ui_model_format=Diffusers hint on qwen3_source_model so the combobox
  can also show SDNQ pipeline configs.
- readiness.ts no longer demands a standalone VAE/Qwen3 for FLUX.2
  Klein when the main model is itself a pipeline (diffusers or
  SDNQ-with-submodels). Without this the Invoke button stayed disabled
  with "Non-diffusers FLUX.2 Klein models require a standalone Qwen3
  Encoder" even when the SDNQ pipeline could self-source everything.
- Register sdnq_quantized in zModelFormat, the manually-edited OpenAPI
  schema, ModelFormatBadge, and MODEL_FORMAT_TO_LONG_NAME so SDNQ
  models render an "sdnq" badge instead of an empty placeholder.
- 4 new starter models covering all SDNQ pipelines verified
  end-to-end in this branch: FLUX.1 schnell, Z-Image Turbo,
  FLUX.2 Klein 4B (dynamic mixed), FLUX.2 Klein 9B (dynamic
  mixed + SVD). Each entry is self-contained (no separate
  encoder/VAE dependencies because the SDNQ pipeline folder
  bundles them).
- New /configuration/sdnq-quantization/ page: support matrix,
  VRAM footprints, install steps (Starter Models + HF + Folder),
  LoRA compatibility notes, SDNQ-vs-SVDQuant/Nunchaku
  disambiguation, comparison with GGUF/NF4/FP8, troubleshooting.
- Cross-link from fp8-storage.mdx's "no-op on quantized" caution.
@github-actions github-actions Bot added python PRs that change python files invocations PRs that change invocations backend PRs that change backend files frontend PRs that change frontend files python-tests PRs that change python tests docs PRs that change docs labels May 24, 2026
Z-Image and Qwen3 SDNQ configs were missing `variant` (and `cpu_only`
on Qwen3) fields that exist on the other variants of the same union,
breaking TypeScript narrowing on the FE.

- Main_SDNQ_ZImage_Config: add variant (default Turbo)
- Main_SDNQ_Diffusers_ZImage_Config: add variant, detect from
  scheduler_config.json shift value
- Qwen3Encoder_SDNQ_Config: add cpu_only + variant, detect from
  embed_tokens shape
- Qwen3Encoder_SDNQ_Folder_Config: add cpu_only + variant, detect
  from config.json hidden_size
- Regenerate FE schema.ts

Discriminator tags are unchanged since variant has no default.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend PRs that change backend files docs PRs that change docs frontend PRs that change frontend files invocations PRs that change invocations python PRs that change python files python-tests PRs that change python tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[enhancement]: Support for SD.Next Quantizer

1 participant