feat: Adding Support for SD.Next Quantization Engine (SDNQ) (Flux1&Flux2klein4B/9B&Z-Image)#9228
Open
Pfannkuchensack wants to merge 20 commits into
Open
feat: Adding Support for SD.Next Quantization Engine (SDNQ) (Flux1&Flux2klein4B/9B&Z-Image)#9228Pfannkuchensack wants to merge 20 commits into
Pfannkuchensack wants to merge 20 commits into
Conversation
Add support for loading SDNQ-quantized models with on-the-fly CPU dequantization, similar to existing GGUF support. New features: - SDNQTensor class with __torch_dispatch__ for automatic dequantization - Support for symmetric/asymmetric int8/uint8/fp8 quantization - Optional SVD correction (low-rank approximation) - Model loaders for Flux and Z-Image SDNQ models - Automatic format detection via weight+scale key pairs New files: - invokeai/backend/quantization/sdnq/ (core module) - tests/backend/quantization/sdnq/ (unit tests) Modified files: - taxonomy.py: Add ModelFormat.SDNQQuantized - configs/main.py: Add Main_SDNQ_FLUX_Config, Main_SDNQ_ZImage_Config - configs/factory.py: Register SDNQ configs - model_loaders/flux.py: Add FluxSDNQCheckpointModel - model_loaders/z_image.py: Add ZImageSDNQCheckpointModel
- Add uint4 per-group quantization with packed weight unpacking - Handle 1D flattened weights (reshape to 2D before unpacking) - Support SDNQ diffusers format for FLUX transformer and T5 - Add SDNQ VAE loading with AutoencoderKL - Add diagnostic logging for debugging dequantization - Fix bit order in uint4 unpacking (lower, upper)
…tion The test was checking `(weight - zero_point) * scale`, but SDNQ (Disty0/sdnq) defines asymmetric dequantization as `zero_point + weight * scale` (via torch.addcmul), where zero_point is a post-scale bias rather than a pre-scale integer offset. The implementation already follows this convention; only the test expectation was wrong.
…tion The test was checking `(weight - zero_point) * scale`, but SDNQ (Disty0/sdnq) defines asymmetric dequantization as `zero_point + weight * scale` (via torch.addcmul), where zero_point is a post-scale bias rather than a pre-scale integer offset. The implementation already follows this convention; only the test expectation was wrong. feat(sdnq): support sidecar LoRA application on SDNQ-quantized layers Bring SDNQ to feature parity with GGUF in the sidecar patching path so LoRA, LoKr, DoRA, FullLayer, and FluxControl patches apply correctly to SDNQ-quantized Linear and Conv2d modules. Without this, the sidecar aggregate replaced the SDNQTensor weight with a meta tensor and patches silently produced wrong results. - Add SDNQTensor branch in CustomModuleMixin._aggregate_patch_parameters mirroring the GGMLTensor branch. - Extend the (GGMLTensor) dtype-cast exclusion to also cover SDNQTensor in CustomLinear, CustomConv2d, CustomInvokeLinearNF4, and CustomInvokeLinear8bitLt. - Add `linear_with_sdnq_quantized_tensor` and `linear_sdnq_quantized` fixtures so the existing custom-module test matrix exercises SDNQ alongside GGUF, BnB-8bit, and NF4.
Add T5Encoder_SDNQ_Config for diffusers-style T5 bundles whose text_encoder_2/ folder holds SDNQ-quantized safetensors (detected via quantization_config.json's quant_method or via the SDNQ-style weight+scale key pairs). Add T5EncoderSDNQLoader that materializes the T5EncoderModel on meta, then loads the SDNQ state dict, and re-shares the embed_tokens/shared weight per HuggingFace's tied- weight convention.
Add Main_SDNQ_Flux2_Config covering Klein 4B/9B and their Base variants (detected via _get_flux2_variant on the dequantized SDNQTensor shapes plus the existing filename heuristic), and Flux2SDNQCheckpointModel that loads diffusers-layout SDNQ FLUX.2 checkpoints straight into Flux2Transformer2DModel. Architecture (num_layers, hidden_size, attention head count, guidance presence) is detected from state-dict shapes the same way the fp16 loader does, since SDNQTensor.shape reports the dequantized shape. BFL-layout SDNQ FLUX.2 checkpoints are not supported here — that would require an SDNQTensor-aware port of the _convert_flux2_bfl_to_diffusers fuse logic.
Add Main_SDNQ_Diffusers_ZImage_Config so a complete SDNQ ZImagePipeline folder (model_index.json + transformer/ + text_encoder/ + tokenizer/ + vae/) is recognised on install and its submodels are wired up. Extend ZImageSDNQCheckpointModel to load the transformer from the subfolder using ZImageTransformer2DModel.from_config() so non-default architecture parameters (e.g. axes_lens [1536,512,512] in newer Z-Image Turbo SDNQ exports) are honoured instead of the single-file path's hardcoded [1024,512,512]. Verified end-to-end against Tongyi-MAI/Z-Image-Turbo-SDNQ-uint4-svd-r32: 269 quantized + 252 regular tensors load into a 6.15B-param model with 0 missing / 0 unexpected keys.
T5Encoder_SDNQ_Config originally only looked for text_encoder_2/ as a subfolder of mod.path, which works for standalone T5 bundles but misses the case where a parent FluxPipeline / similar config registers its T5 submodel with path_or_prefix pointing straight at the text_encoder_2 folder. Allow both layouts in both the config's detection logic and T5EncoderSDNQLoader's te_dir resolution. Verified end-to-end with Disty0/FLUX.1-schnell-SDNQ-uint4-svd-r32.
The diffusers→BFL state-dict converter renamed norm_out.linear.{weight,bias}
to final_layer.adaLN_modulation.1.{weight,bias} but did not swap the
two halves along dim 0. diffusers' AdaLayerNormContinuous packs the
linear output as (scale, shift); BFL's LastLayer packs as (shift, scale).
Without the swap, the final adaLN modulation runs with scale and shift
permuted, which produces structured-but-very-noisy output for every
pixel. Reuse the same pattern the FLUX.2 converter applies for the
analogous adaLN_modulation key.
ZImageSDNQCheckpointModel only handled the Transformer submodel, so attempts to use an SDNQ ZImagePipeline as the "Qwen3 & VAE source model" (which triggers loads for TextEncoder / Tokenizer / VAE) crashed with "Only Transformer submodels are currently supported". Add per-submodel handlers that load text_encoder/ via sdnq_sd_loader into an empty Qwen3ForCausalLM (re-sharing lm_head with embed_tokens when tied), tokenizer/ via AutoTokenizer, and vae/ via AutoencoderKL.from_pretrained. The single-file SDNQ checkpoint path keeps its transformer-only behaviour but now raises a clearer error when asked for a different submodel.
Add support for SDNQ-quantized Flux2KleinPipeline folders, which mix uint4 and int5 dtypes across layers (chosen dynamically by SDNQ during quantization to stay under a per-group loss budget). Core changes: - Add INT5_ASYM quantization type + unpack_uint5 + dequantize_int5_per_group. Sign-extension matches Disty0/sdnq's unpack_int convention (raw 0..31 - 16). zero_point is optional (dynamic-mixed sometimes emits scale-only int5 tensors). - _infer_quantization_type now takes a per_tensor_dtype override; the loader builds an inverted map from quantization_config.json's modules_dtype_dict. - _get_original_shape uses the packed weight size as the authoritative source for in_features, fixing a bug where Klein 4B's group_size=64 layers were misread as group_size=128 (the previous fallback). Pipeline integration: - Add Main_SDNQ_Diffusers_Flux2_Config matching Flux2Pipeline / Flux2KleinPipeline folders with quantized transformer. - Flux2SDNQCheckpointModel now dispatches all pipeline submodels: transformer (Flux2Transformer2DModel.from_config + sdnq state dict), text_encoder (Qwen3ForCausalLM SDNQ + lm_head/embed_tokens tie), tokenizer (AutoTokenizer), vae (AutoencoderKLFlux2 / AutoencoderKL). - Extend flux2_klein_model_loader._validate_diffusers_format and the isFlux2DiffusersMainModelConfig FE filter to also accept SDNQ pipeline configs (when submodels is populated). Verified against Disty0/FLUX.2-klein-4B-SDNQ-4bit-dynamic: 98 uint4 + 2 int5 tensors load into a 3.88B-param Flux2Transformer2DModel with 0 missing / 0 unexpected keys; both dequant paths produce reasonable zero-centred weight distributions.
Main_Diffusers_Flux2_Config so identification routes them to the SDNQ configs instead. Without this both configs accept the folder and the plain diffusers loader wins, then crashes when reading packed uint8 weights as bf16.
diffusion_pytorch_model-{00001,00002}-of-00002.safetensors and FLUX.2
dev's sharded transformer both load. Detect cross-shard key collisions
as a corruption signal.
"main_is_diffusers" in z_image_model_loader and flux2_klein_model_loader so the auto-extract-submodels branch handles them. Without this the loader demanded a separate VAE/Qwen3 source even though the SDNQ pipeline carries those submodels itself. - Drop the ui_model_format=Diffusers hint on Klein's qwen3_source_model field so the FE combobox can also show SDNQ pipeline configs (the FE filter already accepts them).
Loading the Klein 4B SDNQ pipeline as the main model errored with "No Qwen3 Encoder selected" in the UI even though the pipeline carries its own Qwen3 + VAE submodels, and the Model Manager showed no format badge at all on SDNQ models. - flux2_klein_model_loader now treats SDNQ-with-submodels as main_is_diffusers, so the auto-extract-submodels branch handles SDNQ pipelines exactly like plain diffusers. Drop the ui_model_format=Diffusers hint on qwen3_source_model so the combobox can also show SDNQ pipeline configs. - readiness.ts no longer demands a standalone VAE/Qwen3 for FLUX.2 Klein when the main model is itself a pipeline (diffusers or SDNQ-with-submodels). Without this the Invoke button stayed disabled with "Non-diffusers FLUX.2 Klein models require a standalone Qwen3 Encoder" even when the SDNQ pipeline could self-source everything. - Register sdnq_quantized in zModelFormat, the manually-edited OpenAPI schema, ModelFormatBadge, and MODEL_FORMAT_TO_LONG_NAME so SDNQ models render an "sdnq" badge instead of an empty placeholder.
- 4 new starter models covering all SDNQ pipelines verified end-to-end in this branch: FLUX.1 schnell, Z-Image Turbo, FLUX.2 Klein 4B (dynamic mixed), FLUX.2 Klein 9B (dynamic mixed + SVD). Each entry is self-contained (no separate encoder/VAE dependencies because the SDNQ pipeline folder bundles them). - New /configuration/sdnq-quantization/ page: support matrix, VRAM footprints, install steps (Starter Models + HF + Folder), LoRA compatibility notes, SDNQ-vs-SVDQuant/Nunchaku disambiguation, comparison with GGUF/NF4/FP8, troubleshooting. - Cross-link from fp8-storage.mdx's "no-op on quantized" caution.
Z-Image and Qwen3 SDNQ configs were missing `variant` (and `cpu_only` on Qwen3) fields that exist on the other variants of the same union, breaking TypeScript narrowing on the FE. - Main_SDNQ_ZImage_Config: add variant (default Turbo) - Main_SDNQ_Diffusers_ZImage_Config: add variant, detect from scheduler_config.json shift value - Qwen3Encoder_SDNQ_Config: add cpu_only + variant, detect from embed_tokens shape - Qwen3Encoder_SDNQ_Folder_Config: add cpu_only + variant, detect from config.json hidden_size - Regenerate FE schema.ts Discriminator tags are unchanged since variant has no default.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds support for SDNQ (SD.Next Quantization) as a new quantization format in InvokeAI, enabling memory-efficient inference for large models on consumer GPUs.
What's included:
sdnqquantization backend (invokeai/backend/quantization/sdnq/) withSDNQTensor, dequant utils, and safetensors loaders (incl. multi-shard support)norm_outscale/shift fix)ZImagePipelinediffusers folders (all submodels dispatched via SDNQ loader)ZImagePipeline/Flux2KleinPipelinefolders asmain_is_diffusersso submodels auto-extract (no separate VAE/Qwen3 source required)SDNQmodel format badge, schema/types regeneration, readiness updates, Klein FE combobox now accepts SDNQ pipeline configsdocs/src/content/docs/configuration/sdnq-quantization.mdxtests/backend/quantization/sdnq/covering tensor dequant + loader behavior; custom-modules tests extendedWhy: SDNQ enables running FLUX, FLUX.2, and Z-Image on lower-VRAM GPUs by loading pre-quantized weight folders directly, without runtime conversion overhead.
Related Issues / Discussions
Closes #8789
QA Instructions
diffusion_pytorch_model-*-of-*.safetensorsfiles merge correctly (Klein 9B, FLUX.2 dev)bf16reads on packeduint8weightsnorm_outscale/shift swap).uv run --extra cuda pytest tests/backend/quantization/sdnq/.Merge Plan
Needs Testing
Checklist
What's Newcopy (if doing a release after this PR)