[DRAFT] Zaya 1 Draft Support#2
Conversation
- Remove LLM_TENSOR_CCA_CONV_DW and LLM_TENSOR_CCA_CONV_DW_B from llama-arch.h - Update tensor name mappings in llama-arch.cpp to use SSM_CONV1D - Remove CCA_CONV_DW and CCA_CONV_DW_B from gguf constants.py - Update MODEL_ARCH.ZAYA1 tensor list to use SSM_CONV1D - Update zaya.cpp to create tensors using LLM_TENSOR_SSM_CONV1D - Update convert_hf_to_gguf.py to map conv_qk.0 to SSM_CONV1D - Add HuggingFace tensor mapping for zaya conv_qk.0 to SSM_CONV1D This improves consistency by reusing the existing SSM_CONV1D constant that's already used by other SSM-based architectures (mamba, jamba, etc.)
- Remove LLM_TENSOR_ZAYA_ROUTER_NORM from llama-arch.h - Update tensor mappings in llama-arch.cpp to use FFN_NORM - Remove ZAYA_ROUTER_NORM from gguf constants.py - Update MODEL_ARCH.ZAYA1 tensor list to use FFN_NORM - Update zaya.cpp to create router norm tensor using LLM_TENSOR_FFN_NORM - Update convert_hf_to_gguf.py to map rmsnorm_eda to FFN_NORM - Add HuggingFace tensor mapping for zaya rmsnorm_eda to FFN_NORM Router normalization is a standard FFN norm (RMSNorm), making this a semantically correct replacement that reduces custom constants.
- Remove LLM_TENSOR_ZAYA_ROUTER_DOWN from llama-arch.h - Update tensor mappings in llama-arch.cpp to use FFN_GATE_INP - Remove ZAYA_ROUTER_DOWN from gguf constants.py - Update MODEL_ARCH.ZAYA1 tensor list to use FFN_GATE_INP - Update zaya.cpp to create router down tensor using LLM_TENSOR_FFN_GATE_INP - Update convert_hf_to_gguf.py to map down_proj.weight to FFN_GATE_INP - Add HuggingFace tensor mapping for zaya router down_proj to FFN_GATE_INP Router down projection is a linear projection similar to MoE gate input, making this a semantically reasonable replacement.
- Remove LLM_TENSOR_ZAYA_ROUTER_MLP0 from llama-arch.h - Update tensor mappings in llama-arch.cpp to use FFN_GATE - Remove ZAYA_ROUTER_MLP0 from gguf constants.py - Update MODEL_ARCH.ZAYA1 tensor list to use FFN_GATE - Update zaya.cpp to create router mlp0 tensor using LLM_TENSOR_FFN_GATE - Update convert_hf_to_gguf.py to map router_mlp.0.weight to FFN_GATE - Add HuggingFace tensor mapping for zaya router_mlp.0 to FFN_GATE Router MLP hidden layer is a linear projection similar to FFN gate, making this a reasonable replacement for reducing custom constants.
- Remove LLM_TENSOR_RES_SCALE_HS_B, RES_SCALE_RES_B, RES_SCALE_HS_B_FINAL, RES_SCALE_RES_B_FINAL - Use single RES_SCALE_HS for both weight and bias (same for RES_SCALE_RES) - Update tensor mappings in llama-arch.cpp - Remove bias constants from gguf constants.py - Update MODEL_ARCH.ZAYA1 tensor list - Update zaya.cpp to create bias tensors using same constant with 'bias' suffix - Update convert_hf_to_gguf.py to map bias tensors with .bias suffix This reduces 8 custom ZAYA constants to 4 by reusing the same constant for both weight and bias tensors, differentiated by suffix.
- Remove ZAYA_ROUTER_DOWN_B, ZAYA_ROUTER_MLP0_B, ZAYA_ROUTER_MLP2_B - Use FFN_GATE_INP for both router down weight and bias - Use FFN_GATE for both router mlp0 weight and bias - Use ZAYA_ROUTER_MLP2 for both router mlp2 weight and bias - Update tensor mappings in llama-arch.cpp - Remove bias constants from gguf constants.py - Update MODEL_ARCH.ZAYA1 tensor list - Update zaya.cpp to create bias tensors using same constant with 'bias' suffix - Update convert_hf_to_gguf.py to map bias tensors with .bias suffix - Add ZAYA_ROUTER_MLP2 tensor mapping for HuggingFace auto-detection This reduces 3 more custom constants by reusing the same constant for both weight and bias tensors, differentiated by suffix.
|
Okay, regarding the warning that appears during quantification: This warning is normal, and suppressing it causes changes that affect all models. So it’s best to keep the warning; it doesn’t impact inference. |
Remove hardcoded 256 value for router MLP hidden size and read it from the GGUF expert_feed_forward_length metadata key instead. The converter now writes zaya_mlp_expansion from config.json.
val_proj1 and val_proj2 output dimension should be latent_k_dim / 2 (n_embd_k / 2) as per vLLM reference, not n_embd_head. Currently both are equal for ZAYA1-8B (n_head_kv=2), but this would break for any other n_head_kv configuration.
Follows the same pattern as Mamba ssm_conv1d, Kimi shortconv, and RWKV time_mix tensors. These small conv weights (d_conv=2) are not divisible by quant block sizes (32), causing Q8_0 failures.
ggml_im2col on CPU requires F16 kernel weights. Cast cca_conv_dw and cca_conv_grp to F16 before convolution to support quantized models (Q4, Q8). CUDA/SYCL backends are unaffected since their im2col implementation only reads kernel dimensions, not data.
ROCm and Vulkan backends require contiguous tensors for im2col and mul_mat operations. Add ggml_cont after ggml_cast for conv kernels and after ggml_concat for hs_d to ensure compatibility across all backends. CUDA was unaffected since it handles non-contiguous tensors more permissively.
|
Tested this PR, llama-cli is working. But failing with llama-server |
Thanks for the catch @Ramachandrajoshi . I'll fix that later. In the meantime, you can use this command instead: llama-server -m ../models/ZAYA1-8B-Q4_K_S.gguf --parallel 1That should work |
- Add ggml_cont(prev_hs) for non-contiguous tensor view (n_seqs > 1) - Replace ggml_conv_1d_dw with ggml_ssm_conv for proper batch support - Cast conv kernel to F32 and permute output shape ggml_conv_1d_dw does not support n_seqs > 1 (assert b->ne[3] == 1). Use ggml_ssm_conv which is designed for SSM models with batching.
|
@nanduruganesh Can you try again with the correction? It should be fixed by now :) |
|
Thanks, with |
The model's config.json reports vocab_size=262272 but the actual tokenizer only has 262147 tokens. The 125 extra entries are padding in PyTorch's embed_tokens.weight matrix that don't correspond to any real tokens. Use the pre-computed _tokenizer_vocab_size to write the correct vocab size in the GGUF metadata, matching llama.cpp's actual tokenizer vocabulary.
Add detailed inline comments mapping each C++ code section to the corresponding zaya.py and cca.py Python lines, including code snippets for direct comparison.
zaya.py L294-296: EDA is disabled for layer 1 (first MoE layer) via (self.layer_number != zaya_first_layer). Add il != 1 guard to match.
05ec4f4 to
2b0c8c8
Compare
… _FP32EmbeddingMethod
Correct line reference from zaya.py L387-389 to L459-469, and add note explaining why excluding the skip expert from gate_probs is correct (bias=-1.0 makes it effectively never selected at inference with topk=1).
- New llm_graph_input_cca_mask class + build_inp_cca_mask() in graph infra - cca_mask tensor [1, n_tokens] F32 binary mask applied to hidden_states before CCA convolutions (modeling_zaya.py ref: CCA.forward L325-328) - Applied only during prefill (n_seq_tokens > 1), matching Python logic - Mask filled with 1.0f for all positions (no padding info in ubatch)
Match Python reference which casts hidden_states and residual to float32 before ggml_add in both per-layer and final residual paths. zaya.py ref: L900, L1387, L1701
This reverts commit f1bd772.
- ggml: Update `ggml_conv_1d` (and variants) to use a conditional type for `im2col` activation (`a->type == GGML_TYPE_F16 ? GGML_TYPE_F16 : GGML_TYPE_F32`) instead of hardcoding `GGML_TYPE_F16`. This aligns with `ggml_conv_2d`, preserving F32/BF16 precision while still safely protecting against quantized weight crashes (e.g., Q4_0). - zaya: Replace the forced F16 downcast for grouped convolutions with a dynamic promotion to F32 for unsupported types (like BF16 or quantized types). This ensures `im2col` properly allocates an F32 matrix and computes an F32xF32 mul_mat, avoiding CUDA/CPU backend crashes while fully restoring model accuracy and NMSE metrics.
This is a safety guard matching self.layer_number != zaya_first_layer in the original implementation. No behavioral change for correctly converted models since the tensor is already nullptr for layer 1.
The model config has residual_in_fp32=true. Cast both residual branches to float32 to align with the python reference.
@nanduruganesh I've created a new pull request here, so you can make changes and I can make changes too.
So here, I've refactored the code to use the existing constants while keeping it functional. I used Open Code to make the changes. It will likely replace the PR #1
I'm reposting the message you originally posted below for anyone who wants to try the PR
Quickstart
Output:
Todo: