Enable separate-xclbin dispatch for FusedMLIROperator for debugging#117
Draft
andrej wants to merge 4 commits into
Draft
Enable separate-xclbin dispatch for FusedMLIROperator for debugging#117andrej wants to merge 4 commits into
andrej wants to merge 4 commits into
Conversation
(fusion.py portion of upstream commit daf9162; operator-specific test changes omitted since those operators are not on devel)
Each MLIROperator subclass used by llama decode now exposes a reference() instance method callable with the operator's input tensors (shaped per get_arg_spec()), returning the output tensor. Covered: ElementwiseAdd, ElementwiseMul, SiLU, RMSNorm (weighted), GEMV (optionally batched), GEMM (b_col_maj/c_col_maj), Softmax, Transpose, Repeat, RoPE (method_type=0). Used by the new 'reference' and 'compare' fusion dispatch modes.
- 'reference': pure-CPU evaluation of the runlist; each step calls op.reference(*inputs) on host-side torch.bfloat16 buffers. No NPU compilation or dispatch. - 'compare': runs the separate-xclbin NPU pipeline (Phoenix path) and, after each step, runs op.reference() on the NPU-produced inputs and logs per-step max_abs / mean_abs / max_rel deviations. Because the reference is re-seeded from the NPU's actual inputs every step, each comparison reflects only the current operator's error (no accumulation). New callables: FusedReferenceCallable, FusedCompareCallable (subclass of FusedXclbinCallable).
Contributor
CI Test Resultsc05f5c9 (2026_05_12_21_59_06) IRON - CI SummaryExamplesiron/applications/llama_3.2_1b
Smalliron/operators/axpy
iron/operators/dequant
iron/operators/elementwise_add
iron/operators/elementwise_mul
iron/operators/gelu
iron/operators/gemm
iron/operators/gemv
iron/operators/layer_norm
iron/operators/mem_copy
iron/operators/mha
iron/operators/relu
iron/operators/rms_norm
iron/operators/rope
iron/operators/sigmoid
iron/operators/silu
iron/operators/softmax
iron/operators/swiglu_decode
iron/operators/swiglu_prefill
iron/operators/tanh
iron/operators/transpose
Krackan - SmallIRONTested on iron/operators/axpy
iron/operators/dequant
iron/operators/elementwise_add
iron/operators/elementwise_mul
iron/operators/gelu
iron/operators/gemm
iron/operators/gemv
iron/operators/layer_norm
iron/operators/mem_copy
iron/operators/mha
iron/operators/relu
iron/operators/rms_norm
iron/operators/rope
iron/operators/sigmoid
iron/operators/silu
iron/operators/softmax
iron/operators/swiglu_decode
iron/operators/swiglu_prefill
iron/operators/tanh
iron/operators/transpose
Trends: IRON Trendsiron/operators/axpytest_axpy[input_length_2048-num_aie_columns_1-tile_size_2048-scalar_factor_3.0]
test_axpy[input_length_2048-num_aie_columns_2-tile_size_1024-scalar_factor_3.0]
test_axpy[input_length_2048-num_aie_columns_4-tile_size_512-scalar_factor_3.0]
test_axpy[input_length_2048-num_aie_columns_8-tile_size_256-scalar_factor_3.0]
iron/operators/dequanttest_dequant[input_length_2048-num_aie_columns_1-num_channels_1-tile_size_2048-group_size_32]
test_dequant[input_length_2048-num_aie_columns_1-num_channels_2-tile_size_1024-group_size_32]
test_dequant[input_length_2048-num_aie_columns_2-num_channels_1-tile_size_1024-group_size_32]
test_dequant[input_length_2048-num_aie_columns_2-num_channels_2-tile_size_512-group_size_32]
test_dequant[input_length_2048-num_aie_columns_4-num_channels_1-tile_size_512-group_size_32]
test_dequant[input_length_2048-num_aie_columns_4-num_channels_2-tile_size_256-group_size_32]
test_dequant[input_length_2048-num_aie_columns_8-num_channels_1-tile_size_256-group_size_32]
test_dequant[input_length_2048-num_aie_columns_8-num_channels_2-tile_size_128-group_size_32]
iron/operators/elementwise_addtest_elementwise_add[input_length_2048-num_aie_columns_1-tile_size_2048]
test_elementwise_add[input_length_2048-num_aie_columns_2-tile_size_1024]
test_elementwise_add[input_length_2048-num_aie_columns_4-tile_size_512]
test_elementwise_add[input_length_2048-num_aie_columns_8-tile_size_256]
iron/operators/elementwise_multest_elementwise_mul[input_length_2048-num_aie_columns_1-tile_size_2048]
test_elementwise_mul[input_length_2048-num_aie_columns_2-tile_size_1024]
test_elementwise_mul[input_length_2048-num_aie_columns_4-tile_size_512]
test_elementwise_mul[input_length_2048-num_aie_columns_8-tile_size_256]
iron/operators/gelutest_gelu[input_length_2048-num_aie_columns_1-num_channels_1-tile_size_2048]
test_gelu[input_length_2048-num_aie_columns_1-num_channels_2-tile_size_1024]
test_gelu[input_length_2048-num_aie_columns_2-num_channels_1-tile_size_1024]
test_gelu[input_length_2048-num_aie_columns_2-num_channels_2-tile_size_512]
test_gelu[input_length_2048-num_aie_columns_4-num_channels_1-tile_size_512]
test_gelu[input_length_2048-num_aie_columns_4-num_channels_2-tile_size_256]
test_gelu[input_length_2048-num_aie_columns_8-num_channels_1-tile_size_256]
test_gelu[input_length_2048-num_aie_columns_8-num_channels_2-tile_size_128]
iron/operators/gemmtest_gemm[M_1792-K_896-N_1152-num_aie_columns_8-b_col_maj_False-c_col_maj_True-m_64-k_32-n_48-trace_size_0-partition_N_1]
test_gemm[M_192-K_384-N_64-num_aie_columns_4-b_col_maj_False-c_col_maj_False-m_48-k_96-n_16-trace_size_0-partition_N_1]
test_gemm[M_192-K_384-N_64-num_aie_columns_4-b_col_maj_True-c_col_maj_True-m_48-k_96-n_16-trace_size_0-partition_N_1]
test_gemm[M_2048-K_2048-N_2048-num_aie_columns_1-b_col_maj_False-c_col_maj_False-m_64-k_64-n_64-trace_size_0-partition_N_1]
test_gemm[M_2048-K_2048-N_2048-num_aie_columns_2-b_col_maj_True-c_col_maj_False-m_64-k_64-n_64-trace_size_0-partition_N_1]
test_gemm[M_2048-K_2048-N_2048-num_aie_columns_8-b_col_maj_True-c_col_maj_True-m_64-k_64-n_64-trace_size_0-partition_N_1]
test_gemm[M_384-K_1536-N_1792-num_aie_columns_4-b_col_maj_True-c_col_maj_False-m_32-k_48-n_64-trace_size_0-partition_N_1]
test_gemm[M_64-K_512-N_256-num_aie_columns_4-b_col_maj_True-c_col_maj_False-m_16-k_64-n_64-trace_size_0-partition_N_4]
test_gemm[M_896-K_1792-N_640-num_aie_columns_8-b_col_maj_False-c_col_maj_True-m_32-k_64-n_80-trace_size_0-partition_N_1]
iron/operators/gemvtest_gemv[M_128-K_128-num_aie_columns_1-tile_size_input_32-tile_size_output_128]
test_gemv[M_2048-K_8192-num_aie_columns_1-tile_size_input_1-tile_size_output_2048]
test_gemv[M_2048-K_8192-num_aie_columns_2-tile_size_input_1-tile_size_output_1024]
test_gemv[M_2048-K_8192-num_aie_columns_4-tile_size_input_1-tile_size_output_512]
test_gemv[M_2048-K_8192-num_aie_columns_8-tile_size_input_1-tile_size_output_256]
test_gemv[M_8192-K_2048-num_aie_columns_1-tile_size_input_4-tile_size_output_1024]
test_gemv[M_8192-K_2048-num_aie_columns_2-tile_size_input_4-tile_size_output_1024]
test_gemv[M_8192-K_2048-num_aie_columns_4-tile_size_input_4-tile_size_output_1024]
test_gemv[M_8192-K_2048-num_aie_columns_8-tile_size_input_4-tile_size_output_1024]
iron/operators/layer_normtest_layer_norm[input_length_2048-num_aie_columns_1-num_channels_1-tile_size_2048]
test_layer_norm[input_length_2048-num_aie_columns_1-num_channels_2-tile_size_1024]
test_layer_norm[input_length_2048-num_aie_columns_2-num_channels_1-tile_size_1024]
test_layer_norm[input_length_2048-num_aie_columns_2-num_channels_2-tile_size_512]
test_layer_norm[input_length_2048-num_aie_columns_4-num_channels_1-tile_size_512]
test_layer_norm[input_length_2048-num_aie_columns_4-num_channels_2-tile_size_256]
test_layer_norm[input_length_2048-num_aie_columns_8-num_channels_1-tile_size_256]
test_layer_norm[input_length_2048-num_aie_columns_8-num_channels_2-tile_size_128]
iron/operators/mem_copytest_mem_copy[input_length_2048-num_cores_1-num_channels_1-bypass_False-tile_size_2048]
test_mem_copy[input_length_2048-num_cores_16-num_channels_2-bypass_False-tile_size_128]
test_mem_copy[input_length_2048-num_cores_2-num_channels_1-bypass_False-tile_size_1024]
test_mem_copy[input_length_2048-num_cores_2-num_channels_2-bypass_False-tile_size_1024]
test_mem_copy[input_length_2048-num_cores_4-num_channels_1-bypass_False-tile_size_512]
test_mem_copy[input_length_2048-num_cores_4-num_channels_2-bypass_False-tile_size_512]
test_mem_copy[input_length_2048-num_cores_8-num_channels_1-bypass_False-tile_size_256]
test_mem_copy[input_length_2048-num_cores_8-num_channels_2-bypass_False-tile_size_256]
iron/operators/mhatest_mha[seq_len_16384-dim_64-num_heads_1-num_pipelines_8-num_kv_heads_0]
iron/operators/rms_normtest_rms_norm[input_length_2048-num_aie_columns_1-num_channels_1-tile_size_2048-weighted_False]
test_rms_norm[input_length_2048-num_aie_columns_1-num_channels_1-tile_size_2048-weighted_True]
test_rms_norm[input_length_2048-num_aie_columns_1-num_channels_2-tile_size_1024-weighted_False]
test_rms_norm[input_length_2048-num_aie_columns_1-num_channels_2-tile_size_1024-weighted_True]
test_rms_norm[input_length_2048-num_aie_columns_2-num_channels_1-tile_size_1024-weighted_False]
test_rms_norm[input_length_2048-num_aie_columns_2-num_channels_1-tile_size_1024-weighted_True]
test_rms_norm[input_length_2048-num_aie_columns_2-num_channels_2-tile_size_512-weighted_False]
test_rms_norm[input_length_2048-num_aie_columns_2-num_channels_2-tile_size_512-weighted_True]
test_rms_norm[input_length_2048-num_aie_columns_4-num_channels_1-tile_size_512-weighted_False]
test_rms_norm[input_length_2048-num_aie_columns_4-num_channels_1-tile_size_512-weighted_True]
test_rms_norm[input_length_2048-num_aie_columns_4-num_channels_2-tile_size_256-weighted_False]
test_rms_norm[input_length_2048-num_aie_columns_4-num_channels_2-tile_size_256-weighted_True]
test_rms_norm[input_length_2048-num_aie_columns_8-num_channels_1-tile_size_256-weighted_False]
test_rms_norm[input_length_2048-num_aie_columns_8-num_channels_1-tile_size_256-weighted_True]
test_rms_norm[input_length_2048-num_aie_columns_8-num_channels_2-tile_size_128-weighted_False]
iron/operators/ropetest_rope[rows_32-cols_512-angle_rows_32-aie_columns_1-method_type_0]
test_rope[rows_32-cols_512-angle_rows_32-aie_columns_2-method_type_0]
test_rope[rows_32-cols_512-angle_rows_32-aie_columns_4-method_type_0]
test_rope[rows_32-cols_512-angle_rows_32-aie_columns_8-method_type_0]
test_rope[rows_32-cols_512-angle_rows_8-aie_columns_1-method_type_0]
test_rope[rows_32-cols_512-angle_rows_8-aie_columns_2-method_type_0]
test_rope[rows_32-cols_512-angle_rows_8-aie_columns_4-method_type_0]
test_rope[rows_32-cols_512-angle_rows_8-aie_columns_8-method_type_0]
iron/operators/softmaxtest_softmax[input_length_32768-num_aie_columns_2-num_channels_2-tile_size_1024]
test_softmax[input_length_32768-num_aie_columns_2-num_channels_2-tile_size_2048]
test_softmax[input_length_32768-num_aie_columns_2-num_channels_2-tile_size_512]
iron/operators/swiglu_decodetest_swiglu_decode[embedding_dim_1024-hidden_dim_3584]
test_swiglu_decode[embedding_dim_2048-hidden_dim_2048]
iron/operators/swiglu_prefilltest_swiglu_prefill[seq_len_256-embedding_dim_2048-hidden_dim_2048-prio_accuracy_False]
iron/operators/transposetest_transpose[M_2048-N_64-aie_columns_1-channels_1-m_64-n_64-s_8]
test_transpose[M_2048-N_64-aie_columns_1-channels_2-m_64-n_64-s_8]
Krackan - ExamplesIRONTested on iron/applications/llama_3.2_1b
Trends: IRON Trendsiron/applications/llama_3.2_1btest_llama_3_2_1b[llama_3.2_1b_prompt_1024_tokens_1]
test_llama_3_2_1b[llama_3.2_1b_prompt_1024_tokens_40]
test_llama_3_2_1b[llama_3.2_1b_prompt_13_tokens_1]
test_llama_3_2_1b[llama_3.2_1b_prompt_13_tokens_40]
Phoenix - SmallIRONTested on iron/operators/axpy
iron/operators/dequant
iron/operators/elementwise_add
iron/operators/elementwise_mul
iron/operators/gelu
iron/operators/gemm
iron/operators/gemv
iron/operators/layer_norm
iron/operators/mem_copy
iron/operators/relu
iron/operators/rms_norm
iron/operators/rope
iron/operators/sigmoid
iron/operators/silu
iron/operators/softmax
iron/operators/swiglu_decode
iron/operators/swiglu_prefill
iron/operators/tanh
iron/operators/transpose
Trends: IRON Trendsiron/operators/axpytest_axpy[input_length_2048-num_aie_columns_1-tile_size_2048-scalar_factor_3.0]
test_axpy[input_length_2048-num_aie_columns_2-tile_size_1024-scalar_factor_3.0]
test_axpy[input_length_2048-num_aie_columns_4-tile_size_512-scalar_factor_3.0]
iron/operators/dequanttest_dequant[input_length_2048-num_aie_columns_1-num_channels_1-tile_size_2048-group_size_32]
test_dequant[input_length_2048-num_aie_columns_1-num_channels_2-tile_size_1024-group_size_32]
test_dequant[input_length_2048-num_aie_columns_2-num_channels_1-tile_size_1024-group_size_32]
test_dequant[input_length_2048-num_aie_columns_2-num_channels_2-tile_size_512-group_size_32]
test_dequant[input_length_2048-num_aie_columns_4-num_channels_1-tile_size_512-group_size_32]
test_dequant[input_length_2048-num_aie_columns_4-num_channels_2-tile_size_256-group_size_32]
iron/operators/elementwise_addtest_elementwise_add[input_length_2048-num_aie_columns_1-tile_size_2048]
test_elementwise_add[input_length_2048-num_aie_columns_2-tile_size_1024]
test_elementwise_add[input_length_2048-num_aie_columns_4-tile_size_512]
iron/operators/elementwise_multest_elementwise_mul[input_length_2048-num_aie_columns_1-tile_size_2048]
test_elementwise_mul[input_length_2048-num_aie_columns_2-tile_size_1024]
test_elementwise_mul[input_length_2048-num_aie_columns_4-tile_size_512]
iron/operators/gelutest_gelu[input_length_2048-num_aie_columns_1-num_channels_1-tile_size_2048]
test_gelu[input_length_2048-num_aie_columns_1-num_channels_2-tile_size_1024]
test_gelu[input_length_2048-num_aie_columns_2-num_channels_1-tile_size_1024]
test_gelu[input_length_2048-num_aie_columns_2-num_channels_2-tile_size_512]
test_gelu[input_length_2048-num_aie_columns_4-num_channels_1-tile_size_512]
test_gelu[input_length_2048-num_aie_columns_4-num_channels_2-tile_size_256]
iron/operators/gemmtest_gemm[M_192-K_384-N_64-num_aie_columns_4-b_col_maj_False-c_col_maj_False-m_48-k_96-n_16-trace_size_0-partition_N_1]
test_gemm[M_192-K_384-N_64-num_aie_columns_4-b_col_maj_True-c_col_maj_True-m_48-k_96-n_16-trace_size_0-partition_N_1]
test_gemm[M_2048-K_2048-N_2048-num_aie_columns_1-b_col_maj_False-c_col_maj_False-m_64-k_64-n_64-trace_size_0-partition_N_1]
test_gemm[M_2048-K_2048-N_2048-num_aie_columns_2-b_col_maj_True-c_col_maj_False-m_64-k_64-n_64-trace_size_0-partition_N_1]
test_gemm[M_384-K_1536-N_1792-num_aie_columns_4-b_col_maj_True-c_col_maj_False-m_32-k_48-n_64-trace_size_0-partition_N_1]
test_gemm[M_64-K_512-N_256-num_aie_columns_4-b_col_maj_True-c_col_maj_False-m_16-k_64-n_64-trace_size_0-partition_N_4]
iron/operators/gemvtest_gemv[M_128-K_128-num_aie_columns_1-tile_size_input_32-tile_size_output_128]
test_gemv[M_2048-K_8192-num_aie_columns_1-tile_size_input_1-tile_size_output_2048]
test_gemv[M_2048-K_8192-num_aie_columns_2-tile_size_input_1-tile_size_output_1024]
test_gemv[M_2048-K_8192-num_aie_columns_4-tile_size_input_1-tile_size_output_512]
test_gemv[M_8192-K_2048-num_aie_columns_1-tile_size_input_4-tile_size_output_1024]
test_gemv[M_8192-K_2048-num_aie_columns_2-tile_size_input_4-tile_size_output_1024]
test_gemv[M_8192-K_2048-num_aie_columns_4-tile_size_input_4-tile_size_output_1024]
iron/operators/layer_normtest_layer_norm[input_length_2048-num_aie_columns_1-num_channels_1-tile_size_2048]
test_layer_norm[input_length_2048-num_aie_columns_1-num_channels_2-tile_size_1024]
test_layer_norm[input_length_2048-num_aie_columns_2-num_channels_1-tile_size_1024]
test_layer_norm[input_length_2048-num_aie_columns_2-num_channels_2-tile_size_512]
test_layer_norm[input_length_2048-num_aie_columns_4-num_channels_1-tile_size_512]
test_layer_norm[input_length_2048-num_aie_columns_4-num_channels_2-tile_size_256]
iron/operators/mem_copytest_mem_copy[input_length_2048-num_cores_1-num_channels_1-bypass_False-tile_size_2048]
test_mem_copy[input_length_2048-num_cores_2-num_channels_1-bypass_False-tile_size_1024]
test_mem_copy[input_length_2048-num_cores_2-num_channels_2-bypass_False-tile_size_1024]
test_mem_copy[input_length_2048-num_cores_4-num_channels_1-bypass_False-tile_size_512]
test_mem_copy[input_length_2048-num_cores_4-num_channels_2-bypass_False-tile_size_512]
test_mem_copy[input_length_2048-num_cores_8-num_channels_2-bypass_False-tile_size_256]
iron/operators/rms_normtest_rms_norm[input_length_2048-num_aie_columns_1-num_channels_1-tile_size_2048-weighted_False]
test_rms_norm[input_length_2048-num_aie_columns_1-num_channels_1-tile_size_2048-weighted_True]
test_rms_norm[input_length_2048-num_aie_columns_1-num_channels_2-tile_size_1024-weighted_False]
test_rms_norm[input_length_2048-num_aie_columns_1-num_channels_2-tile_size_1024-weighted_True]
test_rms_norm[input_length_2048-num_aie_columns_2-num_channels_1-tile_size_1024-weighted_False]
test_rms_norm[input_length_2048-num_aie_columns_2-num_channels_1-tile_size_1024-weighted_True]
test_rms_norm[input_length_2048-num_aie_columns_2-num_channels_2-tile_size_512-weighted_False]
test_rms_norm[input_length_2048-num_aie_columns_2-num_channels_2-tile_size_512-weighted_True]
test_rms_norm[input_length_2048-num_aie_columns_4-num_channels_1-tile_size_512-weighted_False]
test_rms_norm[input_length_2048-num_aie_columns_4-num_channels_1-tile_size_512-weighted_True]
test_rms_norm[input_length_2048-num_aie_columns_4-num_channels_2-tile_size_256-weighted_False]
iron/operators/ropetest_rope[rows_32-cols_512-angle_rows_32-aie_columns_1-method_type_0]
test_rope[rows_32-cols_512-angle_rows_32-aie_columns_2-method_type_0]
test_rope[rows_32-cols_512-angle_rows_32-aie_columns_4-method_type_0]
test_rope[rows_32-cols_512-angle_rows_8-aie_columns_1-method_type_0]
test_rope[rows_32-cols_512-angle_rows_8-aie_columns_2-method_type_0]
test_rope[rows_32-cols_512-angle_rows_8-aie_columns_4-method_type_0]
iron/operators/softmaxtest_softmax[input_length_32768-num_aie_columns_2-num_channels_2-tile_size_1024]
test_softmax[input_length_32768-num_aie_columns_2-num_channels_2-tile_size_2048]
test_softmax[input_length_32768-num_aie_columns_2-num_channels_2-tile_size_512]
iron/operators/swiglu_decodetest_swiglu_decode[embedding_dim_1024-hidden_dim_3584]
test_swiglu_decode[embedding_dim_2048-hidden_dim_2048]
iron/operators/swiglu_prefilltest_swiglu_prefill[seq_len_256-embedding_dim_2048-hidden_dim_2048-prio_accuracy_False]
iron/operators/transposetest_transpose[M_2048-N_64-aie_columns_1-channels_1-m_64-n_64-s_8]
test_transpose[M_2048-N_64-aie_columns_1-channels_2-m_64-n_64-s_8]
Phoenix - ExamplesIRONTested on Trends: IRON Trends |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
While single-dispatch operators improve performance, it makes it hard to debug when something goes wrong. This adds several modes to the
FusedMLIROperatorto be able to dispatch layer-by-layer, inspect outputs after each layer and compare against a reference for troubleshooting.