CMSIS-NN Graph Repair Tool

CMSIS-NN is Arm's embedded neural network library that provides highly optimized integer and DSP-accelerated kernels for efficient inference on resource-constrained microcontrollers.

A deterministic, static ONNX graph repair tool that automatically zero-pads weight and bias tensors to satisfy CMSIS-NN fast-path alignment constraints, without requiring retraining. The transformation preserves original model behavior while reshaping internal dimensions to fully unlock optimized kernel routing on ARM Cortex-M microcontrollers.

Motivation

CMSIS-NN selects optimized kernels at runtime via wrapper functions that inspect tensor dimensions before dispatching. A model exported with channels that do not satisfy alignment requirements silently falls through to scalar fallback kernels — degrading inference throughput on Cortex-M devices with no warning at compile time.

Existing toolchains do not address this gap. The standard workflow is to retrain with aligned channel counts. This tool performs the alignment as a post-export graph transformation, preserving model semantics while unlocking fast-path dispatch.

Novelty

Constraint knowledge base built from CMSIS-NN wrapper source — every alignment requirement is traced to an exact routing condition in the C wrapper. knowledge_base/constraints.py derives constraints directly from the CMSIS-NN implementation — nothing is assumed or heuristic.
Union-Find propagation of ONNX op semantics — dimension variables are coupled by the shape compatibility rules of each operation type, so padding one tensor automatically identifies all tensors that must move together — crucial for stability and unchanged behaviour.
Three-class violation taxonomy — FREE, COUPLED, and LOCKED violations are distinguished before any patch is attempted, giving a clear audit trail of what was changed and why.
Reshape safety detection — tensors that feed into Reshape nodes with hardcoded shape constants are locked rather than silently patched, preventing shape mismatches downstream.

Core Structure

cmsis_nn_lab/
├── pipeline.py               # padding pipeline entry point
├── make_test_model.py        # primary MobileNet-style ONNX generator
├── report.py                 # structured violation reports
├── models/                   # generated ONNX + JSON (make tool-model)
├── build/                    # firmware ELF/BIN/objects + qemu.log
├── build/reports/            # generated analysis reports
├── docs/                     # static docs (EXECUTION_GUIDE.md, etc.)
├── scripts/                  # QEMU/dispatch report parsers
├── scripts/experimental/     # optional alternate model generator
├── src/                      # Cortex-M4 harness (QEMU)
├── knowledge_base/
├── graph/
├── analysis/
└── transforms/

Start-to-end commands: see docs/EXECUTION_GUIDE.md.

Each layer depends only on the layer below it. transforms calls analysis, analysis calls graph, knowledge_base is read-only data with no dependencies.

Pipeline

load model
    ↓
assign dim variables
    ↓
propagate constraints
    ↓
classify violations
    ↓
check feasibility
    ↓
build padding plan
    ↓
detect Reshape locks
    ↓
apply padding
    ↓
validate

Dispatch-Aware Logic Updates

Recent improvements make the pipeline route-aware rather than alignment-only:

Constraint applicability gating: channel constraints are now evaluated only when the wrapper preconditions make them relevant for that node (e.g., Conv 1xN gating on H==1, kH==1, etc.).
Coupled-target resolution by union-find root: coupled violations now resolve to one target per true coupling group.
Robust report layer: pipeline execution no longer depends on external ad-hoc printing; structured reports are available through report.py.
Gemm bias padding correctness: Gemm output-feature dimension is inferred with transB semantics before bias padding.

Constraint Knowledge Base

Constraints are derived directly from CMSIS-NN wrapper routing conditions, not from benchmarking or heuristics.

Example:

# arm_convolve_wrapper_s8.c BRANCH B  →  arm_convolve_1_x_n_s8
#   elif (input_dims->h == 1
#         && dilation.w == 1
#         && filter_dims->h == 1
#         && (conv_params->stride.w * input_dims->c) % 4 == 0
#         && input_dims->c == filter_dims->c)
#
# For stride_w == 1  → input_channels % 4 == 0
# alignment=4 satisfies all stride values.

Every constraint entry records:

dimension index
required alignment
whether the dimension is patchable
which kernel fast-path becomes available
the exact CMSIS-NN source condition.

Union-Find Propagation

Dimension variables are symbolic names assigned to each tensor axis. The propagator walks every node and unions variables that must remain equal according to ONNX operator semantics.

Examples:

Conv input.channels == weight.in_channels weight.out_channels == output.channels
Add all inputs and outputs must match on every axis
Concat tensors match on all axes except the concat axis
Reshape / Flatten output dimensions are locked because they are defined by constants.

Union-find ensures that if two variables must be equal through a chain of operations, they are automatically grouped and patched together.

Violation Classification

Classification	Condition	Action
FREE	group size == 1	patch directly
COUPLED	group size > 1	patch all members together
LOCKED	graph I/O or reshape constrained	report and skip

Example Output

MobileNet-style inverted residual block:

Node       | Op            | Constraint      | Current | Target | Status
------------------------------------------------------------------------
PW1        | Conv          | output_channels | 6       | 8      | PATCHED
DW2        | DepthwiseConv | channels        | 6       | 8      | PATCHED
PW2        | Conv          | input_channels  | 6       | 8      | PATCHED
PW2        | Conv          | output_channels | 6       | 8      | PATCHED
Expand     | Conv          | input_channels  | 6       | 8      | PATCHED
Expand     | Conv          | output_channels | 10      | 12     | PATCHED
StemConv   | Conv          | input_channels  | 3       | 4      | LOCKED

Host Validation: Behavioural Consistency

Before deploying repaired models to embedded hardware, it is critical to verify that the graph transformation preserves the original model behaviour.

The repair process modifies internal tensor dimensions through channel padding. Validation ensures that these structural changes do not alter:

model execution correctness
operator behaviour
output tensor shapes
runtime stability.

Validation Command

python benchmark.py test_model_mixed.onnx test_model_mixed_patched.onnx

Node-Level Runtime Comparison

Node                                 Before (us)   After (us)      Delta
------------------------------------------------------------------------
Conv/ConvB_kernel_time                         7            7         +0
Conv/DW_A_kernel_time                          9            9         +0
Conv/PW_A_kernel_time                          7            7         +0
Conv/PW_B_kernel_time                          7            7         +0
Conv/ProjConv_kernel_time                      7            7         +0
Conv/StemConv_kernel_time                     11           10         -1
Flatten/Flat1_kernel_time                      4            4         +0
FusedConv/MidA_kernel_time                     9            9         +0
Gemm/Gemm1_kernel_time                         5            5         +0
GlobalAveragePool/GAP_kernel_time              6            6         +0

Observations

Per-node runtime differences fall within measurement noise (−1 μs to +0 μs).

This confirms that the transformation:

preserves operator execution behaviour
does not introduce runtime instability
does not change output tensor dimensions
does not introduce meaningful host-side overhead.

Significance

Structural graph transformations must preserve both functional correctness and interface stability.

This validation demonstrates that:

the repaired model remains fully functional,
the transformation does not alter inference semantics, and
the external model interface remains unchanged.

The repaired graph can therefore be safely deployed to embedded inference environments where CMSIS-NN optimized kernels will provide the expected performance improvements.

Usage

Single edit point: change architecture/channel widths in make_test_model.py, then run the rest from cmsis_nn_lab/:

cd cmsis_nn_lab
make full-flow          # ONNX → patch → harness codegen → cross-compile → QEMU → reports
make padding-sweep      # padding policy comparison (optional)
make ci-sanity          # full-flow + integrity checks

make full-flow regenerates the QEMU harness from your mixed/patched ONNX pair (tool-gen-harness). You do not edit src/main.c case tables by hand.

Artifacts:

Path	Description
`models/test_model_mixed.onnx`	misaligned test model
`models/test_model_mixed_patched.onnx`	padded model
`models/test_model_mixed_report.json`	violation report
`build/reports/onnx_pair_dispatch_report.txt`	ONNX dispatch diff
`build/reports/kernel_selection_report.md`	runtime kernel selection
`build/reports/qemu_research_report.md`	compute/memory analysis

Full step-by-step commands: docs/EXECUTION_GUIDE.md.

Observed effects (MobileNetV1-mini, auto harness)

With the default make_test_model.py graph, padding repairs 14 channel violations and QEMU exercises every Conv node (stem, six depthwise, six pointwise) using shapes codegen'd from ONNX.

Typical outcomes for this architecture (see build/reports/qemu_research_report.md):

Layer family	Route change after patch?	What padding still improves
`Conv_Stem` (3×3)	Usually no (stays `arm_convolve_s8`)	MAC + weight footprint; DSP tail removal
`DW_*` (3×3 depthwise)	Usually no (3×3 fast path)	Channel-aligned depthwise packing
`PW_*` (1×1, stride 1)	Usually no (already `arm_convolve_1x1_s8_fast`)	MAC + param bytes; matmul tile efficiency

To force wrapper route flips in QEMU, add layers that sit on dispatch boundaries (e.g. 1×N H==1 convs or misaligned 1×1 stride≠1) in make_test_model.py — the harness and reports will follow automatically.

How this is documented and verified:

ONNX-side predicted wrapper diffs are captured in build/reports/onnx_pair_dispatch_report.txt.
Runtime kernel-call deltas are captured in build/reports/kernel_selection_report.md.
Measurements come from Cortex-M4 cross-compiled firmware linked against CMSIS-NN and executed in QEMU.
Selection evidence is from wrapped symbols (--wrap=...), not only static ONNX analysis.

CI validation

make ci-sanity          # full-flow + basic gates
make ci-sanity-strict   # also requires kernel-selection changes in QEMU report

Requirements

onnx>=1.13.0
onnxruntime>=1.14.0
numpy>=1.23.0
protobuf>=3.20.0

Python ≥ 3.10 required.

Scope

The tool supports typical CNN architectures deployed on Cortex-M devices:

convolution layers
depthwise convolution
pointwise convolution
residual additions
concatenation
fully connected layers.

Transformer and RNN architectures are outside scope.

The framework is backend-agnostic and could be extended to other kernel selection systems that depend on tensor alignment constraints.

Attribution

This project derives alignment constraint definitions from the CMSIS-NN project.

See the NOTICE file for full attribution.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CMSIS-NN Graph Repair Tool

Motivation

Novelty

Core Structure

Pipeline

Dispatch-Aware Logic Updates

Constraint Knowledge Base

Union-Find Propagation

Violation Classification

Example Output

Host Validation: Behavioural Consistency

Validation Command

Node-Level Runtime Comparison

Observations

Significance

Usage

Observed effects (MobileNetV1-mini, auto harness)

CI validation

Requirements

Scope

Attribution

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
CMSIS-NN/Source		CMSIS-NN/Source
analysis		analysis
docs		docs
graph		graph
knowledge_base		knowledge_base
model_profiles		model_profiles
models		models
scripts		scripts
src		src
transforms		transforms
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
linker.ld		linker.ld
make_test_model.py		make_test_model.py
pipeline.py		pipeline.py
report.py		report.py

Folders and files

Latest commit

History

Repository files navigation

CMSIS-NN Graph Repair Tool

Motivation

Novelty

Core Structure

Pipeline

Dispatch-Aware Logic Updates

Constraint Knowledge Base

Union-Find Propagation

Violation Classification

Example Output

Host Validation: Behavioural Consistency

Validation Command

Node-Level Runtime Comparison

Observations

Significance

Usage

Observed effects (MobileNetV1-mini, auto harness)

CI validation

Requirements

Scope

Attribution

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages