The component that appears in every sparse accelerator paper but is never released.
RTL implementation of a bitmap-driven sparse MAC pipeline in SystemVerilog, synthesized to SKY130A 130nm standard cells via Yosys. Activation sparsity stimulus derived from real ResNet-50 inference using PyTorch forward hooks.
In ReLU-activated networks like ResNet-50, 40–70% of activations are exactly zero after each ReLU layer. Standard dense MAC pipelines multiply these zeros anyway — wasting cycles and energy on operations that contribute nothing to the result. This project implements the hardware mechanism that skips them: a bitmap-encoded sparse index decoder feeding a zero-skipping multiply-accumulate pipeline.
bitmap_i [15:0]
│
▼
┌──────────────┐ addr_o [3:0] ┌─────────────┐
│ popcount │──────────────────────▶ Weight SRAM │──┐
│ priority_enc│ │ (16 x 16b) │ │
│ decoder │──fetch_cnt [9:0]────▶ Activ. SRAM │ │
└──────────────┘ │ (1024 x 16b)│ │
│ done_o └─────────────┘ │
│ │ │
│ ┌────────▼──────┐ │
│ │ Stage 1: Reg │ │
│ │ Stage 2: MUL │◀─┘
│ │ Accumulator │
│ └───────┬───────┘
└───────────────────────────────────────▼
result_o [31:0]
done_o
| Module | Type | Function |
|---|---|---|
popcount.sv |
Combinational | Counts set bits in 16-bit bitmap — determines non-zero count |
priority_enc.sv |
Combinational | Isolate-lowest-bit trick — extracts position of next non-zero |
decoder.sv |
Sequential | Bitmap register + load/emit/done state machine — drives SRAM addresses |
sram_model.sv |
Sequential | Synchronous single-port SRAM with 1-cycle read latency |
mac_pipe.sv |
Sequential | Top-level: integrates decoder, both SRAMs, 2-stage MAC pipeline |
| Metric | Value |
|---|---|
| Decoder area (SKY130A) | 1,594 μm² |
| Full mac_pipe area (SKY130A) | 15,175 μm² |
| Decoder overhead | 6.8% of total area |
| Critical path | < 5 ns (no violations at 20/10/5 ns targets) |
| Implied max frequency | > 200 MHz on SKY130A 130nm |
| Bitmap width | 16 bits |
| Sparsity range tested | 0% to 93.8% |
| Bitmap combinations verified | 65,536 (exhaustive) |
| Sequential cell ratio | 16.3% registers / 83.7% combinational |
sparse_mac/
├── rtl/ # Synthesisable SystemVerilog source
│ ├── popcount.sv
│ ├── priority_enc.sv
│ ├── decoder.sv
│ ├── sram_model.sv
│ └── mac_pipe.sv
├── tb/ # Self-checking testbenches
│ ├── tb_popcount.sv
│ ├── tb_priority_enc.sv
│ ├── tb_decoder.sv
│ ├── tb_mac_pipe_debug.sv
│ └── tb_mac_pipe_multi.sv
├── scripts/ # Golden reference and stimulus generation
│ ├── gen_golden.py
│ └── gen_golden_multi.py
├── data/ # Stimulus files (generated — do not hand-edit)
│ ├── bitmap.mem
│ ├── values.mem
│ ├── weights.mem
│ ├── golden.txt
│ ├── golden_multi.txt
│ └── offsets_multi.txt
├── synth/ # Synthesis scripts and results
│ ├── synth_all.tcl
│ ├── mac_pipe.sdc
│ ├── run_synth.sh
│ └── results/ # Netlists per module per clock target
└── results/ # Area summary, sparsity CSV, waveforms
- Icarus Verilog
>= 11.0— simulation - Yosys
>= 0.23— synthesis - SKY130A PDK — standard cell library
- Python
>= 3.9withtorch,numpy— golden reference generation
# Single bitmap (ResNet-50 layer3 — highest sparsity)
python3 scripts/gen_golden.py
# Multi-bitmap (8 words, 0–93.8% sparsity)
python3 scripts/gen_golden_multi.py# Decoder exhaustive test (all 65,536 bitmaps)
iverilog -g2012 -o sim/tb_decoder \
tb/tb_decoder.sv rtl/decoder.sv rtl/popcount.sv rtl/priority_enc.sv \
&& vvp sim/tb_decoder
# MAC pipeline single-bitmap test
iverilog -g2012 -o sim/tb_mac_pipe_debug \
tb/tb_mac_pipe_debug.sv rtl/mac_pipe.sv rtl/decoder.sv \
rtl/popcount.sv rtl/priority_enc.sv rtl/sram_model.sv \
&& vvp sim/tb_mac_pipe_debug
# MAC pipeline multi-bitmap test (8 sparsity levels)
iverilog -g2012 -o sim/tb_mac_pipe_multi \
tb/tb_mac_pipe_multi.sv rtl/mac_pipe.sv rtl/decoder.sv \
rtl/popcount.sv rtl/priority_enc.sv rtl/sram_model.sv \
&& vvp sim/tb_mac_pipe_multiAll testbenches are self-checking — a passing run prints PASS per test and exits 0.
# Edit PDK_ROOT in synth/synth_all.tcl to match your install path, then:
bash synth/run_synth.shSynthesises popcount, decoder, and mac_pipe at 20/10/5 ns clock targets. Prints area summary table to stdout. Netlists written to synth/results/.
Why bitmap over CSC? Compressed sparse column encoding is more memory-efficient at large dimensions but requires a sequential scan to find the next non-zero. Bitmap encoding supports O(1) non-zero count via popcount and O(1) next-index via isolate-lowest-bit (bitmap & (bitmap - 1)), making it faster for the 16-element vectors targeted here.
The SRAM offset problem. Multi-bitmap operation requires each bitmap word to index into a different slice of the packed activation SRAM. The act_offset_i port solves this by setting fetch_cnt to an absolute start address per word rather than resetting to zero, avoiding silent out-of-bounds reads that return zeros and corrupt the dot product silently.
Why integer arithmetic? The golden reference and RTL both use 16-bit integer multiply rather than float16. This enables bit-exact software/hardware co-verification without floating-point rounding discrepancies. The tradeoff is that real inference requires proper float16 or INT8 quantisation — noted as a future extension.
| Module | Cells | Area (μm²) |
|---|---|---|
popcount |
— | 422.9 |
priority_enc |
— | 162.7 |
decoder (total with submodules) |
— | 1,594.0 |
mac_pipe @ 20 ns |
— | 15,175.0 |
mac_pipe @ 10 ns |
— | 15,198.0 |
mac_pipe @ 5 ns |
— | 15,198.0 |
Area-frequency curve flat across all three targets — no timing violations at 200 MHz. Critical path inferred < 5 ns.
Activation tensors captured from torchvision.models.resnet50 (pretrained, ImageNet) using register_forward_hook on all ReLU layers. Per-layer sparsity ranges from 2.4% (layer1) to 93.8% (layer3). The 8 bitmaps used in the multi-testbench were selected to uniformly sample this range.
- Han et al., EIE: Efficient Inference Engine on Compressed Deep Neural Network, ISCA 2016
- Parashar et al., SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks, ISCA 2017
- Chen et al., Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep CNNs, ISSCC 2016
- Extend bitmap width to 64/128 bits to match real tensor dimensions
- Add weight sparsity (two-sided sparse MAC matching SCNN architecture)
- Replace integer multiply with INT8 fixed-point with rounding mode control
- Run full OpenROAD place-and-route for real WNS and post-route power numbers
- Integrate as a custom sparse-load instruction extension on a RISC-V core
MIT