SPARCITY — Bitmap-Encoded Sparse Index Decoder and Zero-Skipping MAC Pipeline

The component that appears in every sparse accelerator paper but is never released.

RTL implementation of a bitmap-driven sparse MAC pipeline in SystemVerilog, synthesized to SKY130A 130nm standard cells via Yosys. Activation sparsity stimulus derived from real ResNet-50 inference using PyTorch forward hooks.

Motivation

In ReLU-activated networks like ResNet-50, 40–70% of activations are exactly zero after each ReLU layer. Standard dense MAC pipelines multiply these zeros anyway — wasting cycles and energy on operations that contribute nothing to the result. This project implements the hardware mechanism that skips them: a bitmap-encoded sparse index decoder feeding a zero-skipping multiply-accumulate pipeline.

Architecture

bitmap_i [15:0]
     │
     ▼
┌──────────────┐     addr_o [3:0]     ┌─────────────┐
│   popcount   │──────────────────────▶ Weight SRAM  │──┐
│  priority_enc│                       │  (16 x 16b) │  │
│   decoder    │──fetch_cnt [9:0]────▶ Activ. SRAM  │  │
└──────────────┘                       │ (1024 x 16b)│  │
     │ done_o                          └─────────────┘  │
     │                                        │          │
     │                               ┌────────▼──────┐  │
     │                               │  Stage 1: Reg │  │
     │                               │  Stage 2: MUL │◀─┘
     │                               │  Accumulator  │
     │                               └───────┬───────┘
     └───────────────────────────────────────▼
                                       result_o [31:0]
                                       done_o

Module hierarchy

Module	Type	Function
`popcount.sv`	Combinational	Counts set bits in 16-bit bitmap — determines non-zero count
`priority_enc.sv`	Combinational	Isolate-lowest-bit trick — extracts position of next non-zero
`decoder.sv`	Sequential	Bitmap register + load/emit/done state machine — drives SRAM addresses
`sram_model.sv`	Sequential	Synchronous single-port SRAM with 1-cycle read latency
`mac_pipe.sv`	Sequential	Top-level: integrates decoder, both SRAMs, 2-stage MAC pipeline

Key Results

Metric	Value
Decoder area (SKY130A)	1,594 μm²
Full mac_pipe area (SKY130A)	15,175 μm²
Decoder overhead	6.8% of total area
Critical path	< 5 ns (no violations at 20/10/5 ns targets)
Implied max frequency	> 200 MHz on SKY130A 130nm
Bitmap width	16 bits
Sparsity range tested	0% to 93.8%
Bitmap combinations verified	65,536 (exhaustive)
Sequential cell ratio	16.3% registers / 83.7% combinational

Repository Structure

sparse_mac/
├── rtl/                  # Synthesisable SystemVerilog source
│   ├── popcount.sv
│   ├── priority_enc.sv
│   ├── decoder.sv
│   ├── sram_model.sv
│   └── mac_pipe.sv
├── tb/                   # Self-checking testbenches
│   ├── tb_popcount.sv
│   ├── tb_priority_enc.sv
│   ├── tb_decoder.sv
│   ├── tb_mac_pipe_debug.sv
│   └── tb_mac_pipe_multi.sv
├── scripts/              # Golden reference and stimulus generation
│   ├── gen_golden.py
│   └── gen_golden_multi.py
├── data/                 # Stimulus files (generated — do not hand-edit)
│   ├── bitmap.mem
│   ├── values.mem
│   ├── weights.mem
│   ├── golden.txt
│   ├── golden_multi.txt
│   └── offsets_multi.txt
├── synth/                # Synthesis scripts and results
│   ├── synth_all.tcl
│   ├── mac_pipe.sdc
│   ├── run_synth.sh
│   └── results/          # Netlists per module per clock target
└── results/              # Area summary, sparsity CSV, waveforms

How to Run

Dependencies

Icarus Verilog >= 11.0 — simulation
Yosys >= 0.23 — synthesis
SKY130A PDK — standard cell library
Python >= 3.9 with torch, numpy — golden reference generation

1 — Generate stimulus and golden reference

# Single bitmap (ResNet-50 layer3 — highest sparsity)
python3 scripts/gen_golden.py

# Multi-bitmap (8 words, 0–93.8% sparsity)
python3 scripts/gen_golden_multi.py

2 — Run simulation

# Decoder exhaustive test (all 65,536 bitmaps)
iverilog -g2012 -o sim/tb_decoder \
  tb/tb_decoder.sv rtl/decoder.sv rtl/popcount.sv rtl/priority_enc.sv \
  && vvp sim/tb_decoder

# MAC pipeline single-bitmap test
iverilog -g2012 -o sim/tb_mac_pipe_debug \
  tb/tb_mac_pipe_debug.sv rtl/mac_pipe.sv rtl/decoder.sv \
  rtl/popcount.sv rtl/priority_enc.sv rtl/sram_model.sv \
  && vvp sim/tb_mac_pipe_debug

# MAC pipeline multi-bitmap test (8 sparsity levels)
iverilog -g2012 -o sim/tb_mac_pipe_multi \
  tb/tb_mac_pipe_multi.sv rtl/mac_pipe.sv rtl/decoder.sv \
  rtl/popcount.sv rtl/priority_enc.sv rtl/sram_model.sv \
  && vvp sim/tb_mac_pipe_multi

All testbenches are self-checking — a passing run prints PASS per test and exits 0.

3 — Run synthesis (requires Yosys + SKY130A PDK)

# Edit PDK_ROOT in synth/synth_all.tcl to match your install path, then:
bash synth/run_synth.sh

Synthesises popcount, decoder, and mac_pipe at 20/10/5 ns clock targets. Prints area summary table to stdout. Netlists written to synth/results/.

Design Notes

Why bitmap over CSC? Compressed sparse column encoding is more memory-efficient at large dimensions but requires a sequential scan to find the next non-zero. Bitmap encoding supports O(1) non-zero count via popcount and O(1) next-index via isolate-lowest-bit (bitmap & (bitmap - 1)), making it faster for the 16-element vectors targeted here.

The SRAM offset problem. Multi-bitmap operation requires each bitmap word to index into a different slice of the packed activation SRAM. The act_offset_i port solves this by setting fetch_cnt to an absolute start address per word rather than resetting to zero, avoiding silent out-of-bounds reads that return zeros and corrupt the dot product silently.

Why integer arithmetic? The golden reference and RTL both use 16-bit integer multiply rather than float16. This enables bit-exact software/hardware co-verification without floating-point rounding discrepancies. The tradeoff is that real inference requires proper float16 or INT8 quantisation — noted as a future extension.

Synthesis Results (SKY130A TT 25C 1V8)

Module	Cells	Area (μm²)
`popcount`	—	422.9
`priority_enc`	—	162.7
`decoder` (total with submodules)	—	1,594.0
`mac_pipe` @ 20 ns	—	15,175.0
`mac_pipe` @ 10 ns	—	15,198.0
`mac_pipe` @ 5 ns	—	15,198.0

Area-frequency curve flat across all three targets — no timing violations at 200 MHz. Critical path inferred < 5 ns.

Sparsity Data Source

Activation tensors captured from torchvision.models.resnet50 (pretrained, ImageNet) using register_forward_hook on all ReLU layers. Per-layer sparsity ranges from 2.4% (layer1) to 93.8% (layer3). The 8 bitmaps used in the multi-testbench were selected to uniformly sample this range.

References

Han et al., EIE: Efficient Inference Engine on Compressed Deep Neural Network, ISCA 2016
Parashar et al., SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks, ISCA 2017
Chen et al., Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep CNNs, ISSCC 2016

Future Work

Extend bitmap width to 64/128 bits to match real tensor dimensions
Add weight sparsity (two-sided sparse MAC matching SCNN architecture)
Replace integer multiply with INT8 fixed-point with rounding mode control
Run full OpenROAD place-and-route for real WNS and post-route power numbers
Integrate as a custom sparse-load instruction extension on a RISC-V core

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SPARCITY — Bitmap-Encoded Sparse Index Decoder and Zero-Skipping MAC Pipeline

Motivation

Architecture

Module hierarchy

Key Results

Repository Structure

How to Run

Dependencies

1 — Generate stimulus and golden reference

2 — Run simulation

3 — Run synthesis (requires Yosys + SKY130A PDK)

Design Notes

Synthesis Results (SKY130A TT 25C 1V8)

Sparsity Data Source

References

Future Work

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
rtl		rtl
scripts		scripts
sim		sim
synth		synth
tb		tb
.gitignore		.gitignore
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

SPARCITY — Bitmap-Encoded Sparse Index Decoder and Zero-Skipping MAC Pipeline

Motivation

Architecture

Module hierarchy

Key Results

Repository Structure

How to Run

Dependencies

1 — Generate stimulus and golden reference

2 — Run simulation

3 — Run synthesis (requires Yosys + SKY130A PDK)

Design Notes

Synthesis Results (SKY130A TT 25C 1V8)

Sparsity Data Source

References

Future Work

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages