Skip to content

Raa-23/SPARCE_MAC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SPARCITY — Bitmap-Encoded Sparse Index Decoder and Zero-Skipping MAC Pipeline

The component that appears in every sparse accelerator paper but is never released.

RTL implementation of a bitmap-driven sparse MAC pipeline in SystemVerilog, synthesized to SKY130A 130nm standard cells via Yosys. Activation sparsity stimulus derived from real ResNet-50 inference using PyTorch forward hooks.


Motivation

In ReLU-activated networks like ResNet-50, 40–70% of activations are exactly zero after each ReLU layer. Standard dense MAC pipelines multiply these zeros anyway — wasting cycles and energy on operations that contribute nothing to the result. This project implements the hardware mechanism that skips them: a bitmap-encoded sparse index decoder feeding a zero-skipping multiply-accumulate pipeline.


Architecture

bitmap_i [15:0]
     │
     ▼
┌──────────────┐     addr_o [3:0]     ┌─────────────┐
│   popcount   │──────────────────────▶ Weight SRAM  │──┐
│  priority_enc│                       │  (16 x 16b) │  │
│   decoder    │──fetch_cnt [9:0]────▶ Activ. SRAM  │  │
└──────────────┘                       │ (1024 x 16b)│  │
     │ done_o                          └─────────────┘  │
     │                                        │          │
     │                               ┌────────▼──────┐  │
     │                               │  Stage 1: Reg │  │
     │                               │  Stage 2: MUL │◀─┘
     │                               │  Accumulator  │
     │                               └───────┬───────┘
     └───────────────────────────────────────▼
                                       result_o [31:0]
                                       done_o

Module hierarchy

Module Type Function
popcount.sv Combinational Counts set bits in 16-bit bitmap — determines non-zero count
priority_enc.sv Combinational Isolate-lowest-bit trick — extracts position of next non-zero
decoder.sv Sequential Bitmap register + load/emit/done state machine — drives SRAM addresses
sram_model.sv Sequential Synchronous single-port SRAM with 1-cycle read latency
mac_pipe.sv Sequential Top-level: integrates decoder, both SRAMs, 2-stage MAC pipeline

Key Results

Metric Value
Decoder area (SKY130A) 1,594 μm²
Full mac_pipe area (SKY130A) 15,175 μm²
Decoder overhead 6.8% of total area
Critical path < 5 ns (no violations at 20/10/5 ns targets)
Implied max frequency > 200 MHz on SKY130A 130nm
Bitmap width 16 bits
Sparsity range tested 0% to 93.8%
Bitmap combinations verified 65,536 (exhaustive)
Sequential cell ratio 16.3% registers / 83.7% combinational

Repository Structure

sparse_mac/
├── rtl/                  # Synthesisable SystemVerilog source
│   ├── popcount.sv
│   ├── priority_enc.sv
│   ├── decoder.sv
│   ├── sram_model.sv
│   └── mac_pipe.sv
├── tb/                   # Self-checking testbenches
│   ├── tb_popcount.sv
│   ├── tb_priority_enc.sv
│   ├── tb_decoder.sv
│   ├── tb_mac_pipe_debug.sv
│   └── tb_mac_pipe_multi.sv
├── scripts/              # Golden reference and stimulus generation
│   ├── gen_golden.py
│   └── gen_golden_multi.py
├── data/                 # Stimulus files (generated — do not hand-edit)
│   ├── bitmap.mem
│   ├── values.mem
│   ├── weights.mem
│   ├── golden.txt
│   ├── golden_multi.txt
│   └── offsets_multi.txt
├── synth/                # Synthesis scripts and results
│   ├── synth_all.tcl
│   ├── mac_pipe.sdc
│   ├── run_synth.sh
│   └── results/          # Netlists per module per clock target
└── results/              # Area summary, sparsity CSV, waveforms

How to Run

Dependencies

  • Icarus Verilog >= 11.0 — simulation
  • Yosys >= 0.23 — synthesis
  • SKY130A PDK — standard cell library
  • Python >= 3.9 with torch, numpy — golden reference generation

1 — Generate stimulus and golden reference

# Single bitmap (ResNet-50 layer3 — highest sparsity)
python3 scripts/gen_golden.py

# Multi-bitmap (8 words, 0–93.8% sparsity)
python3 scripts/gen_golden_multi.py

2 — Run simulation

# Decoder exhaustive test (all 65,536 bitmaps)
iverilog -g2012 -o sim/tb_decoder \
  tb/tb_decoder.sv rtl/decoder.sv rtl/popcount.sv rtl/priority_enc.sv \
  && vvp sim/tb_decoder

# MAC pipeline single-bitmap test
iverilog -g2012 -o sim/tb_mac_pipe_debug \
  tb/tb_mac_pipe_debug.sv rtl/mac_pipe.sv rtl/decoder.sv \
  rtl/popcount.sv rtl/priority_enc.sv rtl/sram_model.sv \
  && vvp sim/tb_mac_pipe_debug

# MAC pipeline multi-bitmap test (8 sparsity levels)
iverilog -g2012 -o sim/tb_mac_pipe_multi \
  tb/tb_mac_pipe_multi.sv rtl/mac_pipe.sv rtl/decoder.sv \
  rtl/popcount.sv rtl/priority_enc.sv rtl/sram_model.sv \
  && vvp sim/tb_mac_pipe_multi

All testbenches are self-checking — a passing run prints PASS per test and exits 0.

3 — Run synthesis (requires Yosys + SKY130A PDK)

# Edit PDK_ROOT in synth/synth_all.tcl to match your install path, then:
bash synth/run_synth.sh

Synthesises popcount, decoder, and mac_pipe at 20/10/5 ns clock targets. Prints area summary table to stdout. Netlists written to synth/results/.


Design Notes

Why bitmap over CSC? Compressed sparse column encoding is more memory-efficient at large dimensions but requires a sequential scan to find the next non-zero. Bitmap encoding supports O(1) non-zero count via popcount and O(1) next-index via isolate-lowest-bit (bitmap & (bitmap - 1)), making it faster for the 16-element vectors targeted here.

The SRAM offset problem. Multi-bitmap operation requires each bitmap word to index into a different slice of the packed activation SRAM. The act_offset_i port solves this by setting fetch_cnt to an absolute start address per word rather than resetting to zero, avoiding silent out-of-bounds reads that return zeros and corrupt the dot product silently.

Why integer arithmetic? The golden reference and RTL both use 16-bit integer multiply rather than float16. This enables bit-exact software/hardware co-verification without floating-point rounding discrepancies. The tradeoff is that real inference requires proper float16 or INT8 quantisation — noted as a future extension.


Synthesis Results (SKY130A TT 25C 1V8)

Module Cells Area (μm²)
popcount 422.9
priority_enc 162.7
decoder (total with submodules) 1,594.0
mac_pipe @ 20 ns 15,175.0
mac_pipe @ 10 ns 15,198.0
mac_pipe @ 5 ns 15,198.0

Area-frequency curve flat across all three targets — no timing violations at 200 MHz. Critical path inferred < 5 ns.


Sparsity Data Source

Activation tensors captured from torchvision.models.resnet50 (pretrained, ImageNet) using register_forward_hook on all ReLU layers. Per-layer sparsity ranges from 2.4% (layer1) to 93.8% (layer3). The 8 bitmaps used in the multi-testbench were selected to uniformly sample this range.


References

  • Han et al., EIE: Efficient Inference Engine on Compressed Deep Neural Network, ISCA 2016
  • Parashar et al., SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks, ISCA 2017
  • Chen et al., Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep CNNs, ISSCC 2016

Future Work

  • Extend bitmap width to 64/128 bits to match real tensor dimensions
  • Add weight sparsity (two-sided sparse MAC matching SCNN architecture)
  • Replace integer multiply with INT8 fixed-point with rounding mode control
  • Run full OpenROAD place-and-route for real WNS and post-route power numbers
  • Integrate as a custom sparse-load instruction extension on a RISC-V core

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors