A high-performance, 4-lane SIMD (Single Instruction, Multiple Data) vector processor core designed for high-density, low-latency data streaming workloads (e.g., Edge-AI inference and DSP filtering primitives). The design adopts an industry-standard hybrid-precision topology—utilizing packed INT8 streaming inputs to maximize bus utilization and save silicon area, while accumulating into deep, sign-extended INT32 registers to provide a massive headroom safety net.
The entire accelerator core is structurally decoupled via active Valid/Ready handshake protocols and validated using an automated C++ co-simulation framework.
-
4-Lane SIMD Parallelism: Processes an entire 4-component vector block simultaneously every single clock cycle, multiplying throughput by
$4\times$ compared to traditional scalar engines. -
Industrial Hybrid Datapath: Emulates the exact multi-precision mathematical topologies utilized by commercial compute blocks—mirroring the ARM Neon
SDOTextension and NVIDIA Tensor Coredp4ahardware primitives ($4 \times \text{INT8 Input} \rightarrow \text{INT32 Accumulate}$ ). -
Dynamic Flow Control & Backpressure: Implements a strict Valid/Ready Handshake Protocol. High-density reduction logic (
!(|lane_fifo_full)) monitors internal queues to apply upstream backpressure, while automatically injecting execution bubbles to freeze pipeline registers during data dry-ups. -
Elastic Buffer Input Queues: Features 8 parallel synchronous FIFOs (
$8 \times 16$ -deep) acting as localized ring-buffers to cleanly decouple asynchronous host data sourcing from core computational execution. -
Active Saturation Guarding: Outfitted with dedicated, anti-wrapping overflow logic that dynamically clips and locks the 32-bit signed accumulators at max positive (
32'h7FFFFFFF) or max negative (32'h80000000) industrial boundaries if limits are breached. - Automated Co-Simulation Environment: Fully verified using an object-oriented C++ simulation testbench via Verilator, featuring automated scoreboard validation, temporal queue time-alignment, and cycle-accurate tracking.
The microarchitecture splits the wide bus inputs across independent, hardware-isolated compute cells and manages the timing of the execution timeline through a multi-stage control path.
32-bit Packed Bus (4x INT8)
┌─────────────────────────────────┐
▼ ▼
vec_din_a vec_din_b
(Bits [31:0]) (Bits [31:0])
│ │
┌───────┴───────┐ ┌───────┴───────┐
▼ Slicing Loop ▼ ▼ Slicing Loop ▼
[L3][L2][L1][L0] Bytes [L3][L2][L1][L0] Bytes
│ │ │ │ │ │ │ │
▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼
┌───────────────┐ ┌───────────────┐
│ 4x Sync FIFOs │ │ 4x Sync FIFOs │
└───────┬───────┘ └───────┬───────┘
│ (4x 8-bit) │ (4x 8-bit)
▼ ▼
=====================================================
STRUCTURAL COMPUTATION LAYER (4x Parallel MAC Lanes)
=====================================================
┌───────────────────────────────────────────────────┐
│ Lane 0: [8x8 Mult] ──► [16-bit Reg] ──► [32-bit Acc] │
│ Lane 1: [8x8 Mult] ──► [16-bit Reg] ──► [32-bit Acc] │
│ Lane 2: [8x8 Mult] ──► [16-bit Reg] ──► [32-bit Acc] │
│ Lane 3: [8x8 Mult] ──► [16-bit Reg] ──► [32-bit Acc] │
└───────────────────────┬───────────────────────────┘
│
▼
vec_acc_out
128-bit Packed Bus (4x INT32)
Because the architecture decouples memory fetching from complex arithmetic to prevent critical-path timing violations, transactions ripple through a 3-cycle timeline:
| Latency Milestone | Path Segment | Operational Behavior |
|---|---|---|
| Cycle N + 0 | Input Buffer Ingestion | Handshake passes (v_in && r_ready); packed bytes are written into the synchronous lane FIFOs. |
| Cycle N + 1 | Stage 1: Signed Multiply | Data pops from FIFOs; lane multipliers compute an mult_reg. |
| Cycle N + 2 | Stage 2: Saturated Accumulate | The product is sign-extended to acc_reg. Results retire out of the wide 128-bit vec_acc_out bus as v_out asserts. |
Rather than relying on manual waveform inspection, this IP core is signed off via an automated Dual-Path C++ Scoreboard Framework running inside Verilator.
┌────────────────────────┐
│ Random Vector Generator│
└───────────┬────────────┘
│
┌───────────────┴───────────────┐
▼ ▼
┌─────────────────────────┐ ┌─────────────────────────┐
│ C++ Reference Model │ │ Verilated Hardware RTL │
│ (The "Truth") │ │ (SIMD Core IP) │
└────────────┬────────────┘ └────────────┬────────────┘
│ │
[Push Golden Vector] │
│ │
▼ ▼
┌─────────────────────────┐ ┌─────────────────────────┐
│ Temporal Delay Queue │ │ RTL Retirement Monitor │
│ (Buffers 3-Cycle Delay) │ │ (Detects v_out) │
└────────────┬────────────┘ └────────────┬────────────┘
│ │
[Pop Expected] │
└───────────────┬───────────────┘
▼
┌─────────────────────────┐
│ Digital Comparator │ ──► [100% Bit-Accurate
│ Assertion Scoreboard │ Specification Match]
└─────────────────────────┘
- The Software Path: An object-oriented C++ reference class acts as the golden mathematical specification. It reads incoming vector sequences, computes ideal saturating dot-product outputs instantly, and keeps track of expected states.
- Temporal Alignment: Because the RTL core contains an explicit 3-cycle hardware delay (FIFO ingestion + 2-stage MAC pipeline), the testbench feeds golden expectations into a software delay line (
std::queue). - The Scoreboard Judge: When the hardware retires a calculation and drives the output valid pin (
v_out = 1), the monitor pops the matching vector from the software queue and executes automatic, bit-accurate assertions (assert()) across all 4 lanes simultaneously.
==================================================
🚀 STARTING CO-SIMULATION REGRESSION TESTING
==================================================
[Cycle 2] Checking SIMD Lanes:
↳ Lane 0 -> Expected: 1242 | RTL Read: 1242 MATCH
↳ Lane 1 -> Expected: -412 | RTL Read: -412 MATCH
↳ Lane 2 -> Expected: 8765 | RTL Read: 8765 MATCH
↳ Lane 3 -> Expected: -31004 | RTL Read: -31004 MATCH
[Cycle 3] Checking SIMD Lanes:
↳ Lane 0 -> Expected: 3122 | RTL Read: 3122 MATCH
↳ Lane 1 -> Expected: -1980 | RTL Read: -1980 MATCH
↳ Lane 2 -> Expected: 2147483647 | RTL Read: 2147483647 MATCH [Saturation Locked]
↳ Lane 3 -> Expected: -45812 | RTL Read: -45812 MATCH
Each vector lane dynamically evaluates a streaming vector dot-product equation independently:
Ensure your development workstation contains a modern C++ compiler toolchain (GCC/Clang), GNU Make, and Verilator.
On macOS (via Homebrew):
brew install verilator gtkwave
To compile the underlying hardware tree blocks, instantiate the C++ verification infrastructure, and run the automated regression sweep, execute the Makefile target from the root directory:
make clean && make
To visually audit the cycle-by-cycle signal ripples inside the internal hardware cells yourself, load the exported trace file into your wave environment:
gtkwave sim/waveform.vcd