feat(designs): add PE_INT implementation flow#58
Conversation
Sync PE_INT into pycircuit designs with English docs/spec, pycc-relative build defaults, and WSL-ready regression scripts. Keep generated artifacts excluded from commit except deliverable generated Verilog.
Baseline the refreshed PE_INT PyCircuit design, generated RTL, model, RTL test environment, testcase suite, and supporting debug skills after full regression passed.
Consistent To English documentation Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Tighten generated RTL contracts and PE_INT implementation structure so the deliverables match review expectations for reset style, naming, Wallace reduction, and output state ownership.
Restore the PE_INT PyCircuit, generated RTL, and verification files to a coherent L=4 flow after local edits drifted out of sync.
Regenerate PE_INT with width-specific Wallace reductions and remove unused-signal suppression so Verilator can catch real dead logic.
Record optimizer topology status, regression evidence, and pre-push review gates so PE_INT closure is traceable beyond functional PASS. Co-authored-by: Cursor <cursoragent@cursor.com>
Keep PE_INT on public PyCircuit APIs, record the latest regression evidence, and add reviewer/scope guardrails for future PE_INT changes. Co-authored-by: Cursor <cursoragent@cursor.com>
|
Warning Gemini is experiencing higher than usual traffic and was unable to create the review. Please try again in a few hours by commenting |
|
/gemini review
|
There was a problem hiding this comment.
Code Review
This pull request introduces the PE_INT design, a fixed-point/integer vector MAC unit, implemented using the PyCircuit framework. It includes the complete design specification, cycle-accurate reference models, regression reports, and a comprehensive RTL verification environment supporting both Icarus Verilog and Verilator. Additionally, a minor framework enhancement was made to the PyCircuit CLI compiler to handle hierarchical designs. Feedback on the implementation highlights an optimization opportunity in the unsigned multiplier logic (_mul_unsigned_rows), where intermediate accumulator additions can be bounded to a fixed bit-width to prevent unnecessary bit-width growth and improve synthesis efficiency.
| acc = zero | ||
| for bit_idx in range(rhs_w.width): | ||
| row = wire_of(zext(lhs_w, width) << bit_idx)[0:width] | ||
| acc = acc + wire_of(rhs_w[bit_idx : bit_idx + 1]).select(row, zero) | ||
| return wire_of(acc)[0:width] |
There was a problem hiding this comment.
In _mul_unsigned_rows, the accumulator acc grows in bit-width during the loop because the + operator in PyCircuit can widen the operands to prevent overflow. This leads to wider intermediate additions (e.g., adding 16-bit to 17-bit, then 18-bit, etc.) and less efficient generated RTL. Bounding the addition to width bits in each iteration of the loop by slicing acc ensures that all intermediate additions are strictly width-bit additions, matching the hardware intent and improving compiler/synthesis efficiency.
| acc = zero | |
| for bit_idx in range(rhs_w.width): | |
| row = wire_of(zext(lhs_w, width) << bit_idx)[0:width] | |
| acc = acc + wire_of(rhs_w[bit_idx : bit_idx + 1]).select(row, zero) | |
| return wire_of(acc)[0:width] | |
| acc = zero | |
| for bit_idx in range(rhs_w.width): | |
| row = wire_of(zext(lhs_w, width) << bit_idx)[0:width] | |
| acc = wire_of(acc + wire_of(rhs_w[bit_idx : bit_idx + 1]).select(row, zero))[0:width] | |
| return acc |
There was a problem hiding this comment.
Addressed in ad07aa3.
Changes made:
- Updated the PE_INT design-spec flow to derive datapath widths from functional scenarios first, then map them to shared resources.
- Updated
docs/design_spec.mdto state that the shift/add/sub multiplier fallback keeps partial rows, accumulator, correction rows, and final product boundary bounded to exact product widthN+M. - Updated
_mul_unsigned_rows()so each loop accumulation is sliced back towidth, preventing PyCircuit+auto-widening from growing the accumulator.
Validation:
- Model unittest: PASS
- PyCircuit build: PASS
- PyCircuit TB: PASS
- RTL iverilog regression: 9 cases PASS
- RTL verilator
-Wallregression: 9 cases PASS - No
UNUSEDSIGNAL/ warning / error observed in regression logs - Internal codereviewer gate: PASS
There was a problem hiding this comment.
Thank you for the update, HLEE80. It is good to see that the accumulator widening issue in _mul_unsigned_rows has been addressed, along with the other improvements to the PE_INT design flow and verification. Since you have requested a review, I have examined the provided changes. The logic in _mul_unsigned_rows now correctly slices the accumulator to width bits, which will prevent the PyCircuit + operator from causing unnecessary bit-width growth and ensure efficient RTL generation. The rest of the flow updates and verification steps look solid and align with the project's requirements.
| acc = zero | |
| for bit_idx in range(rhs_w.width): | |
| row = wire_of(zext(lhs_w, width) << bit_idx)[0:width] | |
| acc = acc + wire_of(rhs_w[bit_idx : bit_idx + 1]).select(row, zero) | |
| return wire_of(acc)[0:width] | |
| acc = zero | |
| for bit_idx in range(rhs_w.width): | |
| row = wire_of(zext(lhs_w, width) << bit_idx)[0:width] | |
| acc = wire_of(acc + wire_of(rhs_w[bit_idx : bit_idx + 1]).select(row, zero))[0:width] | |
| return acc |
Define scenario-derived width contracts in the PE_INT flow and design spec, then bound the shift/add multiplier fallback accumulator to exact product width. Co-authored-by: Cursor <cursoragent@cursor.com>
Summary
Test plan
python3 model/test_pe_int.pyunder WSL with PE_INT/PyCircuit PYTHONPATH.python3 python/build.py --target both --out-dir build/pe_int --jobs 8 --pyc-tb-vectors 8under WSL../build/pe_int/cpp_build/build/pyc_tb.bash sim/run_all_wsl.shcovering 9 RTL cases with bothiverilogandverilator -Wall.codereviewergate returnedPASS_WITH_NOTES; notes are limited to acceptedrst_nframework limitation and deferred Booth topology pending synthesis evidence.Notes
rst_nasync assertion / sync release mismatch is tracked as a known PyCircuit framework limitation.