Remove NumPy benchmark dependency; fix benchmark honesty and where uint8 masks#1
Merged
Merged
Conversation
…nt8 masks NumPy removal: - Delete the NumPy benchmark suite (bench/numpy/) and comparison tooling (bench/compare.py, bench/graph/); run.sh's numpy/compare/plot stages are gone, so `./run.sh bench` is now numc-only, emitting bench/numc/results.csv. - Drop "vs NumPy" framing from README, ROADMAP, and bench/README; reword kernel comments to keep the rationale without the NumPy name. externals/numpy/ stays as a study reference. Benchmark honesty (bench/numc/bench.c): - Spread per-element input data (was constant) so data-dependent ops (comparisons, max/min, clip, argmax/argmin, where) aren't flattered by perfect branch prediction; nonzero divisors avoid integer div-by-zero. - where uses a 50/50 condition mask. - Reset in-place exp/log inputs each iteration (were compounding to inf/NaN). - Report the minimum per-iteration time consistently across all categories. - Add an L1->DRAM cache sweep so throughput's cache dependence is explicit. numc_where uint8 condition masks: - Comparisons emit uint8; where now accepts a uint8 cond over any value dtype (the natural comparison-mask pattern) via a uint8-cond kernel variant and dispatch table, fixing the previously dead uint8 path in _check_ternary. - Add tests: uint8-mask where (contiguous, comparison-driven, and strided).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Three related pieces of work. No behavioral change to library math except the
numc_wherefix (#3).1. Remove NumPy as a benchmark dependency
bench/numpy/) and the comparison/charting tooling (bench/compare.py,bench/graph/).run.sh's numpy-run, compare, and plot stages are gone —./run.sh benchis now numc-only, emittingbench/numc/results.csv.README,ROADMAP, andbench/README; reword kernel comments to keep the rationale without the NumPy name.externals/numpy/is kept as a study reference (and its CLAUDE/AGENTS reference pointer stays).2. Benchmark honesty (
bench/numc/bench.c)The previous harness flattered several ops. Fixes:
max/min,clip,argmax/argmin, andwherearen't perfectly predicted. Nonzero divisors avoid integer divide-by-zero.whereuses a 50/50 condition mask (was all-true).exp/logreset each iteration — they were compounding toinf/NaNwithin a few iterations and timing a different code path.cachecategory: anaddsweep from L1 into DRAM so throughput's cache dependence is explicit (the fixed-1M numbers are largely L3-resident).3.
numc_whereaccepts a uint8 condition maskComparisons emit
uint8, butwherepreviously requiredcondto match the value dtype — so the naturalwhere(numc_gt(...), a, b, out)pattern was rejected (and theuint8carve-out in_check_ternarywas dead code).uint8-cond kernel variant + dispatch table;wherenow acceptscondof either the value dtype oruint8over any value dtype.uint8-maskwhere(contiguous, comparison-driven, and strided).Verification
./run.sh test→ 49/49 pass (incl. 3 newwheretests).cachesweep shows float32 throughput falling ~5.7× from L3-resident to DRAM-bound.