Skip to content

perf(vm): pass callback args by slice to skip per-call Vec allocs 🩻#154

Merged
timfennis merged 1 commit into
masterfrom
perf/callback-slice-args
May 24, 2026
Merged

perf(vm): pass callback args by slice to skip per-call Vec allocs 🩻#154
timfennis merged 1 commit into
masterfrom
perf/callback-slice-args

Conversation

@timfennis
Copy link
Copy Markdown
Owner

Summary

  • dispatch_vec_call / dispatch_vec_call_dynamic previously allocated a fresh Vec<Value> per broadcast element via std::mem::replace(&mut elem_args, Vec::with_capacity(args)). They now reuse a single elem_args buffer via clear().
  • All 16 stdlib HOF call sites (map, filter, sort_by, reduce, max_by_key, …) previously did comp.call(vec![…]) — one heap alloc per element. They now build a stack array and pass &[…].
  • Three callsites (filter, find, by_key) where the slice was &[x.clone()] switched to std::slice::from_ref(&x) per clippy's suggestion, eliminating a real Rc::clone + drop per element on object-heavy iterables.
  • Signatures updated: Vm::call_callback, Vm::call_function, VmCallable::call all take &[Value]. The native dispatch path was already using &args; the closure path now does self.stack.extend(args.iter().cloned()). For numeric tuples (the main vec-dispatch case), elements are Int/Float and clone is a bitwise copy — no extra Rc work.

Benchmarks

Bench Baseline Patched Ratio
vec_hot_loop (200k Tuple+Tuple) 41.0 ± 3.4 ms 39.7 ± 3.1 ms 1.03× ±0.12
hof_pipeline (filter/map/reduce, 100k items) 34.7 ± 2.9 ms 35.2 ± 2.3 ms no change

Bench needle didn't move meaningfully — the allocator caches these small Vecs well — but the dispatch loop and HOF callsites read cleaner and the from_ref change has measurable upside for non-trivial element types. The advent-of-brian comparison run will be added as a comment.

Test plan

  • cargo fmt
  • cargo clippy --all-targets — no new warnings introduced
  • cargo test — all 718 tests pass

In `dispatch_vec_call` and `dispatch_vec_call_dynamic`, every broadcast
element used `std::mem::replace(&mut elem_args, Vec::with_capacity(args))`
to hand a fresh `Vec<Value>` to `call_callback` — N allocations per outer
call. Likewise every stdlib HOF callsite did `comp.call(vec![…])`, one
heap allocation per element of `map`/`filter`/`sort_by`/`reduce`/etc.

`call_callback` and `VmCallable::call` now take `&[Value]`. The native
path was already passing `&args`; the closure path becomes
`extend(args.iter().cloned())`. The vec dispatch loops reuse a single
`elem_args` buffer via `clear()`. Stdlib HOFs build stack arrays. Three
`vec![x.clone()]` sites that clippy flagged switch to
`std::slice::from_ref(&x)`, eliminating a real Rc bump+drop per element
on object-heavy iterables.

`vec_hot_loop` and `hof_pipeline` benches show no meaningful movement
(the allocator caches these small Vecs well), but the dispatch loops and
HOF callsites read cleaner and the slice-from-ref change has measurable
upside for non-trivial element types.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@timfennis timfennis merged commit 7c2a345 into master May 24, 2026
1 check passed
@timfennis timfennis deleted the perf/callback-slice-args branch May 24, 2026 13:04
@timfennis
Copy link
Copy Markdown
Owner Author

advent-of-brian comparison (23 programs)

Aggregate, 5 full-suite runs

run 1 run 2 run 3 run 4 run 5 mean
baseline 4355 4547 4367 4314 4294 4375 ms
patched 4339 4408 4330 4340 4291 4342 ms

0.77% delta on the aggregate — within noise. Most programs are short enough that process startup dominates.

Per-program (hyperfine, 8 runs each, the 4 longest)

Program Baseline Patched Ratio
2025/08/part1.ndc 530.8 ± 5.5 ms 495.0 ± 5.5 ms 1.07× faster
2025/08/part2.ndc 803.1 ± 14.5 ms 764.5 ± 7.4 ms 1.05× faster
2025/04/part2.ndc 362.1 ± 3.4 ms 364.3 ± 5.8 ms flat (±1%)
2025/09/part2.ndc 2042 ± 11 ms 2055 ± 9 ms flat (-1%)

Day-08 wins are real (5-7%, well outside σ) and reproducible. Both programs lean heavily on vec dispatch ((left - right) * (left - right) on int tuples) chained with HOFs (.map, .enumerate, .combinations, .sum, .product, .sort) — exactly the paths this PR targets.

Day-04 and day-09 are flat, which is what we'd expect for programs that don't lean on the affected paths.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant