Benchmarks proving macOS clonefile + concurrency-cap perf win#1
Open
erneestoc wants to merge 1 commit into
Open
Conversation
…le branch
Two criterion benches under nativelink-util/benches/ and three additional
security tests in fs_util.rs that together prove (a) the clonefile fast
path on macOS is faster than per-file hardlinks and (b) the
download_to_directory concurrency cap is not a regression.
Results on Apple M4 Max / APFS (full table in BENCHMARKS.md):
- hardlink_directory_tree, 1978 files / 466 MB: 150 ms vs 590 ms = 3.93x
- hardlink_directory_tree, 635 files / 180 MB: 49 ms vs 181 ms = 3.70x
- download_to_directory_concurrency, n=1978: 893 ms vs 887 ms = 1.01x
(perf-neutral on single-process, as expected)
The treatment-vs-baseline comparison uses a new #[doc(hidden)]
hardlink_directory_tree_perfile helper that forces the per-file path so
both can be measured on the same host.
Security tests added (all green):
- test_clonefile_preserves_internal_symlinks (symlinks inside the tree
are cloned as symlinks, not followed)
- test_clonefile_nofollow_on_top_level_symlink_src (top-level symlink
src is cloned as a symlink, not followed)
- test_dst_under_file_parent_errors_cleanly (clean error, no half-
materialized tree on bad dst parent)
59daf65 to
9e357d8
Compare
erneestoc
added a commit
that referenced
this pull request
May 22, 2026
Reverts the bounded-concurrency construct on this POC branch. After #1 makes construct metadata-only, intra-tree parallelism is bounded by APFS metadata serialization; on a busy worker (inter-action concurrency already saturates the box) 64-wide spawn_blocking fan-out risks oversubscription and stealing cycles from the compiles. Keeping #1/#2/TraceMachina#5. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why this branch exists
Our CI Bazel builds on the macOS Nativelink RBE workers spend more wall time materializing action inputs than actually running the compile. From a real CI run:
Input-side phases total ~7,400s vs ~4,900s for actual compiles. The worker is spending 1.5x more wall time setting up inputs than running them.
What's being shipped
The base branch (
ec/macos-clonefile-optimizations) extracts four focused commits from TraceMachina/nativelink#2243 — a 156-commit PR we did NOT want to merge wholesale because it's entangled with unrelated refactors (Moka, Redis, QUIC, scheduler):clonefile(2)fast path inhardlink_directory_tree(macOS only)CloneMethodtelemetry so production logs can confirm the fast path is engageddownload_to_directory(defends against APFS metadata-lock thrash)directory_cache(fixes the ~30% of cache constructions that fail on empty marker files)Linux / Windows paths are untouched.
What this PR adds
Benchmarks and security tests proving the base branch is safe and fast. Two criterion benches under
nativelink-util/benches/and three new fs_util tests. Full numbers + acceptance verdicts innativelink-util/BENCHMARKS.md.Performance (Apple M4 Max, APFS)
hardlink_directory_treeon shapes that mirror real SwiftCompile inputs:small_flatpcm_clustermedium_flatlarge_flatThe 64-cap on
download_to_directoryis performance-neutral (0.98x–1.01x) across all input counts on a single process — proves "no regression," which is the important security claim for that change.Caveat (worth knowing before merge)
The original PR claimed ~10x on the largest shapes. We see ~4x because
hardlink_directory_treedoes an O(N)set_readwrite_recursivechmod walk after the O(1) clonefile (the cache stores files 0o555/0o444 and actions need 0o755/0o644 to drop outputs). 4x is still a 440ms-per-action saving on the p95 SwiftCompile shape (~6 min/build across 814 such actions). Parallelizing or replacing the chmod walk is a known follow-up that should unlock the remaining 2-3x.Security tests
10/10 fs_util tests green, including the 3 new ones:
/etc/passwd)