Benchmarks proving macOS clonefile + concurrency-cap perf win by erneestoc · Pull Request #1 · erneestoc/nativelink

erneestoc · 2026-05-16T06:59:56Z

Why this branch exists

Our CI Bazel builds on the macOS Nativelink RBE workers spend more wall time materializing action inputs than actually running the compile. From a real CI run:

Phase	Total	p50	p95
Remote execution file fetching (inputs)	7,313s	1.45s	6.24s
Remote execution process wall time (your code)	4,947s	520ms	6.2s
Remote execution upload time (outputs)	1,840s
Remote execution setup (worker sandbox)	59s

Input-side phases total ~7,400s vs ~4,900s for actual compiles. The worker is spending 1.5x more wall time setting up inputs than running them.

What's being shipped

The base branch (ec/macos-clonefile-optimizations) extracts four focused commits from TraceMachina/nativelink#2243 — a 156-commit PR we did NOT want to merge wholesale because it's entangled with unrelated refactors (Moka, Redis, QUIC, scheduler):

APFS clonefile(2) fast path in hardlink_directory_tree (macOS only)
CloneMethod telemetry so production logs can confirm the fast path is engaged
A 64-cap on concurrent hardlink futures in download_to_directory (defends against APFS metadata-lock thrash)
Zero-byte file short-circuit in directory_cache (fixes the ~30% of cache constructions that fail on empty marker files)

Linux / Windows paths are untouched.

What this PR adds

Benchmarks and security tests proving the base branch is safe and fast. Two criterion benches under nativelink-util/benches/ and three new fs_util tests. Full numbers + acceptance verdicts in nativelink-util/BENCHMARKS.md.

Performance (Apple M4 Max, APFS)

hardlink_directory_tree on shapes that mirror real SwiftCompile inputs:

shape	files	treatment	baseline (per-file hardlinks)	speedup
`small_flat`	64	4.43 ms	17.7 ms	4.00x
`pcm_cluster`	219	15.23 ms	61.3 ms	4.03x
`medium_flat`	635	49.03 ms	181 ms	3.70x
`large_flat`	1,978	150.18 ms	590 ms	3.93x

The 64-cap on download_to_directory is performance-neutral (0.98x–1.01x) across all input counts on a single process — proves "no regression," which is the important security claim for that change.

Caveat (worth knowing before merge)

The original PR claimed ~10x on the largest shapes. We see ~4x because hardlink_directory_tree does an O(N) set_readwrite_recursive chmod walk after the O(1) clonefile (the cache stores files 0o555/0o444 and actions need 0o755/0o644 to drop outputs). 4x is still a 440ms-per-action saving on the p95 SwiftCompile shape (~6 min/build across 814 such actions). Parallelizing or replacing the chmod walk is a known follow-up that should unlock the remaining 2-3x.

Security tests

10/10 fs_util tests green, including the 3 new ones:

internal symlinks survive clone as symlinks (CLONE_NOFOLLOW is top-level-only, so a malicious CAS symlink can't be silently followed to /etc/passwd)
top-level symlink src is itself cloned as a symlink, not dereferenced
bad dst-parent error leaves no half-materialized tree

…le branch Two criterion benches under nativelink-util/benches/ and three additional security tests in fs_util.rs that together prove (a) the clonefile fast path on macOS is faster than per-file hardlinks and (b) the download_to_directory concurrency cap is not a regression. Results on Apple M4 Max / APFS (full table in BENCHMARKS.md): - hardlink_directory_tree, 1978 files / 466 MB: 150 ms vs 590 ms = 3.93x - hardlink_directory_tree, 635 files / 180 MB: 49 ms vs 181 ms = 3.70x - download_to_directory_concurrency, n=1978: 893 ms vs 887 ms = 1.01x (perf-neutral on single-process, as expected) The treatment-vs-baseline comparison uses a new #[doc(hidden)] hardlink_directory_tree_perfile helper that forces the per-file path so both can be measured on the same host. Security tests added (all green): - test_clonefile_preserves_internal_symlinks (symlinks inside the tree are cloned as symlinks, not followed) - test_clonefile_nofollow_on_top_level_symlink_src (top-level symlink src is cloned as a symlink, not followed) - test_dst_under_file_parent_errors_cleanly (clean error, no half- materialized tree on bad dst parent)

Reverts the bounded-concurrency construct on this POC branch. After #1 makes construct metadata-only, intra-tree parallelism is bounded by APFS metadata serialization; on a busy worker (inter-action concurrency already saturates the box) 64-wide spawn_blocking fan-out risks oversubscription and stealing cycles from the compiles. Keeping #1/#2/TraceMachina#5. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

This was referenced May 16, 2026

macOS: APFS clonefile fast path + concurrency cap + zero-byte fix for Bazel input materialization #2

Closed

macOS: APFS clonefile fast path + concurrency cap + zero-byte fix for Bazel input materialization TraceMachina/nativelink#2338

Merged

erneestoc force-pushed the ec/macos-clonefile-optimizations-benchmarks branch 2 times, most recently from 59daf65 to 9e357d8 Compare May 16, 2026 19:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarks proving macOS clonefile + concurrency-cap perf win#1

Benchmarks proving macOS clonefile + concurrency-cap perf win#1
erneestoc wants to merge 1 commit into
ec/macos-clonefile-optimizationsfrom
ec/macos-clonefile-optimizations-benchmarks

erneestoc commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

erneestoc commented May 16, 2026

Why this branch exists

What's being shipped

What this PR adds

Performance (Apple M4 Max, APFS)

Caveat (worth knowing before merge)

Security tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant