Skip to content

Benchmarks proving macOS clonefile + concurrency-cap perf win#1

Open
erneestoc wants to merge 1 commit into
ec/macos-clonefile-optimizationsfrom
ec/macos-clonefile-optimizations-benchmarks
Open

Benchmarks proving macOS clonefile + concurrency-cap perf win#1
erneestoc wants to merge 1 commit into
ec/macos-clonefile-optimizationsfrom
ec/macos-clonefile-optimizations-benchmarks

Conversation

@erneestoc
Copy link
Copy Markdown
Owner

Why this branch exists

Our CI Bazel builds on the macOS Nativelink RBE workers spend more wall time materializing action inputs than actually running the compile. From a real CI run:

Phase Total p50 p95
Remote execution file fetching (inputs) 7,313s 1.45s 6.24s
Remote execution process wall time (your code) 4,947s 520ms 6.2s
Remote execution upload time (outputs) 1,840s
Remote execution setup (worker sandbox) 59s

Input-side phases total ~7,400s vs ~4,900s for actual compiles. The worker is spending 1.5x more wall time setting up inputs than running them.

What's being shipped

The base branch (ec/macos-clonefile-optimizations) extracts four focused commits from TraceMachina/nativelink#2243 — a 156-commit PR we did NOT want to merge wholesale because it's entangled with unrelated refactors (Moka, Redis, QUIC, scheduler):

  1. APFS clonefile(2) fast path in hardlink_directory_tree (macOS only)
  2. CloneMethod telemetry so production logs can confirm the fast path is engaged
  3. A 64-cap on concurrent hardlink futures in download_to_directory (defends against APFS metadata-lock thrash)
  4. Zero-byte file short-circuit in directory_cache (fixes the ~30% of cache constructions that fail on empty marker files)

Linux / Windows paths are untouched.

What this PR adds

Benchmarks and security tests proving the base branch is safe and fast. Two criterion benches under nativelink-util/benches/ and three new fs_util tests. Full numbers + acceptance verdicts in nativelink-util/BENCHMARKS.md.

Performance (Apple M4 Max, APFS)

hardlink_directory_tree on shapes that mirror real SwiftCompile inputs:

shape files treatment baseline (per-file hardlinks) speedup
small_flat 64 4.43 ms 17.7 ms 4.00x
pcm_cluster 219 15.23 ms 61.3 ms 4.03x
medium_flat 635 49.03 ms 181 ms 3.70x
large_flat 1,978 150.18 ms 590 ms 3.93x

The 64-cap on download_to_directory is performance-neutral (0.98x–1.01x) across all input counts on a single process — proves "no regression," which is the important security claim for that change.

Caveat (worth knowing before merge)

The original PR claimed ~10x on the largest shapes. We see ~4x because hardlink_directory_tree does an O(N) set_readwrite_recursive chmod walk after the O(1) clonefile (the cache stores files 0o555/0o444 and actions need 0o755/0o644 to drop outputs). 4x is still a 440ms-per-action saving on the p95 SwiftCompile shape (~6 min/build across 814 such actions). Parallelizing or replacing the chmod walk is a known follow-up that should unlock the remaining 2-3x.

Security tests

10/10 fs_util tests green, including the 3 new ones:

  • internal symlinks survive clone as symlinks (CLONE_NOFOLLOW is top-level-only, so a malicious CAS symlink can't be silently followed to /etc/passwd)
  • top-level symlink src is itself cloned as a symlink, not dereferenced
  • bad dst-parent error leaves no half-materialized tree

…le branch

Two criterion benches under nativelink-util/benches/ and three additional
security tests in fs_util.rs that together prove (a) the clonefile fast
path on macOS is faster than per-file hardlinks and (b) the
download_to_directory concurrency cap is not a regression.

Results on Apple M4 Max / APFS (full table in BENCHMARKS.md):
  - hardlink_directory_tree, 1978 files / 466 MB: 150 ms vs 590 ms = 3.93x
  - hardlink_directory_tree, 635 files / 180 MB:   49 ms vs 181 ms = 3.70x
  - download_to_directory_concurrency, n=1978:    893 ms vs 887 ms = 1.01x
    (perf-neutral on single-process, as expected)

The treatment-vs-baseline comparison uses a new #[doc(hidden)]
hardlink_directory_tree_perfile helper that forces the per-file path so
both can be measured on the same host.

Security tests added (all green):
  - test_clonefile_preserves_internal_symlinks (symlinks inside the tree
    are cloned as symlinks, not followed)
  - test_clonefile_nofollow_on_top_level_symlink_src (top-level symlink
    src is cloned as a symlink, not followed)
  - test_dst_under_file_parent_errors_cleanly (clean error, no half-
    materialized tree on bad dst parent)
@erneestoc erneestoc force-pushed the ec/macos-clonefile-optimizations-benchmarks branch 2 times, most recently from 59daf65 to 9e357d8 Compare May 16, 2026 19:26
erneestoc added a commit that referenced this pull request May 22, 2026
Reverts the bounded-concurrency construct on this POC branch. After #1 makes construct metadata-only, intra-tree parallelism is bounded by APFS metadata serialization; on a busy worker (inter-action concurrency already saturates the box) 64-wide spawn_blocking fan-out risks oversubscription and stealing cycles from the compiles. Keeping #1/#2/TraceMachina#5.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant