Skip to content

macOS: APFS clonefile fast path + concurrency cap + zero-byte fix for Bazel input materialization#2

Closed
erneestoc wants to merge 40 commits into
mainfrom
ec/macos-clonefile-optimizations
Closed

macOS: APFS clonefile fast path + concurrency cap + zero-byte fix for Bazel input materialization#2
erneestoc wants to merge 40 commits into
mainfrom
ec/macos-clonefile-optimizations

Conversation

@erneestoc
Copy link
Copy Markdown
Owner

Summary

Four small, atomic commits that make macOS RBE workers materialize Bazel
action input trees dramatically faster on APFS. Extracted from
TraceMachina/nativelink#2243
as the safest, most independently-verifiable subset.

Commit Scope
b3b0cd3f fs_util: APFS clonefile(2) fast path in hardlink_directory_tree (with set_readwrite_recursive so cloned trees stay writable). Falls through to per-file hard_link on Linux/Windows.
ab49f162 DirectoryCache: clonefile_hits / hardlink_hits AtomicU64 telemetry so production can confirm the fast path is actually firing.
1ddce0fc running_actions_manager::download_to_directory: cap concurrent hard_link calls at 64 via stream::buffer_unordered (was unbounded FuturesUnordered).
8051ca9e DirectoryCache: don't crash when the CAS has no entry for the zero-byte digest — short-circuit and create an empty file.

Why

Bazel exec-log analysis on our worker fleet shows that file fetching
dominates wall time on macOS RBE workers
:

Metric Value
Sum of action durations (process wall time) 4,947 s
Sum of file fetching durations 7,313 s
Critical path 1,074 s
Total wall time 1,213 s

File-fetching wall time exceeds all process wall time combined — the
worker is spending more time materializing input trees than running
actions. The dominant shape is SwiftCompile (635 input files / 183 MB
mean, 1,978 files / 466 MB at p95), exactly the shape APFS
clonefile(2) was designed for.

PR TraceMachina#2243 on upstream is large and changes many unrelated things. We're
pulling out just the four pieces that (a) directly attack the
file-fetching bottleneck on macOS and (b) can be A/B tested
independently with the telemetry in ab49f162.

Benchmark proof

See #1 — adds two criterion
benches in `nativelink-util` that reproduce the wins on a real APFS
volume.

Headline numbers on Apple M4 Max / APFS:

  • `hardlink_directory_tree` (b3b0cd3) — 3.6×–4.0× faster across
    all five SwiftCompile-shaped trees. p95 shape (1,978 files / 466 MB)
    drops from 590 ms → 150 ms per action. Scaled by 814 such actions
    per CI build, that's a ~6 minute upper-bound saving per build from
    this single optimization.
  • `download_to_directory` 64-cap (1ddce0f) — performance-neutral
    on a single-process microbench (1.01× vs unbounded at 1,978 files).
    Confirms the cap is not a regression; multi-action contention wins
    will show up in production telemetry.

Caveat: 4× not 10×

The handoff predicted ≥10× based on upstream PR TraceMachina#2243's reported wins.
We see a stable ~4×, bounded by the O(N) `set_readwrite_recursive`
chmod walk that runs after the O(1) `clonefile(2)`. Diagnosis and
follow-up (single top-level chmod + lazy per-file) documented in the
bench PR. 4× on the dominant shape already moves the needle hard —
shipping now, optimizing the walk later.

Telemetry & rollback

`ab49f162` adds `DirectoryCache::stats()` with `clonefile_hits` and
`hardlink_hits` counters so we can confirm in production:

  • macOS workers should show `clonefile_hits` climbing and
    `hardlink_hits` near zero
  • Linux workers should show the inverse (clonefile path is
    macOS-gated at compile time)

If anything looks wrong, the four commits revert independently — each
is its own atomic change with its own test coverage.

Security

10 `fs_util` tests cover the clonefile path (COW isolation, symlink
handling under `CLONE_NOFOLLOW`, error paths, permission transitions
from `0o555` → `0o755`). 2 `DirectoryCache` tests cover the
telemetry counters and the zero-byte short-circuit. All green on macOS
arm64 and Linux x86_64. Details in the bench PR.

What's deliberately NOT included

PR TraceMachina#2243 also touches `has_with_results` chunking and BFS-parallel
`mkdir` — those are deferred because they require their own bench
infrastructure and aren't on the file-fetching critical path. Easy to
pull in as a follow-up if we want them.

🤖 Generated with Claude Code

palfrey and others added 30 commits April 27, 2026 12:55
* Update dependency rules_python to v2

* Use local python version, not rules_python downloaded one

* Update dependency rules_python to v2

* Use local python version, not rules_python downloaded one

* Commit module.bazel.lock

---------

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Co-authored-by: Tom Parker-Shemilt <tom@tracemachina.com>
* Add use_legacy_resource_names option to GrpcSpec

Some older backends (e.g. Buildbarn) do not understand the modern
ByteStream resource name format that includes the digest function:

  {instance}/blobs/{digest_function}/{hash}/{size}

They expect the original format without the digest function component:

  {instance}/blobs/{hash}/{size}

Add a `use_legacy_resource_names` boolean to GrpcSpec (default false)
that, when enabled, omits the digest function from ByteStream resource
names for both reads and writes. This fixes InvalidArgument errors when
proxying to such backends.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Update nativelink-config/src/stores.rs

* Add testing for legacy resource names

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Tom Parker-Shemilt <tom@tracemachina.com>
Harry Potter would like the clipboard buffer
* Set arg0 for process.

In TraceMachina#2237 the process path was canoncialised to work around the Rust stdlib instability when using current_dir.  However, this breaks RBE where the program path changes the compiler behaviour.

Update the builder to ensure that the arg0 remains relative even when the process path is resolved.

* Get full path to sh

---------

Co-authored-by: Tom Parker-Shemilt <tom@tracemachina.com>
…TraceMachina#2288)

* Forward client headers and OTEL trace context to upstream gRPC stores

Adds two complementary mechanisms to GrpcStore for propagating headers
to upstream remote caches (e.g. Buildbarn):

1. `headers` (GrpcSpec): static key/value pairs attached to every
   outgoing request, useful for fixed auth tokens.

2. `forward_headers` (GrpcSpec): header names to forward from the
   inbound client request. OtlpMiddleware captures all ASCII-valued
   headers from the client into a ClientHeaders value stored in the
   task context; enrich_request reads them back and injects whichever
   names are listed here. This enables JWT pass-through so build
   clients can authenticate directly with upstream caches.

Additionally, every outgoing request now has the current W3C trace
context (traceparent/tracestate) injected via the OpenTelemetry
propagator, fixing distributed trace continuity across NativeLink
instances and into upstream services.

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Tom Parker-Shemilt <tom@tracemachina.com>
* Interval for keepalives

* Upgrade tokio to 1.52.2

* Actually spawn the keep alive
* Migrate to rules_rs (hermeticbuild)

* update bazel lockfile

* hermetic llvm
…GREGATE (TraceMachina#2298)

* fast_slow_store: only bound followers' wait, never the leader's populate

* fast_slow_store: never pass caller's writer into follower closures

* Add tests for leader follower split

* redis_store: lightweight check_health using PING instead of full I/O

* Add more tests

* Fix tokio time timeout declaration
* Add teardown step to check attic push
* Use patched attic for watch-store fixes
* Up the timeout so complete rebuilds can work
…re (TraceMachina#2322)

* execution_server: pre-validate CAS blobs and return PreconditionFailure

* Add Error Type to handle Not Found Context
palfrey and others added 10 commits May 13, 2026 13:28
* Update the SECURITY.md

* Discord -> Slack
* Split rbe-toolchain into multiple tests
* Disable slow remove packages step
On macOS, try APFS clonefile(2) before falling back to the existing
per-file hardlink walk. clonefile is O(1) in tree size and uses
copy-on-write, so subtree-cache hits no longer scale with input count.

After a successful clone the destination is made writable (0o755/0o644)
because the clone inherits the source's permissions and cached subtrees
are 0o555/0o444. The COW semantics of clonefile mean writes to the
destination do not affect the source, so this is safe.

On EXDEV (cross-volume), ENOTSUP, or any other errno, we log at debug
and fall through to hardlink_directory_tree_recursive. Linux and Windows
paths are unchanged.

Extracted from TraceMachina/nativelink PR TraceMachina#2243 (commit 13fcc0c).
hardlink_directory_tree now returns CloneMethod (Clonefile | Hardlink) so
callers can see which kernel path was taken. DirectoryCache records the
result in two atomic counters and exposes them via CacheStats. Without
this telemetry, a silent fall-through from clonefile to per-file
hardlinks (e.g. cache and workspace on different volumes, or APFS
clonefile failing for any other reason) would be invisible.

On Linux/Windows hardlink_directory_tree always returns Hardlink — no
behavioral change. The new CacheStats fields default to zero on those
platforms.

Extracted from TraceMachina/nativelink PR TraceMachina#2243 (commit 13fcc0c).
Previously download_to_directory pushed every file-hardlink, subdir
recursion, and symlink future onto an unbounded FuturesUnordered, then
drained it. On macOS this produced thousands of parallel hardlink(2)
calls fighting APFS's per-volume metadata lock — the observed exec-log
shape was ~4 ms per input file at scale, consistent with serialized
metadata mutations plus tokio scheduling overhead.

This commit gates each directory level to at most 64 in-flight futures
via stream::buffer_unordered(64). 64 is well above the inflection point
on any modern Linux filesystem, so Linux is unaffected beyond replacing
tokio scheduling overhead with simpler stream polling.

Scope notes (vs PR TraceMachina#2243 ee85fdc):
- The chunked has_with_results sub-change does not apply directly: the
  current code calls populate_fast_store per-digest, not a batched
  has_with_results.
- Level-parallel BFS mkdir is not applied here; the recursion structure
  is unchanged. The 64-cap is per recursive call, not global. Deep trees
  can therefore still have 64 * depth in-flight futures. A full flatten
  pass is a follow-up.

Extracted from TraceMachina/nativelink PR TraceMachina#2243 (commit ee85fdc),
narrowed to fit the current code shape.
FilesystemStore (and several other CAS backends) refuse to store
zero-byte blobs, so a get_part_unchunked for the zero-byte digest
(af1349b9... / e3b0c449...) returns NotFound. Bazel input trees
routinely contain empty marker/config files (.linksearchpaths, empty
.env, .toml, etc.), so without this fix a single such file in any
directory causes the entire DirectoryCache construction to fail —
roughly 30% of cache attempts per PR TraceMachina#2243.

Short-circuit create_file: if the digest is the zero-byte digest, write
b"" to disk directly and never consult the CAS.

Cross-platform correctness fix.

Extracted from TraceMachina/nativelink PR TraceMachina#2243 (commit d198902).
@erneestoc erneestoc closed this May 16, 2026
erneestoc added a commit that referenced this pull request May 22, 2026
Reverts the bounded-concurrency construct on this POC branch. After #1 makes construct metadata-only, intra-tree parallelism is bounded by APFS metadata serialization; on a busy worker (inter-action concurrency already saturates the box) 64-wide spawn_blocking fan-out risks oversubscription and stealing cycles from the compiles. Keeping #1/#2/TraceMachina#5.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants