macOS: APFS clonefile fast path + concurrency cap + zero-byte fix for Bazel input materialization#2
Closed
erneestoc wants to merge 40 commits into
Closed
macOS: APFS clonefile fast path + concurrency cap + zero-byte fix for Bazel input materialization#2erneestoc wants to merge 40 commits into
erneestoc wants to merge 40 commits into
Conversation
* Update dependency rules_python to v2 * Use local python version, not rules_python downloaded one * Update dependency rules_python to v2 * Use local python version, not rules_python downloaded one * Commit module.bazel.lock --------- Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> Co-authored-by: Tom Parker-Shemilt <tom@tracemachina.com>
* Add use_legacy_resource_names option to GrpcSpec
Some older backends (e.g. Buildbarn) do not understand the modern
ByteStream resource name format that includes the digest function:
{instance}/blobs/{digest_function}/{hash}/{size}
They expect the original format without the digest function component:
{instance}/blobs/{hash}/{size}
Add a `use_legacy_resource_names` boolean to GrpcSpec (default false)
that, when enabled, omits the digest function from ByteStream resource
names for both reads and writes. This fixes InvalidArgument errors when
proxying to such backends.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Update nativelink-config/src/stores.rs
* Add testing for legacy resource names
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Tom Parker-Shemilt <tom@tracemachina.com>
Harry Potter would like the clipboard buffer
* Set arg0 for process. In TraceMachina#2237 the process path was canoncialised to work around the Rust stdlib instability when using current_dir. However, this breaks RBE where the program path changes the compiler behaviour. Update the builder to ensure that the arg0 remains relative even when the process path is resolved. * Get full path to sh --------- Co-authored-by: Tom Parker-Shemilt <tom@tracemachina.com>
…TraceMachina#2288) * Forward client headers and OTEL trace context to upstream gRPC stores Adds two complementary mechanisms to GrpcStore for propagating headers to upstream remote caches (e.g. Buildbarn): 1. `headers` (GrpcSpec): static key/value pairs attached to every outgoing request, useful for fixed auth tokens. 2. `forward_headers` (GrpcSpec): header names to forward from the inbound client request. OtlpMiddleware captures all ASCII-valued headers from the client into a ClientHeaders value stored in the task context; enrich_request reads them back and injects whichever names are listed here. This enables JWT pass-through so build clients can authenticate directly with upstream caches. Additionally, every outgoing request now has the current W3C trace context (traceparent/tracestate) injected via the OpenTelemetry propagator, fixing distributed trace continuity across NativeLink instances and into upstream services. Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Tom Parker-Shemilt <tom@tracemachina.com>
* Interval for keepalives * Upgrade tokio to 1.52.2 * Actually spawn the keep alive
* Migrate to rules_rs (hermeticbuild) * update bazel lockfile * hermetic llvm
…GREGATE (TraceMachina#2298) * fast_slow_store: only bound followers' wait, never the leader's populate * fast_slow_store: never pass caller's writer into follower closures * Add tests for leader follower split * redis_store: lightweight check_health using PING instead of full I/O * Add more tests * Fix tokio time timeout declaration
* Add teardown step to check attic push * Use patched attic for watch-store fixes * Up the timeout so complete rebuilds can work
…re (TraceMachina#2322) * execution_server: pre-validate CAS blobs and return PreconditionFailure * Add Error Type to handle Not Found Context
* Update the SECURITY.md * Discord -> Slack
* Split rbe-toolchain into multiple tests * Disable slow remove packages step
On macOS, try APFS clonefile(2) before falling back to the existing per-file hardlink walk. clonefile is O(1) in tree size and uses copy-on-write, so subtree-cache hits no longer scale with input count. After a successful clone the destination is made writable (0o755/0o644) because the clone inherits the source's permissions and cached subtrees are 0o555/0o444. The COW semantics of clonefile mean writes to the destination do not affect the source, so this is safe. On EXDEV (cross-volume), ENOTSUP, or any other errno, we log at debug and fall through to hardlink_directory_tree_recursive. Linux and Windows paths are unchanged. Extracted from TraceMachina/nativelink PR TraceMachina#2243 (commit 13fcc0c).
hardlink_directory_tree now returns CloneMethod (Clonefile | Hardlink) so callers can see which kernel path was taken. DirectoryCache records the result in two atomic counters and exposes them via CacheStats. Without this telemetry, a silent fall-through from clonefile to per-file hardlinks (e.g. cache and workspace on different volumes, or APFS clonefile failing for any other reason) would be invisible. On Linux/Windows hardlink_directory_tree always returns Hardlink — no behavioral change. The new CacheStats fields default to zero on those platforms. Extracted from TraceMachina/nativelink PR TraceMachina#2243 (commit 13fcc0c).
Previously download_to_directory pushed every file-hardlink, subdir recursion, and symlink future onto an unbounded FuturesUnordered, then drained it. On macOS this produced thousands of parallel hardlink(2) calls fighting APFS's per-volume metadata lock — the observed exec-log shape was ~4 ms per input file at scale, consistent with serialized metadata mutations plus tokio scheduling overhead. This commit gates each directory level to at most 64 in-flight futures via stream::buffer_unordered(64). 64 is well above the inflection point on any modern Linux filesystem, so Linux is unaffected beyond replacing tokio scheduling overhead with simpler stream polling. Scope notes (vs PR TraceMachina#2243 ee85fdc): - The chunked has_with_results sub-change does not apply directly: the current code calls populate_fast_store per-digest, not a batched has_with_results. - Level-parallel BFS mkdir is not applied here; the recursion structure is unchanged. The 64-cap is per recursive call, not global. Deep trees can therefore still have 64 * depth in-flight futures. A full flatten pass is a follow-up. Extracted from TraceMachina/nativelink PR TraceMachina#2243 (commit ee85fdc), narrowed to fit the current code shape.
FilesystemStore (and several other CAS backends) refuse to store zero-byte blobs, so a get_part_unchunked for the zero-byte digest (af1349b9... / e3b0c449...) returns NotFound. Bazel input trees routinely contain empty marker/config files (.linksearchpaths, empty .env, .toml, etc.), so without this fix a single such file in any directory causes the entire DirectoryCache construction to fail — roughly 30% of cache attempts per PR TraceMachina#2243. Short-circuit create_file: if the digest is the zero-byte digest, write b"" to disk directly and never consult the CAS. Cross-platform correctness fix. Extracted from TraceMachina/nativelink PR TraceMachina#2243 (commit d198902).
erneestoc
added a commit
that referenced
this pull request
May 22, 2026
Reverts the bounded-concurrency construct on this POC branch. After #1 makes construct metadata-only, intra-tree parallelism is bounded by APFS metadata serialization; on a busy worker (inter-action concurrency already saturates the box) 64-wide spawn_blocking fan-out risks oversubscription and stealing cycles from the compiles. Keeping #1/#2/TraceMachina#5. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Four small, atomic commits that make macOS RBE workers materialize Bazel
action input trees dramatically faster on APFS. Extracted from
TraceMachina/nativelink#2243
as the safest, most independently-verifiable subset.
b3b0cd3ffs_util: APFSclonefile(2)fast path inhardlink_directory_tree(withset_readwrite_recursiveso cloned trees stay writable). Falls through to per-filehard_linkon Linux/Windows.ab49f162DirectoryCache:clonefile_hits/hardlink_hitsAtomicU64 telemetry so production can confirm the fast path is actually firing.1ddce0fcrunning_actions_manager::download_to_directory: cap concurrenthard_linkcalls at 64 viastream::buffer_unordered(was unboundedFuturesUnordered).8051ca9eDirectoryCache: don't crash when the CAS has no entry for the zero-byte digest — short-circuit and create an empty file.Why
Bazel exec-log analysis on our worker fleet shows that file fetching
dominates wall time on macOS RBE workers:
File-fetching wall time exceeds all process wall time combined — the
worker is spending more time materializing input trees than running
actions. The dominant shape is SwiftCompile (635 input files / 183 MB
mean, 1,978 files / 466 MB at p95), exactly the shape APFS
clonefile(2)was designed for.PR TraceMachina#2243 on upstream is large and changes many unrelated things. We're
pulling out just the four pieces that (a) directly attack the
file-fetching bottleneck on macOS and (b) can be A/B tested
independently with the telemetry in
ab49f162.Benchmark proof
See #1 — adds two criterion
benches in `nativelink-util` that reproduce the wins on a real APFS
volume.
Headline numbers on Apple M4 Max / APFS:
all five SwiftCompile-shaped trees. p95 shape (1,978 files / 466 MB)
drops from 590 ms → 150 ms per action. Scaled by 814 such actions
per CI build, that's a ~6 minute upper-bound saving per build from
this single optimization.
on a single-process microbench (1.01× vs unbounded at 1,978 files).
Confirms the cap is not a regression; multi-action contention wins
will show up in production telemetry.
Caveat: 4× not 10×
The handoff predicted ≥10× based on upstream PR TraceMachina#2243's reported wins.
We see a stable ~4×, bounded by the O(N) `set_readwrite_recursive`
chmod walk that runs after the O(1) `clonefile(2)`. Diagnosis and
follow-up (single top-level chmod + lazy per-file) documented in the
bench PR. 4× on the dominant shape already moves the needle hard —
shipping now, optimizing the walk later.
Telemetry & rollback
`ab49f162` adds `DirectoryCache::stats()` with `clonefile_hits` and
`hardlink_hits` counters so we can confirm in production:
`hardlink_hits` near zero
macOS-gated at compile time)
If anything looks wrong, the four commits revert independently — each
is its own atomic change with its own test coverage.
Security
10 `fs_util` tests cover the clonefile path (COW isolation, symlink
handling under `CLONE_NOFOLLOW`, error paths, permission transitions
from `0o555` → `0o755`). 2 `DirectoryCache` tests cover the
telemetry counters and the zero-byte short-circuit. All green on macOS
arm64 and Linux x86_64. Details in the bench PR.
What's deliberately NOT included
PR TraceMachina#2243 also touches `has_with_results` chunking and BFS-parallel
`mkdir` — those are deferred because they require their own bench
infrastructure and aren't on the file-fetching critical path. Easy to
pull in as a follow-up if we want them.
🤖 Generated with Claude Code