Production hardening: scheduling, P2P transfers, directory cache, TLS/QUIC#2243
Open
rejuvenile wants to merge 310 commits into
Open
Production hardening: scheduling, P2P transfers, directory cache, TLS/QUIC#2243rejuvenile wants to merge 310 commits into
rejuvenile wants to merge 310 commits into
Conversation
inner_store returned self, preventing callers (like LocalWorker) from downcasting through the chain to find FastSlowStore. Delegate to inner store instead — optimized_for override is independent. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
EvictingMap: warn! for items evicted within 120s of insertion (age + size in log), debug! for older items. Helps diagnose Bazel "lost inputs" errors. Worker: append .local to bare hostnames for mDNS resolution so the server can connect to worker CAS endpoints for peer blob sharing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
upload_results writes to fast_store() only (FilesystemStore), deferring remote CAS upload to spawn_upload_to_remote. But that function only collected tree_digest blobs, not the individual file blobs inside output directory trees (dep-graph.bin, query-cache.bin, etc). This caused "Missing digest" / "lost inputs" errors when Bazel tried to download action outputs that were never pushed to the server. Fix: decode each Tree proto from fast_store and extract all file digests for inclusion in the background upload. Also add success/fail counters and tree file count to upload logging for diagnostics. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Workers now race a peer fetch (via locality map) in parallel with the server fetch when peers are known. Whichever responds first wins; the loser is cancelled. This gives LAN peers a chance to win when they're closer/faster. Server-side behavior is unchanged — IS_WORKER_REQUEST detection ensures the sequential path (with redirect generation) is used for server-side requests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The previous commit raced on both server and worker sides because IS_WORKER_REQUEST isn't set for Bazel client requests to the server. Add an explicit race_peers flag (default false) that only workers enable, preventing the server from wastefully racing against its own workers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…AS, register output digests - Parallelize server-side GetTree BFS (FuturesUnordered per tree level) - GrpcStore: report LazyExistenceOnSync for CAS stores (skip FindMissingBlobs before get_part) - WORKER_BACKLOG 8→64 to reduce backpressure during burst patterns - Worker peer CAS connections 4→64 - Include tree digests in BlobsAvailableNotification from worker - Register output digests from ExecuteResult in server locality map - Fix existence_store_test: yield for async eviction callbacks - Fix bytestream_server_test: tonic Status format change Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…check, parallel mkdir - Bound hardlink phase to 64 concurrent tasks (was unbounded 4000+) - Split has_with_results into 500-key chunks to release Mutex between batches - Level-parallel BFS for directory creation (siblings concurrent, parents first) - Log CAS server exit errors in local_worker Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Log requested/missing counts at info level, and list missing digests at debug level. Needed to diagnose "Lost inputs" build failures where blobs exist on disk but Bazel reports them missing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tribution Uses WorkerProxyStore locality map to route batch blob reads to peers that already have the data, falling back to server for unknown digests. All peer and server batches execute in parallel via join_all, eliminating the previous server-only bottleneck where 10 workers competed for the same blobs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…concurrency to 64 - Round-robin digest assignment across peers that have the blob, preventing hotspots when one peer has most blobs - Retry path is now best-effort: individual failures are logged and skipped instead of aborting the entire batch operation - BYTESTREAM_CONCURRENCY increased from 16 to 64 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Workers report cpu_load_pct (load_avg_1m / num_cpus * 100) piggybacked on KeepAliveRequest (~2.5s), BlobsAvailableNotification (~500ms), and ExecuteComplete (per action). The scheduler stores this per-worker and prefers lightly-loaded workers when selecting candidates: - LRU/MRU fallback path: picks lightest-loaded viable worker - Locality scoring tiebreaker: when scores are within 10%, lower CPU load wins before timestamp - Workers reporting 0 (unknown/old) are sorted last among known loads Backward compatible: old workers send 0 (proto default), treated as unknown and sorted last. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three bugs in inner_get_tree: 1. FuturesUnordered returned directories in completion order, not BFS discovery order, making paging tokens nondeterministic. Fixed by collecting into a HashMap and iterating in original order. 2. page_size=0 (no paging) triggered `len >= 0` which is always true, breaking after the first BFS level. Fixed by treating 0 as MAX. 3. When page was filled mid-level, remaining unprocessed items were dropped, producing empty next_page_token. Fixed by copying remaining items back to the deque front. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…e fast path - GetTree BFS validation: use position assignment instead of re-hashing directory protos (fixes 77% validation failure rate with Java-serialized protos) - EvictingMap::get_many(): batch retrieval in single lock acquisition - FilesystemStore::get_file_entries_batch(): batch file entry lookup - Hardlink pre-fetch: pre-fetch entries in batches of 500 before concurrent hardlink loop, eliminating per-file EvictingMap lock contention - Blob eviction race fix: eager pre-read of small blobs (<=1MiB) before background upload to prevent eviction race in spawn_upload_to_remote - Directory cache: use download_to_directory for cache-miss construction instead of serial per-file RPCs (2.5s -> 50-200ms) - Combined set_readonly_recursive + calculate_directory_size into single walk - GetTree BFS dedup logging: per-level timing, dedup stats, slow level warnings - Input fetch logging: tree resolution, materialization, hardlink stats, blob fetch throughput, slow operation warnings - CPU load logging downgraded to debug level (worker + server) - Load-aware selection logging downgraded to debug level - Fix mkdir_depth_levels log field (was using dirs_created instead of depth) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Streaming pipeline: producer fetches missing blobs in batches of 64, consumer hardlinks concurrently (64-wide) as blobs arrive via mpsc channel. Both overlap via futures::future::join for minimum wall-clock time. - Raise MAX_PEER_HINTS from 1000 to 16384 to cover large actions. - FastSlowStore: per-leg timing (fast_ms, slow_ms, slower_leg). - FilesystemStore: per-phase timing (temp create, write, emplace) >50ms. - EvictingMap: warn! on lock contention >1ms with operation name. - StallGuard: with_context() for dynamic digest/size in stall dumps. - DirectoryCache: comprehensive hit/miss/eviction/timing logging. - MemoryStore config: 16GB→32GB per tier (64GB total) in server configs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…dlinks as blobs arrive Three-future pipeline: fetcher (all blobs at once, bounded at 128), producer (sends files to channel as blobs land via Notify), consumer (hardlinks at 64 concurrency). Eliminates serial per-batch round-trips that bottlenecked at 47-60 MB/s despite 10GbE capacity. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…info - directory_cache: set_readonly_and_calculate_size was setting all files to 0o444, stripping the execute bit from executables. Now preserves execute permission (0o555 for executables, 0o444 for non-executables). This caused EPERM failures for cargo build scripts (runner, ring, etc). - api_worker_scheduler: promote "Load-aware worker selection" log from debug! to info! so CPU-load scheduling decisions are visible in production logs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
4 tests verifying load-aware worker selection: - cpu_load_update_worker_load_stores_correctly - cpu_load_lightest_loaded_worker_gets_picked - cpu_load_unknown_zero_sorted_last - cpu_load_falls_back_to_lru_when_no_load_data Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Workers report cached input_root_digest values to the scheduler via BlobsAvailableNotification. Scheduler gives highest routing priority to workers with a directory cache hit for the action's input_root_digest. Fix EPERM: set_readonly_and_calculate_size now strips write bits only (& 0o555) instead of guessing executable status. Also removes the skip-when-0o555 chmod optimization in hardlink_and_set_metadata which was unsafe because concurrent hardlinks sharing CAS inodes can corrupt file permissions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
On cache MISS, resolve the full merkle tree and check if any subtrees are already cached from other root entries. Cached subtrees are reused via directory symlinks (APFS-compatible), skipping download of already- materialized portions. BFS traversal ensures maximum (top-down) subtree matching. - Store .merkle_tree_meta alongside each cached directory entry - In-memory subtree_index maps every directory digest to its disk path - Rebuild subtree index from disk metadata on startup - Clean up subtree index entries on eviction - Made resolve_directory_tree public for cache access - 6 new tests for merkle metadata, subtree index, and cache reload Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
macOS requires write permission on the source directory for rename(2), unlike POSIX/Linux which only checks the parent. Temporarily restore 0o755 on the temp dir before rename, then lock down to 0o555 after. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…er subtree scoring Workers report cached subtree digests via delta encoding (added/removed) instead of full snapshots every 500ms. First notification sends full snapshot, subsequent ones send only changes. Scheduler scores workers by both root directory cache hits and subtree cache hits. Fix CAS inode corruption: always provide explicit unix_mode (0o444 for non-executable, 0o555 for executable) to prevent concurrent hardlinks from corrupting shared inode permissions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… not just directory count When no worker has an exact root match, the scheduler now scores workers by the total file bytes under their cached subtree digests. A worker caching a subtree with 10GB of files scores higher than one with 100 bytes. The tree resolver computes per-subtree byte totals via bottom-up aggregation during BFS resolution, cached alongside file digests. Scoring tiers: exact root match > weighted subtree coverage > blob locality > LRU. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…e cache Bazel actions create output directories inside the input tree. Directory symlinks to cached subtrees caused EPERM because either: - Cached dirs were 0o555 (can't mkdir inside), or - If made writable, actions would mutate the cache Fix: use hardlink_directory_tree for subtree cache hits — creates fresh writable directories and hardlinks only the files. Cache integrity is preserved (0o555 dirs, read-only files) since actions never access the cache directly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ach path component When create_dir_all fails for an output directory, walk up the path and log the mode, is_dir/is_file/is_symlink status of each component to identify whether the failure is due to a read-only parent, a file blocking a directory, or a symlink to a read-only cache. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…dlink The Bazel input tree contains SymlinkNodes (e.g., bazel-out) that, when recreated as symlinks in the work directory, point to the read-only directory cache (0o555 directories). create_dir_all then fails with EPERM when trying to create output directories through these symlinks. Fix: hardlink_directory_tree_recursive now resolves directory symlinks to real directories with fresh writable permissions, while preserving file symlinks and dangling/looping symlinks as-is. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…opies The Bazel input tree contains SymlinkNodes (e.g., bazel-out -> .) that point to directories in the read-only directory cache (0o555). When create_dir_all tries to create output directories through these symlinks, it fails with EPERM because the resolved target directories are read-only. Previous approach (resolving ALL directory symlinks in hardlink_directory_tree) caused infinite recursion for self-referential symlinks like bazel-out -> . New approach: prepare_output_directories now handles this surgically: 1. Fast path: try create_dir_all (usually works) 2. On failure, walk the output path component by component 3. For each symlink that resolves to a read-only directory: - Replace the symlink with a real writable directory - Create absolute symlinks to all entries in the original target - Skip self-referential entries (e.g., bazel-out pointing to itself) 4. For read-only work dirs (0o555): chmod writable 5. Retry create_dir_all after each fix This preserves access to all input tree files while making the specific output path writable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…_dir_all create_dir_all succeeds when the output directory already exists (it's part of the input tree), even if the directory is read-only (0o555) through a symlink chain to the cached directory. Then rustc fails with Permission denied when writing output files (.d, binary) into it. Fix: after create_dir_all succeeds, check if the parent directory is actually writable (mode & 0o200). If not, fall through to the slow path that replaces symlinks-to-read-only-dirs with writable shallow copies. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- set_readonly_recursive: preserve execute bits (& !0o222 instead of hardcoded 0o444) so cached shell scripts remain executable - prepare_output_directories: serialize slow-path symlink replacement with a tokio Mutex to prevent concurrent EEXIST/ENOENT races - Cargo.toml: lto="thin", codegen-units=16 for faster release builds Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Old cache entries had files at 0o444 (no execute bit) from before the set_readonly_and_calculate_size fix. Since the cache persists on disk across restarts, these stale entries kept serving non-executable files like cargo_build_script_runner and cc_wrapper.sh. Add CACHE_FORMAT_VERSION (currently 2) with a version file check on startup. When the version is missing or stale, all entries are cleared and the version file is written. This ensures format changes (like permission semantics) automatically invalidate old entries. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
18 integration tests covering: insert/get/remove, max_bytes/max_count eviction, TTL expiration, pin/unpin/pin_cap, replaced items, startup insert, sizes_for_keys, range queries, concurrent stress, callbacks. VerifyStore: documented that batch_get_part_unchunked bypasses hash verification. Audited all store wrappers — completeness_checking and dedup correctly fall through to default (no bypass). Expiry investigation: Moka's Expiry trait does NOT help with size-based eviction ordering (TTL and size eviction are independent). Corrected doc: window deque is unused in Moka 0.12, entries go to MainProbation. Current mitigation (insert_startup skips freq bump, FIFO ordering in MainProbation) correctly preserves atime-based eviction. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
MemoryStore: direct evicting_map.get_many() + BytesWrapper::to_contiguous(). Zero buf_channel allocations, zero async task pairs. Single-chunk blobs are zero-copy (Arc bump). ~50us for 500 keys vs ~2-5ms before. FilesystemStore: FuturesUnordered for parallel I/O but skips buf_channel. Each task does evicting_map.get() → read_file_entry_bytes() → Bytes. Preserves FD semaphore, stale-entry cleanup, length cap. 256MiB safety bound for unlimited reads. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The dual_transport code was passing ALL endpoints (including use_http3=true) to the TCP ConnectionManager. This caused TCP connection attempts to the QUIC-only UDP port (50072), generating persistent ConnectionRefused errors and log floods on all workers. Fix: filter out use_http3 endpoints when building tcp_endpoints. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
GetTree: MokaEvictingMap cache (512MiB, 10K entries, 5min TTL) for assembled tree results. 54 concurrent identical GetTree calls → 1 BFS + 53 cache hits. Eliminates ~15,900 redundant directory lookups. ExistenceCacheStore: batch_get_part_unchunked now uses insert_many() instead of 552 sequential insert() calls. Reduces ~2200 Moka ops + 552 run_pending_tasks() to ~552 inserts + 1 maintenance pass. get_many(): documented why sequential is correct (Moka lock-free reads at 100ns each, parallelism overhead exceeds benefit). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
GetTree coalescing: DashMap + watch channel prevents thundering herd. 54 concurrent same-root calls → 1 BFS + 53 waiters. TOCTOU race fixed with Entry API (single lock scope for check+register). Subtree cache (level 2): per-directory MokaEvictingMap (256MiB, 50K, 5min TTL). BFS checks subtree cache before store fetch. Different roots with 90% overlap → only ~10% dirs fetched from store. Arc<Vec<Directory>>: CachedTree wraps directories in Arc for cheap cache clones within moka. Response construction still deep-clones (required by protobuf ownership). insert_many: uses insert_batch (defers run_pending_tasks) instead of insert_inner (called it per-item). Now 1 maintenance pass per batch instead of N+1. CLAUDE.md: code review requirement before committing. Level 3 Tree proto lookup deferred (no root→tree digest mapping). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
8 integration tests: tree_cache_hit, tree_cache_miss_different_root, subtree_cache_overlap, coalescing_concurrent, coalescing_leader_failure, paginated_bypasses_cache, subtree_cache_deduplication, next_page_token. Arc optimization: BFS result moved into Arc (zero-copy), cache gets Arc clone (refcount bump), response gets one deep clone. Eliminates transient double materialization (~5000 heap allocations saved). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4727b44 to
bdb015a
Compare
FindMissingBlobs (120K/min), ByteStream read/write completed, mirror blob streamed, AC read/write, BlobsAvailable registration, streaming populate, connection creation — all downgraded from info! to debug!. With release_max_level_info these are compiled out in release builds, reducing log volume ~60x under load. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Collaborator
|
/build-image |
Collaborator
|
Warning: until we have the source code for the |
Contributor
|
@MarcusSorealheis isn't that just part of Linux? (src, also, does that mean it'd only work on Linux though?) |
Collaborator
|
@erneestoc that's exactly right. We will have to think very carefully about that aspect of this PR. It is only one optimization. There are many others. |
This was referenced May 16, 2026
MarcusSorealheis
pushed a commit
that referenced
this pull request
May 17, 2026
* worker: set QoS USER_INITIATED on macOS for P-core preference
Apple Silicon's XNU scheduler will park UTILITY/BACKGROUND threads on
efficiency cores. Single-thread-bursty workloads (swift-frontend,
clang) typical in iOS RBE builds can run 2x-3x slower on an E-core,
so tag the worker process with QOS_CLASS_USER_INITIATED to bias
scheduling toward P-cores.
The setter runs in three places:
- Main thread before tokio runtime creation so worker threads
inherit the class via pthread QoS inheritance.
- tokio Builder::on_thread_start hook as belt-and-suspenders for
any thread (e.g. blocking pool) that misses inheritance.
- Top of LocalWorker::run for the same reason.
Implementation uses libc's pthread_set_qos_class_self_np binding;
the new `nativelink_worker::qos` module is compile-gated so non-macOS
targets emit no call and pull in no symbol. A round-trip test on
macOS verifies the kernel accepted the class change.
Ported from upstream commit 0fce813 (TraceMachina/nativelink #2243).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* worker: test tokio worker thread inherits USER_INITIATED QoS
The QoS scheme in PR #2243 hinges on tokio worker threads actually
seeing QOS_CLASS_USER_INITIATED at task-runtime; without an end-to-end
test the on_thread_start hook could silently regress (e.g. if the hook
ran on the wrong thread or the kernel rejected the class) and the
worker would quietly fall back to E-core scheduling.
Adds a macOS-only test that builds a fresh multi-threaded tokio runtime
with the same on_thread_start hook used in main, spawns a task to force
execution on a worker thread, and reads back the class with
pthread_get_qos_class_np. Also refactors the existing single-thread test
to share a helper.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* qos: justify disallowed_methods escape on tokio_worker_threads_inherit test
The on_thread_start inheritance test must construct a custom-built
runtime via `Builder::new_multi_thread()` and drive it with `block_on`
— that is the unit under test. No `nativelink-util::task` wrapper
exposes a custom-built runtime with a thread-start hook, so the
disallowed_methods lint cannot be addressed at the root cause.
Use `#[expect(clippy::disallowed_methods, reason = ...)]` per the
modern Rust 2024 idiom (fails if the lint stops firing, with a
reviewer-visible justification) rather than a silent `#[allow]`.
Mirrors the same justified escape already used in
src/bin/nativelink.rs::main.
* qos: split set_user_initiated into cfg-gated fn / const fn
The previous single-definition `pub fn set_user_initiated() -> bool` had
a `#[cfg(target_os = "macos")]` block that called libc and a
`#[cfg(not(...))]` block that returned `true`. On Linux CI clippy sees
only the trivial `true` arm and fires `missing_const_for_fn`, failing
ubuntu, asan, Bazel Dev/ubuntu, and every dependent rbe-* job. This did
not reproduce on macOS because the macOS arm calls libc, which is not
const-eligible, so clippy stays silent.
Split into two cfg-gated definitions: the macOS impl stays a regular
`pub fn` because `libc::pthread_set_qos_class_self_np` is not const; the
non-macOS impl becomes `pub const fn` returning `true`. Call sites are
unchanged, both arms still return `bool`, and the existing
`qos::macos_tests::*` continue to apply since they were already gated on
`#[cfg(target_os = "macos")]`. Doc comments are now split per arm and
specialised to each platform's actual behaviour.
Splitting (rather than `#[allow(missing_const_for_fn)]` on a single
function) is the right fix because the lint is accurate for the
non-macOS arm in isolation; suppressing it would hide a legitimate
const-fn opportunity and mask future bugs on whichever platform clippy
runs against.
* ci: retrigger after GitHub 502 fetching rules_kotlin tarball
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
MarcusSorealheis
added a commit
that referenced
this pull request
May 17, 2026
…low writes (#2343) * fast_slow_store: make has_with_results aware of fast store and in-flight slow writes `FastSlowStore::has_with_results` previously only consulted the slow store, which caused two known bugs: 1. Fast-only writes (and the brief window between fast-store insert and slow-store write start) were invisible — callers got NotFound for blobs that were locally present, triggering redundant fetches. 2. Concurrent writers racing on the same digest could not see each other's in-flight slow-store writes, so the second writer re-uploaded what the first had nearly finished pushing. The fix layers three consultations in order: slow store first (authoritative for downstream consumers), then a `in_flight_slow_writes: Mutex<HashMap>` tracking active slow writes, then fast store as a final fallback for fast-only / pre-slow-write hits. In-flight tracking uses a cancel-safe RAII guard (`InFlightSlowWriteGuard`) registered at the start of each `update`/`update_with_whole_file` path that writes to the slow store, so a cancelled write future correctly removes itself from the map on Drop. Tests cover all four reachable cases: slow-hit short-circuit, fast-only hit, in-flight slow write visibility, and noop-slow-store fall-through. Provenance: equivalent to upstream commits f69aaf8 and 2d770d9 from TraceMachina/nativelink PR #2243, ported atomically to current main. * fast_slow_store_test: add cancel-safety and mixed-key has_with_results tests The previous commit added tests for the four reachable cases of the fast/slow/in-flight has() lookup, but two correctness properties were asserted only indirectly: 1. Cancel safety of `InFlightSlowWriteGuard`. The fix's central claim is that aborting an in-progress update() removes the key from the in_flight_slow_writes map via Drop. The previous in-flight test only verified the happy-path (writer completes normally) and the fast-store fallback masked any leak. 2. Per-key independence in batched has_with_results. With multiple keys spanning different storage tiers, an off-by-one in the missing_indices fallback or an over-broad in-flight match would silently corrupt results without any existing test noticing. New tests: * dropping_update_future_cleans_up_in_flight_entry: uses a gated-slow-store with NoopStore as fast (so the fast-store fallback cannot mask a leak), spawns a writer, waits on a oneshot started signal, aborts the writer mid-update, then asserts has() returns None — proving the guard's Drop ran. * has_with_results_handles_mixed_key_sources: builds a request with four keys — slow-only, in-flight, fast-only, and missing — and asserts each result independently matches its source. Catches index-mapping regressions in the batched fallback path. Both tests are deterministic (oneshot channels, no sleeps). Verified that reverting the fix in fast_slow_store.rs causes both new tests plus the existing has_sees_in_flight_slow_writes and fast_store_only_value_is_reported_by_has to fail. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fmt: apply rustfmt to nativelink-store Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fast_slow_store_test: address clippy items_after_statements and cast_possible_truncation Hoist the `MapBackedSlow` test helper (struct, StoreDriver impl, and `default_health_status_indicator!` macro invocation) out of `has_with_results_handles_mixed_key_sources` to module scope so that items are declared before statements, satisfying `clippy::items_after_statements`. Replace `as usize` casts of the per-key `u64` sizes with `usize::try_from(...).unwrap()` to satisfy `clippy::cast_possible_truncation`, matching the existing conversion pattern elsewhere in this test file. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci: retrigger after GitHub 502 fetching hermetic_cc_toolchain tarball --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Marcus Eagan <marcuseagan@gmail.com>
8 tasks
palfrey
added a commit
that referenced
this pull request
May 19, 2026
… Bazel input materialization (#2338) * fs_util: add clonefile(2) fast path to hardlink_directory_tree On macOS, try APFS clonefile(2) before falling back to the existing per-file hardlink walk. clonefile is O(1) in tree size and uses copy-on-write, so subtree-cache hits no longer scale with input count. After a successful clone the destination is made writable (0o755/0o644) because the clone inherits the source's permissions and cached subtrees are 0o555/0o444. The COW semantics of clonefile mean writes to the destination do not affect the source, so this is safe. On EXDEV (cross-volume), ENOTSUP, or any other errno, we log at debug and fall through to hardlink_directory_tree_recursive. Linux and Windows paths are unchanged. Extracted from TraceMachina/nativelink PR #2243 (commit 13fcc0c). * fs_util,directory_cache: track CloneMethod per materialization hardlink_directory_tree now returns CloneMethod (Clonefile | Hardlink) so callers can see which kernel path was taken. DirectoryCache records the result in two atomic counters and exposes them via CacheStats. Without this telemetry, a silent fall-through from clonefile to per-file hardlinks (e.g. cache and workspace on different volumes, or APFS clonefile failing for any other reason) would be invisible. On Linux/Windows hardlink_directory_tree always returns Hardlink — no behavioral change. The new CacheStats fields default to zero on those platforms. Extracted from TraceMachina/nativelink PR #2243 (commit 13fcc0c). * running_actions_manager: bound download_to_directory concurrency to 64 Previously download_to_directory pushed every file-hardlink, subdir recursion, and symlink future onto an unbounded FuturesUnordered, then drained it. On macOS this produced thousands of parallel hardlink(2) calls fighting APFS's per-volume metadata lock — the observed exec-log shape was ~4 ms per input file at scale, consistent with serialized metadata mutations plus tokio scheduling overhead. This commit gates each directory level to at most 64 in-flight futures via stream::buffer_unordered(64). 64 is well above the inflection point on any modern Linux filesystem, so Linux is unaffected beyond replacing tokio scheduling overhead with simpler stream polling. Scope notes (vs PR #2243 ee85fdc): - The chunked has_with_results sub-change does not apply directly: the current code calls populate_fast_store per-digest, not a batched has_with_results. - Level-parallel BFS mkdir is not applied here; the recursion structure is unchanged. The 64-cap is per recursive call, not global. Deep trees can therefore still have 64 * depth in-flight futures. A full flatten pass is a follow-up. Extracted from TraceMachina/nativelink PR #2243 (commit ee85fdc), narrowed to fit the current code shape. * directory_cache: short-circuit zero-byte files FilesystemStore (and several other CAS backends) refuse to store zero-byte blobs, so a get_part_unchunked for the zero-byte digest (af1349b9... / e3b0c449...) returns NotFound. Bazel input trees routinely contain empty marker/config files (.linksearchpaths, empty .env, .toml, etc.), so without this fix a single such file in any directory causes the entire DirectoryCache construction to fail — roughly 30% of cache attempts per PR #2243. Short-circuit create_file: if the digest is the zero-byte digest, write b"" to disk directly and never consult the CAS. Cross-platform correctness fix. Extracted from TraceMachina/nativelink PR #2243 (commit d198902). --------- Co-authored-by: Tom Parker-Shemilt <tom@tracemachina.com>
erneestoc
added a commit
to erneestoc/nativelink
that referenced
this pull request
May 19, 2026
Port of TraceMachina/nativelink PR TraceMachina#2243 commit a47774d. Root cause: when DirectoryCache evicts an entry, the cleanup path calls `set_readwrite_recursive` on the cached tree before `remove_dir_all`. That helper chmods every entry — including files — to 0o755/0o644. Files in a cached entry are hardlinked into in-flight action workspaces (via `hardlink_directory_tree` in `get_or_create`) and ultimately share an inode with the underlying `FilesystemStore` CAS blob (via `fs::hard_link` in `download_to_directory`). Chmoding the cached-side file therefore silently mutates the shared inode's mode for every other in-flight action holding a hardlink to the same blob. Production symptom: EACCES on exec for `cc_wrapper.sh`. The CAS mode of 0o555 (r-xr-xr-x) gets clobbered to 0o644 (rw-r--r--), dropping the +x bit while an unrelated action is mid-exec. Fix: introduce `set_dir_writable_recursive` which only chmods directories, never files. On unix, write permission on the parent directory is sufficient to unlink files inside; the files' own modes are irrelevant for unlinking. Switch the eviction cleanup path in `DirectoryCache::evict_lru` to the new helper. Empirically verified: a regression test that hardlinks a 0o555 file into a cached tree and runs the cleanup helper FAILS on pre-fix code (file mode mutated to 0o644) and PASSES on post-fix code. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8 tasks
MarcusSorealheis
pushed a commit
that referenced
this pull request
May 19, 2026
#2347) Port of TraceMachina/nativelink PR #2243 commit a47774d. Root cause: when DirectoryCache evicts an entry, the cleanup path calls `set_readwrite_recursive` on the cached tree before `remove_dir_all`. That helper chmods every entry — including files — to 0o755/0o644. Files in a cached entry are hardlinked into in-flight action workspaces (via `hardlink_directory_tree` in `get_or_create`) and ultimately share an inode with the underlying `FilesystemStore` CAS blob (via `fs::hard_link` in `download_to_directory`). Chmoding the cached-side file therefore silently mutates the shared inode's mode for every other in-flight action holding a hardlink to the same blob. Production symptom: EACCES on exec for `cc_wrapper.sh`. The CAS mode of 0o555 (r-xr-xr-x) gets clobbered to 0o644 (rw-r--r--), dropping the +x bit while an unrelated action is mid-exec. Fix: introduce `set_dir_writable_recursive` which only chmods directories, never files. On unix, write permission on the parent directory is sufficient to unlink files inside; the files' own modes are irrelevant for unlinking. Switch the eviction cleanup path in `DirectoryCache::evict_lru` to the new helper. Empirically verified: a regression test that hardlinks a 0o555 file into a cached tree and runs the cleanup helper FAILS on pre-fix code (file mode mutated to 0o644) and PASSES on post-fix code. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
erneestoc
added a commit
to erneestoc/nativelink
that referenced
this pull request
May 19, 2026
Previously download_to_directory pushed every file's fetch+hardlink as
one fused future into a single FuturesUnordered. Each future ran the
fetch (populate_fast_store) and then the hardlink sequentially within
the same task. With per-future store-side serialization the fetch leg
becomes the bottleneck for SwiftCompile-style workloads (1978+ input
files at p95), where blob-fetch wall time dominates total action time.
This commit restructures download_to_directory into a two-phase
pipeline driven by try_join3:
1. Fetcher launches ALL missing-blob populate_fast_store calls up
front, bounded at DOWNLOAD_TO_DIRECTORY_FETCH_CONCURRENCY = 128
via stream::try_for_each_concurrent. Digests are deduplicated so a
blob referenced by N files is only fetched once. As each blob
arrives, every file index that needs it is sent through an mpsc
channel.
2. Hardlinker drains the channel and performs hardlink + chmod +
set_mtime, bounded at DOWNLOAD_TO_DIRECTORY_HARDLINK_CONCURRENCY
= 64 (the same cap that PR TraceMachina#2338 added to compose with APFS
metadata-lock behavior on macOS). Hardlinking starts as soon as
the first blob arrives - it does not wait for the fetcher to
finish.
3. Subdirectory recursion and symlink creation run in a separate
FuturesUnordered alongside the file pipeline.
Net win on a Mac-Mini RBE worker: fetch wall time becomes
max(per-blob latency) bounded by the 128-cap, not sum(per-blob latency)
serialized inside one future.
Concurrency caps compose: fetch (128) is network/store-bound where extra
in-flight requests are cheap; hardlink (64) is metadata-bound where
extra in-flight syscalls fight APFS's per-volume lock on macOS. The
two phases run independently with their own bound, so on cache-hot
runs only the hardlink leg is active and the fetch cap is irrelevant.
Tests:
- download_to_directory_fetches_blobs_concurrently: a per-blob 75 ms
sleep on the slow store. Serial fetch of 20 blobs would be 1500 ms;
concurrent fetch finishes well under 500 ms, and max_in_flight is
observed to reach >=4 (would be 1 if serial).
- download_to_directory_streams_hardlinks_while_fetching: 30 blobs
with 50 ms per-blob latency. Serial fetch+hardlink would be 1500 ms;
pipelined version finishes under 750 ms. Catches regression to a
fetch-all-then-hardlink-all two-phase shape that drops overlap.
Both tests use a minimal DelayedStore harness (StoreDriver wrapping
MemoryStore) added inline. This pulls in async-trait as a dev-dependency
(version 0.1.88, matching the rest of the workspace).
Scope notes (vs PR TraceMachina#2243 a8db953):
- The upstream commit depends on infrastructure introduced earlier in
the PR that is NOT being ported here: batch_read_small_blobs,
populate_fast_store_unchecked, get_file_entries_batch,
BATCH_READ_MAX_BLOB_SIZE, BatchReadBlobs / ByteStream partitioning.
These hooks are entangled with the peer-to-peer infrastructure and
proto changes we explicitly excluded. The upstream commit was
reduced to its core idea: launch all fetches upfront, hardlink as
blobs arrive.
- Upstream uses a Notify + shared HashSet<DigestInfo> "fetched_set"
to coordinate fetcher and producer. Here we use a simpler mpsc
channel: the fetcher sends file indices directly as their blob
becomes available. There is no separate producer step because there
is no batch-vs-bytestream partitioning to defer to a second pass.
- Follow-on upstream commits 7b093e8 and b64944a are excluded
per task scope - they depend on peer hints and proto changes.
Extracted from TraceMachina/nativelink PR TraceMachina#2243 (commit a8db953),
adapted to the current code shape.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
palfrey
pushed a commit
that referenced
this pull request
May 20, 2026
…ro-digest files) (#2346) * Resolve absolute symlinks in output upload instead of uploading raw targets When the worker's work directory is populated via DirectoryCache, output paths can be absolute symlinks pointing into the cache directory (/var/.../directory_cache/...). The previous output collection code uploaded these as raw SymlinkInfo with absolute targets that are meaningless on the Bazel client, causing "No such file or directory" errors when the client materialised the action result. Detect absolute symlinks in output paths and resolve them: upload directory contents as Tree protos and file contents as regular files. Relative symlinks (intentionally created by the action) are still preserved as symlinks. Updates upload_dir_and_symlink_test to use a relative symlink (the previous /dev/null absolute symlink is now resolved, breaking the old assertion) and adds a new upload_absolute_symlink_resolves_contents regression test that verifies an out-of-tree payload reachable via an absolute symlink lands in CAS as a file. Ported from upstream 0807153 (PR #2243). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Fix zero-digest files missing from worker execution directories FilesystemStore::get_file_entry_for_digest previously returned a synthetic FileEntry for zero-digest blobs pointing to a content_path that never exists on disk (FilesystemStore deliberately never persists zero-byte files). Downstream worker output-materialisation code that took the prefetched hardlink path would try to hard_link from this non-existent source, fail, and either silently produce a missing file or fall back without enough context to diagnose the failure. This is the OUTPUT-materialisation companion to PR #2338's DirectoryCache zero-byte fix, which lives on the input-fetch path. The two changes are complementary: #2338 covers the directory cache short-circuit; this covers the FilesystemStore API and the running_actions_manager fallback. Changes: - FilesystemStore::get_file_entry_for_digest now returns Code::NotFound for zero digests instead of a synthetic FileEntry, forcing callers to materialise empty files via fs::create_file rather than hard_linking from a phantom source. - running_actions_manager::download_to_directory adds err_tip context on the zero-digest fs::create_file / write_all so a failure here is attributable to the empty-file path. - Updates the existing get_file_entry_for_zero_digest test to assert the new NotFound behaviour and renames it to get_file_entry_for_zero_digest_returns_not_found. - Adds download_to_directory_zero_digest_empty_file_test that exercises an empty file declared inside an input directory and verifies it materialises on disk with zero bytes. This test intentionally does NOT collide with PR #2338's DirectoryCache zero-byte test (which validates a different short-circuit path). Ported from upstream 19d7e20 (PR #2243). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Close test coverage gaps for ported output-upload fixes Audit of the two ported commits (e57b675b, b1dc3a1c) flagged two correctness-relevant gaps. Both are now closed with strict assertions so the underlying bugs cannot silently regress. 1. upload_absolute_symlink_to_directory_uploads_tree The original e57b675b regression test covers absolute-symlink-to-file but not absolute-symlink-to-directory, which exercises an entirely separate code path (upload_directory + Tree proto serialisation). The new test creates an out-of-tree directory containing inner.txt, absolute-symlinks it into the work dir, runs the action, and: - asserts output_directory_symlinks and output_file_symlinks are BOTH empty (the symlink must NOT be preserved), - asserts output_folders contains exactly one entry at the symlink path, proving Tree-upload happened, - walks the uploaded Tree, locates inner.txt, fetches its blob from CAS and verifies the content. This proves the directory was actually traversed and uploaded — not stubbed. 2. download_to_directory_zero_digest_empty_file_test (strengthened) Extended the existing single-file test to cover: - a second sibling zero-digest file (proves the path is not a single-file fluke), - a zero-digest file nested inside a subdirectory (proves the recursive download_to_directory caller also handles NotFound from get_file_entry_for_digest — not just the top-level invocation), - strict per-file assertions: is_file, !is_symlink, len == 0, and a read-back that confirms the file is actually empty rather than a phantom dirent. Verdict on the dropped checks: - Dangling absolute symlink: source returns OutputType::None which is the correct graceful behaviour, but writing a test for it requires fabricating a broken symlink mid-action and is comparatively low value; left uncovered intentionally. - File mode on zero-digest materialisation: the worker uses fs::create_file with no explicit mode, so the result follows umask and would be brittle across CI environments. Length + type assertions are the stronger invariant. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * running_actions_manager_test: use /dev/null in upload_dir_and_symlink_test Previous iterations of this PR sidestepped `/dev/null` by swapping the absolute-symlink fixture for either a relative symlink (lost coverage) or an out-of-tree real empty file (added scaffolding to dodge a corner case we should just test directly). Restore the original `ln -s /dev/null empty_sym` fixture. The post-fix worker code resolves the absolute symlink via `fs::metadata` (follows), sees a character device (`is_dir() == false`), and falls into the "upload as file" branch. `upload_file` opens `/dev/null` and reads to EOF — `/dev/null`'s contract is that reads return 0 bytes immediately — producing the canonical sha256 empty digest (e3b0c44…) with size 0. That's well-defined, harmless behavior and a real-world Bazel pattern (rules sometimes use `ln -sf /dev/null x` to create empty outputs). Test now locks in this contract directly, no scaffolding required. Drops the `external_root` / `external_file` setup since it was only needed to avoid `/dev/null`. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Marcus Eagan <marcuseagan@gmail.com>
erneestoc
added a commit
to erneestoc/nativelink
that referenced
this pull request
May 21, 2026
Previously download_to_directory pushed every file's fetch+hardlink as
one fused future into a single FuturesUnordered. Each future ran the
fetch (populate_fast_store) and then the hardlink sequentially within
the same task. With per-future store-side serialization the fetch leg
becomes the bottleneck for SwiftCompile-style workloads (1978+ input
files at p95), where blob-fetch wall time dominates total action time.
This commit restructures download_to_directory into a two-phase
pipeline driven by try_join3:
1. Fetcher launches ALL missing-blob populate_fast_store calls up
front, bounded at DOWNLOAD_TO_DIRECTORY_FETCH_CONCURRENCY = 128
via stream::try_for_each_concurrent. Digests are deduplicated so a
blob referenced by N files is only fetched once. As each blob
arrives, every file index that needs it is sent through an mpsc
channel.
2. Hardlinker drains the channel and performs hardlink + chmod +
set_mtime, bounded at DOWNLOAD_TO_DIRECTORY_HARDLINK_CONCURRENCY
= 64 (the same cap that PR TraceMachina#2338 added to compose with APFS
metadata-lock behavior on macOS). Hardlinking starts as soon as
the first blob arrives - it does not wait for the fetcher to
finish.
3. Subdirectory recursion and symlink creation run in a separate
FuturesUnordered alongside the file pipeline.
Net win on a Mac-Mini RBE worker: fetch wall time becomes
max(per-blob latency) bounded by the 128-cap, not sum(per-blob latency)
serialized inside one future.
Concurrency caps compose: fetch (128) is network/store-bound where extra
in-flight requests are cheap; hardlink (64) is metadata-bound where
extra in-flight syscalls fight APFS's per-volume lock on macOS. The
two phases run independently with their own bound, so on cache-hot
runs only the hardlink leg is active and the fetch cap is irrelevant.
Tests:
- download_to_directory_fetches_blobs_concurrently: a per-blob 75 ms
sleep on the slow store. Serial fetch of 20 blobs would be 1500 ms;
concurrent fetch finishes well under 500 ms, and max_in_flight is
observed to reach >=4 (would be 1 if serial).
- download_to_directory_streams_hardlinks_while_fetching: 30 blobs
with 50 ms per-blob latency. Serial fetch+hardlink would be 1500 ms;
pipelined version finishes under 750 ms. Catches regression to a
fetch-all-then-hardlink-all two-phase shape that drops overlap.
Both tests use a minimal DelayedStore harness (StoreDriver wrapping
MemoryStore) added inline. This pulls in async-trait as a dev-dependency
(version 0.1.88, matching the rest of the workspace).
Scope notes (vs PR TraceMachina#2243 a8db953):
- The upstream commit depends on infrastructure introduced earlier in
the PR that is NOT being ported here: batch_read_small_blobs,
populate_fast_store_unchecked, get_file_entries_batch,
BATCH_READ_MAX_BLOB_SIZE, BatchReadBlobs / ByteStream partitioning.
These hooks are entangled with the peer-to-peer infrastructure and
proto changes we explicitly excluded. The upstream commit was
reduced to its core idea: launch all fetches upfront, hardlink as
blobs arrive.
- Upstream uses a Notify + shared HashSet<DigestInfo> "fetched_set"
to coordinate fetcher and producer. Here we use a simpler mpsc
channel: the fetcher sends file indices directly as their blob
becomes available. There is no separate producer step because there
is no batch-vs-bytestream partitioning to defer to a second pass.
- Follow-on upstream commits 7b093e8 and b64944a are excluded
per task scope - they depend on peer hints and proto changes.
Extracted from TraceMachina/nativelink PR TraceMachina#2243 (commit a8db953),
adapted to the current code shape.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Comprehensive production hardening from running NativeLink at scale with 10 Mac Mini workers and a ZFS-backed server. 156 commits across the following areas:
Store reliability & I/O performance
MokaEvictingMap — lock-free TinyLFU cache (NEW)
Redis pipelining (NEW)
Batch store reads (NEW)
GetTree caching & coalescing (NEW)
Graceful shutdown (NEW)
Streaming blob fixes (NEW)
gRPC protocol & transport
Worker improvements
Observability
QUIC/HTTP3 transport
Test plan
cargo check --bin nativelink --features quic,pprofcompiles clean🤖 Generated with Claude Code
This change is