Skip to content

Production hardening: scheduling, P2P transfers, directory cache, TLS/QUIC#2243

Open
rejuvenile wants to merge 310 commits into
TraceMachina:mainfrom
rejuvenile:pr/all-changes
Open

Production hardening: scheduling, P2P transfers, directory cache, TLS/QUIC#2243
rejuvenile wants to merge 310 commits into
TraceMachina:mainfrom
rejuvenile:pr/all-changes

Conversation

@rejuvenile
Copy link
Copy Markdown

@rejuvenile rejuvenile commented Mar 24, 2026

Summary

Comprehensive production hardening from running NativeLink at scale with 10 Mac Mini workers and a ZFS-backed server. 156 commits across the following areas:

Store reliability & I/O performance

  • Fix store pipeline races causing Bazel "Lost inputs" errors
  • Fix ExistenceCacheStore false positives and LRU promotion race
  • Fix BatchUpdateBlobs missing results for duplicate digests
  • Retry hardlinks on cache eviction, hold file path read-lock during hardlink
  • Fire-and-forget eviction unrefs + CAS dedup in FilesystemStore
  • POSIX_FADV_SEQUENTIAL for kernel readahead, rayon-based blake3 hashing
  • Fix LRU eviction ordering at startup (sort by atime, not directory order)
  • Fix zero-digest file handling in FilesystemStore and worker pipeline

MokaEvictingMap — lock-free TinyLFU cache (NEW)

  • Replace Mutex+LRU EvictingMap with moka::sync::Cache (lock-free reads)
  • TinyLFU admission with frequency bump on insert (prevents single-access rejection)
  • Pinning via DashMap with 120s auto-timeout, 25% cap
  • Bounded eviction channel (4096) with inline fallback
  • Startup atime ordering preserved via insert_startup (no frequency bump)
  • KB-scaled weigher for items up to 4TB
  • All 4 stores migrated: ExistenceCacheStore, MemoryStore, FilesystemStore, MemoryAwaitedActionDb
  • Old EvictingMap/ShardedEvictingMap removed (-2,177 lines)
  • 18 integration tests covering full API
  • Production result: lock contention 391 (406ms worst) → 0

Redis pipelining (NEW)

  • has_with_results: batch STRLEN+EXISTS into single pipeline (N keys → 1 round-trip)
  • batch_get_part_unchunked: pipelined GETRANGE+EXISTS across store chain
  • BatchReadBlobs: delegates to batch pipeline instead of N individual reads
  • RENAME+PUBLISH pipelined (saves 1 RTT per write with pub_sub)
  • Pipeline chunking (5000 max) prevents unbounded response buffering
  • CrossSlot fallback for Redis cluster mode
  • Production result: Redis slow commands 10,400 → 0

Batch store reads (NEW)

  • batch_get_part_unchunked trait method across full store chain
  • MemoryStore override: direct get_many() + BytesWrapper::to_contiguous(), zero buf_channel
  • FilesystemStore override: parallel FuturesUnordered without buf_channel overhead
  • FastSlowStore: fast-first with slow fallback for NotFound
  • SizePartitioningStore: concurrent lower/upper partition batches
  • Production result: GetTree BFS 100-600ms → sub-millisecond (warm cache)

GetTree caching & coalescing (NEW)

  • Tree result cache: MokaEvictingMap (512MiB, 10K entries, 5min TTL)
  • Subtree directory cache: per-directory MokaEvictingMap (256MiB, 50K, 5min TTL)
  • Thundering herd coalescing: watch channel prevents redundant BFS (54 calls → 1 BFS + 53 waiters)
  • Arc<Vec> for zero-copy cache sharing
  • insert_many batch fix: 1 run_pending_tasks instead of N+1
  • 8 integration tests for caching, coalescing, subtree overlap
  • Production result: slow BFS levels 642 → 0, tree cache hit 6µs vs 26ms BFS

Graceful shutdown (NEW)

  • Server: HTTP/2 GOAWAY drain via hyper_util GracefulShutdown, 30s per-listener timeout
  • Server: QUIC serve_with_shutdown on accept_stop signal
  • Worker: CAS TCP server serve_with_shutdown with cas_shutdown_tx watch channel
  • SIGTERM sequence: stop accept → drain (35s) → flush writes (30s) → shutdown schedulers (20s) → exit
  • Systemd: TimeoutStopSec=90, KillMode=mixed
  • Production result: clean shutdown, no more stale AC entries after restart

Streaming blob fixes (NEW)

  • Fix errored in-flight entries poisoning all reads for that digest
  • has_error() check falls through to CAS store on poisoned entries
  • Immediate cleanup of poisoned InFlightBlobMap entries

gRPC protocol & transport

  • Fix 4MB default message size limit (configurable per-listener)
  • Fix ByteStream resume returning wrong error code
  • Add gRPC status detail propagation for FAILED_PRECONDITION
  • Parallel chunked ByteStream reads for large blobs
  • HTTP/2 send buffer and flow control tuning
  • Fix dual_transport: skip QUIC endpoints in TCP ConnectionManager

Worker improvements

  • DirectoryCache: spawn_blocking for filesystem ops, parallel blob downloads
  • Fix subtree race condition with download fallback
  • Worker proxy store with blob mirroring and locality-aware scheduling
  • Targeted prefetch of missing blobs at action dispatch
  • BlobsInStableStorage lifecycle for durability
  • Fix worker reconnect on scheduler eviction / server restart

Observability

  • Runtime watchdog + TCP keepalive
  • Store operation stall detector with thread dumps
  • pprof auto-capture on CPU threshold
  • Comprehensive write path and eviction logging
  • EvictingMap lock instrumentation
  • Non-blocking stdout logging via tracing-appender

QUIC/HTTP3 transport

  • QUIC connection pool with SO_REUSEPORT
  • BBR congestion control, tuned ACK delay
  • TLS support for worker-to-server and server-to-worker connections
  • Dual TCP+QUIC transport with intelligent routing

Test plan

  • 18 MokaEvictingMap integration tests (insert/get/remove, eviction, TTL, pinning, callbacks, concurrent stress)
  • 8 GetTree cache tests (tree_cache_hit, subtree_overlap, coalescing_concurrent, leader_failure, paginated_bypass)
  • Existing CAS server tests pass
  • cargo check --bin nativelink --features quic,pprof compiles clean
  • Production deployment: 10 Mac Mini workers, 48GB MemoryStore, 800GB FilesystemStore
  • macOS benchmark: 39s → 11s (3.5x faster)

🤖 Generated with Claude Code


This change is Reviewable

rejuvenile and others added 30 commits March 23, 2026 09:39
inner_store returned self, preventing callers (like LocalWorker) from
downcasting through the chain to find FastSlowStore. Delegate to inner
store instead — optimized_for override is independent.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
EvictingMap: warn! for items evicted within 120s of insertion (age + size
in log), debug! for older items. Helps diagnose Bazel "lost inputs" errors.

Worker: append .local to bare hostnames for mDNS resolution so the server
can connect to worker CAS endpoints for peer blob sharing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
upload_results writes to fast_store() only (FilesystemStore), deferring
remote CAS upload to spawn_upload_to_remote. But that function only
collected tree_digest blobs, not the individual file blobs inside output
directory trees (dep-graph.bin, query-cache.bin, etc). This caused
"Missing digest" / "lost inputs" errors when Bazel tried to download
action outputs that were never pushed to the server.

Fix: decode each Tree proto from fast_store and extract all file digests
for inclusion in the background upload. Also add success/fail counters
and tree file count to upload logging for diagnostics.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Workers now race a peer fetch (via locality map) in parallel with the server
fetch when peers are known. Whichever responds first wins; the loser is
cancelled. This gives LAN peers a chance to win when they're closer/faster.

Server-side behavior is unchanged — IS_WORKER_REQUEST detection ensures the
sequential path (with redirect generation) is used for server-side requests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The previous commit raced on both server and worker sides because
IS_WORKER_REQUEST isn't set for Bazel client requests to the server.
Add an explicit race_peers flag (default false) that only workers
enable, preventing the server from wastefully racing against its
own workers.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…AS, register output digests

- Parallelize server-side GetTree BFS (FuturesUnordered per tree level)
- GrpcStore: report LazyExistenceOnSync for CAS stores (skip FindMissingBlobs before get_part)
- WORKER_BACKLOG 8→64 to reduce backpressure during burst patterns
- Worker peer CAS connections 4→64
- Include tree digests in BlobsAvailableNotification from worker
- Register output digests from ExecuteResult in server locality map
- Fix existence_store_test: yield for async eviction callbacks
- Fix bytestream_server_test: tonic Status format change

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…check, parallel mkdir

- Bound hardlink phase to 64 concurrent tasks (was unbounded 4000+)
- Split has_with_results into 500-key chunks to release Mutex between batches
- Level-parallel BFS for directory creation (siblings concurrent, parents first)
- Log CAS server exit errors in local_worker

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Log requested/missing counts at info level, and list missing digests at
debug level. Needed to diagnose "Lost inputs" build failures where blobs
exist on disk but Bazel reports them missing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tribution

Uses WorkerProxyStore locality map to route batch blob reads to peers that
already have the data, falling back to server for unknown digests. All peer
and server batches execute in parallel via join_all, eliminating the previous
server-only bottleneck where 10 workers competed for the same blobs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…concurrency to 64

- Round-robin digest assignment across peers that have the blob, preventing
  hotspots when one peer has most blobs
- Retry path is now best-effort: individual failures are logged and skipped
  instead of aborting the entire batch operation
- BYTESTREAM_CONCURRENCY increased from 16 to 64

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Workers report cpu_load_pct (load_avg_1m / num_cpus * 100) piggybacked
on KeepAliveRequest (~2.5s), BlobsAvailableNotification (~500ms), and
ExecuteComplete (per action). The scheduler stores this per-worker and
prefers lightly-loaded workers when selecting candidates:

- LRU/MRU fallback path: picks lightest-loaded viable worker
- Locality scoring tiebreaker: when scores are within 10%, lower CPU
  load wins before timestamp
- Workers reporting 0 (unknown/old) are sorted last among known loads

Backward compatible: old workers send 0 (proto default), treated as
unknown and sorted last.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three bugs in inner_get_tree:

1. FuturesUnordered returned directories in completion order, not BFS
   discovery order, making paging tokens nondeterministic. Fixed by
   collecting into a HashMap and iterating in original order.

2. page_size=0 (no paging) triggered `len >= 0` which is always true,
   breaking after the first BFS level. Fixed by treating 0 as MAX.

3. When page was filled mid-level, remaining unprocessed items were
   dropped, producing empty next_page_token. Fixed by copying remaining
   items back to the deque front.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…e fast path

- GetTree BFS validation: use position assignment instead of re-hashing
  directory protos (fixes 77% validation failure rate with Java-serialized protos)
- EvictingMap::get_many(): batch retrieval in single lock acquisition
- FilesystemStore::get_file_entries_batch(): batch file entry lookup
- Hardlink pre-fetch: pre-fetch entries in batches of 500 before concurrent
  hardlink loop, eliminating per-file EvictingMap lock contention
- Blob eviction race fix: eager pre-read of small blobs (<=1MiB) before
  background upload to prevent eviction race in spawn_upload_to_remote
- Directory cache: use download_to_directory for cache-miss construction
  instead of serial per-file RPCs (2.5s -> 50-200ms)
- Combined set_readonly_recursive + calculate_directory_size into single walk
- GetTree BFS dedup logging: per-level timing, dedup stats, slow level warnings
- Input fetch logging: tree resolution, materialization, hardlink stats,
  blob fetch throughput, slow operation warnings
- CPU load logging downgraded to debug level (worker + server)
- Load-aware selection logging downgraded to debug level
- Fix mkdir_depth_levels log field (was using dirs_created instead of depth)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Streaming pipeline: producer fetches missing blobs in batches of 64,
  consumer hardlinks concurrently (64-wide) as blobs arrive via mpsc channel.
  Both overlap via futures::future::join for minimum wall-clock time.
- Raise MAX_PEER_HINTS from 1000 to 16384 to cover large actions.
- FastSlowStore: per-leg timing (fast_ms, slow_ms, slower_leg).
- FilesystemStore: per-phase timing (temp create, write, emplace) >50ms.
- EvictingMap: warn! on lock contention >1ms with operation name.
- StallGuard: with_context() for dynamic digest/size in stall dumps.
- DirectoryCache: comprehensive hit/miss/eviction/timing logging.
- MemoryStore config: 16GB→32GB per tier (64GB total) in server configs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…dlinks as blobs arrive

Three-future pipeline: fetcher (all blobs at once, bounded at 128), producer
(sends files to channel as blobs land via Notify), consumer (hardlinks at 64
concurrency). Eliminates serial per-batch round-trips that bottlenecked at
47-60 MB/s despite 10GbE capacity.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…info

- directory_cache: set_readonly_and_calculate_size was setting all files
  to 0o444, stripping the execute bit from executables. Now preserves
  execute permission (0o555 for executables, 0o444 for non-executables).
  This caused EPERM failures for cargo build scripts (runner, ring, etc).

- api_worker_scheduler: promote "Load-aware worker selection" log from
  debug! to info! so CPU-load scheduling decisions are visible in
  production logs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
4 tests verifying load-aware worker selection:
- cpu_load_update_worker_load_stores_correctly
- cpu_load_lightest_loaded_worker_gets_picked
- cpu_load_unknown_zero_sorted_last
- cpu_load_falls_back_to_lru_when_no_load_data

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Workers report cached input_root_digest values to the scheduler via
BlobsAvailableNotification. Scheduler gives highest routing priority
to workers with a directory cache hit for the action's input_root_digest.

Fix EPERM: set_readonly_and_calculate_size now strips write bits only
(& 0o555) instead of guessing executable status. Also removes the
skip-when-0o555 chmod optimization in hardlink_and_set_metadata which
was unsafe because concurrent hardlinks sharing CAS inodes can corrupt
file permissions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
On cache MISS, resolve the full merkle tree and check if any subtrees
are already cached from other root entries. Cached subtrees are reused
via directory symlinks (APFS-compatible), skipping download of already-
materialized portions. BFS traversal ensures maximum (top-down) subtree
matching.

- Store .merkle_tree_meta alongside each cached directory entry
- In-memory subtree_index maps every directory digest to its disk path
- Rebuild subtree index from disk metadata on startup
- Clean up subtree index entries on eviction
- Made resolve_directory_tree public for cache access
- 6 new tests for merkle metadata, subtree index, and cache reload

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
macOS requires write permission on the source directory for rename(2),
unlike POSIX/Linux which only checks the parent. Temporarily restore
0o755 on the temp dir before rename, then lock down to 0o555 after.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…er subtree scoring

Workers report cached subtree digests via delta encoding (added/removed)
instead of full snapshots every 500ms. First notification sends full
snapshot, subsequent ones send only changes. Scheduler scores workers
by both root directory cache hits and subtree cache hits.

Fix CAS inode corruption: always provide explicit unix_mode (0o444 for
non-executable, 0o555 for executable) to prevent concurrent hardlinks
from corrupting shared inode permissions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… not just directory count

When no worker has an exact root match, the scheduler now scores workers by
the total file bytes under their cached subtree digests. A worker caching a
subtree with 10GB of files scores higher than one with 100 bytes. The tree
resolver computes per-subtree byte totals via bottom-up aggregation during
BFS resolution, cached alongside file digests.

Scoring tiers: exact root match > weighted subtree coverage > blob locality > LRU.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…e cache

Bazel actions create output directories inside the input tree. Directory
symlinks to cached subtrees caused EPERM because either:
- Cached dirs were 0o555 (can't mkdir inside), or
- If made writable, actions would mutate the cache

Fix: use hardlink_directory_tree for subtree cache hits — creates fresh
writable directories and hardlinks only the files. Cache integrity is
preserved (0o555 dirs, read-only files) since actions never access the
cache directly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ach path component

When create_dir_all fails for an output directory, walk up the path and
log the mode, is_dir/is_file/is_symlink status of each component to
identify whether the failure is due to a read-only parent, a file
blocking a directory, or a symlink to a read-only cache.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…dlink

The Bazel input tree contains SymlinkNodes (e.g., bazel-out) that, when
recreated as symlinks in the work directory, point to the read-only
directory cache (0o555 directories). create_dir_all then fails with
EPERM when trying to create output directories through these symlinks.

Fix: hardlink_directory_tree_recursive now resolves directory symlinks
to real directories with fresh writable permissions, while preserving
file symlinks and dangling/looping symlinks as-is.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…opies

The Bazel input tree contains SymlinkNodes (e.g., bazel-out -> .) that
point to directories in the read-only directory cache (0o555). When
create_dir_all tries to create output directories through these symlinks,
it fails with EPERM because the resolved target directories are read-only.

Previous approach (resolving ALL directory symlinks in hardlink_directory_tree)
caused infinite recursion for self-referential symlinks like bazel-out -> .

New approach: prepare_output_directories now handles this surgically:
1. Fast path: try create_dir_all (usually works)
2. On failure, walk the output path component by component
3. For each symlink that resolves to a read-only directory:
   - Replace the symlink with a real writable directory
   - Create absolute symlinks to all entries in the original target
   - Skip self-referential entries (e.g., bazel-out pointing to itself)
4. For read-only work dirs (0o555): chmod writable
5. Retry create_dir_all after each fix

This preserves access to all input tree files while making the specific
output path writable.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…_dir_all

create_dir_all succeeds when the output directory already exists (it's
part of the input tree), even if the directory is read-only (0o555)
through a symlink chain to the cached directory. Then rustc fails with
Permission denied when writing output files (.d, binary) into it.

Fix: after create_dir_all succeeds, check if the parent directory is
actually writable (mode & 0o200). If not, fall through to the slow path
that replaces symlinks-to-read-only-dirs with writable shallow copies.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- set_readonly_recursive: preserve execute bits (& !0o222 instead of
  hardcoded 0o444) so cached shell scripts remain executable
- prepare_output_directories: serialize slow-path symlink replacement
  with a tokio Mutex to prevent concurrent EEXIST/ENOENT races
- Cargo.toml: lto="thin", codegen-units=16 for faster release builds

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Old cache entries had files at 0o444 (no execute bit) from before the
set_readonly_and_calculate_size fix. Since the cache persists on disk
across restarts, these stale entries kept serving non-executable files
like cargo_build_script_runner and cc_wrapper.sh.

Add CACHE_FORMAT_VERSION (currently 2) with a version file check on
startup. When the version is missing or stale, all entries are cleared
and the version file is written. This ensures format changes (like
permission semantics) automatically invalidate old entries.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
rejuvenile and others added 6 commits April 14, 2026 21:52
18 integration tests covering: insert/get/remove, max_bytes/max_count
eviction, TTL expiration, pin/unpin/pin_cap, replaced items, startup
insert, sizes_for_keys, range queries, concurrent stress, callbacks.

VerifyStore: documented that batch_get_part_unchunked bypasses hash
verification. Audited all store wrappers — completeness_checking and
dedup correctly fall through to default (no bypass).

Expiry investigation: Moka's Expiry trait does NOT help with size-based
eviction ordering (TTL and size eviction are independent). Corrected
doc: window deque is unused in Moka 0.12, entries go to MainProbation.
Current mitigation (insert_startup skips freq bump, FIFO ordering in
MainProbation) correctly preserves atime-based eviction.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
MemoryStore: direct evicting_map.get_many() + BytesWrapper::to_contiguous().
Zero buf_channel allocations, zero async task pairs. Single-chunk blobs
are zero-copy (Arc bump). ~50us for 500 keys vs ~2-5ms before.

FilesystemStore: FuturesUnordered for parallel I/O but skips buf_channel.
Each task does evicting_map.get() → read_file_entry_bytes() → Bytes.
Preserves FD semaphore, stale-entry cleanup, length cap. 256MiB safety
bound for unlimited reads.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The dual_transport code was passing ALL endpoints (including
use_http3=true) to the TCP ConnectionManager. This caused TCP
connection attempts to the QUIC-only UDP port (50072), generating
persistent ConnectionRefused errors and log floods on all workers.

Fix: filter out use_http3 endpoints when building tcp_endpoints.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
GetTree: MokaEvictingMap cache (512MiB, 10K entries, 5min TTL) for
assembled tree results. 54 concurrent identical GetTree calls → 1 BFS
+ 53 cache hits. Eliminates ~15,900 redundant directory lookups.

ExistenceCacheStore: batch_get_part_unchunked now uses insert_many()
instead of 552 sequential insert() calls. Reduces ~2200 Moka ops +
552 run_pending_tasks() to ~552 inserts + 1 maintenance pass.

get_many(): documented why sequential is correct (Moka lock-free
reads at 100ns each, parallelism overhead exceeds benefit).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
GetTree coalescing: DashMap + watch channel prevents thundering herd.
54 concurrent same-root calls → 1 BFS + 53 waiters. TOCTOU race
fixed with Entry API (single lock scope for check+register).

Subtree cache (level 2): per-directory MokaEvictingMap (256MiB, 50K,
5min TTL). BFS checks subtree cache before store fetch. Different
roots with 90% overlap → only ~10% dirs fetched from store.

Arc<Vec<Directory>>: CachedTree wraps directories in Arc for cheap
cache clones within moka. Response construction still deep-clones
(required by protobuf ownership).

insert_many: uses insert_batch (defers run_pending_tasks) instead
of insert_inner (called it per-item). Now 1 maintenance pass per
batch instead of N+1.

CLAUDE.md: code review requirement before committing.

Level 3 Tree proto lookup deferred (no root→tree digest mapping).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
8 integration tests: tree_cache_hit, tree_cache_miss_different_root,
subtree_cache_overlap, coalescing_concurrent, coalescing_leader_failure,
paginated_bypasses_cache, subtree_cache_deduplication, next_page_token.

Arc optimization: BFS result moved into Arc (zero-copy), cache gets
Arc clone (refcount bump), response gets one deep clone. Eliminates
transient double materialization (~5000 heap allocations saved).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
FindMissingBlobs (120K/min), ByteStream read/write completed,
mirror blob streamed, AC read/write, BlobsAvailable registration,
streaming populate, connection creation — all downgraded from
info! to debug!. With release_max_level_info these are compiled
out in release builds, reducing log volume ~60x under load.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@MarcusSorealheis MarcusSorealheis mentioned this pull request May 2, 2026
8 tasks
@MarcusSorealheis
Copy link
Copy Markdown
Collaborator

/build-image

@MarcusSorealheis
Copy link
Copy Markdown
Collaborator

Warning: until we have the source code for the io_uring dep, it is risky to run this code.

@erneestoc
Copy link
Copy Markdown
Contributor

@MarcusSorealheis isn't that just part of Linux? (src, also, does that mean it'd only work on Linux though?)

@MarcusSorealheis
Copy link
Copy Markdown
Collaborator

MarcusSorealheis commented May 14, 2026

@erneestoc that's exactly right. We will have to think very carefully about that aspect of this PR. It is only one optimization. There are many others.

MarcusSorealheis pushed a commit that referenced this pull request May 17, 2026
* worker: set QoS USER_INITIATED on macOS for P-core preference

Apple Silicon's XNU scheduler will park UTILITY/BACKGROUND threads on
efficiency cores. Single-thread-bursty workloads (swift-frontend,
clang) typical in iOS RBE builds can run 2x-3x slower on an E-core,
so tag the worker process with QOS_CLASS_USER_INITIATED to bias
scheduling toward P-cores.

The setter runs in three places:
  - Main thread before tokio runtime creation so worker threads
    inherit the class via pthread QoS inheritance.
  - tokio Builder::on_thread_start hook as belt-and-suspenders for
    any thread (e.g. blocking pool) that misses inheritance.
  - Top of LocalWorker::run for the same reason.

Implementation uses libc's pthread_set_qos_class_self_np binding;
the new `nativelink_worker::qos` module is compile-gated so non-macOS
targets emit no call and pull in no symbol. A round-trip test on
macOS verifies the kernel accepted the class change.

Ported from upstream commit 0fce813 (TraceMachina/nativelink #2243).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* worker: test tokio worker thread inherits USER_INITIATED QoS

The QoS scheme in PR #2243 hinges on tokio worker threads actually
seeing QOS_CLASS_USER_INITIATED at task-runtime; without an end-to-end
test the on_thread_start hook could silently regress (e.g. if the hook
ran on the wrong thread or the kernel rejected the class) and the
worker would quietly fall back to E-core scheduling.

Adds a macOS-only test that builds a fresh multi-threaded tokio runtime
with the same on_thread_start hook used in main, spawns a task to force
execution on a worker thread, and reads back the class with
pthread_get_qos_class_np. Also refactors the existing single-thread test
to share a helper.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* qos: justify disallowed_methods escape on tokio_worker_threads_inherit test

The on_thread_start inheritance test must construct a custom-built
runtime via `Builder::new_multi_thread()` and drive it with `block_on`
— that is the unit under test. No `nativelink-util::task` wrapper
exposes a custom-built runtime with a thread-start hook, so the
disallowed_methods lint cannot be addressed at the root cause.

Use `#[expect(clippy::disallowed_methods, reason = ...)]` per the
modern Rust 2024 idiom (fails if the lint stops firing, with a
reviewer-visible justification) rather than a silent `#[allow]`.

Mirrors the same justified escape already used in
src/bin/nativelink.rs::main.

* qos: split set_user_initiated into cfg-gated fn / const fn

The previous single-definition `pub fn set_user_initiated() -> bool` had
a `#[cfg(target_os = "macos")]` block that called libc and a
`#[cfg(not(...))]` block that returned `true`. On Linux CI clippy sees
only the trivial `true` arm and fires `missing_const_for_fn`, failing
ubuntu, asan, Bazel Dev/ubuntu, and every dependent rbe-* job. This did
not reproduce on macOS because the macOS arm calls libc, which is not
const-eligible, so clippy stays silent.

Split into two cfg-gated definitions: the macOS impl stays a regular
`pub fn` because `libc::pthread_set_qos_class_self_np` is not const; the
non-macOS impl becomes `pub const fn` returning `true`. Call sites are
unchanged, both arms still return `bool`, and the existing
`qos::macos_tests::*` continue to apply since they were already gated on
`#[cfg(target_os = "macos")]`. Doc comments are now split per arm and
specialised to each platform's actual behaviour.

Splitting (rather than `#[allow(missing_const_for_fn)]` on a single
function) is the right fix because the lint is accurate for the
non-macOS arm in isolation; suppressing it would hide a legitimate
const-fn opportunity and mask future bugs on whichever platform clippy
runs against.

* ci: retrigger after GitHub 502 fetching rules_kotlin tarball

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
MarcusSorealheis added a commit that referenced this pull request May 17, 2026
…low writes (#2343)

* fast_slow_store: make has_with_results aware of fast store and in-flight slow writes

`FastSlowStore::has_with_results` previously only consulted the slow store,
which caused two known bugs:

  1. Fast-only writes (and the brief window between fast-store insert and
     slow-store write start) were invisible — callers got NotFound for blobs
     that were locally present, triggering redundant fetches.

  2. Concurrent writers racing on the same digest could not see each other's
     in-flight slow-store writes, so the second writer re-uploaded what the
     first had nearly finished pushing.

The fix layers three consultations in order: slow store first (authoritative
for downstream consumers), then a `in_flight_slow_writes: Mutex<HashMap>`
tracking active slow writes, then fast store as a final fallback for
fast-only / pre-slow-write hits.

In-flight tracking uses a cancel-safe RAII guard (`InFlightSlowWriteGuard`)
registered at the start of each `update`/`update_with_whole_file` path that
writes to the slow store, so a cancelled write future correctly removes
itself from the map on Drop.

Tests cover all four reachable cases: slow-hit short-circuit, fast-only
hit, in-flight slow write visibility, and noop-slow-store fall-through.

Provenance: equivalent to upstream commits f69aaf8 and 2d770d9 from
TraceMachina/nativelink PR #2243, ported atomically to current main.

* fast_slow_store_test: add cancel-safety and mixed-key has_with_results tests

The previous commit added tests for the four reachable cases of the
fast/slow/in-flight has() lookup, but two correctness properties were
asserted only indirectly:

  1. Cancel safety of `InFlightSlowWriteGuard`. The fix's central claim
     is that aborting an in-progress update() removes the key from the
     in_flight_slow_writes map via Drop. The previous in-flight test
     only verified the happy-path (writer completes normally) and the
     fast-store fallback masked any leak.

  2. Per-key independence in batched has_with_results. With multiple
     keys spanning different storage tiers, an off-by-one in the
     missing_indices fallback or an over-broad in-flight match would
     silently corrupt results without any existing test noticing.

New tests:

* dropping_update_future_cleans_up_in_flight_entry: uses a
  gated-slow-store with NoopStore as fast (so the fast-store fallback
  cannot mask a leak), spawns a writer, waits on a oneshot started
  signal, aborts the writer mid-update, then asserts has() returns
  None — proving the guard's Drop ran.

* has_with_results_handles_mixed_key_sources: builds a request with
  four keys — slow-only, in-flight, fast-only, and missing — and
  asserts each result independently matches its source. Catches
  index-mapping regressions in the batched fallback path.

Both tests are deterministic (oneshot channels, no sleeps). Verified
that reverting the fix in fast_slow_store.rs causes both new tests
plus the existing has_sees_in_flight_slow_writes and
fast_store_only_value_is_reported_by_has to fail.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fmt: apply rustfmt to nativelink-store

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fast_slow_store_test: address clippy items_after_statements and cast_possible_truncation

Hoist the `MapBackedSlow` test helper (struct, StoreDriver impl, and
`default_health_status_indicator!` macro invocation) out of
`has_with_results_handles_mixed_key_sources` to module scope so that
items are declared before statements, satisfying
`clippy::items_after_statements`.

Replace `as usize` casts of the per-key `u64` sizes with
`usize::try_from(...).unwrap()` to satisfy
`clippy::cast_possible_truncation`, matching the existing conversion
pattern elsewhere in this test file.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci: retrigger after GitHub 502 fetching hermetic_cc_toolchain tarball

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Marcus Eagan <marcuseagan@gmail.com>
palfrey added a commit that referenced this pull request May 19, 2026
… Bazel input materialization (#2338)

* fs_util: add clonefile(2) fast path to hardlink_directory_tree

On macOS, try APFS clonefile(2) before falling back to the existing
per-file hardlink walk. clonefile is O(1) in tree size and uses
copy-on-write, so subtree-cache hits no longer scale with input count.

After a successful clone the destination is made writable (0o755/0o644)
because the clone inherits the source's permissions and cached subtrees
are 0o555/0o444. The COW semantics of clonefile mean writes to the
destination do not affect the source, so this is safe.

On EXDEV (cross-volume), ENOTSUP, or any other errno, we log at debug
and fall through to hardlink_directory_tree_recursive. Linux and Windows
paths are unchanged.

Extracted from TraceMachina/nativelink PR #2243 (commit 13fcc0c).

* fs_util,directory_cache: track CloneMethod per materialization

hardlink_directory_tree now returns CloneMethod (Clonefile | Hardlink) so
callers can see which kernel path was taken. DirectoryCache records the
result in two atomic counters and exposes them via CacheStats. Without
this telemetry, a silent fall-through from clonefile to per-file
hardlinks (e.g. cache and workspace on different volumes, or APFS
clonefile failing for any other reason) would be invisible.

On Linux/Windows hardlink_directory_tree always returns Hardlink — no
behavioral change. The new CacheStats fields default to zero on those
platforms.

Extracted from TraceMachina/nativelink PR #2243 (commit 13fcc0c).

* running_actions_manager: bound download_to_directory concurrency to 64

Previously download_to_directory pushed every file-hardlink, subdir
recursion, and symlink future onto an unbounded FuturesUnordered, then
drained it. On macOS this produced thousands of parallel hardlink(2)
calls fighting APFS's per-volume metadata lock — the observed exec-log
shape was ~4 ms per input file at scale, consistent with serialized
metadata mutations plus tokio scheduling overhead.

This commit gates each directory level to at most 64 in-flight futures
via stream::buffer_unordered(64). 64 is well above the inflection point
on any modern Linux filesystem, so Linux is unaffected beyond replacing
tokio scheduling overhead with simpler stream polling.

Scope notes (vs PR #2243 ee85fdc):
- The chunked has_with_results sub-change does not apply directly: the
  current code calls populate_fast_store per-digest, not a batched
  has_with_results.
- Level-parallel BFS mkdir is not applied here; the recursion structure
  is unchanged. The 64-cap is per recursive call, not global. Deep trees
  can therefore still have 64 * depth in-flight futures. A full flatten
  pass is a follow-up.

Extracted from TraceMachina/nativelink PR #2243 (commit ee85fdc),
narrowed to fit the current code shape.

* directory_cache: short-circuit zero-byte files

FilesystemStore (and several other CAS backends) refuse to store
zero-byte blobs, so a get_part_unchunked for the zero-byte digest
(af1349b9... / e3b0c449...) returns NotFound. Bazel input trees
routinely contain empty marker/config files (.linksearchpaths, empty
.env, .toml, etc.), so without this fix a single such file in any
directory causes the entire DirectoryCache construction to fail —
roughly 30% of cache attempts per PR #2243.

Short-circuit create_file: if the digest is the zero-byte digest, write
b"" to disk directly and never consult the CAS.

Cross-platform correctness fix.

Extracted from TraceMachina/nativelink PR #2243 (commit d198902).

---------

Co-authored-by: Tom Parker-Shemilt <tom@tracemachina.com>
erneestoc added a commit to erneestoc/nativelink that referenced this pull request May 19, 2026
Port of TraceMachina/nativelink PR TraceMachina#2243 commit a47774d.

Root cause: when DirectoryCache evicts an entry, the cleanup path calls
`set_readwrite_recursive` on the cached tree before `remove_dir_all`. That
helper chmods every entry — including files — to 0o755/0o644. Files in
a cached entry are hardlinked into in-flight action workspaces (via
`hardlink_directory_tree` in `get_or_create`) and ultimately share an
inode with the underlying `FilesystemStore` CAS blob (via `fs::hard_link`
in `download_to_directory`). Chmoding the cached-side file therefore
silently mutates the shared inode's mode for every other in-flight
action holding a hardlink to the same blob.

Production symptom: EACCES on exec for `cc_wrapper.sh`. The CAS mode of
0o555 (r-xr-xr-x) gets clobbered to 0o644 (rw-r--r--), dropping the +x
bit while an unrelated action is mid-exec.

Fix: introduce `set_dir_writable_recursive` which only chmods
directories, never files. On unix, write permission on the parent
directory is sufficient to unlink files inside; the files' own modes
are irrelevant for unlinking. Switch the eviction cleanup path in
`DirectoryCache::evict_lru` to the new helper.

Empirically verified: a regression test that hardlinks a 0o555 file
into a cached tree and runs the cleanup helper FAILS on pre-fix code
(file mode mutated to 0o644) and PASSES on post-fix code.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
MarcusSorealheis pushed a commit that referenced this pull request May 19, 2026
#2347)

Port of TraceMachina/nativelink PR #2243 commit a47774d.

Root cause: when DirectoryCache evicts an entry, the cleanup path calls
`set_readwrite_recursive` on the cached tree before `remove_dir_all`. That
helper chmods every entry — including files — to 0o755/0o644. Files in
a cached entry are hardlinked into in-flight action workspaces (via
`hardlink_directory_tree` in `get_or_create`) and ultimately share an
inode with the underlying `FilesystemStore` CAS blob (via `fs::hard_link`
in `download_to_directory`). Chmoding the cached-side file therefore
silently mutates the shared inode's mode for every other in-flight
action holding a hardlink to the same blob.

Production symptom: EACCES on exec for `cc_wrapper.sh`. The CAS mode of
0o555 (r-xr-xr-x) gets clobbered to 0o644 (rw-r--r--), dropping the +x
bit while an unrelated action is mid-exec.

Fix: introduce `set_dir_writable_recursive` which only chmods
directories, never files. On unix, write permission on the parent
directory is sufficient to unlink files inside; the files' own modes
are irrelevant for unlinking. Switch the eviction cleanup path in
`DirectoryCache::evict_lru` to the new helper.

Empirically verified: a regression test that hardlinks a 0o555 file
into a cached tree and runs the cleanup helper FAILS on pre-fix code
(file mode mutated to 0o644) and PASSES on post-fix code.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
erneestoc added a commit to erneestoc/nativelink that referenced this pull request May 19, 2026
Previously download_to_directory pushed every file's fetch+hardlink as
one fused future into a single FuturesUnordered. Each future ran the
fetch (populate_fast_store) and then the hardlink sequentially within
the same task. With per-future store-side serialization the fetch leg
becomes the bottleneck for SwiftCompile-style workloads (1978+ input
files at p95), where blob-fetch wall time dominates total action time.

This commit restructures download_to_directory into a two-phase
pipeline driven by try_join3:

  1. Fetcher launches ALL missing-blob populate_fast_store calls up
     front, bounded at DOWNLOAD_TO_DIRECTORY_FETCH_CONCURRENCY = 128
     via stream::try_for_each_concurrent. Digests are deduplicated so a
     blob referenced by N files is only fetched once. As each blob
     arrives, every file index that needs it is sent through an mpsc
     channel.

  2. Hardlinker drains the channel and performs hardlink + chmod +
     set_mtime, bounded at DOWNLOAD_TO_DIRECTORY_HARDLINK_CONCURRENCY
     = 64 (the same cap that PR TraceMachina#2338 added to compose with APFS
     metadata-lock behavior on macOS). Hardlinking starts as soon as
     the first blob arrives - it does not wait for the fetcher to
     finish.

  3. Subdirectory recursion and symlink creation run in a separate
     FuturesUnordered alongside the file pipeline.

Net win on a Mac-Mini RBE worker: fetch wall time becomes
max(per-blob latency) bounded by the 128-cap, not sum(per-blob latency)
serialized inside one future.

Concurrency caps compose: fetch (128) is network/store-bound where extra
in-flight requests are cheap; hardlink (64) is metadata-bound where
extra in-flight syscalls fight APFS's per-volume lock on macOS. The
two phases run independently with their own bound, so on cache-hot
runs only the hardlink leg is active and the fetch cap is irrelevant.

Tests:

- download_to_directory_fetches_blobs_concurrently: a per-blob 75 ms
  sleep on the slow store. Serial fetch of 20 blobs would be 1500 ms;
  concurrent fetch finishes well under 500 ms, and max_in_flight is
  observed to reach >=4 (would be 1 if serial).

- download_to_directory_streams_hardlinks_while_fetching: 30 blobs
  with 50 ms per-blob latency. Serial fetch+hardlink would be 1500 ms;
  pipelined version finishes under 750 ms. Catches regression to a
  fetch-all-then-hardlink-all two-phase shape that drops overlap.

Both tests use a minimal DelayedStore harness (StoreDriver wrapping
MemoryStore) added inline. This pulls in async-trait as a dev-dependency
(version 0.1.88, matching the rest of the workspace).

Scope notes (vs PR TraceMachina#2243 a8db953):

- The upstream commit depends on infrastructure introduced earlier in
  the PR that is NOT being ported here: batch_read_small_blobs,
  populate_fast_store_unchecked, get_file_entries_batch,
  BATCH_READ_MAX_BLOB_SIZE, BatchReadBlobs / ByteStream partitioning.
  These hooks are entangled with the peer-to-peer infrastructure and
  proto changes we explicitly excluded. The upstream commit was
  reduced to its core idea: launch all fetches upfront, hardlink as
  blobs arrive.

- Upstream uses a Notify + shared HashSet<DigestInfo> "fetched_set"
  to coordinate fetcher and producer. Here we use a simpler mpsc
  channel: the fetcher sends file indices directly as their blob
  becomes available. There is no separate producer step because there
  is no batch-vs-bytestream partitioning to defer to a second pass.

- Follow-on upstream commits 7b093e8 and b64944a are excluded
  per task scope - they depend on peer hints and proto changes.

Extracted from TraceMachina/nativelink PR TraceMachina#2243 (commit a8db953),
adapted to the current code shape.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
palfrey pushed a commit that referenced this pull request May 20, 2026
…ro-digest files) (#2346)

* Resolve absolute symlinks in output upload instead of uploading raw targets

When the worker's work directory is populated via DirectoryCache, output
paths can be absolute symlinks pointing into the cache directory
(/var/.../directory_cache/...). The previous output collection code
uploaded these as raw SymlinkInfo with absolute targets that are
meaningless on the Bazel client, causing "No such file or directory"
errors when the client materialised the action result.

Detect absolute symlinks in output paths and resolve them: upload
directory contents as Tree protos and file contents as regular files.
Relative symlinks (intentionally created by the action) are still
preserved as symlinks.

Updates upload_dir_and_symlink_test to use a relative symlink (the
previous /dev/null absolute symlink is now resolved, breaking the old
assertion) and adds a new upload_absolute_symlink_resolves_contents
regression test that verifies an out-of-tree payload reachable via an
absolute symlink lands in CAS as a file.

Ported from upstream
0807153 (PR
#2243).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Fix zero-digest files missing from worker execution directories

FilesystemStore::get_file_entry_for_digest previously returned a
synthetic FileEntry for zero-digest blobs pointing to a content_path
that never exists on disk (FilesystemStore deliberately never persists
zero-byte files). Downstream worker output-materialisation code that
took the prefetched hardlink path would try to hard_link from this
non-existent source, fail, and either silently produce a missing file
or fall back without enough context to diagnose the failure.

This is the OUTPUT-materialisation companion to PR #2338's DirectoryCache
zero-byte fix, which lives on the input-fetch path. The two changes are
complementary: #2338 covers the directory cache short-circuit; this
covers the FilesystemStore API and the running_actions_manager fallback.

Changes:
- FilesystemStore::get_file_entry_for_digest now returns
  Code::NotFound for zero digests instead of a synthetic FileEntry,
  forcing callers to materialise empty files via fs::create_file rather
  than hard_linking from a phantom source.
- running_actions_manager::download_to_directory adds err_tip context
  on the zero-digest fs::create_file / write_all so a failure here is
  attributable to the empty-file path.
- Updates the existing get_file_entry_for_zero_digest test to assert
  the new NotFound behaviour and renames it to
  get_file_entry_for_zero_digest_returns_not_found.
- Adds download_to_directory_zero_digest_empty_file_test that
  exercises an empty file declared inside an input directory and
  verifies it materialises on disk with zero bytes. This test
  intentionally does NOT collide with PR #2338's DirectoryCache
  zero-byte test (which validates a different short-circuit path).

Ported from upstream
19d7e20 (PR
#2243).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Close test coverage gaps for ported output-upload fixes

Audit of the two ported commits (e57b675b, b1dc3a1c) flagged two
correctness-relevant gaps. Both are now closed with strict assertions so
the underlying bugs cannot silently regress.

1. upload_absolute_symlink_to_directory_uploads_tree
   The original e57b675b regression test covers absolute-symlink-to-file
   but not absolute-symlink-to-directory, which exercises an entirely
   separate code path (upload_directory + Tree proto serialisation). The
   new test creates an out-of-tree directory containing inner.txt,
   absolute-symlinks it into the work dir, runs the action, and:
   - asserts output_directory_symlinks and output_file_symlinks are
     BOTH empty (the symlink must NOT be preserved),
   - asserts output_folders contains exactly one entry at the symlink
     path, proving Tree-upload happened,
   - walks the uploaded Tree, locates inner.txt, fetches its blob from
     CAS and verifies the content. This proves the directory was
     actually traversed and uploaded — not stubbed.

2. download_to_directory_zero_digest_empty_file_test (strengthened)
   Extended the existing single-file test to cover:
   - a second sibling zero-digest file (proves the path is not a
     single-file fluke),
   - a zero-digest file nested inside a subdirectory (proves the
     recursive download_to_directory caller also handles NotFound from
     get_file_entry_for_digest — not just the top-level invocation),
   - strict per-file assertions: is_file, !is_symlink, len == 0, and
     a read-back that confirms the file is actually empty rather than
     a phantom dirent.

Verdict on the dropped checks:
- Dangling absolute symlink: source returns OutputType::None which is
  the correct graceful behaviour, but writing a test for it requires
  fabricating a broken symlink mid-action and is comparatively low
  value; left uncovered intentionally.
- File mode on zero-digest materialisation: the worker uses
  fs::create_file with no explicit mode, so the result follows umask
  and would be brittle across CI environments. Length + type
  assertions are the stronger invariant.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* running_actions_manager_test: use /dev/null in upload_dir_and_symlink_test

Previous iterations of this PR sidestepped `/dev/null` by swapping the
absolute-symlink fixture for either a relative symlink (lost coverage)
or an out-of-tree real empty file (added scaffolding to dodge a corner
case we should just test directly).

Restore the original `ln -s /dev/null empty_sym` fixture. The post-fix
worker code resolves the absolute symlink via `fs::metadata` (follows),
sees a character device (`is_dir() == false`), and falls into the
"upload as file" branch. `upload_file` opens `/dev/null` and reads to
EOF — `/dev/null`'s contract is that reads return 0 bytes immediately
— producing the canonical sha256 empty digest (e3b0c44…) with size 0.

That's well-defined, harmless behavior and a real-world Bazel pattern
(rules sometimes use `ln -sf /dev/null x` to create empty outputs).
Test now locks in this contract directly, no scaffolding required.

Drops the `external_root` / `external_file` setup since it was only
needed to avoid `/dev/null`.

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Marcus Eagan <marcuseagan@gmail.com>
erneestoc added a commit to erneestoc/nativelink that referenced this pull request May 21, 2026
Previously download_to_directory pushed every file's fetch+hardlink as
one fused future into a single FuturesUnordered. Each future ran the
fetch (populate_fast_store) and then the hardlink sequentially within
the same task. With per-future store-side serialization the fetch leg
becomes the bottleneck for SwiftCompile-style workloads (1978+ input
files at p95), where blob-fetch wall time dominates total action time.

This commit restructures download_to_directory into a two-phase
pipeline driven by try_join3:

  1. Fetcher launches ALL missing-blob populate_fast_store calls up
     front, bounded at DOWNLOAD_TO_DIRECTORY_FETCH_CONCURRENCY = 128
     via stream::try_for_each_concurrent. Digests are deduplicated so a
     blob referenced by N files is only fetched once. As each blob
     arrives, every file index that needs it is sent through an mpsc
     channel.

  2. Hardlinker drains the channel and performs hardlink + chmod +
     set_mtime, bounded at DOWNLOAD_TO_DIRECTORY_HARDLINK_CONCURRENCY
     = 64 (the same cap that PR TraceMachina#2338 added to compose with APFS
     metadata-lock behavior on macOS). Hardlinking starts as soon as
     the first blob arrives - it does not wait for the fetcher to
     finish.

  3. Subdirectory recursion and symlink creation run in a separate
     FuturesUnordered alongside the file pipeline.

Net win on a Mac-Mini RBE worker: fetch wall time becomes
max(per-blob latency) bounded by the 128-cap, not sum(per-blob latency)
serialized inside one future.

Concurrency caps compose: fetch (128) is network/store-bound where extra
in-flight requests are cheap; hardlink (64) is metadata-bound where
extra in-flight syscalls fight APFS's per-volume lock on macOS. The
two phases run independently with their own bound, so on cache-hot
runs only the hardlink leg is active and the fetch cap is irrelevant.

Tests:

- download_to_directory_fetches_blobs_concurrently: a per-blob 75 ms
  sleep on the slow store. Serial fetch of 20 blobs would be 1500 ms;
  concurrent fetch finishes well under 500 ms, and max_in_flight is
  observed to reach >=4 (would be 1 if serial).

- download_to_directory_streams_hardlinks_while_fetching: 30 blobs
  with 50 ms per-blob latency. Serial fetch+hardlink would be 1500 ms;
  pipelined version finishes under 750 ms. Catches regression to a
  fetch-all-then-hardlink-all two-phase shape that drops overlap.

Both tests use a minimal DelayedStore harness (StoreDriver wrapping
MemoryStore) added inline. This pulls in async-trait as a dev-dependency
(version 0.1.88, matching the rest of the workspace).

Scope notes (vs PR TraceMachina#2243 a8db953):

- The upstream commit depends on infrastructure introduced earlier in
  the PR that is NOT being ported here: batch_read_small_blobs,
  populate_fast_store_unchecked, get_file_entries_batch,
  BATCH_READ_MAX_BLOB_SIZE, BatchReadBlobs / ByteStream partitioning.
  These hooks are entangled with the peer-to-peer infrastructure and
  proto changes we explicitly excluded. The upstream commit was
  reduced to its core idea: launch all fetches upfront, hardlink as
  blobs arrive.

- Upstream uses a Notify + shared HashSet<DigestInfo> "fetched_set"
  to coordinate fetcher and producer. Here we use a simpler mpsc
  channel: the fetcher sends file indices directly as their blob
  becomes available. There is no separate producer step because there
  is no batch-vs-bytestream partitioning to defer to a second pass.

- Follow-on upstream commits 7b093e8 and b64944a are excluded
  per task scope - they depend on peer hints and proto changes.

Extracted from TraceMachina/nativelink PR TraceMachina#2243 (commit a8db953),
adapted to the current code shape.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants