worker: faster directory-cache materialization — hardlink CAS blobs, drop redundant walks, narrow lock#2359
Open
erneestoc wants to merge 4 commits into
Open
Conversation
`DirectoryCache::construct_directory` previously materialized every file by
fetching the whole blob into RAM (`get_part_unchunked`) and writing a full
copy (`fs::write`). For a cache that exists to avoid re-fetching from the
CAS, this is the dominant cost on a miss.
Switch the cache-entry file build to hardlink the FilesystemStore CAS blob
directly into the cache entry — zero-copy, metadata-only — exactly the way
`download_to_directory` already does on the fallback path:
`populate_fast_store` then `get_file_entry_for_digest` /
`get_file_path_locked` / `fs::hard_link`.
Correctness:
* A hardlinked CAS blob shares its inode with the CAS store and every
other action that hardlinked the same blob, so it must never be
chmod'd (the inode-corruption bug PR TraceMachina#2347 fixed). Executable files
(`FileNode.is_executable`) therefore get their own private inode via
fetch+write and are chmod'd 0o555 on that unshared copy — never
hardlinked.
* When the blob is not locally hardlinkable (the fast tier is not a
FilesystemStore, or the blob is absent / evicted from it), the file
falls back to fetch+write rather than failing the build.
* Zero-byte files keep their existing direct-write special case.
* The post-construction lockdown switches from `set_readonly_recursive`
(which chmods files, and would corrupt the shared CAS inode) to
`set_dir_writable_recursive`, which only touches directories.
`DirectoryCache::new` now takes the worker's `Arc<FastSlowStore>` so it can
reach `populate_fast_store` and downcast the fast tier to FilesystemStore.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After `construct_directory`, the cache-miss path walked the materialized tree twice more: `calculate_directory_size` (an `fs::metadata` per file) to compute the LRU size, and a recursive permission pass to normalize directory modes. Both are now folded into construction itself. Size: `construct_directory` returns the total tree size, accumulated from `FileNode.digest.size_bytes` in the `Directory` protos it already decodes. This is also more correct than the old filesystem walk — it counts each file once by its CAS size and never follows symlinks into possibly-shared or external targets. Symlinks contribute nothing. Directory mode: each cache-entry directory is chmod'd 0o755 the moment it is created (`create_dir_writable`), umask-independent. The directory is writable while it is populated and that is its stable final mode, so the separate post-construction `set_dir_writable_recursive` walk is gone. Cache-entry files are still never chmod'd here — they may be CAS-blob hardlinks (OPT #1) and mutating their mode would corrupt the shared inode. Reconciliation with PR TraceMachina#2357: that PR reworks `set_readonly_recursive` so the recursive walk leaves dirs 0o755 / files 0o555. This commit removes the directory-cache build's dependence on any such recursive walk entirely — modes are set at creation. Whichever lands second, the rebase is a straight delete of the now-unused call site; there is no semantic conflict because both converge on 0o755 directories, and TraceMachina#2357's file handling is irrelevant here since the cache build no longer touches file modes at all. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cache write lock was held across syscall-heavy I/O, serializing every
concurrent `get_or_create`:
* On the cache-hit paths, `cache.write()` was held across the whole
`hardlink_directory_tree` (clonefile / per-file hardlink) materialization.
* `evict_lru` ran `set_dir_writable_recursive` + `remove_dir_all` on the
evicted tree under the write lock during a cache miss.
Lock narrowing:
* `acquire_entry` / `release_entry` take the write lock only to bump and
drop a `ref_count` pin and snapshot the entry path; the
`hardlink_directory_tree` materialization runs fully unlocked. The pin is
what makes this safe — `evict_lru` never selects an entry with
`ref_count > 0`, so the cache tree cannot be deleted mid-hardlink. The
newly constructed entry is likewise inserted pre-pinned (`ref_count: 1`)
and unpinned only after its destination hardlink completes; otherwise a
concurrent miss for an unrelated digest could evict the brand-new entry
(its `last_access` is recent but it is the only unpinned one) while this
caller is still hardlinking from it.
* `evict_if_needed` / `evict_lru` are now pure in-memory: they select
victims and remove them from the map under the lock, returning the
victim paths. `dispatch_evictions` then performs the chmod + removal on
a `background_spawn` task, off the lock.
Single-flight: the existing per-digest construction mutex already ensures a
digest is constructed once while N callers wait; this commit additionally
unmaps the per-digest mutex (`forget_construction_lock`) once construction
finishes so `construction_locks` no longer grows unbounded over the worker's
lifetime. Unmapping is race-free: a waiter has already cloned the `Arc<Mutex>`
before blocking, and a late arrival that creates a fresh mutex still re-checks
the cache, finds the entry, and takes the fast hardlink path — never a
redundant construct.
`ref_count` / `CachedDirectoryMetadata` semantics are unchanged; the
hit/miss return contract is unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
MarcusSorealheis
approved these changes
May 23, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Three atomic commits that cut the cost of
DirectoryCachematerialization onthe worker — the path that dominates input-tree materialization wall time.
Part of the input-materialization performance work (related to #2243), kept as
a small standalone change.
On a
DirectoryCachemiss,construct_directoryrebuilds the entire inputtree: for every file it reads the whole blob into memory (
get_part_unchunked)and writes a full byte copy (
fs::write), then walks the finished tree twicemore (
set_readonly_recursive,calculate_directory_size). The worker'sfallback path,
download_to_directory, already materializes files byhardlinking the CAS blob — this brings the directory-cache path in line and
removes the redundant walks.
f23dffa7— hardlink CAS blobs.create_filehardlinks theFilesystemStoreCAS blob directly into the cache entry(
FileEntry::get_file_path_locked+fs::hard_link), mirroringdownload_to_directory. Eliminates the read-into-memory and the byte copy forevery non-executable file present in the local fast tier. Executable files get
a private inode (fetch + write + chmod) — a hardlinked CAS blob must never be
chmod'd, that mutates the shared inode (the invariant fixed in #2347). Blobsnot present in the filesystem tier fall back to fetch + write.
DirectoryCache::newnow takesArc<FastSlowStore>(was a genericStore) toreach the typed
FilesystemStorepath APIs.9425d4c8— drop two redundant walks. Entry size is accumulated fromFileNode.digest.size_bytesduring construction (removes thecalculate_directory_sizefilesystem walk); file/directory modes are set atcreation time (removes the separate
set_readonly_recursivepost-walk).33d5b1ba— narrow the lock + single-flight. The cacheRwLockwriteguard is held only for in-memory map mutations; the clone to the destination
runs unlocked, with the entry pinned via
ref_countso a concurrent callercannot evict it mid-materialize. Eviction's filesystem I/O is moved off the
lock onto a background task. Concurrent callers for the same digest now
construct it once and share the result (single-flight).
Materialized output is byte-identical to before.
Type of change
not work as expected)
How Has This Been Tested?
bazel test //nativelink-worker/... //nativelink-util/... //nativelink-store/...— 51/51 pass (the build also runs the clippy and rustfmt aspects).
New tests in
directory_cache.rs/tests/directory_cache_test.rscover: ahardlinked file shares the CAS inode; an executable gets a private inode and
leaves the CAS blob's mode untouched; materialized content round-trips
byte-identical; size accounting from digest sums, including nested trees;
concurrent single-flight (16 simultaneous callers for one digest → exactly 1
construction, 15 hits); cache-entry directory modes.
Beyond the test suite, this was exercised in a real-world application build:
input-materialization time dropped roughly 2x on a clean build with a cold
cache. That is a real-workload measurement rather than a synthetic
microbenchmark — measured clean-build/cold-cache; it should carry over to
real-world incremental builds as well.
Checklist
bazel test //...passes locallygit amendsee some docsThis change is