Skip to content

worker: faster directory-cache materialization — hardlink CAS blobs, drop redundant walks, narrow lock#2359

Open
erneestoc wants to merge 4 commits into
TraceMachina:mainfrom
erneestoc:ec/pr2243-get-or-create-optimize
Open

worker: faster directory-cache materialization — hardlink CAS blobs, drop redundant walks, narrow lock#2359
erneestoc wants to merge 4 commits into
TraceMachina:mainfrom
erneestoc:ec/pr2243-get-or-create-optimize

Conversation

@erneestoc
Copy link
Copy Markdown
Contributor

@erneestoc erneestoc commented May 22, 2026

Description

Three atomic commits that cut the cost of DirectoryCache materialization on
the worker — the path that dominates input-tree materialization wall time.
Part of the input-materialization performance work (related to #2243), kept as
a small standalone change.

On a DirectoryCache miss, construct_directory rebuilds the entire input
tree: for every file it reads the whole blob into memory (get_part_unchunked)
and writes a full byte copy (fs::write), then walks the finished tree twice
more (set_readonly_recursive, calculate_directory_size). The worker's
fallback path, download_to_directory, already materializes files by
hardlinking the CAS blob — this brings the directory-cache path in line and
removes the redundant walks.

f23dffa7 — hardlink CAS blobs. create_file hardlinks the
FilesystemStore CAS blob directly into the cache entry
(FileEntry::get_file_path_locked + fs::hard_link), mirroring
download_to_directory. Eliminates the read-into-memory and the byte copy for
every non-executable file present in the local fast tier. Executable files get
a private inode (fetch + write + chmod) — a hardlinked CAS blob must never be
chmod'd, that mutates the shared inode (the invariant fixed in #2347). Blobs
not present in the filesystem tier fall back to fetch + write.
DirectoryCache::new now takes Arc<FastSlowStore> (was a generic Store) to
reach the typed FilesystemStore path APIs.

9425d4c8 — drop two redundant walks. Entry size is accumulated from
FileNode.digest.size_bytes during construction (removes the
calculate_directory_size filesystem walk); file/directory modes are set at
creation time (removes the separate set_readonly_recursive post-walk).

33d5b1ba — narrow the lock + single-flight. The cache RwLock write
guard is held only for in-memory map mutations; the clone to the destination
runs unlocked, with the entry pinned via ref_count so a concurrent caller
cannot evict it mid-materialize. Eviction's filesystem I/O is moved off the
lock onto a background task. Concurrent callers for the same digest now
construct it once and share the result (single-flight).

Materialized output is byte-identical to before.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to
    not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

bazel test //nativelink-worker/... //nativelink-util/... //nativelink-store/...
— 51/51 pass (the build also runs the clippy and rustfmt aspects).

New tests in directory_cache.rs / tests/directory_cache_test.rs cover: a
hardlinked file shares the CAS inode; an executable gets a private inode and
leaves the CAS blob's mode untouched; materialized content round-trips
byte-identical; size accounting from digest sums, including nested trees;
concurrent single-flight (16 simultaneous callers for one digest → exactly 1
construction, 15 hits); cache-entry directory modes.

Beyond the test suite, this was exercised in a real-world application build:
input-materialization time dropped roughly 2x on a clean build with a cold
cache. That is a real-workload measurement rather than a synthetic
microbenchmark — measured clean-build/cold-cache; it should carry over to
real-world incremental builds as well.

Checklist

  • Updated documentation if needed
  • Tests added/amended
  • bazel test //... passes locally
  • PR is contained in a single commit, using git amend see some docs

This change is Reviewable

erneestoc and others added 3 commits May 22, 2026 00:06
`DirectoryCache::construct_directory` previously materialized every file by
fetching the whole blob into RAM (`get_part_unchunked`) and writing a full
copy (`fs::write`). For a cache that exists to avoid re-fetching from the
CAS, this is the dominant cost on a miss.

Switch the cache-entry file build to hardlink the FilesystemStore CAS blob
directly into the cache entry — zero-copy, metadata-only — exactly the way
`download_to_directory` already does on the fallback path:
`populate_fast_store` then `get_file_entry_for_digest` /
`get_file_path_locked` / `fs::hard_link`.

Correctness:
  * A hardlinked CAS blob shares its inode with the CAS store and every
    other action that hardlinked the same blob, so it must never be
    chmod'd (the inode-corruption bug PR TraceMachina#2347 fixed). Executable files
    (`FileNode.is_executable`) therefore get their own private inode via
    fetch+write and are chmod'd 0o555 on that unshared copy — never
    hardlinked.
  * When the blob is not locally hardlinkable (the fast tier is not a
    FilesystemStore, or the blob is absent / evicted from it), the file
    falls back to fetch+write rather than failing the build.
  * Zero-byte files keep their existing direct-write special case.
  * The post-construction lockdown switches from `set_readonly_recursive`
    (which chmods files, and would corrupt the shared CAS inode) to
    `set_dir_writable_recursive`, which only touches directories.

`DirectoryCache::new` now takes the worker's `Arc<FastSlowStore>` so it can
reach `populate_fast_store` and downcast the fast tier to FilesystemStore.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After `construct_directory`, the cache-miss path walked the materialized
tree twice more: `calculate_directory_size` (an `fs::metadata` per file)
to compute the LRU size, and a recursive permission pass to normalize
directory modes. Both are now folded into construction itself.

Size: `construct_directory` returns the total tree size, accumulated from
`FileNode.digest.size_bytes` in the `Directory` protos it already decodes.
This is also more correct than the old filesystem walk — it counts each
file once by its CAS size and never follows symlinks into possibly-shared
or external targets. Symlinks contribute nothing.

Directory mode: each cache-entry directory is chmod'd 0o755 the moment it
is created (`create_dir_writable`), umask-independent. The directory is
writable while it is populated and that is its stable final mode, so the
separate post-construction `set_dir_writable_recursive` walk is gone.

Cache-entry files are still never chmod'd here — they may be CAS-blob
hardlinks (OPT #1) and mutating their mode would corrupt the shared inode.

Reconciliation with PR TraceMachina#2357: that PR reworks `set_readonly_recursive` so
the recursive walk leaves dirs 0o755 / files 0o555. This commit removes
the directory-cache build's dependence on any such recursive walk
entirely — modes are set at creation. Whichever lands second, the rebase
is a straight delete of the now-unused call site; there is no semantic
conflict because both converge on 0o755 directories, and TraceMachina#2357's file
handling is irrelevant here since the cache build no longer touches file
modes at all.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cache write lock was held across syscall-heavy I/O, serializing every
concurrent `get_or_create`:

  * On the cache-hit paths, `cache.write()` was held across the whole
    `hardlink_directory_tree` (clonefile / per-file hardlink) materialization.
  * `evict_lru` ran `set_dir_writable_recursive` + `remove_dir_all` on the
    evicted tree under the write lock during a cache miss.

Lock narrowing:
  * `acquire_entry` / `release_entry` take the write lock only to bump and
    drop a `ref_count` pin and snapshot the entry path; the
    `hardlink_directory_tree` materialization runs fully unlocked. The pin is
    what makes this safe — `evict_lru` never selects an entry with
    `ref_count > 0`, so the cache tree cannot be deleted mid-hardlink. The
    newly constructed entry is likewise inserted pre-pinned (`ref_count: 1`)
    and unpinned only after its destination hardlink completes; otherwise a
    concurrent miss for an unrelated digest could evict the brand-new entry
    (its `last_access` is recent but it is the only unpinned one) while this
    caller is still hardlinking from it.
  * `evict_if_needed` / `evict_lru` are now pure in-memory: they select
    victims and remove them from the map under the lock, returning the
    victim paths. `dispatch_evictions` then performs the chmod + removal on
    a `background_spawn` task, off the lock.

Single-flight: the existing per-digest construction mutex already ensures a
digest is constructed once while N callers wait; this commit additionally
unmaps the per-digest mutex (`forget_construction_lock`) once construction
finishes so `construction_locks` no longer grows unbounded over the worker's
lifetime. Unmapping is race-free: a waiter has already cloned the `Arc<Mutex>`
before blocking, and a late arrival that creates a fresh mutex still re-checks
the cache, finds the entry, and takes the fast hardlink path — never a
redundant construct.

`ref_count` / `CachedDirectoryMetadata` semantics are unchanged; the
hit/miss return contract is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@MarcusSorealheis MarcusSorealheis requested a review from amankrx May 23, 2026 00:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants