Skip to content

filesystem_store: fix emplace_file race when two writers target the same key#2341

Closed
erneestoc wants to merge 3 commits into
TraceMachina:mainfrom
erneestoc:ec/pr2243-emplace-race-upstream
Closed

filesystem_store: fix emplace_file race when two writers target the same key#2341
erneestoc wants to merge 3 commits into
TraceMachina:mainfrom
erneestoc:ec/pr2243-emplace-race-upstream

Conversation

@erneestoc
Copy link
Copy Markdown
Contributor

@erneestoc erneestoc commented May 17, 2026

What

When two threads write the same blob concurrently, emplace_file() can leave one writer pointing at a deleted temp file, causing an ENOENT on rename.

The race in plain language

emplace_file() is the function that takes a freshly-written temp file and atomically moves it to its digest-keyed home in the filesystem CAS, registering it in the LRU evicting map.

  1. Thread A writes blob X → calls emplace_file() → inserts entry A into the evicting map.
  2. Thread B (also writing the same blob X) → calls emplace_file() → its insert replaces entry A with entry B. Replacement triggers unref() on entry A, which deletes A's temp file.
  3. Thread A resumes after acquiring the write lock and checks evicting_map.get(&key).is_none() — but the key is present (entry B is there now), so the check passes.
  4. Thread A proceeds to rename(temp_file → final_path) — and gets ENOENT because B's unref already deleted A's temp file.

The bug is in the existence check: it asks "is some entry present under this key?" when it needs to ask "is my entry still the one in the map?".

The fix

Replace the key-presence check with an Arc::ptr_eq check against our own entry. If the map's entry for this key isn't the same Arc we inserted, treat it the same as eviction — bail out without renaming.

let still_ours = match evicting_map.get(&key).await {
    Some(map_entry) => Arc::ptr_eq(&map_entry, &entry),
    None => false,
};
if !still_ours {
    info!(%key, "Got eviction or replacement while emplacing, dropping");
    return Ok(());
}

This is the same pattern already used downstream in the error-handling path at line 884 (Arc::<Fe>::ptr_eq), so it's not introducing a new idiom — it's making the success path consistent with the existing failure path.

Symptom this prevents

Rare ENOENT failures during CAS writes under parallel-action load on a single worker. Easy to miss because:

  • It only fires when two writes for the same digest race on the same worker.
  • The failure surfaces as "file not found" right after a "file written" log line, which looks like an external race or a corrupted store rather than an internal race.
  • It scales with concurrent actions per worker, so it stays dormant on lightly-loaded workers and shows up under load.

Provenance

Cherry-picked from PR #2243 (commit e020168). Isolated and rebased onto current main so it can land independently of the larger PR.

Test plan

  • cargo build -p nativelink-store — clean.
  • cargo test -p nativelink-store --lib — green.
  • cargo clippy -p nativelink-store --lib -- -D warnings — clean.
  • cargo fmt -p nativelink-store — clean.
  • No new test added — the existing concurrency-focused tests live in tests/filesystem_store_test.rs with substantial scaffolding; reproducing the race deterministically requires injecting an ordering hook. Happy to add one if reviewers want — open to suggestions on the right shape.

🤖 Generated with Claude Code


This change is Reviewable

erneestoc and others added 3 commits May 16, 2026 18:05
… same key

When two threads write the same blob concurrently, both call
emplace_file(). The second insert replaces the first entry in the
evicting map, triggering unref() which deletes the first thread's
temp file. The first thread then checks if the key exists (it does —
the replacement entry), proceeds to rename its deleted temp file,
and gets ENOENT.

Fix: use Arc::ptr_eq to verify our specific entry is still in the
map, not just that the key exists. This matches the same pattern
already used in the error-handling path at line 966.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Cherry-picked from upstream nativelink PR TraceMachina#2243 commit e020168.
…to_directory to fix reader-side emplace race

The writer-side fix in the parent commit (filesystem_store: Fix
emplace_file race) addresses concurrent writes for the same key.
With that fix in place, Buildstream CI on PR TraceMachina#2341 surfaced a
sibling READER-side race in download_to_directory: the per-file
hard_link runs OUTSIDE the FileEntry's read lock, so a concurrent
writer's unref() (triggered when an insert() replaces an existing
entry) can rename the on-disk file out of content/<digest>
between the reader capturing the path and the hard_link syscall
executing. The reader then sees ENOENT and surfaces the
misleading "file was likely evicted from cache" error.

The pre-existing comment in emplace_file makes the contract
explicit: "Since we hold a write lock still anyone that gets our
new FileEntry (which has not yet been placed on disk) will not be
able to read the file's contents until we release the lock." The
reader code violated that contract.

Fix: move the fs::hard_link call INSIDE the get_file_path_locked
closure so the per-FileEntry read lock is held for the duration
of the syscall. The writer's unref() can no longer interleave
because its write().await is blocked until the reader's hard_link
completes.

Deadlock concern (TODO TraceMachina#2051 was left near this site as a flag
for "deadlock with large number of files"): the lock here is a
per-FileEntry read lock, not a global lock. Multiple concurrent
hard_link calls against DIFFERENT digests do not contend, and
multiple readers of the SAME digest share the read lock. The only
contention is reader-vs-writer on the same digest, which is
exactly the synchronization needed for correctness. The outer
concurrency cap on download_to_directory is governed by
fs::hard_link's open-file semaphore (nativelink_util::fs), not by
this RwLock. The TODO is updated rather than removed to flag a
revisit if the file-handle semaphore proves insufficient under
very large fan-outs.

Test: download_to_directory_holds_lock_across_hard_link
deterministically reproduces the race by draining
OPEN_FILE_SEMAPHORE (the same trick upload_with_single_permit
uses) to pin the reader at hard_link, polling try_write() on the
FileEntry's encoded_file_path to detect that the reader has
reached get_file_path_locked, then spawning an "evictor" that
takes the same write lock and renames the file (mimicking what
FilesystemStore::unref() does under the same lock). With the
pre-fix code, the evictor wins the lock race and the reader's
hard_link returns NotFound. With the post-fix code, the
evictor's write().await is blocked by the reader's held read
lock until the reader's hard_link completes atomically against
the rename.

Verified locally: reverting just the running_actions_manager.rs
change (git stash) makes the test fail with NotFound; restoring
the fix makes it pass consistently (5/5 runs in ~0.11s each).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…oser-entry emplace race

The first reader-side fix (commit a2f4f4a) wrapped `fs::hard_link`
inside `FileEntry::get_file_path_locked` so the per-entry read lock
was held across the syscall. Buildstream CI on PR TraceMachina#2341 STILL
surfaced the same ENOENT "file was likely evicted from cache"
failure, hitting the action's max-retry budget (4 > 3) and
cancelling jobs that referenced files like `usr/bin/perl5.26.1`
across multiple digests.

The first fix protected the WINNER-entry case: a reader holding
`Arc<entry_A>` cannot have entry_A's content file renamed out
mid-hard_link, because a concurrent `unref(entry_A)` (which takes
the same RwLock as a writer) must wait for the reader's read lock.

It did NOT protect the LOSER-entry case:

  1. Reader R calls `get_file_entry_for_digest(d)` and the
     evicting map returns `Arc<entry_A>` (the entry currently
     under d in the map).
  2. Concurrent writer B finishes an emplace for the SAME key.
     `EvictingMap::insert(B)` displaces A, calling `unref(A)`.
     `unref(A)` takes A's write lock and renames A's file from
     `content_path/<d>` to A's own temp path. A's file is now
     gone from the content path.
  3. R then calls `get_file_path_locked(A)`. The read lock now
     correctly serializes with the (already-completed) unref —
     but R's captured path points at A, and A's file is gone.
     `fs::hard_link` returns ENOENT. The CAS still has the digest
     under entry_B's content path; B is in the map; B's file is
     on disk. Re-fetching the entry from the map returns B and
     a second hard_link attempt against B's path succeeds.

Fix: bounded retry inside `download_to_directory`. On Code::NotFound
the reader sleeps for a 10ms backoff (giving any racing writer's
`emplace_file` background spawn time to finish renaming the temp
into the content path), re-fetches the entry from the map (which
now returns the winning writer's entry), and retries the hard_link.
Capped at HARDLINK_MAX_RETRIES = 3 so that genuine eviction-pressure
ENOENT (no writer racing, the digest truly is gone) cannot spin —
the existing "max_bytes too small" guidance is preserved on the
post-budget error path.

Retry budget choice: 3 attempts (= 1 original + 2 retries).
Production traces show at most one displacement per concurrent
write cycle for a given digest; a single retry resolves the
documented race. Two retries gives one extra slot of headroom for
the rare case where a third writer enters the cycle between
attempts. Going higher risks masking the real eviction-pressure
case the original error message was designed to surface.

Test: download_to_directory_retries_when_entry_evicted_between_lookup_and_hardlink

  * Pre-populates digest with entry_A in the evicting_map at
    content_path/<d>.
  * Constructs a synthetic entry_B pointing at the same content
    path (via test-only constructors on `SharedContext` and
    `EncodedFilePath` added in this commit) and inserts it under
    the same key. The map's real `insert` calls `LenEntry::unref(A)`
    which renames A's file out — same code path as a real writer's
    displacement.
  * Spawns the reader (`download_to_directory`). Spawns a
    restorer task that sleeps 2ms then writes fresh bytes back
    into content_path/<d> (mimicking writer B's emplace having
    completed its rename).
  * Without the retry the reader's single hard_link runs against
    the still-missing content path and fails with NotFound. With
    the retry, attempt 1 fails, the loop sleeps 10ms (during
    which the restorer's write lands), attempt 2 finds the file
    and the hard_link succeeds.

Empirical FAIL-at-HEAD~1 / PASS-at-HEAD proof:

  * Locally toggled HARDLINK_MAX_RETRIES from 3 to 1 (effectively
    no retry); the new test failed deterministically with:
      "Error { code: NotFound, messages: [\"No such file or
      directory (os error 2)\", \"Could not make hardlink ... after
      1 attempts ...\"] }"
  * Restored HARDLINK_MAX_RETRIES = 3; the new test passes 5/5
    runs in ~0.03s each, and the full
    running_actions_manager_test binary passes all 32 tests
    (the prior 31 + the new one).

Re: TODO TraceMachina#2051 (deadlock with large number of files) — unchanged
from the analysis in the parent commit. The per-FileEntry read
lock is per-digest, not global; concurrency for download_to_directory
remains governed by `fs::hard_link`'s open-file semaphore.

Scaffolding additions in nativelink-store are doc(hidden) and
*_for_test-named:
  * `FsEvictingMap` type alias made pub so external tests can
    name the evicting_map return type.
  * `SharedContext::new_for_test(temp_path, content_path)`.
  * `EncodedFilePath::new_content_for_test(shared_ctx, key)`.
  * `FilesystemStore::evicting_map_for_test()`.
None are used by production code.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@erneestoc erneestoc closed this May 17, 2026
@erneestoc erneestoc deleted the ec/pr2243-emplace-race-upstream branch May 17, 2026 04:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant