filesystem_store: fix emplace_file race when two writers target the same key by erneestoc · Pull Request #2341 · TraceMachina/nativelink

erneestoc · 2026-05-17T01:06:19Z

What

When two threads write the same blob concurrently, emplace_file() can leave one writer pointing at a deleted temp file, causing an ENOENT on rename.

The race in plain language

emplace_file() is the function that takes a freshly-written temp file and atomically moves it to its digest-keyed home in the filesystem CAS, registering it in the LRU evicting map.

Thread A writes blob X → calls emplace_file() → inserts entry A into the evicting map.
Thread B (also writing the same blob X) → calls emplace_file() → its insert replaces entry A with entry B. Replacement triggers unref() on entry A, which deletes A's temp file.
Thread A resumes after acquiring the write lock and checks evicting_map.get(&key).is_none() — but the key is present (entry B is there now), so the check passes.
Thread A proceeds to rename(temp_file → final_path) — and gets ENOENT because B's unref already deleted A's temp file.

The bug is in the existence check: it asks "is some entry present under this key?" when it needs to ask "is my entry still the one in the map?".

The fix

Replace the key-presence check with an Arc::ptr_eq check against our own entry. If the map's entry for this key isn't the same Arc we inserted, treat it the same as eviction — bail out without renaming.

let still_ours = match evicting_map.get(&key).await {
    Some(map_entry) => Arc::ptr_eq(&map_entry, &entry),
    None => false,
};
if !still_ours {
    info!(%key, "Got eviction or replacement while emplacing, dropping");
    return Ok(());
}

This is the same pattern already used downstream in the error-handling path at line 884 (Arc::<Fe>::ptr_eq), so it's not introducing a new idiom — it's making the success path consistent with the existing failure path.

Symptom this prevents

Rare ENOENT failures during CAS writes under parallel-action load on a single worker. Easy to miss because:

It only fires when two writes for the same digest race on the same worker.
The failure surfaces as "file not found" right after a "file written" log line, which looks like an external race or a corrupted store rather than an internal race.
It scales with concurrent actions per worker, so it stays dormant on lightly-loaded workers and shows up under load.

Provenance

Cherry-picked from PR #2243 (commit e020168). Isolated and rebased onto current main so it can land independently of the larger PR.

Test plan

cargo build -p nativelink-store — clean.
cargo test -p nativelink-store --lib — green.
cargo clippy -p nativelink-store --lib -- -D warnings — clean.
cargo fmt -p nativelink-store — clean.
No new test added — the existing concurrency-focused tests live in tests/filesystem_store_test.rs with substantial scaffolding; reproducing the race deterministically requires injecting an ordering hook. Happy to add one if reviewers want — open to suggestions on the right shape.

🤖 Generated with Claude Code

This change is

… same key When two threads write the same blob concurrently, both call emplace_file(). The second insert replaces the first entry in the evicting map, triggering unref() which deletes the first thread's temp file. The first thread then checks if the key exists (it does — the replacement entry), proceeds to rename its deleted temp file, and gets ENOENT. Fix: use Arc::ptr_eq to verify our specific entry is still in the map, not just that the key exists. This matches the same pattern already used in the error-handling path at line 966. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Cherry-picked from upstream nativelink PR TraceMachina#2243 commit e020168.

…to_directory to fix reader-side emplace race The writer-side fix in the parent commit (filesystem_store: Fix emplace_file race) addresses concurrent writes for the same key. With that fix in place, Buildstream CI on PR TraceMachina#2341 surfaced a sibling READER-side race in download_to_directory: the per-file hard_link runs OUTSIDE the FileEntry's read lock, so a concurrent writer's unref() (triggered when an insert() replaces an existing entry) can rename the on-disk file out of content/<digest> between the reader capturing the path and the hard_link syscall executing. The reader then sees ENOENT and surfaces the misleading "file was likely evicted from cache" error. The pre-existing comment in emplace_file makes the contract explicit: "Since we hold a write lock still anyone that gets our new FileEntry (which has not yet been placed on disk) will not be able to read the file's contents until we release the lock." The reader code violated that contract. Fix: move the fs::hard_link call INSIDE the get_file_path_locked closure so the per-FileEntry read lock is held for the duration of the syscall. The writer's unref() can no longer interleave because its write().await is blocked until the reader's hard_link completes. Deadlock concern (TODO TraceMachina#2051 was left near this site as a flag for "deadlock with large number of files"): the lock here is a per-FileEntry read lock, not a global lock. Multiple concurrent hard_link calls against DIFFERENT digests do not contend, and multiple readers of the SAME digest share the read lock. The only contention is reader-vs-writer on the same digest, which is exactly the synchronization needed for correctness. The outer concurrency cap on download_to_directory is governed by fs::hard_link's open-file semaphore (nativelink_util::fs), not by this RwLock. The TODO is updated rather than removed to flag a revisit if the file-handle semaphore proves insufficient under very large fan-outs. Test: download_to_directory_holds_lock_across_hard_link deterministically reproduces the race by draining OPEN_FILE_SEMAPHORE (the same trick upload_with_single_permit uses) to pin the reader at hard_link, polling try_write() on the FileEntry's encoded_file_path to detect that the reader has reached get_file_path_locked, then spawning an "evictor" that takes the same write lock and renames the file (mimicking what FilesystemStore::unref() does under the same lock). With the pre-fix code, the evictor wins the lock race and the reader's hard_link returns NotFound. With the post-fix code, the evictor's write().await is blocked by the reader's held read lock until the reader's hard_link completes atomically against the rename. Verified locally: reverting just the running_actions_manager.rs change (git stash) makes the test fail with NotFound; restoring the fix makes it pass consistently (5/5 runs in ~0.11s each). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…oser-entry emplace race The first reader-side fix (commit a2f4f4a) wrapped `fs::hard_link` inside `FileEntry::get_file_path_locked` so the per-entry read lock was held across the syscall. Buildstream CI on PR TraceMachina#2341 STILL surfaced the same ENOENT "file was likely evicted from cache" failure, hitting the action's max-retry budget (4 > 3) and cancelling jobs that referenced files like `usr/bin/perl5.26.1` across multiple digests. The first fix protected the WINNER-entry case: a reader holding `Arc<entry_A>` cannot have entry_A's content file renamed out mid-hard_link, because a concurrent `unref(entry_A)` (which takes the same RwLock as a writer) must wait for the reader's read lock. It did NOT protect the LOSER-entry case: 1. Reader R calls `get_file_entry_for_digest(d)` and the evicting map returns `Arc<entry_A>` (the entry currently under d in the map). 2. Concurrent writer B finishes an emplace for the SAME key. `EvictingMap::insert(B)` displaces A, calling `unref(A)`. `unref(A)` takes A's write lock and renames A's file from `content_path/<d>` to A's own temp path. A's file is now gone from the content path. 3. R then calls `get_file_path_locked(A)`. The read lock now correctly serializes with the (already-completed) unref — but R's captured path points at A, and A's file is gone. `fs::hard_link` returns ENOENT. The CAS still has the digest under entry_B's content path; B is in the map; B's file is on disk. Re-fetching the entry from the map returns B and a second hard_link attempt against B's path succeeds. Fix: bounded retry inside `download_to_directory`. On Code::NotFound the reader sleeps for a 10ms backoff (giving any racing writer's `emplace_file` background spawn time to finish renaming the temp into the content path), re-fetches the entry from the map (which now returns the winning writer's entry), and retries the hard_link. Capped at HARDLINK_MAX_RETRIES = 3 so that genuine eviction-pressure ENOENT (no writer racing, the digest truly is gone) cannot spin — the existing "max_bytes too small" guidance is preserved on the post-budget error path. Retry budget choice: 3 attempts (= 1 original + 2 retries). Production traces show at most one displacement per concurrent write cycle for a given digest; a single retry resolves the documented race. Two retries gives one extra slot of headroom for the rare case where a third writer enters the cycle between attempts. Going higher risks masking the real eviction-pressure case the original error message was designed to surface. Test: download_to_directory_retries_when_entry_evicted_between_lookup_and_hardlink * Pre-populates digest with entry_A in the evicting_map at content_path/<d>. * Constructs a synthetic entry_B pointing at the same content path (via test-only constructors on `SharedContext` and `EncodedFilePath` added in this commit) and inserts it under the same key. The map's real `insert` calls `LenEntry::unref(A)` which renames A's file out — same code path as a real writer's displacement. * Spawns the reader (`download_to_directory`). Spawns a restorer task that sleeps 2ms then writes fresh bytes back into content_path/<d> (mimicking writer B's emplace having completed its rename). * Without the retry the reader's single hard_link runs against the still-missing content path and fails with NotFound. With the retry, attempt 1 fails, the loop sleeps 10ms (during which the restorer's write lands), attempt 2 finds the file and the hard_link succeeds. Empirical FAIL-at-HEAD~1 / PASS-at-HEAD proof: * Locally toggled HARDLINK_MAX_RETRIES from 3 to 1 (effectively no retry); the new test failed deterministically with: "Error { code: NotFound, messages: [\"No such file or directory (os error 2)\", \"Could not make hardlink ... after 1 attempts ...\"] }" * Restored HARDLINK_MAX_RETRIES = 3; the new test passes 5/5 runs in ~0.03s each, and the full running_actions_manager_test binary passes all 32 tests (the prior 31 + the new one). Re: TODO TraceMachina#2051 (deadlock with large number of files) — unchanged from the analysis in the parent commit. The per-FileEntry read lock is per-digest, not global; concurrency for download_to_directory remains governed by `fs::hard_link`'s open-file semaphore. Scaffolding additions in nativelink-store are doc(hidden) and *_for_test-named: * `FsEvictingMap` type alias made pub so external tests can name the evicting_map return type. * `SharedContext::new_for_test(temp_path, content_path)`. * `EncodedFilePath::new_content_for_test(shared_ctx, key)`. * `FilesystemStore::evicting_map_for_test()`. None are used by production code. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

erneestoc and others added 3 commits May 16, 2026 18:05

erneestoc closed this May 17, 2026

erneestoc deleted the ec/pr2243-emplace-race-upstream branch May 17, 2026 04:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

filesystem_store: fix emplace_file race when two writers target the same key#2341

filesystem_store: fix emplace_file race when two writers target the same key#2341
erneestoc wants to merge 3 commits into
TraceMachina:mainfrom
erneestoc:ec/pr2243-emplace-race-upstream

erneestoc commented May 17, 2026 •

edited by MarcusSorealheis

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

erneestoc commented May 17, 2026 • edited by MarcusSorealheis Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

The race in plain language

The fix

Symptom this prevents

Provenance

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

erneestoc commented May 17, 2026 •

edited by MarcusSorealheis

Loading