filesystem_store: fix emplace_file race when two writers target the same key#2341
Closed
erneestoc wants to merge 3 commits into
Closed
filesystem_store: fix emplace_file race when two writers target the same key#2341erneestoc wants to merge 3 commits into
erneestoc wants to merge 3 commits into
Conversation
… same key When two threads write the same blob concurrently, both call emplace_file(). The second insert replaces the first entry in the evicting map, triggering unref() which deletes the first thread's temp file. The first thread then checks if the key exists (it does — the replacement entry), proceeds to rename its deleted temp file, and gets ENOENT. Fix: use Arc::ptr_eq to verify our specific entry is still in the map, not just that the key exists. This matches the same pattern already used in the error-handling path at line 966. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Cherry-picked from upstream nativelink PR TraceMachina#2243 commit e020168.
…to_directory to fix reader-side emplace race The writer-side fix in the parent commit (filesystem_store: Fix emplace_file race) addresses concurrent writes for the same key. With that fix in place, Buildstream CI on PR TraceMachina#2341 surfaced a sibling READER-side race in download_to_directory: the per-file hard_link runs OUTSIDE the FileEntry's read lock, so a concurrent writer's unref() (triggered when an insert() replaces an existing entry) can rename the on-disk file out of content/<digest> between the reader capturing the path and the hard_link syscall executing. The reader then sees ENOENT and surfaces the misleading "file was likely evicted from cache" error. The pre-existing comment in emplace_file makes the contract explicit: "Since we hold a write lock still anyone that gets our new FileEntry (which has not yet been placed on disk) will not be able to read the file's contents until we release the lock." The reader code violated that contract. Fix: move the fs::hard_link call INSIDE the get_file_path_locked closure so the per-FileEntry read lock is held for the duration of the syscall. The writer's unref() can no longer interleave because its write().await is blocked until the reader's hard_link completes. Deadlock concern (TODO TraceMachina#2051 was left near this site as a flag for "deadlock with large number of files"): the lock here is a per-FileEntry read lock, not a global lock. Multiple concurrent hard_link calls against DIFFERENT digests do not contend, and multiple readers of the SAME digest share the read lock. The only contention is reader-vs-writer on the same digest, which is exactly the synchronization needed for correctness. The outer concurrency cap on download_to_directory is governed by fs::hard_link's open-file semaphore (nativelink_util::fs), not by this RwLock. The TODO is updated rather than removed to flag a revisit if the file-handle semaphore proves insufficient under very large fan-outs. Test: download_to_directory_holds_lock_across_hard_link deterministically reproduces the race by draining OPEN_FILE_SEMAPHORE (the same trick upload_with_single_permit uses) to pin the reader at hard_link, polling try_write() on the FileEntry's encoded_file_path to detect that the reader has reached get_file_path_locked, then spawning an "evictor" that takes the same write lock and renames the file (mimicking what FilesystemStore::unref() does under the same lock). With the pre-fix code, the evictor wins the lock race and the reader's hard_link returns NotFound. With the post-fix code, the evictor's write().await is blocked by the reader's held read lock until the reader's hard_link completes atomically against the rename. Verified locally: reverting just the running_actions_manager.rs change (git stash) makes the test fail with NotFound; restoring the fix makes it pass consistently (5/5 runs in ~0.11s each). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…oser-entry emplace race The first reader-side fix (commit a2f4f4a) wrapped `fs::hard_link` inside `FileEntry::get_file_path_locked` so the per-entry read lock was held across the syscall. Buildstream CI on PR TraceMachina#2341 STILL surfaced the same ENOENT "file was likely evicted from cache" failure, hitting the action's max-retry budget (4 > 3) and cancelling jobs that referenced files like `usr/bin/perl5.26.1` across multiple digests. The first fix protected the WINNER-entry case: a reader holding `Arc<entry_A>` cannot have entry_A's content file renamed out mid-hard_link, because a concurrent `unref(entry_A)` (which takes the same RwLock as a writer) must wait for the reader's read lock. It did NOT protect the LOSER-entry case: 1. Reader R calls `get_file_entry_for_digest(d)` and the evicting map returns `Arc<entry_A>` (the entry currently under d in the map). 2. Concurrent writer B finishes an emplace for the SAME key. `EvictingMap::insert(B)` displaces A, calling `unref(A)`. `unref(A)` takes A's write lock and renames A's file from `content_path/<d>` to A's own temp path. A's file is now gone from the content path. 3. R then calls `get_file_path_locked(A)`. The read lock now correctly serializes with the (already-completed) unref — but R's captured path points at A, and A's file is gone. `fs::hard_link` returns ENOENT. The CAS still has the digest under entry_B's content path; B is in the map; B's file is on disk. Re-fetching the entry from the map returns B and a second hard_link attempt against B's path succeeds. Fix: bounded retry inside `download_to_directory`. On Code::NotFound the reader sleeps for a 10ms backoff (giving any racing writer's `emplace_file` background spawn time to finish renaming the temp into the content path), re-fetches the entry from the map (which now returns the winning writer's entry), and retries the hard_link. Capped at HARDLINK_MAX_RETRIES = 3 so that genuine eviction-pressure ENOENT (no writer racing, the digest truly is gone) cannot spin — the existing "max_bytes too small" guidance is preserved on the post-budget error path. Retry budget choice: 3 attempts (= 1 original + 2 retries). Production traces show at most one displacement per concurrent write cycle for a given digest; a single retry resolves the documented race. Two retries gives one extra slot of headroom for the rare case where a third writer enters the cycle between attempts. Going higher risks masking the real eviction-pressure case the original error message was designed to surface. Test: download_to_directory_retries_when_entry_evicted_between_lookup_and_hardlink * Pre-populates digest with entry_A in the evicting_map at content_path/<d>. * Constructs a synthetic entry_B pointing at the same content path (via test-only constructors on `SharedContext` and `EncodedFilePath` added in this commit) and inserts it under the same key. The map's real `insert` calls `LenEntry::unref(A)` which renames A's file out — same code path as a real writer's displacement. * Spawns the reader (`download_to_directory`). Spawns a restorer task that sleeps 2ms then writes fresh bytes back into content_path/<d> (mimicking writer B's emplace having completed its rename). * Without the retry the reader's single hard_link runs against the still-missing content path and fails with NotFound. With the retry, attempt 1 fails, the loop sleeps 10ms (during which the restorer's write lands), attempt 2 finds the file and the hard_link succeeds. Empirical FAIL-at-HEAD~1 / PASS-at-HEAD proof: * Locally toggled HARDLINK_MAX_RETRIES from 3 to 1 (effectively no retry); the new test failed deterministically with: "Error { code: NotFound, messages: [\"No such file or directory (os error 2)\", \"Could not make hardlink ... after 1 attempts ...\"] }" * Restored HARDLINK_MAX_RETRIES = 3; the new test passes 5/5 runs in ~0.03s each, and the full running_actions_manager_test binary passes all 32 tests (the prior 31 + the new one). Re: TODO TraceMachina#2051 (deadlock with large number of files) — unchanged from the analysis in the parent commit. The per-FileEntry read lock is per-digest, not global; concurrency for download_to_directory remains governed by `fs::hard_link`'s open-file semaphore. Scaffolding additions in nativelink-store are doc(hidden) and *_for_test-named: * `FsEvictingMap` type alias made pub so external tests can name the evicting_map return type. * `SharedContext::new_for_test(temp_path, content_path)`. * `EncodedFilePath::new_content_for_test(shared_ctx, key)`. * `FilesystemStore::evicting_map_for_test()`. None are used by production code. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
When two threads write the same blob concurrently,
emplace_file()can leave one writer pointing at a deleted temp file, causing anENOENTon rename.The race in plain language
emplace_file()is the function that takes a freshly-written temp file and atomically moves it to its digest-keyed home in the filesystem CAS, registering it in the LRU evicting map.X→ callsemplace_file()→ inserts entry A into the evicting map.X) → callsemplace_file()→ its insert replaces entry A with entry B. Replacement triggersunref()on entry A, which deletes A's temp file.evicting_map.get(&key).is_none()— but the key is present (entry B is there now), so the check passes.rename(temp_file → final_path)— and gets ENOENT because B's unref already deleted A's temp file.The bug is in the existence check: it asks "is some entry present under this key?" when it needs to ask "is my entry still the one in the map?".
The fix
Replace the key-presence check with an
Arc::ptr_eqcheck against our own entry. If the map's entry for this key isn't the sameArcwe inserted, treat it the same as eviction — bail out without renaming.This is the same pattern already used downstream in the error-handling path at line 884 (
Arc::<Fe>::ptr_eq), so it's not introducing a new idiom — it's making the success path consistent with the existing failure path.Symptom this prevents
Rare
ENOENTfailures during CAS writes under parallel-action load on a single worker. Easy to miss because:Provenance
Cherry-picked from PR #2243 (commit e020168). Isolated and rebased onto current
mainso it can land independently of the larger PR.Test plan
cargo build -p nativelink-store— clean.cargo test -p nativelink-store --lib— green.cargo clippy -p nativelink-store --lib -- -D warnings— clean.cargo fmt -p nativelink-store— clean.tests/filesystem_store_test.rswith substantial scaffolding; reproducing the race deterministically requires injecting an ordering hook. Happy to add one if reviewers want — open to suggestions on the right shape.🤖 Generated with Claude Code
This change is