Conversation
We should never call reset() on a value (such as vRes) than can be seen by another thread. This was causing random failures about 'partially applied built-in function' etc. (cherry picked from commit 839aec2)
Resolves NixOS#15420 Signed-off-by: Lisanna Dettwyler <lisanna.dettwyler@gmail.com>
…ToPath This is totally equivalent on unix, because the path is guaranteed to be absolute (there's a check right before the call to rootPath) and absPath is equivalent to canonPath when the path is absolute and resolveSymlinks is false (as in our case). There are a couple of reasons we want to do this: * Portability. Windows should always use unix-style paths in the evaluator. * Performance. We don't want to pay the overhead of constructing a std::filesystem::path.
Windows build can evaluate flakes
Fix "due not" → "do not" typo, and replace the incomplete sentence "Different stores that disagree." with a complete, accurate explanation.
We can reuse the asynchronous graph traversal logic from computeClosure to traverse the "references" edges asynchronously to avoid relying on the narinfo disk cache. queryMissing will get the same treatment. This is limited to the binary cache store implementation so that we don't needlessly spawn coroutines when the underlying queryPathInfo is synchronous.
Coroutines are cheap and the FileTransfer already implements its own rate limiting. This is needed just to limit the maximum number of coroutines in flight.
Make queryMissing, querySubstitutablePaths, topoSortPaths async
Fix compatibility with lowdown 3
In 08887ca I neglected the case of interrupts and got bitten by destruction order. fd gets closed before FdSink destructor gets a chance to run. This doesn't matter in the successful code path, but does get hit during interrupts and leads to a bunch of annoying ignored error messages: error (ignored): write of 16384 bytes: Bad file descriptor error (ignored): write of 16384 bytes: Bad file descriptor
…in-destructor libutil/fs-sink: Flush FdSink before closing the file descriptor
Make even more parts of the evaluator thread-safe
This wasn't handling Interrupted, since it's a subclass of BaseError, not Error. Since this is called from handleExceptions() while printing exceptions, the uncaught exception here was causing Nix to crash with a call to std::unexpected().
This hardens Nix against uncaught exceptions (like Interrupted) while printing exceptions.
My SNAFU from initially implementing computeClosure and callbackToAwaitable. * computeClosure must ensure that all callbacks have been invoked before destroying the event loop. That wasn't the case when we hit an early error. * Callbacks must not take (shared) ownership of the shared_ptr when the completion has been run. I ran into a crash due to the shared_ptr being destroyed on the libcurl thread: 0x00007f64d2b99e4e _ZNSt23_Sp_counted_ptr_inplaceIN5boost4asio6detail23strand_executor_service11strand_implENS1_17execution_context9allocatorIvEELN9__gnu_cxx12_Lock_policyE2EE10_M_destroyEv (libnixstore.so.2.35.0 + 0x2bae4e) 0x00007f64d2c1b254 _ZN5boost4asio9execution6detail22shared_target_executor4implINS0_6strandINS0_15any_io_executorEEEED0Ev (libnixstore.so.2.35.0 + 0x33c254) 0x00007f64d2c13508 _ZNSt17_Function_handlerIFvSt6futureIN3nix3refIKNS1_13ValidPathInfoEEEEEZZNS1_19callbackToAwaitableIS5_ZZNS1_L32querySubstitutablePathInfosAsyncERNS1_5StoreERKSt3mapINS1_9StorePathESt8optionalINS1_14ContentA> 0x00007f64d2c8a23b _ZNSt23_Sp_counted_ptr_inplaceIN3nix8CallbackINS0_3refIKNS0_13ValidPathInfoEEEEESaIvELN9__gnu_cxx12_Lock_policyE2EE10_M_disposeEv (libnixstore.so.2.35.0 + 0x3ab23b) 0x00007f64d2a424fa _ZNSt16_Sp_counted_baseILN9__gnu_cxx12_Lock_policyE2EE10_M_releaseEv (libnixstore.so.2.35.0 + 0x1634fa) 0x00007f64d2c8b9aa _ZNSt17_Function_handlerIFvSt6futureISt10shared_ptrIKN3nix13ValidPathInfoEEEEZNS2_5Store13queryPathInfoERKNS2_9StorePathENS2_8CallbackINS2_3refIS4_EEEEEUlS6_E_E10_M_managerERSt9_Any_dataRKSI_St18_Manager_ope> 0x00007f64d2a7196b _ZNSt23_Sp_counted_ptr_inplaceIN3nix8CallbackISt10shared_ptrIKNS0_13ValidPathInfoEEEESaIvELN9__gnu_cxx12_Lock_policyE2EE10_M_disposeEv (libnixstore.so.2.35.0 + 0x19296b) 0x00007f64d2a424fa _ZNSt16_Sp_counted_baseILN9__gnu_cxx12_Lock_policyE2EE10_M_releaseEv (libnixstore.so.2.35.0 + 0x1634fa) 0x00007f64d2a5b5c2 _ZNSt17_Function_handlerIFvSt6futureISt8optionalINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEEEZN3nix16BinaryCacheStore21queryPathInfoUncachedERKNSB_9StorePathENSB_8CallbackISt10shared_ptrIKNSB_13Va> 0x00007f64d2b9961b _ZNSt23_Sp_counted_ptr_inplaceIN3nix8CallbackISt8optionalINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEEESaIvELN9__gnu_cxx12_Lock_policyE2EE10_M_disposeEv (libnixstore.so.2.35.0 + 0x2ba61b) 0x00007f64d2a424fa _ZNSt16_Sp_counted_baseILN9__gnu_cxx12_Lock_policyE2EE10_M_releaseEv (libnixstore.so.2.35.0 + 0x1634fa) 0x00007f64d2b9132a _ZNSt17_Function_handlerIFvSt6futureIN3nix18FileTransferResultEEEZNS1_20HttpBinaryCacheStore7getFileERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS1_8CallbackISt8optionalISB_EEEEUlS3_E_E10_M_manage> 0x00007f64d2b38f54 _ZN3nix16curlFileTransfer12TransferItemD2Ev (libnixstore.so.2.35.0 + 0x259f54) 0x00007f64d2a4d252 _ZNSt16_Sp_counted_baseILN9__gnu_cxx12_Lock_policyE2EE24_M_release_last_use_coldEv (libnixstore.so.2.35.0 + 0x16e252) 0x00007f64d2b47555 _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN3nix16curlFileTransferC4ERKNS3_20FileTransferSettingsEEUlvE_EEEEE6_M_runEv (libnixstore.so.2.35.0 + 0x268555) 0x00007f64d203ffa4 execute_native_thread_routine (libstdc++.so.6 + 0xf2fa4) 0x00007f64d1db2d53 start_thread (libc.so.6 + 0x9dd53) 0x00007f64d1e3a63c __clone3 (libc.so.6 + 0x12563c)
So apparently this had a very high overhead with the loop constantly enqueueing more work into the executor, which wastes cpu cycles needlessly.
libutil: throw Error instead of asserting in `AbsolutePath` constructors
During the initial implementation I messed up here and also included the whole closure. The invariant of topoSortPaths is that it doesn't include more stuff than what's present in the starting set. The code is updated to reflect that.
File transfer retries now use AWS-style "full jitter" exponential backoff, treat HTTP 503 as rate-limited (same longer delay as 429), and honor the Retry-After response header. New nix.conf settings: - filetransfer-retry-attempts (was download-attempts, old name aliased) - filetransfer-retry-delay (100ms): base for transient errors - filetransfer-retry-delay-rate-limited (5000ms): base for 429/503 - filetransfer-retry-max-delay (60000ms): ceiling on backoff growth - filetransfer-retry-jitter (true): enable full jitter Per-substituter overrides are available as store URL parameters (retry-delay, retry-delay-rate-limited, retry-max-delay, retry-attempts), e.g. s3://bucket?retry-attempts=8. The override docstrings link to the corresponding nix.conf settings via @docroot@. Implementation notes: - computeRetryDelayMs + RetryDelayParams + clampedExponential + saturateMs live in filetransfer-impl.hh, included only by filetransfer.cc and unit tests, keeping <random> out of the public header - clampedExponential widens to uint64_t for the intermediate shift to avoid uint32_t overflow - Retry-After parsing uses std::chrono (seconds->ms conversion is overflow-safe since milliseconds::rep is >=45 bits); saturateMs narrows back to uint32_t once at the boundary - retryAfterMs is consumed via std::exchange so it applies to at most one retry attempt (otherwise a transport-level failure after a 503 would reuse the stale server-provided floor) - the retry-guard conditional is a canRetry() IIFE for readability; the Range+Content-Encoding interaction is documented inline - HttpStatus enum struct replaces magic status numbers; std::to_underlying is used where a raw long is required This addresses thundering-herd scenarios where many CI jobs hit the same S3 prefix and receive 503 SlowDown; previously the retry window for 503 was only ~4 seconds. Closes NixOS#15023 Part of NixOS#15419
Table-driven parameterized suite (36 cases) covering exponential growth, ceiling interaction, Retry-After floor semantics, shift clamp boundary at attempt 32/33, and integer-extreme inputs. Plus 5 jitter loop tests for stochastic bounds.
runCommand derivation spinning up a Python HTTP server that returns 503 with Retry-After, verifying that nix retries with the expected backoff timing. Registered in hydra.nix.
libutil: Fix computeClosure and callbackToAwaitable lifetime issues
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
WIP, not finished yet.
Context