Skip to content

Sync with upstream 2.35#530

Draft
edolstra wants to merge 395 commits into
mainfrom
sync-2.35
Draft

Sync with upstream 2.35#530
edolstra wants to merge 395 commits into
mainfrom
sync-2.35

Conversation

@edolstra

@edolstra edolstra commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

Motivation

WIP, not finished yet.

Context

xokdvium and others added 30 commits March 14, 2026 15:09
We should never call reset() on a value (such as vRes) than can be
seen by another thread.

This was causing random failures about 'partially applied built-in
function' etc.

(cherry picked from commit 839aec2)
Resolves NixOS#15420

Signed-off-by: Lisanna Dettwyler <lisanna.dettwyler@gmail.com>
…ToPath

This is totally equivalent on unix, because the path is guaranteed to be absolute
(there's a check right before the call to rootPath) and absPath is equivalent to
canonPath when the path is absolute and resolveSymlinks is false (as in our case).
There are a couple of reasons we want to do this:

* Portability. Windows should always use unix-style paths in the evaluator.
* Performance. We don't want to pay the overhead of constructing a std::filesystem::path.
Fix "due not" → "do not" typo, and replace the incomplete sentence
"Different stores that disagree." with a complete, accurate explanation.
We can reuse the asynchronous graph traversal logic from computeClosure
to traverse the "references" edges asynchronously to avoid relying on the
narinfo disk cache. queryMissing will get the same treatment. This is limited
to the binary cache store implementation so that we don't needlessly spawn
coroutines when the underlying queryPathInfo is synchronous.
Coroutines are cheap and the FileTransfer already implements
its own rate limiting. This is needed just to limit the maximum
number of coroutines in flight.
Make queryMissing, querySubstitutablePaths, topoSortPaths async
In 08887ca I neglected the case of
interrupts and got bitten by destruction order. fd gets closed before
FdSink destructor gets a chance to run. This doesn't matter in the successful
code path, but does get hit during interrupts and leads to a bunch of annoying
ignored error messages:

error (ignored): write of 16384 bytes: Bad file descriptor
error (ignored): write of 16384 bytes: Bad file descriptor
…in-destructor

libutil/fs-sink: Flush FdSink before closing the file descriptor
Make even more parts of the evaluator thread-safe
This wasn't handling Interrupted, since it's a subclass of BaseError,
not Error. Since this is called from handleExceptions() while printing
exceptions, the uncaught exception here was causing Nix to crash with
a call to std::unexpected().
This hardens Nix against uncaught exceptions (like Interrupted) while
printing exceptions.
My SNAFU from initially implementing computeClosure and callbackToAwaitable.

* computeClosure must ensure that all callbacks have been invoked before destroying
  the event loop. That wasn't the case when we hit an early error.

* Callbacks must not take (shared) ownership of the shared_ptr when the completion
  has been run. I ran into a crash due to the shared_ptr being destroyed on the libcurl
  thread:

0x00007f64d2b99e4e _ZNSt23_Sp_counted_ptr_inplaceIN5boost4asio6detail23strand_executor_service11strand_implENS1_17execution_context9allocatorIvEELN9__gnu_cxx12_Lock_policyE2EE10_M_destroyEv (libnixstore.so.2.35.0 + 0x2bae4e)
0x00007f64d2c1b254 _ZN5boost4asio9execution6detail22shared_target_executor4implINS0_6strandINS0_15any_io_executorEEEED0Ev (libnixstore.so.2.35.0 + 0x33c254)
0x00007f64d2c13508 _ZNSt17_Function_handlerIFvSt6futureIN3nix3refIKNS1_13ValidPathInfoEEEEEZZNS1_19callbackToAwaitableIS5_ZZNS1_L32querySubstitutablePathInfosAsyncERNS1_5StoreERKSt3mapINS1_9StorePathESt8optionalINS1_14ContentA>
0x00007f64d2c8a23b _ZNSt23_Sp_counted_ptr_inplaceIN3nix8CallbackINS0_3refIKNS0_13ValidPathInfoEEEEESaIvELN9__gnu_cxx12_Lock_policyE2EE10_M_disposeEv (libnixstore.so.2.35.0 + 0x3ab23b)
0x00007f64d2a424fa _ZNSt16_Sp_counted_baseILN9__gnu_cxx12_Lock_policyE2EE10_M_releaseEv (libnixstore.so.2.35.0 + 0x1634fa)
0x00007f64d2c8b9aa _ZNSt17_Function_handlerIFvSt6futureISt10shared_ptrIKN3nix13ValidPathInfoEEEEZNS2_5Store13queryPathInfoERKNS2_9StorePathENS2_8CallbackINS2_3refIS4_EEEEEUlS6_E_E10_M_managerERSt9_Any_dataRKSI_St18_Manager_ope>
0x00007f64d2a7196b _ZNSt23_Sp_counted_ptr_inplaceIN3nix8CallbackISt10shared_ptrIKNS0_13ValidPathInfoEEEESaIvELN9__gnu_cxx12_Lock_policyE2EE10_M_disposeEv (libnixstore.so.2.35.0 + 0x19296b)
0x00007f64d2a424fa _ZNSt16_Sp_counted_baseILN9__gnu_cxx12_Lock_policyE2EE10_M_releaseEv (libnixstore.so.2.35.0 + 0x1634fa)
0x00007f64d2a5b5c2 _ZNSt17_Function_handlerIFvSt6futureISt8optionalINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEEEZN3nix16BinaryCacheStore21queryPathInfoUncachedERKNSB_9StorePathENSB_8CallbackISt10shared_ptrIKNSB_13Va>
0x00007f64d2b9961b _ZNSt23_Sp_counted_ptr_inplaceIN3nix8CallbackISt8optionalINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEEESaIvELN9__gnu_cxx12_Lock_policyE2EE10_M_disposeEv (libnixstore.so.2.35.0 + 0x2ba61b)
0x00007f64d2a424fa _ZNSt16_Sp_counted_baseILN9__gnu_cxx12_Lock_policyE2EE10_M_releaseEv (libnixstore.so.2.35.0 + 0x1634fa)
0x00007f64d2b9132a _ZNSt17_Function_handlerIFvSt6futureIN3nix18FileTransferResultEEEZNS1_20HttpBinaryCacheStore7getFileERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS1_8CallbackISt8optionalISB_EEEEUlS3_E_E10_M_manage>
0x00007f64d2b38f54 _ZN3nix16curlFileTransfer12TransferItemD2Ev (libnixstore.so.2.35.0 + 0x259f54)
0x00007f64d2a4d252 _ZNSt16_Sp_counted_baseILN9__gnu_cxx12_Lock_policyE2EE24_M_release_last_use_coldEv (libnixstore.so.2.35.0 + 0x16e252)
0x00007f64d2b47555 _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN3nix16curlFileTransferC4ERKNS3_20FileTransferSettingsEEUlvE_EEEEE6_M_runEv (libnixstore.so.2.35.0 + 0x268555)
0x00007f64d203ffa4 execute_native_thread_routine (libstdc++.so.6 + 0xf2fa4)
0x00007f64d1db2d53 start_thread (libc.so.6 + 0x9dd53)
0x00007f64d1e3a63c __clone3 (libc.so.6 + 0x12563c)
So apparently this had a very high overhead with the loop constantly
enqueueing more work into the executor, which wastes cpu cycles needlessly.
libutil: throw Error instead of asserting in `AbsolutePath` constructors
During the initial implementation I messed up here and also
included the whole closure. The invariant of topoSortPaths
is that it doesn't include more stuff than what's present in
the starting set. The code is updated to reflect that.
File transfer retries now use AWS-style "full jitter" exponential
backoff, treat HTTP 503 as rate-limited (same longer delay as 429), and
honor the Retry-After response header.

New nix.conf settings:
  - filetransfer-retry-attempts (was download-attempts, old name aliased)
  - filetransfer-retry-delay (100ms): base for transient errors
  - filetransfer-retry-delay-rate-limited (5000ms): base for 429/503
  - filetransfer-retry-max-delay (60000ms): ceiling on backoff growth
  - filetransfer-retry-jitter (true): enable full jitter

Per-substituter overrides are available as store URL parameters
(retry-delay, retry-delay-rate-limited, retry-max-delay, retry-attempts),
e.g. s3://bucket?retry-attempts=8. The override docstrings link to the
corresponding nix.conf settings via @docroot@.

Implementation notes:
  - computeRetryDelayMs + RetryDelayParams + clampedExponential +
    saturateMs live in filetransfer-impl.hh, included only by
    filetransfer.cc and unit tests, keeping <random> out of the
    public header
  - clampedExponential widens to uint64_t for the intermediate shift
    to avoid uint32_t overflow
  - Retry-After parsing uses std::chrono (seconds->ms conversion is
    overflow-safe since milliseconds::rep is >=45 bits); saturateMs
    narrows back to uint32_t once at the boundary
  - retryAfterMs is consumed via std::exchange so it applies to at
    most one retry attempt (otherwise a transport-level failure after
    a 503 would reuse the stale server-provided floor)
  - the retry-guard conditional is a canRetry() IIFE for readability;
    the Range+Content-Encoding interaction is documented inline
  - HttpStatus enum struct replaces magic status numbers;
    std::to_underlying is used where a raw long is required

This addresses thundering-herd scenarios where many CI jobs hit the
same S3 prefix and receive 503 SlowDown; previously the retry window
for 503 was only ~4 seconds.

Closes NixOS#15023
Part of NixOS#15419
Table-driven parameterized suite (36 cases) covering exponential growth,
ceiling interaction, Retry-After floor semantics, shift clamp boundary
at attempt 32/33, and integer-extreme inputs. Plus 5 jitter loop tests
for stochastic bounds.
runCommand derivation spinning up a Python HTTP server that returns 503
with Retry-After, verifying that nix retries with the expected backoff
timing. Registered in hydra.nix.
libutil: Fix computeClosure and callbackToAwaitable lifetime issues
edolstra added 28 commits July 2, 2026 19:52
@edolstra edolstra added the flake-regression-test Run the flake regressions test suite on this PR label Jul 2, 2026
@coderabbitai

coderabbitai Bot commented Jul 2, 2026

Copy link
Copy Markdown

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0b02625e-1c50-4fbe-bb0a-a4af92ba348e

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch sync-2.35

Comment @coderabbitai help to get the list of available commands.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

flake-regression-test Run the flake regressions test suite on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.