Skip to content

fix(retry-budget): close 7 audit findings (0.8.3)#34

Merged
lesnik512 merged 9 commits into
mainfrom
fix/retry-budget-cluster
Jun 8, 2026
Merged

fix(retry-budget): close 7 audit findings (0.8.3)#34
lesnik512 merged 9 commits into
mainfrom
fix/retry-budget-cluster

Conversation

@lesnik512

@lesnik512 lesnik512 commented Jun 8, 2026

Copy link
Copy Markdown
Member

Summary

Closes 7 of the 35 deep-audit findings — the RetryBudget cross-cutting cluster plus 2 adjacent retry-surface nits. See planning/specs/2026-06-08-retry-budget-cluster-design.md for the design and planning/releases/0.8.3.md for the user-facing release notes.

Behavioral changes (read these)

  • RetryBudget deposits once per request, not per attempt. Tighter retry pacing under load.
  • RetryBudget ceiling uses math.ceil. No more silent off-by-one against percent_can_retry.
  • Retry-After > max_delay raises with a PEP 678 note instead of silently capping.
  • RuntimeError → TransportError via is_closed. Message-independent mapping.
  • Streaming-body refusal note scoped to where streaming is actually the blocker.

Audit findings closed

# Severity File:line Closed by
1 Low retry.py:105 + :236 deposit() hoist
2 Low retry.py:189 + :320 Retry-After give-up
3 Low budget.py:67 math.ceil ceiling
4 Low test_budget_props.py:52 test rewrite (paired with #3)
5 Nit test_retry_props.py:60 budget-exhaustion property test
6 Nit retry.py:111 + :242/:249 streaming-note scoping
7 Nit client.py:135 + :852 is_closed check

Test plan

  • All retry + budget + client + lifecycle tests pass after each commit
  • just lint-ci green after each commit
  • Reviewer manually scans the per-commit messages for any behavioral surprise

Release

Tag 0.8.3 from the merge SHA after this PR lands.

🤖 Generated with Claude Code

lesnik512 and others added 9 commits June 8, 2026 11:04
Pairs the 3 RetryBudget-cluster code fixes (deposit per-request hoist,
math.ceil ceiling, Retry-After give-up when > max_delay) with the 2 chunk-3
test rewrites that should have caught them, plus the 2 adjacent retry nits
(streaming-note scoping, RuntimeError → TransportError via is_closed).

Decisions: deposit() hoists above the loop, math.ceil ceiling, Retry-After
> max_delay raises the underlying StatusError with PEP 678 note rather than
silently capping. All behavioral changes documented in 0.8.3 release notes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Each task pairs a failing test (or test rewrite that pins the buggy
behavior) with the production fix that satisfies it. Task 1 = math.ceil
+ budget property test rewrite; Task 2 = deposit() per-request hoist;
Task 3 = Retry-After give-up (rewrites the existing capped test);
Task 4 = budget-exhaustion property test; Task 5 = streaming-note
scoping; Task 6 = is_closed mapping; Task 7 = release notes + PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`RetryBudget.try_withdraw`'s ceiling computed `int(deposits * percent) + floor`,
truncating so 4 deposits at percent_can_retry=0.2 yielded `int(0.8) = 0` —
the configured percentage was never reached at deposit counts below
`1/percent_can_retry`. Replace with math.ceil so the threshold is hit at the
first deposit-count where it is mathematically expressible.

The Hypothesis property test mirrored the same int() formula, so it could
not detect the off-by-one. Rewrite it to use math.ceil for the expected
ceiling and tighten `permitted <= ceiling` to `permitted == ceiling` so it
discriminates between the two roundings going forward.

Also update test_budget.py::test_deposit_after_exhaustion_does_not_immediately_unblock,
which was written to document the old truncation behaviour; the test is renamed
and rewritten to document the correct ceil semantics.

Closes audit Low findings (budget.py:67 truncation, test_budget_props.py:52
formula-mirror) from planning/audit/2026-06-07-deep-audit.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`AsyncRetry.__call__` and `Retry.__call__` deposited a token inside the
per-attempt for-loop. A request that retried twice contributed three
deposits and two withdrawals — inflating the budget's deposits-to-
withdrawals ratio by `(attempts-1)/attempts` and letting through more
retries than `percent_can_retry` should permit.

Move `self.budget.deposit()` immediately above the loop in both worlds so
the Finagle contract holds: deposits count original requests; withdrawals
count retries.

Behavioral impact: tighter retry pacing under load — users with active
retry traffic see the budget refuse retries earlier than before. This is
the documented contract; the previous behavior was the bug.

Closes audit Low finding (retry.py:105 + :236) from
planning/audit/2026-06-07-deep-audit.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously when `respect_retry_after=True` and the server sent
`Retry-After: 120` while the client had `max_delay=5`, AsyncRetry/Retry
silently clamped to 5s and retried — almost certainly hitting the same
503/429 and burning an attempt. The option name implied the header was
honored.

New behavior: when the header value exceeds max_delay, re-raise the last
observed StatusError with a PEP 678 note
`httpware: Retry-After (Ns) exceeded max_delay (Ms); giving up`. Same
treatment in both AsyncRetry and Retry. `respect_retry_after=False` and
raising `max_delay` to accommodate the header are the opt-outs.

Replaces the existing `test_retry_after_capped_at_max_delay` (which
pinned the old silent-cap) with two tests: the give-up assertion at
Retry-After > max_delay, and the boundary case at Retry-After == max_delay
which still retries. Sync mirrors added. Also updates
`test_retry_after_seconds_honored` (sync) which implicitly tested
clamping and now tests the within-budget path instead.

docs/resilience.md's respect_retry_after row describes the new give-up
behavior.

Closes audit Low finding (retry.py:189 + :320) from
planning/audit/2026-06-07-deep-audit.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The existing AsyncRetry property tests build budgets with
`min_retries_per_sec=1000.0` — floor of 60_000 retries permitted
unconditionally — so the budget-exhaustion path is never exercised by
Hypothesis. Add a focused property test on RetryBudget directly that
configures `min_retries_per_sec=0.0` and asserts: exactly ceiling
permits succeed, then the next one fails.

Pairs with the math.ceil fix in commit 1 of this branch — the same
ceiling formula is now the regression guard for both production and tests.

Closes audit Nit finding (test_retry_props.py:60) from
planning/audit/2026-06-07-deep-audit.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The early-out branch for method ineligibility AND non-retryable status
also attached the streaming-body refusal note whenever `retryable_status
and request.extensions.get(STREAMING_BODY_MARKER)` — telling users their
stream couldn't replay when the actual reason for not retrying was method
exclusion (e.g. POST not in retry_methods). The same misrouting existed
on the NetworkError/TimeoutError arm.

Drop the misleading add_note() inside the early-out branches in both
AsyncRetry and Retry. The note still fires at the existing dedicated
streaming-refusal site (the `# ---- retryable failure path` block) where
streaming is the actual blocker.

Also update existing tests that exercised POST (method-ineligible) to use
PUT (method-eligible) so streaming IS the actual reason for refusal, and
flip the two non-idempotent sync tests to assert the note is NOT attached.

Closes audit Nit finding (retry.py:111 + :242/:249) from
planning/audit/2026-06-07-deep-audit.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…string

Both AsyncClient._terminal and Client._terminal mapped RuntimeError to
TransportError by checking `"closed" in str(exc)`. This conflates two
unrelated conditions: any RuntimeError whose message happens to contain
"closed" (from any transport layer, plugin, or future httpx2 wording
change) was mis-classified as TransportError; conversely, if httpx2 reworded
its closed-client error to "shut down" or "disposed", the mapping would
silently break.

Replace with `self._httpx2_client.is_closed` — the same public attribute
already used elsewhere in client.py for the borrowed-client teardown
guards (lines 774, 784, 870, 880). Message-independent.

Tests pin both directions: RuntimeError raised while the client is open
propagates as-is (even when its message contains "closed"); RuntimeError
raised after close (real httpx2 case) maps to TransportError.

Closes audit Nit finding (client.py:135 + :852) from
planning/audit/2026-06-07-deep-audit.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… robustness

Three behavioral changes (deposit per-request, ceiling math.ceil,
Retry-After give-up), one mapping cleanup (RuntimeError → TransportError
via is_closed), one diagnostic fix (streaming-note scoping), plus two
non-user-visible test improvements. Closes 7 audit findings.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@lesnik512 lesnik512 self-assigned this Jun 8, 2026
@lesnik512 lesnik512 merged commit 171d893 into main Jun 8, 2026
5 checks passed
@lesnik512 lesnik512 deleted the fix/retry-budget-cluster branch June 8, 2026 08:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant