Skip to content

M18.1: anvil_ssh::retry module — RetryPolicy + classifier + run loop (FR-81, FR-82, FR-83)#28

Merged
UnbreakableMJ merged 1 commit into
mainfrom
feature/m18-1-retry-module
May 4, 2026
Merged

M18.1: anvil_ssh::retry module — RetryPolicy + classifier + run loop (FR-81, FR-82, FR-83)#28
UnbreakableMJ merged 1 commit into
mainfrom
feature/m18-1-retry-module

Conversation

@UnbreakableMJ

Copy link
Copy Markdown
Contributor

Summary

First slice of M18 (PRD §5.8.7 — connection retry / backoff / timeouts). Adds the library-side scaffold for FR-81 / FR-82 / FR-83 — M18.2 plumbs this into AnvilConfig + session.rs::connect; M18.4 wires the Gitway CLI flags + retry_attempts JSON envelope.

New anvil_ssh::retry module

  • RetryPolicy { attempts, base, factor, cap, max_window, connect_timeout } with builder setters. Defaults per PRD: 3 attempts, 250 ms base, ×2 factor, 8 s cap, 30 s max_window, no connect_timeout.
  • Disposition { Retry, Fatal } + classify(err) (FR-82) — auth/host-key/no-key/key-encrypted → Fatal; transient I/O kinds (ConnectionRefused, TimedOut, HostUnreachable, NetworkUnreachable, NotFound (DNS), AddrNotAvailable) → Retry; everything else (russh protocol, other I/O kinds) → Fatal.
  • RetryAttempt { attempt, reason, elapsed } — per-failure history for FR-83.
  • async fn run(policy, op) -> (T, Vec<RetryAttempt>) — drives the loop with jittered exponential backoff (min(base * factor^(n-1), cap) + uniform_jitter([0, base/2])), jitter sourced from OsRng. Bails on max_window cap before starting another attempt. Emits tracing::warn! at the new CAT_RETRY category per failed attempt.

run is timeout-agnostic — the per-attempt tokio::time::timeout wrap lives at the M18.2 call site so the same loop is reusable for non-network operations.

Supporting additions

  • log::CAT_RETRY = "anvil_ssh::retry" appended to CATEGORIES.
  • AnvilError::io_kind() — returns Option<std::io::ErrorKind> for the Io variant.
  • AnvilError::is_transient() — single-call predicate wrapping retry::classify.

Tests (15 unit tests)

Default policy shape, builder chainability, classifier matrix (7 cases), run loop (success-first / bail-on-fatal / retry-record-history / exhaust-count), backoff curve (exponential growth + cap enforcement + 1000-draw jitter window).

Public API: pure additive. Version bump to 0.9.0 lands in M18.3. HTTP 429/503 detection from FR-82 is documented as out of scope — Anvil speaks raw SSH.

Test plan

  • cargo fmt --all -- --check
  • cargo clippy --all-targets --all-features --locked -- -D warnings
  • cargo test --lib --tests --locked — all green (15 new retry::tests::*, plus the existing M11–M17 set)
  • CI matrix on this PR

🤖 Generated with Claude Code

…(FR-81, FR-82, FR-83 lib side)

Adds the library-side scaffold M18 needs to honour ConnectTimeout +
ConnectionAttempts from `~/.ssh/config` (FR-80, M18.2), classify
transient vs fatal errors (FR-82), drive an exponential-backoff
retry loop with jitter (FR-81), and capture per-attempt history for
FR-83's `gitway --test --json` envelope.

src/retry.rs (new, ~430 lines + ~190 lines of tests):

- pub struct RetryPolicy { attempts, base, factor, cap, max_window,
  connect_timeout } with builder-style setters.  Default values
  per PRD: 3 attempts, 250 ms base, x2 factor, 8 s cap, 30 s max
  window, no connect_timeout.

- pub enum Disposition { Retry, Fatal }
  pub fn classify(err: &AnvilError) -> Disposition (FR-82):
  * AuthenticationFailed / HostKeyMismatch / NoKeyFound /
    KeyEncrypted -> Fatal
  * Io kind in {ConnectionRefused, TimedOut, HostUnreachable,
    NetworkUnreachable, NotFound (DNS NXDOMAIN), AddrNotAvailable}
    -> Retry
  * Everything else (russh protocol, signing, signature-invalid,
    other Io kinds) -> Fatal
  HTTP 429/503 detection from FR-82's defensive wording is out of
  scope: Anvil speaks raw SSH; HTTP statuses only surface in
  ProxyCommand subprocess output that Anvil doesn't parse.

- pub struct RetryAttempt { attempt: u32, reason: String,
  elapsed: Duration } captures per-failure history for FR-83.

- pub async fn run<F, Fut, T>(policy, op) -> Result<(T, Vec<RetryAttempt>), AnvilError>
  drives the loop with jittered exponential backoff:
    delay_n = min(base * factor^(n-1), cap) + uniform_jitter([0, base/2])
  Jitter sourced from rand_core::OsRng (already in deps from M19's
  prepend_revoked).
  Bails on max_window before starting another attempt.
  Emits tracing::warn! at CAT_RETRY per failed attempt with
  attempt / reason / elapsed_ms / disposition fields.

- run is timeout-agnostic: the per-attempt tokio::time::timeout
  wrap lives at the call site (M18.2's session.rs::connect) so
  the same loop driver can be reused for non-network operations.

- 15 unit tests covering: default-policy values, builder
  chainability, classifier matrix (auth-fatal / host-key-fatal /
  no-key-fatal / io-connection-refused-retry / io-timed-out-retry /
  io-not-found-retry / io-permission-denied-fatal), run loop
  (success-first-try / bail-on-fatal / retry-and-record-history /
  exhaust-attempt-count), backoff curve (exponential growth, cap
  enforcement, 1000-draw jitter window).

src/log.rs:
- New pub const CAT_RETRY = "anvil_ssh::retry"; appended to
  CATEGORIES.
- categories_slice_matches_individual_constants test updated to
  include the new constant.

src/error.rs:
- New pub fn AnvilError::io_kind(&self) -> Option<std::io::ErrorKind>
  returns the underlying io::Error::kind() for the Io variant.
  Used by retry::classify; also useful for downstream consumers
  inspecting failure categories.
- New pub fn AnvilError::is_transient(&self) -> bool returning
  matches!(retry::classify(self), Disposition::Retry).  Surfaces
  the classifier as a single-call predicate for log-aggregation
  pipelines and CLI error paths.

src/lib.rs:
- pub mod retry;

Public API: pure additive.  Version bump to 0.9.0 lands in M18.3.

Plan: M18.1 of anvil-gitway-milestone-plan.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@UnbreakableMJ UnbreakableMJ merged commit b143f5e into main May 4, 2026
5 checks passed
@UnbreakableMJ UnbreakableMJ deleted the feature/m18-1-retry-module branch May 4, 2026 20:55
UnbreakableMJ added a commit that referenced this pull request May 4, 2026
…t) (#30)

Final Anvil-side slice of M18.  Bumps anvil-ssh from 0.8.0 to 0.9.0
to publish the M18.1 + M18.2 work as a single crates.io release.
The Gitway-side CLI flags + retry_attempts JSON envelope (M18.4 +
M18.5) land against this 0.9.0; the M18.X PRD doc PR closes the
milestone with Gitway v1.0.0-rc.9.

Cargo.toml:
- version "0.8.0" -> "0.9.0"

Cargo.lock:
- regenerated locally; reflects the 0.9.0 version.

CHANGELOG.md:
- 0.9.0 entry covering the new anvil_ssh::retry module
  (RetryPolicy, classify, run, RetryAttempt), the new
  CAT_RETRY tracing category, the AnvilError::io_kind +
  is_transient predicates, the three new AnvilConfig fields +
  builder setters, the apply_ssh_config consumption of
  ConnectTimeout / ConnectionAttempts, the AnvilSession::connect
  retry+timeout wrap, and the AnvilSession::retry_history
  accessor.  Documents the proxy/jump scope-narrowing and the
  HTTP 429/503 out-of-scope decision.

Stacked after PRs #28 (M18.1, merged) and #29 (M18.2, merged).

Plan: M18.3 of anvil-gitway-milestone-plan.md.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant