Skip to content

Add graceful degradation when dependent services are unavailable #57

@geoffjay

Description

@geoffjay

Context

agentd services communicate via HTTP REST calls (e.g., ask → notify, orchestrator → wrap). Currently, if a downstream service is unavailable, requests fail with connection errors. There's no retry logic, circuit breaking, or graceful fallback behavior in the service layer.

Proposal

Implement resilient inter-service communication with retries, circuit breakers, and fallback behavior.

Acceptance Criteria

  • Add retry logic with exponential backoff to all inter-service HTTP clients (notification_client.rs, client.rs in wrap/orchestrator)
  • Implement circuit breaker pattern — after N consecutive failures, stop attempting calls for a cooldown period
  • Services continue operating in degraded mode when dependencies are down (e.g., ask still answers questions but logs that notification delivery failed)
  • Health endpoint reports dependency status (e.g., /health returns { "status": "degraded", "dependencies": { "notify": "unavailable" } })
  • Startup does not fail if optional dependencies are unreachable — log warning and continue
  • Add configurable timeout, retry count, and circuit breaker thresholds

Relevant Files

  • crates/ask/src/notification_client.rs — calls notify service
  • crates/orchestrator/src/manager.rs — calls wrap service
  • crates/cli/src/client.rs — calls all services
  • crates/notify/src/client.rs — HTTP client base
  • crates/wrap/src/client.rs — HTTP client base

Notes

Dependencies

Benefits from: #49 (shared HTTP client would be the natural place to add retry/circuit-breaker logic, avoiding duplication across service clients)

Related: #17 (API authentication), #7 (integration tests for cross-service communication)

Metadata

Metadata

Assignees

No one assigned

    Labels

    architectureCross-service architectural design or reviewcomplexity:largeLarge scope: 200+ lines, multiple filesenhancementNew feature or requestneeds-testsArea needs dedicated test coveragetriagedIssue has been triaged, ready for planning or implementation

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions