Skip to content

feat: implement mid-workflow Agamemnon health checks (#161)#265

Open
mvillmow wants to merge 1 commit into
mainfrom
161-auto-impl
Open

feat: implement mid-workflow Agamemnon health checks (#161)#265
mvillmow wants to merge 1 commit into
mainfrom
161-auto-impl

Conversation

@mvillmow

@mvillmow mvillmow commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds out-of-band connectivity monitoring during workflow execution to detect when Agamemnon becomes unresponsive, distinguishing backend failures from slow task execution.

  • AgamemnonClient.ping() — single-shot liveness probe via GET /v1/agents
  • _heartbeat background task — probes every 15s, raises WorkflowConnectivityError after 2 consecutive failures
  • state.connectivity_failed flag — set before raising so _teardown knows to clean up even under on_completion policy (fixes resource leak)
  • Extended _teardown predicate — on_completion now tears down on connectivity failure, not just task success
  • 12 test cases covering ping() semantics, threshold logic, lifecycle, and critical teardown policy regression

Test plan

  • All 12 new health check tests pass
  • All 48 existing executor tests still pass
  • ruff check and ruff format pass
  • Commit is cryptographically signed

Closes #161

🤖 Generated with Claude Code

@mvillmow mvillmow left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, complete impl of #161 health checks; on_completion teardown leak fixed and tested. One minor: redundant except tuple in ping().

Comment thread src/telemachy/agamemnon_client.py Outdated
Add out-of-band connectivity monitoring during workflow execution to detect
when Agamemnon becomes unresponsive, distinguishing backend failures from
slow task execution. Implements:

- AgamemnonClient.ping(): single-shot liveness probe via GET /v1/agents
  (bypasses retry logic to serve as immediate detection signal)
- _heartbeat task in WorkflowExecutor: background task that probes every
  HEALTHCHECK_INTERVAL_SECONDS and sets connectivity_lost event after
  HEALTHCHECK_FAILURE_THRESHOLD consecutive failures
- WorkflowConnectivityError: typed exception raised when threshold exceeded
- state.connectivity_failed flag: set before raising to signal teardown
- Extended _teardown policy gate: on_completion now also tears down when
  connectivity_failed=True (fixes resource-leak regression on default policy)
- New settings in config.py:
  * HEALTHCHECK_INTERVAL_SECONDS (default 15)
  * HEALTHCHECK_FAILURE_THRESHOLD (default 2)
  * HEALTHCHECK_TIMEOUT_SECONDS (default 5)
- 12 test cases covering ping() semantics, threshold logic, lifecycle,
  and teardown policy interactions

Fixes the resource-leak regression where on_completion (the default) would
not tear down agents/teams if Agamemnon failed mid-workflow.

Closes #161

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Signed-off-by: mvillmow <4211002+mvillmow@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[MAJOR] §9: No health check or Agamemnon connectivity verification mid-workflow

1 participant