Skip to content

fix(#164): Close agent provisioning orphan window#267

Open
mvillmow wants to merge 4 commits into
mainfrom
164-auto-impl
Open

fix(#164): Close agent provisioning orphan window#267
mvillmow wants to merge 4 commits into
mainfrom
164-auto-impl

Conversation

@mvillmow

@mvillmow mvillmow commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Summary

This PR fixes the partial agent provisioning failure that could leave orphaned agents untracked by teardown. The issue occurs when create_agent succeeds but wake_agent fails—the agent ID would be lost and teardown would miss it.

The fix records agent IDs the moment create_agent returns (before wake_agent runs) by passing a shared id_map to _provision_one_agent and recording each ID immediately after creation. This ensures teardown always sees every successfully-created agent, even on partial failure.

Changes

  • _provision_agents refactor: Alias state.created_agents to id_map up front and pass id_map to each _provision_one_agent call. The dict is populated incrementally as agents are created, not after all coroutines complete.

  • _provision_one_agent signature: Add id_map: dict[str, str] parameter and record the agent ID immediately after create_agent returns, before wake_agent is awaited.

  • Schema validation: Add validate_unique_agent_names to WorkflowSpec to prevent silent overwrites from duplicate agent names, which could cause the first agent to be orphaned.

  • Deterministic regression tests: Two new tests in TestErrorPaths:

    • test_wake_agent_failure_still_records_created_agent_for_teardown: Single-agent case where wake_agent fails after create_agent succeeds
    • test_partial_provisioning_tracks_all_completed_creates_before_failure: Multi-agent case with mixed outcomes, using max_concurrent_provisioning=1 to ensure deterministic call ordering
  • Dry-run fix: Update dry-run path to also populate id_map so state is consistent regardless of execution mode.

  • Code cleanup: Remove redundant state.created_agents = await self._provision_agents(...) assignment (now handled by alias inside _provision_agents).

Test Results

All 51 tests pass, including the two new regression tests. The fix is proven correct under both single-agent and partial-fan-out scenarios.

Closes #164

Co-Authored-By: Claude Haiku 4.5 noreply@anthropic.com

mvillmow and others added 4 commits June 28, 2026 09:46
Record agent IDs the moment create_agent returns (before wake_agent runs)
so a wake failure does not leave agents untracked by teardown. The fix
adds _provision_one_agent(id_map) parameter and mutates id_map at the
earliest awaitable boundary, preventing orphans when wake_agent raises
after create_agent succeeds.

Changes:
- Alias state.created_agents to id_map up front in _provision_agents
  so teardown sees every successfully-created agent even on partial
  failure. Simplify gather result processing with `next()`

- Update _provision_one_agent signature to accept and populate id_map
  immediately after create_agent returns, before wake_agent runs (#164)

- Add validate_unique_agent_names to WorkflowSpec to prevent silent
  overwrites if duplicate agent names appear in the workflow

- Add deterministic regression tests: wake_agent failure case and
  partial create_agent fan-out with max_concurrent_provisioning=1

- Remove redundant state.created_agents assignment (now handled inside
  _provision_agents via the alias)

Test coverage: 51 passing, including two new acceptance criteria for
the orphan window fix.

Closes #164

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Signed-off-by: mvillmow <4211002+mvillmow@users.noreply.github.com>
No follow-ups identified within strict scope. The fix for the agent
provisioning orphan window is complete, tested, and ready for review.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Signed-off-by: mvillmow <4211002+mvillmow@users.noreply.github.com>
Signed-off-by: mvillmow <4211002+mvillmow@users.noreply.github.com>
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Signed-off-by: mvillmow <4211002+mvillmow@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[MINOR] §9: Partial agent provisioning failure leaves orphaned agents

1 participant