fix(#164): Close agent provisioning orphan window#267
Open
mvillmow wants to merge 4 commits into
Open
Conversation
Record agent IDs the moment create_agent returns (before wake_agent runs) so a wake failure does not leave agents untracked by teardown. The fix adds _provision_one_agent(id_map) parameter and mutates id_map at the earliest awaitable boundary, preventing orphans when wake_agent raises after create_agent succeeds. Changes: - Alias state.created_agents to id_map up front in _provision_agents so teardown sees every successfully-created agent even on partial failure. Simplify gather result processing with `next()` - Update _provision_one_agent signature to accept and populate id_map immediately after create_agent returns, before wake_agent runs (#164) - Add validate_unique_agent_names to WorkflowSpec to prevent silent overwrites if duplicate agent names appear in the workflow - Add deterministic regression tests: wake_agent failure case and partial create_agent fan-out with max_concurrent_provisioning=1 - Remove redundant state.created_agents assignment (now handled inside _provision_agents via the alias) Test coverage: 51 passing, including two new acceptance criteria for the orphan window fix. Closes #164 Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com> Signed-off-by: mvillmow <4211002+mvillmow@users.noreply.github.com>
No follow-ups identified within strict scope. The fix for the agent provisioning orphan window is complete, tested, and ready for review. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com> Signed-off-by: mvillmow <4211002+mvillmow@users.noreply.github.com>
Signed-off-by: mvillmow <4211002+mvillmow@users.noreply.github.com>
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com> Signed-off-by: mvillmow <4211002+mvillmow@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR fixes the partial agent provisioning failure that could leave orphaned agents untracked by teardown. The issue occurs when
create_agentsucceeds butwake_agentfails—the agent ID would be lost and teardown would miss it.The fix records agent IDs the moment
create_agentreturns (beforewake_agentruns) by passing a sharedid_mapto_provision_one_agentand recording each ID immediately after creation. This ensures teardown always sees every successfully-created agent, even on partial failure.Changes
_provision_agents refactor: Alias
state.created_agentstoid_mapup front and passid_mapto each_provision_one_agentcall. The dict is populated incrementally as agents are created, not after all coroutines complete._provision_one_agent signature: Add
id_map: dict[str, str]parameter and record the agent ID immediately aftercreate_agentreturns, beforewake_agentis awaited.Schema validation: Add
validate_unique_agent_namestoWorkflowSpecto prevent silent overwrites from duplicate agent names, which could cause the first agent to be orphaned.Deterministic regression tests: Two new tests in
TestErrorPaths:test_wake_agent_failure_still_records_created_agent_for_teardown: Single-agent case wherewake_agentfails aftercreate_agentsucceedstest_partial_provisioning_tracks_all_completed_creates_before_failure: Multi-agent case with mixed outcomes, usingmax_concurrent_provisioning=1to ensure deterministic call orderingDry-run fix: Update dry-run path to also populate
id_mapso state is consistent regardless of execution mode.Code cleanup: Remove redundant
state.created_agents = await self._provision_agents(...)assignment (now handled by alias inside_provision_agents).Test Results
All 51 tests pass, including the two new regression tests. The fix is proven correct under both single-agent and partial-fan-out scenarios.
Closes #164
Co-Authored-By: Claude Haiku 4.5 noreply@anthropic.com