feat(aqe): support executor failure in AdaptiveExecutionGraph#1601
Draft
jja725 wants to merge 14 commits intoapache:mainfrom
Draft
feat(aqe): support executor failure in AdaptiveExecutionGraph#1601jja725 wants to merge 14 commits intoapache:mainfrom
jja725 wants to merge 14 commits intoapache:mainfrom
Conversation
Spec for issue apache#1359 task "support executor failure" — covers the AdaptivePlanner state-sync gap that prevents re-running stages from accepting update_exchange_locations after rollback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Self-review fix: code blocks now match the Migration section (pub(crate) for exec wrappers, pub(super) for the planner method). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Task-by-task TDD plan covering: - per-exec reset_locations_on_lost_executor (ExchangeExec, AdaptiveDatafusionExec) - AdaptivePlanner::reset_on_lost_executor + collect_affected_stages - wire-in at AdaptiveExecutionGraph::reset_stages_internal - four ported executor-failure tests - final clippy/fmt verification Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Clears resolved shuffle_partitions when any location references the lost executor; returns the stage_id so the planner can restore cache entries downstream. Refs: docs/superpowers/specs/2026-04-27-aqe-executor-failure-design.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors ExchangeExec; the final-stage wrapper can also carry resolved shuffle metadata when the root stage produces shuffled output. Refs: docs/superpowers/specs/2026-04-27-aqe-executor-failure-design.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Walks the live plan tree, clears resolved shuffle metadata on ExchangeExec / AdaptiveDatafusionExec nodes that reference the lost executor, restores runnable_stage_cache / runnable_stage_output entries for affected stages, and re-runs replan_stages. Without this the AdaptiveExecutionGraph rolls back stages but the planner's plan tree still treats them as resolved, and re-running stages can't accept update_exchange_locations. Refs: docs/superpowers/specs/2026-04-27-aqe-executor-failure-design.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
reset_stages_internal now syncs planner state after rolling back graph-level stages on executor loss; also drop the stale "does not cover executor failure" doc line. Refs: docs/superpowers/specs/2026-04-27-aqe-executor-failure-design.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds test_aqe_aggregation_plan / test_aqe_join_plan that build AdaptiveExecutionGraph from a SQL-shaped plan, mirroring the static graph helpers. Registers an empty executor_failure module to be filled in by subsequent tasks. Refs: docs/superpowers/specs/2026-04-27-aqe-executor-failure-design.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Verifies that when an executor is lost mid-stage, the graph rolls back affected stages, the planner restores its cache, and the job completes successfully on a surviving executor. Refs: docs/superpowers/specs/2026-04-27-aqe-executor-failure-design.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Original spec assumed AQE's reset_stages_internal could rely on the
existing stage.inputs walk to detect lost-executor data. That walk is
a no-op for AQE (create_resolved_stage initialises inputs to an empty
HashMap). Update the contract:
- AdaptivePlanner::reset_on_lost_executor now returns the set of
stage_ids whose ExchangeExec / AdaptiveDatafusionExec outputs were
on the lost executor.
- AdaptiveExecutionGraph::reset_stages_internal uses that set to:
1. Reset task_infos and transition matching Successful stages back
to Running so they re-execute.
2. Drop any Resolved/Running stages whose embedded plan reads from
an affected stage (their ShuffleReaderExec entries hold stale
locations). The planner regenerates them via actionable_stages
once the upstream reruns complete.
Add the dependency walker plan_reads_from_any used by step 2.
Also adds test_reset_resolved_stage_executor_lost which covers the
case where both leaf stages have completed before the executor is
lost.
Refs: docs/superpowers/specs/2026-04-27-aqe-executor-failure-design.md
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Verifies that: 1. A late task status arriving after reset_stages_on_lost_executor doesn't corrupt graph state. 2. A second reset call for the same executor is a no-op (idempotent). 3. The job still completes after the reset/late-status churn. Also softens update_task_status to warn (not error) when a task status arrives for a stage that's been dropped during recovery — this is now expected when the dependent-stage drop in reset_stages_internal removes a Running stage that still has in-flight tasks. Refs: docs/superpowers/specs/2026-04-27-aqe-executor-failure-design.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Verifies that a failed task status arriving long after the executor was declared lost does not corrupt graph state — the rerun continues on the surviving executor. Refs: docs/superpowers/specs/2026-04-27-aqe-executor-failure-design.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Collapse a nested if (clippy::collapsible_if) and apply rustfmt across the new and modified files. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements the unchecked "support executor failure" task from the AQE epic #1359. Brings
AdaptiveExecutionGraphto parity withStaticExecutionGraph's executor-loss recovery so re-running stages and rolled-back stages work end-to-end under AQE.Problem
AdaptiveExecutionGraph::reset_stages_on_lost_executorwas structurally a copy of the static-graph version, but it was a no-op in practice for AQE because:create_resolved_stageinitialisesstage.inputsto an emptyHashMap(intentional — partition locations live in the planner's plan tree underExchangeExec.shuffle_partitions, not instage.inputs). The static-graph rollback walk thus found nothing.AdaptivePlanner's side state (runnable_stage_cache,runnable_stage_output) was never told about lost executors, so a re-running successful stage failed withCan't find active stage to update stage outputs.ExchangeExec/AdaptiveDatafusionExecin the live plan tree retainedshuffle_partitions = Some(...)even after their owning stage rolled back, sofind_runnable_exchangeswould skip them.Changes
aqe/execution_plan.rs— newpub(crate) fn reset_locations_on_lost_executor(&self, executor_id) -> Option<usize>on bothExchangeExecandAdaptiveDatafusionExec. Clearsshuffle_partitionsback toNoneif any location matches the lost executor; returns the affected stage_id.aqe/planner.rs— newpub(super) fn reset_on_lost_executor(&mut self, executor_id) -> Result<HashSet<usize>>onAdaptivePlanner. Walks the live plan tree, calls the per-exec reset, restoresrunnable_stage_cache/runnable_stage_outputfor affected stages, re-runsreplan_stages(), and returns the set of affected stage_ids.aqe/mod.rs—reset_stages_internalnow uses the planner's affected set to:Successfulstages back toRunning.Resolved/Runningstages whose embedded plan reads from an affected stage (theirShuffleReaderExecentries hold stale partition locations). The planner regenerates them viaactionable_stagesonce upstream reruns complete.aqe/mod.rs—update_task_statusnow warns (instead of erroring) when a task status arrives for a stage that's no longer inself.stages. This is expected after the dependent-stage drop above.aqe/test/executor_failure.rs:test_reset_completed_stage_executor_losttest_reset_resolved_stage_executor_losttest_task_update_after_reset_stage(incl. idempotency check)test_long_delayed_failed_task_after_executor_losttest_aqe_aggregation_plan/test_aqe_join_planhelpers inaqe/test/mod.rsthat build anAdaptiveExecutionGraphfrom a SQL-shaped plan, mirroring the static-graph builders./// - it does not cover executor failureline from theAdaptiveExecutionGraphdocstring.Test Plan
cargo test -p ballista-scheduler— 76 passed, 0 failed, 1 ignored (pre-existing).cargo clippy -p ballista-scheduler --tests— clean.cargo fmt -p ballista-scheduler -- --check— clean.Out of Scope
Refs: #1359
Design and implementation plan committed under
docs/superpowers/specs/anddocs/superpowers/plans/.