execution_graph: preserve unrun dirty work on dispatch error#91
Conversation
Planning drains and clears the entire in-scope dirty set up front to build the schedule, but InlineDispatcher executes fail-fast: when a node trapped or errored, every node scheduled after it lost its (already drained) dirty mark. A retry then planned from an empty set and reported "nothing to do" while the pending work was silently dropped. On error, re-mark the failed node and the unrun tail dirty (remark_scheduled_dirty) so the work is recoverable on the next run. Applies to both run_all (global) and run_node (closure) paths. Add three regression tests and document the fail-fast recovery contract on run_all/run_node.
| if let Err(e) = graph.execute_scheduled_node(node) { | ||
| // Fail-fast: this node errored and `to_run[i + 1..]` never ran. Re-mark them so | ||
| // their drained dirty state is not silently lost (see `dispatch`). | ||
| graph.remark_scheduled_dirty(&to_run[i..]); |
There was a problem hiding this comment.
But the error gets returned here, the report that has been accumulated so far is dropped ... so the Report handling here could use improvement (but maybe as a follow up?)
There was a problem hiding this comment.
Agreed that this should be taken as a follow-up. Surfacing the partial report on error is an API decision (richer return, or an error variant carrying the partial RunDetailReport, etc).
Also it predates this PR: the old ?-based loop dropped it too.
For now I've left the behavior and added an inline note at the drop site.
| // were cleared when the plan was drained, so re-mark them to keep that pending | ||
| // work recoverable on the next run instead of silently dropping it. | ||
| graph.remark_scheduled_dirty(&to_run[i..]); | ||
| return Err(e); |
There was a problem hiding this comment.
Also, because of this error path, we aren't restoring the buffer for to_run back into the scratch workspace. Might not be that serious a thing, but maybe this is a sign we need a better API or design here overall.
There was a problem hiding this comment.
True! I made it so the dispatcher now hands the buffer back to scratch itself via a reclaim_schedule_buffer hook on every exit path, rather than returning it for the caller to stash on the Ok arm only. Dispatcher is pub(crate), so no public API change - clean on that end.
The InlineDispatcher returned the drained schedule buffer for the caller to stash back into scratch, but only on the Ok arm — on the fail-fast error path the buffer was dropped, so the next planning pass reallocated. Have the dispatcher reclaim the buffer itself via a new reclaim_schedule_buffer hook on every exit path (success and error). dispatch now returns Result<(), GraphError> and dispatch_with_report returns Result<RunDetailReport, GraphError> — the buffer no longer rides on the return type and can't be lost on error. Dispatcher is pub(crate), so no public API change.
waywardmonkeys
left a comment
There was a problem hiding this comment.
Rebase it and you can land it after that (via squash).
plan_all/plan_within_dependencies_ofdrain (and clear) the entire in-scope dirty set before any node runs.InlineDispatcheris fail-fast, so when a node traps or errors mid-pass, the dirty marks of every node scheduled after it are already gone and never re-instated. The caller fixes the fault, retries, and getsOk(executed_nodes: 0): unrelated pending work is silently lost.Fix
On error, the dispatcher re-marks the failed node and the unrun tail dirty via
remark_scheduled_dirty, so a subsequent run re-attempts them. Still fail-fast, only the recovery of dropped work changes. Covers bothrun_all(global scope) andrun_node(dependency-closure scope).Tests
Three regression tests (verified to fail without the fix):
run_nodeclosure sibling survives a trap within the closure