Skip to content

execution_graph: preserve unrun dirty work on dispatch error#91

Merged
no-materials merged 3 commits into
forest-rs:mainfrom
no-materials:graph/fix/preserve-unrun-dirty-on-dispatch-error
Jun 3, 2026
Merged

execution_graph: preserve unrun dirty work on dispatch error#91
no-materials merged 3 commits into
forest-rs:mainfrom
no-materials:graph/fix/preserve-unrun-dirty-on-dispatch-error

Conversation

@no-materials
Copy link
Copy Markdown
Contributor

plan_all / plan_within_dependencies_of drain (and clear) the entire in-scope dirty set before any node runs. InlineDispatcher is fail-fast, so when a node traps or errors mid-pass, the dirty marks of every node scheduled after it are already gone and never re-instated. The caller fixes the fault, retries, and gets Ok(executed_nodes: 0): unrelated pending work is silently lost.

Fix

On error, the dispatcher re-marks the failed node and the unrun tail dirty via remark_scheduled_dirty, so a subsequent run re-attempts them. Still fail-fast, only the recovery of dropped work changes. Covers both run_all (global scope) and run_node (dependency-closure scope).

Tests

Three regression tests (verified to fail without the fix):

  • independent node survives a sibling's trap and recovers on re-run
  • nodes that already executed before the trap are not re-marked
  • run_node closure sibling survives a trap within the closure

  Planning drains and clears the entire in-scope dirty set up front to
  build the schedule, but InlineDispatcher executes fail-fast: when a node
  trapped or errored, every node scheduled after it lost its (already
  drained) dirty mark. A retry then planned from an empty set and reported
  "nothing to do" while the pending work was silently dropped.

  On error, re-mark the failed node and the unrun tail dirty
  (remark_scheduled_dirty) so the work is recoverable on the next run.
  Applies to both run_all (global) and run_node (closure) paths.

  Add three regression tests and document the fail-fast recovery contract
  on run_all/run_node.
if let Err(e) = graph.execute_scheduled_node(node) {
// Fail-fast: this node errored and `to_run[i + 1..]` never ran. Re-mark them so
// their drained dirty state is not silently lost (see `dispatch`).
graph.remark_scheduled_dirty(&to_run[i..]);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the error gets returned here, the report that has been accumulated so far is dropped ... so the Report handling here could use improvement (but maybe as a follow up?)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that this should be taken as a follow-up. Surfacing the partial report on error is an API decision (richer return, or an error variant carrying the partial RunDetailReport, etc).

Also it predates this PR: the old ?-based loop dropped it too.

For now I've left the behavior and added an inline note at the drop site.

// were cleared when the plan was drained, so re-mark them to keep that pending
// work recoverable on the next run instead of silently dropping it.
graph.remark_scheduled_dirty(&to_run[i..]);
return Err(e);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, because of this error path, we aren't restoring the buffer for to_run back into the scratch workspace. Might not be that serious a thing, but maybe this is a sign we need a better API or design here overall.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True! I made it so the dispatcher now hands the buffer back to scratch itself via a reclaim_schedule_buffer hook on every exit path, rather than returning it for the caller to stash on the Ok arm only. Dispatcher is pub(crate), so no public API change - clean on that end.

  The InlineDispatcher returned the drained schedule buffer for the caller
  to stash back into scratch, but only on the Ok arm — on the fail-fast
  error path the buffer was dropped, so the next planning pass reallocated.

  Have the dispatcher reclaim the buffer itself via a new
  reclaim_schedule_buffer hook on every exit path (success and error).
  dispatch now returns Result<(), GraphError> and dispatch_with_report
  returns Result<RunDetailReport, GraphError> — the buffer no longer rides
  on the return type and can't be lost on error. Dispatcher is pub(crate),
  so no public API change.
Copy link
Copy Markdown
Contributor

@waywardmonkeys waywardmonkeys left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebase it and you can land it after that (via squash).

@no-materials no-materials merged commit 88e434e into forest-rs:main Jun 3, 2026
15 checks passed
@no-materials no-materials deleted the graph/fix/preserve-unrun-dirty-on-dispatch-error branch June 4, 2026 06:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants