Summary
- What broke: DAG workflows lose completed node state on the second resume, and may report "completed with no successful nodes" when the first node fails
- When it started: Unknown; bug exists in current codebase
- Severity:
major — blocks multi-resume workflows and causes confusing error messages
Steps to Reproduce
Bug #1: Multi-Resume State Loss (Primary Bug)
- Run a DAG workflow with nodes A, B, C where A and B complete successfully, but C fails
- Resume the workflow — A and B are correctly skipped as already completed
- Resume the workflow a second time
- Observe that A and B are re-executed (they should be skipped)
Database evidence:
- Run
6ba79ec14bb2492f999fd3149c4ac227: Contains 12 node_skipped_prior_success events
- These events are NOT found by
getCompletedDagNodeOutputs() on the next resume
Bug #2: Fresh Run Failure (Secondary Issue)
- Run a DAG workflow (e.g.,
archon-hyperframes-generate) where the first node fails (e.g., missing CLI)
- All downstream nodes are skipped due to
trigger_rule: all_success
- Workflow completes with error: "DAG workflow 'X' completed with no successful nodes"
Database evidence:
- Run
324eb860b5709b9d7488d8a8f9a9181d: Failed at 2026-05-01 17:12:05
- Run
62e72b9c2adc9f88216e970ed0a39efd: Failed at 2026-05-01 17:09:34
Expected vs Actual
Expected (Multi-Resume)
- First resume:
getCompletedDagNodeOutputs(run1_id) returns A, B as completed; A and B are skipped
- Second resume:
getCompletedDagNodeOutputs(run2_id) should still return A, B as completed (inherited from run1)
- Workflow preserves completed node state across all resume cycles
Actual (Multi-Resume)
- First resume: A and B are correctly skipped, but
node_skipped_prior_success events are emitted (NOT node_completed)
- Second resume:
getCompletedDagNodeOutputs(run2_id) returns empty for A, B (only queries node_completed events)
- A and B are re-executed, may fail, resulting in 0 completed nodes
Expected (Fresh Run with Failure)
- Clear error indicating which node failed and why
- Reasonable workflow state (e.g.,
failed with details)
Actual (Fresh Run with Failure)
- Error message: "DAG workflow 'X' completed with no successful nodes"
- 0
node_completed, 1 node_failed, 6 node_skipped events
- Confusing error that suggests a workflow completion issue rather than a node failure
User Flow
User Archon Database
──── ────── ───────
runs workflow ────────▶ executes nodes A, B
C fails
emits node_completed(A,B)
emits node_failed(C)
marks run as failed ◀───── stored
shows "C failed" ◀──────── user sees
resumes workflow ─────▶ getCompletedDagNodeOutputs
returns {A, B} ◀────────── queries node_completed
skips A, B
C completes
emits node_skipped_prior_success(A,B)
emits node_completed(C)
marks run as completed ◀─── stored
resumes again ────────▶ getCompletedDagNodeOutputs
returns {C} only ◀───────── BUG: A,B missing (query misses
node_skipped_prior_success)
re-executes A, B
[X] A or B may fail
0 completed nodes ◀─────────────── "no successful nodes" error
Environment
- Platform: All (workflow engine bug)
- Database: SQLite / PostgreSQL (both affected)
- Running in worktree: Yes (reproduction performed in worktree)
- OS: macOS (arm64), but affects all platforms
- Archon: Current codebase at commit
4631b8e0
Logs
Database Query Results (Multi-Resume Evidence)
-- Run with node_skipped_prior_success events
SELECT event_type, COUNT(*) FROM remote_agent_workflow_events
WHERE workflow_run_id = '6ba79ec14bb2492f999fd3149c4ac227'
GROUP BY event_type;
Results:
node_skipped_prior_success: 12 events
node_completed: N events (only newly completed nodes from this run)
Database Query Results (Fresh Run Failure)
-- Failed run with no successful nodes
SELECT event_type, COUNT(*) FROM remote_agent_workflow_events
WHERE workflow_run_id = '324eb860b5709b9d7488d8a8f9a9181d'
GROUP BY event_type;
Results:
node_completed: 0
node_failed: 1
node_skipped: 6
Relevant Code Locations
Bug Location #1: packages/core/src/db/workflow-events.ts:122-152
export async function getCompletedDagNodeOutputs(workflowRunId: string): Promise<Map<string, string>> {
const result = await pool.query<{...}>(
`SELECT step_name, data FROM remote_agent_workflow_events
WHERE workflow_run_id = $1 AND event_type = 'node_completed' // BUG: Only queries 'node_completed'
ORDER BY created_at ASC`,
[workflowRunId]
);
}
Bug Location #2: packages/workflows/src/dag-executor.ts:2563
// Emits wrong event type when skipping prior completed nodes
deps.store.createWorkflowEvent({
workflow_run_id: workflowRun.id,
event_type: 'node_skipped_prior_success', // BUG: Should be 'node_completed'
step_name: node.id,
data: { reason: 'prior_success' },
})
Completion Check: packages/workflows/src/dag-executor.ts:3049-3091
// Correctly counts only 'completed' state nodes
const nodeCounts = { completed: 0, failed: 0, skipped: 0, total: workflow.nodes.length };
for (const o of nodeOutputs.values()) {
if (o.state === 'completed') nodeCounts.completed++;
else if (o.state === 'failed') nodeCounts.failed++;
else if (o.state === 'skipped') nodeCounts.skipped++;
}
if (nodeCounts.completed === 0) {
// Fails with "no successful nodes" error
}
Impact
- Affected workflows/commands: All DAG workflows with resume capability
- Reproduction rate: Always (for multi-resume scenario), Intermittent (for fresh run failure)
- Workaround available: No — multi-resume workflows are broken
- Data loss risk:
No — workflow runs are preserved in database
Scope
- Package(s) likely involved:
workflows, core
- Module:
workflows:dag-executor (skip logic, completion check)
core:db/workflow-events (getCompletedDagNodeOutputs query)
Potential Fixes
- Option A: Modify
getCompletedDagNodeOutputs to also query node_skipped_prior_success events and extract the preserved output from the event data
- Option B: Emit
node_completed (not node_skipped_prior_success) when skipping due to prior success
- Option C: Store original run ID in
workflow_runs and always query the original run for completed nodes (chain to first run)
Additional fix for secondary issue: Improve error messaging when workflow fails due to first node failure; distinguish between "no nodes completed" and "workflow failed due to node X"
Summary
major— blocks multi-resume workflows and causes confusing error messagesSteps to Reproduce
Bug #1: Multi-Resume State Loss (Primary Bug)
Database evidence:
6ba79ec14bb2492f999fd3149c4ac227: Contains 12node_skipped_prior_successeventsgetCompletedDagNodeOutputs()on the next resumeBug #2: Fresh Run Failure (Secondary Issue)
archon-hyperframes-generate) where the first node fails (e.g., missing CLI)trigger_rule: all_successDatabase evidence:
324eb860b5709b9d7488d8a8f9a9181d: Failed at 2026-05-01 17:12:0562e72b9c2adc9f88216e970ed0a39efd: Failed at 2026-05-01 17:09:34Expected vs Actual
Expected (Multi-Resume)
getCompletedDagNodeOutputs(run1_id)returns A, B as completed; A and B are skippedgetCompletedDagNodeOutputs(run2_id)should still return A, B as completed (inherited from run1)Actual (Multi-Resume)
node_skipped_prior_successevents are emitted (NOTnode_completed)getCompletedDagNodeOutputs(run2_id)returns empty for A, B (only queriesnode_completedevents)Expected (Fresh Run with Failure)
failedwith details)Actual (Fresh Run with Failure)
node_completed, 1node_failed, 6node_skippedeventsUser Flow
Environment
4631b8e0Logs
Database Query Results (Multi-Resume Evidence)
Results:
node_skipped_prior_success: 12 eventsnode_completed: N events (only newly completed nodes from this run)Database Query Results (Fresh Run Failure)
Results:
node_completed: 0node_failed: 1node_skipped: 6Relevant Code Locations
Bug Location #1:
packages/core/src/db/workflow-events.ts:122-152Bug Location #2:
packages/workflows/src/dag-executor.ts:2563Completion Check:
packages/workflows/src/dag-executor.ts:3049-3091Impact
No— workflow runs are preserved in databaseScope
workflows,coreworkflows:dag-executor(skip logic, completion check)core:db/workflow-events(getCompletedDagNodeOutputs query)Potential Fixes
getCompletedDagNodeOutputsto also querynode_skipped_prior_successevents and extract the preserved output from the event datanode_completed(notnode_skipped_prior_success) when skipping due to prior successworkflow_runsand always query the original run for completed nodes (chain to first run)Additional fix for secondary issue: Improve error messaging when workflow fails due to first node failure; distinguish between "no nodes completed" and "workflow failed due to node X"