DAG workflow loses completed node state on second resume (multi-resume bug)

## Summary

- **What broke**: DAG workflows lose completed node state on the second resume, and may report "completed with no successful nodes" when the first node fails
- **When it started**: Unknown; bug exists in current codebase
- **Severity**: `major` — blocks multi-resume workflows and causes confusing error messages

## Steps to Reproduce

### Bug #1: Multi-Resume State Loss (Primary Bug)

1. Run a DAG workflow with nodes A, B, C where A and B complete successfully, but C fails
2. Resume the workflow — A and B are correctly skipped as already completed
3. Resume the workflow a second time
4. Observe that A and B are re-executed (they should be skipped)

**Database evidence:**
- Run `6ba79ec14bb2492f999fd3149c4ac227`: Contains 12 `node_skipped_prior_success` events
- These events are NOT found by `getCompletedDagNodeOutputs()` on the next resume

### Bug #2: Fresh Run Failure (Secondary Issue)

1. Run a DAG workflow (e.g., `archon-hyperframes-generate`) where the first node fails (e.g., missing CLI)
2. All downstream nodes are skipped due to `trigger_rule: all_success`
3. Workflow completes with error: "DAG workflow 'X' completed with no successful nodes"

**Database evidence:**
- Run `324eb860b5709b9d7488d8a8f9a9181d`: Failed at 2026-05-01 17:12:05
- Run `62e72b9c2adc9f88216e970ed0a39efd`: Failed at 2026-05-01 17:09:34

## Expected vs Actual

### Expected (Multi-Resume)
- **First resume**: `getCompletedDagNodeOutputs(run1_id)` returns A, B as completed; A and B are skipped
- **Second resume**: `getCompletedDagNodeOutputs(run2_id)` should still return A, B as completed (inherited from run1)
- Workflow preserves completed node state across all resume cycles

### Actual (Multi-Resume)
- **First resume**: A and B are correctly skipped, but `node_skipped_prior_success` events are emitted (NOT `node_completed`)
- **Second resume**: `getCompletedDagNodeOutputs(run2_id)` returns empty for A, B (only queries `node_completed` events)
- A and B are re-executed, may fail, resulting in 0 completed nodes

### Expected (Fresh Run with Failure)
- Clear error indicating which node failed and why
- Reasonable workflow state (e.g., `failed` with details)

### Actual (Fresh Run with Failure)
- Error message: "DAG workflow 'X' completed with no successful nodes"
- 0 `node_completed`, 1 `node_failed`, 6 `node_skipped` events
- Confusing error that suggests a workflow completion issue rather than a node failure

## User Flow

```
User                    Archon                      Database
────                    ──────                      ───────
runs workflow ────────▶ executes nodes A, B
                        C fails
                        emits node_completed(A,B)
                        emits node_failed(C)
                        marks run as failed ◀───── stored
                        shows "C failed" ◀──────── user sees

resumes workflow ─────▶ getCompletedDagNodeOutputs
                        returns {A, B} ◀────────── queries node_completed
                        skips A, B
                        C completes
                        emits node_skipped_prior_success(A,B)
                        emits node_completed(C)
                        marks run as completed ◀─── stored

resumes again ────────▶ getCompletedDagNodeOutputs
                        returns {C} only ◀───────── BUG: A,B missing (query misses
                                                     node_skipped_prior_success)
                        re-executes A, B
                        [X] A or B may fail
                        0 completed nodes ◀─────────────── "no successful nodes" error
```

## Environment

- **Platform**: All (workflow engine bug)
- **Database**: SQLite / PostgreSQL (both affected)
- **Running in worktree**: Yes (reproduction performed in worktree)
- **OS**: macOS (arm64), but affects all platforms
- **Archon**: Current codebase at commit `4631b8e0`

## Logs

### Database Query Results (Multi-Resume Evidence)

```sql
-- Run with node_skipped_prior_success events
SELECT event_type, COUNT(*) FROM remote_agent_workflow_events
WHERE workflow_run_id = '6ba79ec14bb2492f999fd3149c4ac227'
GROUP BY event_type;
```

Results:
- `node_skipped_prior_success`: 12 events
- `node_completed`: N events (only newly completed nodes from this run)

### Database Query Results (Fresh Run Failure)

```sql
-- Failed run with no successful nodes
SELECT event_type, COUNT(*) FROM remote_agent_workflow_events
WHERE workflow_run_id = '324eb860b5709b9d7488d8a8f9a9181d'
GROUP BY event_type;
```

Results:
- `node_completed`: 0
- `node_failed`: 1
- `node_skipped`: 6

### Relevant Code Locations

**Bug Location #1**: `packages/core/src/db/workflow-events.ts:122-152`
```typescript
export async function getCompletedDagNodeOutputs(workflowRunId: string): Promise<Map<string, string>> {
  const result = await pool.query<{...}>(
    `SELECT step_name, data FROM remote_agent_workflow_events
     WHERE workflow_run_id = $1 AND event_type = 'node_completed'  // BUG: Only queries 'node_completed'
     ORDER BY created_at ASC`,
    [workflowRunId]
  );
}
```

**Bug Location #2**: `packages/workflows/src/dag-executor.ts:2563`
```typescript
// Emits wrong event type when skipping prior completed nodes
deps.store.createWorkflowEvent({
  workflow_run_id: workflowRun.id,
  event_type: 'node_skipped_prior_success',  // BUG: Should be 'node_completed'
  step_name: node.id,
  data: { reason: 'prior_success' },
})
```

**Completion Check**: `packages/workflows/src/dag-executor.ts:3049-3091`
```typescript
// Correctly counts only 'completed' state nodes
const nodeCounts = { completed: 0, failed: 0, skipped: 0, total: workflow.nodes.length };
for (const o of nodeOutputs.values()) {
  if (o.state === 'completed') nodeCounts.completed++;
  else if (o.state === 'failed') nodeCounts.failed++;
  else if (o.state === 'skipped') nodeCounts.skipped++;
}

if (nodeCounts.completed === 0) {
  // Fails with "no successful nodes" error
}
```

## Impact

- **Affected workflows/commands**: All DAG workflows with resume capability
- **Reproduction rate**: Always (for multi-resume scenario), Intermittent (for fresh run failure)
- **Workaround available**: No — multi-resume workflows are broken
- **Data loss risk**: `No` — workflow runs are preserved in database

## Scope

- **Package(s) likely involved**: `workflows`, `core`
- **Module**: 
  - `workflows:dag-executor` (skip logic, completion check)
  - `core:db/workflow-events` (getCompletedDagNodeOutputs query)

## Potential Fixes

1. **Option A**: Modify `getCompletedDagNodeOutputs` to also query `node_skipped_prior_success` events and extract the preserved output from the event data
2. **Option B**: Emit `node_completed` (not `node_skipped_prior_success`) when skipping due to prior success
3. **Option C**: Store original run ID in `workflow_runs` and always query the original run for completed nodes (chain to first run)

**Additional fix for secondary issue**: Improve error messaging when workflow fails due to first node failure; distinguish between "no nodes completed" and "workflow failed due to node X"


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DAG workflow loses completed node state on second resume (multi-resume bug) #1520

Summary

Steps to Reproduce

Bug #1: Multi-Resume State Loss (Primary Bug)

Bug #2: Fresh Run Failure (Secondary Issue)

Expected vs Actual

Expected (Multi-Resume)

Actual (Multi-Resume)

Expected (Fresh Run with Failure)

Actual (Fresh Run with Failure)

User Flow

Environment

Logs

Database Query Results (Multi-Resume Evidence)

Database Query Results (Fresh Run Failure)

Relevant Code Locations

Impact

Scope

Potential Fixes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

DAG workflow loses completed node state on second resume (multi-resume bug) #1520

Description

Summary

Steps to Reproduce

Bug #1: Multi-Resume State Loss (Primary Bug)

Bug #2: Fresh Run Failure (Secondary Issue)

Expected vs Actual

Expected (Multi-Resume)

Actual (Multi-Resume)

Expected (Fresh Run with Failure)

Actual (Fresh Run with Failure)

User Flow

Environment

Logs

Database Query Results (Multi-Resume Evidence)

Database Query Results (Fresh Run Failure)

Relevant Code Locations

Impact

Scope

Potential Fixes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions