Skip to content

/workflow run silently auto-resumes failed runs with stale args, hijacking fresh requests #1549

@ztech-gthb

Description

@ztech-gthb

Summary

  • What broke: /workflow run X "task B" silently auto-resumes a prior failed run of X in the same chat, executing in the failed run's sub-worktree with the failed run's persisted user_message ("task A"). The new prompt is discarded with no UI/log indication. The user sees a positive completion report on task A and is confused why task B never happened. Compounding: /workflow abandon rejects failed runs as "already terminal", so users hit by this cannot easily escape.
  • When it started (if known): introduced in PR 🐛 UserReportedError: Manual bug report #914 (fix: foreground resume for interactive workflows + chat auto-resume) which added findResumableRunByParentConversation with status IN ('failed', 'paused'). The 'failed' clause was scoped to support manual /workflow resume <id>; using it for automatic resume on a fresh /workflow run produces the silent-hijack behavior.
  • Severity: major (silent data loss / silent intent loss; trust-corroding)

Steps to Reproduce

  1. Pick any workflow whose first node materializes the user_message into $ARTIFACTS_DIR/.X files (most non-trivial workflows do this — e.g. parse-args style scripts).
  2. Run it with input that fails an early step:
    /workflow run my-workflow "input that fails parse"
    
  3. Observe: run is failed in remote_agent_workflow_runs, working_path = .../worktrees/archon/thread-<old-id>/.
  4. In the same chat conversation, run it with new input:
    /workflow run my-workflow "completely different task"
    
  5. Observe in server logs:
    module=command-handler   args="completely different task"   ← what the user typed
    module=orchestrator-agent msg=orchestrator.foreground_resume_detected
                              resumableRunId=<old-id>
                              workingPath=…/thread-<old-id>/
    
  6. The workflow runs again, in the same sub-worktree, with the previous run's user_message (preserved in $ARTIFACTS_DIR/.X files). Step 4's input is never used.

Expected vs Actual

  • Expected: a fresh /workflow run with new args dispatches a fresh run in a fresh worktree with the new args. The prior failed run remains as an audit-trail row but does not steer execution. If the user wants to continue the failed run from where it stopped, they explicitly type /workflow resume <id>.
  • Actual: the orchestrator silently picks up any failed | paused resumable run for the same (workflow_name, parent_conversation_id), calls executeWorkflow with the failed run's working_path, and the workflow re-reads stale state from disk. The new args travel through the call as userMessage but are discarded by parse-args/script-style early nodes.

User Flow

User                              Archon                              DB
────                              ──────                              ──
runs /workflow run X "A" ───────▶ findResumable... → null
                                  dispatch fresh                  ───▶ run-A row → status='failed'
                                  (parse-args fails on input "A")

runs /workflow run X "B" ───────▶ findResumable... → run-A
                                  [X] auto-resume in run-A's worktree
                                      with run-A's persisted state
                                  executeWorkflow(
                                      working_path=thread-run-A,
                                      userMessage="B"
                                      ↑ scripts ignore: they read
                                        $ARTIFACTS_DIR/.X from run-A
                                  )                              ───▶ run-A re-executed,
                                                                       still on task A
sees positive report ◀────────── task-A success report
   "I asked for B"                (still no idea task B was hijacked)

The [X] is where intent silently disappears.

Environment

  • Platform: Web (orchestrator agent path)
  • Database: SQLite (PostgreSQL has the same SQL, same behavior)
  • Running in worktree? Yes (workflow sub-worktrees)
  • OS: macOS host with Linux container; not OS-specific

Logs

{"level":30,"module":"command-handler","workflow":"ztech-marimo-edit",
 "args":"fortigapminder.marimo.py Remove redundant local tomllib re-imports
         from cells 4, 7, 12 and 13",                       ← user's correct args
 "msg":"cmd.workflow_starting"}

{"level":30,"module":"orchestrator-agent",
 "workflowName":"ztech-marimo-edit",
 "resumableRunId":"92d86ea89fd6808c5f6534b4ef34acbc",       ← prior failed run
 "workingPath":"/.archon/.../worktrees/archon/thread-85a590f9",
 "msg":"orchestrator.foreground_resume_detected"}

{"level":30,"module":"workflow.dag-executor",
 "priorCompletedCount":5,
 "msg":"dag.workflow_resume_prepopulated"}                  ← old state restored

{"level":50,"module":"workflow.dag-executor","exitCode":1,
 "stderrTail":"ERROR: First argument must be a notebook path ending in .py
              [...] INPUT (arg $1)='Edit the notebook at fortigapminder...'",
                                          ↑ THE OLD reformulated user_message,
                                            not the new args
 "msg":"dag_node_failed"}

The fresh /workflow run typed the correct path-prefixed args, but the resumed run reads the old natural-language reformulation from .edit-description artifact persisted by run-A.

Impact

  • Affected workflows/commands: any workflow with a first node that materializes user_message into $ARTIFACTS_DIR/.X files (most non-trivial DAG workflows). archon-fix-issue, archon-feature-development, custom user workflows, etc.
  • Reproduction rate: Always — deterministic given the SQL match (failed run + same workflow + same conversation).
  • Workaround available: pre-this-PR there was none. /workflow abandon rejected failed runs as terminal; /workflow resume <id> re-ran the same stale state; the only way out was direct DB manipulation (UPDATE remote_agent_workflow_runs SET status='cancelled' WHERE id=...).
  • Data loss risk: Yes — silent intent loss. The user's request is discarded with no log/UI indication.

Scope

  • Package(s): core
  • Module: core:orchestrator (dispatch logic), core:db (findResumableRunByParentConversation), core:operations (abandonWorkflow)

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High priority - Address soon, next in queuearea: cliCLI commands and interfacebugSomething is broken

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions