Skip to content

Pi/Minimax SDK errors cascade under concurrent load — needs throttling, better classification, and richer error surface #1569

@Wirasm

Description

@Wirasm

Summary

  • What broke: When many Pi/Minimax sessions run concurrently — typical during multi-PR review batches and any workflow that fans out parallel AI nodes — sessions hit dag.node_sdk_error_result and abort. Heavy code-review prompts fail more often than light aspects (test-coverage, comment-quality, error-handling). Two distinct failure modes observed.
  • When it started: Surfaced today (2026-05-04) running 11 maintainer-review-pr workflows in parallel (~100 Pi sessions in flight). Likely existed before — earlier sessions had partial failures we attributed to flake.
  • Severity: major — review batches return mostly empty findings; users either re-run sequentially (slow, expensive) or merge without deep review.
  • Related: Workflow runs zombie in 'running' state when DAG node hits SDK error mid-stream #1561 fixed the symptom (zombie workflow runs after these errors). This issue is about the cause — preventing the errors in the first place.

Two failure modes observed

Fast cascade (1–12s)

All aspects on a PR crash near-simultaneously, in single-digit seconds. Looks like rate-limit-and-give-up.

Example — PR #1548 review (today's batch):

nodeId=code-review     errorSubtype=error  durationMs=1869   stopReason=error
nodeId=test-coverage   errorSubtype=error  durationMs=2260   stopReason=error
nodeId=comment-quality errorSubtype=error  durationMs=2261   stopReason=error
nodeId=error-handling  errorSubtype=error  durationMs=2327   stopReason=error
nodeId=docs-impact     errorSubtype=error  durationMs=2327   stopReason=error

5 aspects, all error, all in <2.5s. PR #1529 same pattern.

Slow cascade (long-running session eventually errors)

A Pi session runs long, then errors. Looks like context overflow, hung tool call, or a stalled stream that the SDK eventually surfaces.

Example — repo-triage-minimax's closed-dedup-check:

durationMs=1885830  errorSubtype=error  stopReason=error

31 minutes before erroring. The node used the Claude-only agents: feature which Pi can't run; Pi accepted the prompt anyway, ran for 31m, then errored.

Evidence — today's 11-PR maintainer-review batch

PR Aspects ran/scheduled Pattern
#1523 4/4 clean (early in batch)
#1525 3/4 code-review crashed
#1554 1/4 3 aspects crashed
#1557 3/5 code-review, docs-impact crashed
#1533 1/4 3 aspects crashed
#1551 4/5 test-coverage crashed
#1553 1/5 4 aspects crashed
#1555 0/1 sole scheduled aspect crashed
#1529 0/5 all crashed
#1548 0/5 all crashed

Pattern: success rate degrades roughly with batch saturation. Code-review (heaviest prompt) crashes most often. Lighter aspects crash when the system is most loaded. The first PRs in the batch fared best; the later ones crashed almost completely.

By contrast, when a single Pi/Minimax workflow ran alone today (maintainer-standup-minimax, repo-triage-minimax, isolated maintainer-review-pr), they completed cleanly. The error pattern is concurrency-driven, not per-prompt-driven.

Steps to Reproduce

  1. Launch ~10+ maintainer-review-pr workflows in parallel against open PRs:
    for pr in 1523 1525 1529 1533 1548 1551 1552 1553 1554 1555 1557; do
      archon workflow run maintainer-review-pr "review PR #$pr" &
    done
  2. Watch logs for dag.node_sdk_error_result — most heavy-prompt nodes will fail in the back half of the batch.
  3. The first ~3–4 reviews complete OK; the rest degrade as concurrent Pi-session count rises.

What we know about the SDK error

Today's logs surface very little upstream context:

{
  \"nodeId\": \"...\",
  \"errorSubtype\": \"error\",
  \"sessionId\": \"019df...\",
  \"stopReason\": \"error\",
  \"durationMs\": ...,
  \"msg\": \"dag.node_sdk_error_result\"
}

errorSubtype: \"error\" is the catch-all bucket. We don't know whether Pi sent a 429, a TLS reset, an upstream Anthropic 529, an internal Pi assertion, or a hung-stream timeout. Without the underlying error message, we can't classify these accurately as transient/fatal, which is why the existing transient-retry path doesn't trigger.

Relevant code: packages/workflows/src/dag-executor.ts:902-916 swallows the underlying SDK error message into a generic "SDK returned error" — only errorSubtype survives. The original msg.errors array (line 909) is logged but not propagated to the user-facing error message.

Suggested fixes (multi-tier)

1. Surface the underlying error (low effort, high diagnostic value)

In dag-executor.ts:914-916:

// Currently:
throw new Error(\`Node '\${node.id}' failed: SDK returned \${subtype}\${errorsDetail}\`);
// errorsDetail is built from msg.errors but Pi rarely populates it.

Pi's adapter (packages/providers/src/community/pi/event-bridge.ts) maps Pi events into Archon's MessageChunk stream. When Pi rejects with a status / network error / SDK assertion, that error string should reach the result chunk's errors[]. Currently it goes into a system chunk that we log and discard. Plumb it through.

2. Pi-side concurrency throttle (medium effort)

The Pi adapter is the right place for this — a semaphore that caps concurrent session.prompt() calls. Configurable via .archon/config.yaml:

assistants:
  pi:
    maxConcurrent: 4   # default that doesn't saturate Pi/Minimax

Honest fallback: when the limit's saturated, queue not error. Workflow nodes that would otherwise spawn 5 simultaneous Pi calls instead serialize through the semaphore.

This is a Pi-specific feature, not a workflow-engine feature — Claude SDK does its own throttling at the SDK layer; Pi doesn't.

3. Better transient-error classification (medium effort)

packages/workflows/src/executor-shared.ts (isTransientNodeError) currently checks for keywords like ECONNRESET, 429, etc. on the surfaced error string. Once #1 is in place, common Pi-side transients (rate-limit, upstream-timeout) will become detectable and trigger the existing retry-with-backoff path.

Today, every Pi SDK error gets the generic \"SDK returned error\" message which doesn't match any transient pattern → no retry → hard fail.

4. Per-PR review-pr workflow throttling (low effort, narrow fix)

The maintainer-review-pr workflow fans out 5 aspects in one DAG layer. For Pi-provider runs, the workflow could be rewritten to run aspects sequentially (or in pairs). This is a YAML-only change with no code impact — but it only fixes this one workflow, not the underlying issue.

Recommendation order

  1. Fix Model stucked at response stream text #1 first — even if we did nothing else, surfacing the real error makes everything easier to diagnose. ~30 min of work.
  2. Then doesn't know what o1-mini is, or how to route to openrouter.ai #3 — once errors are real, transient-retry starts working. Many of today's failures would have self-recovered.
  3. Then updated code to use locally hosted llama LLM, nomic-embed-text model. #2 — the proper architectural fix, but more work.
  4. Skip [FEATURE] How about bootstrapping the agent builder? #4 — narrow workaround that masks the real problem.

Adjacent bug (separate issue, mentioning here for cross-reference)

PR #1552's review during the same batch hit a different failure mode: condition_json_parse_failed on the gate's output. The gate produced JSON wrapped in markdown fences (\\\\json\\n{...}\\n\\\\\), and the condition-evaluator's \$gate.output.verdict == 'review' couldn't parse the field because \$gate.output came through as a string, not a parsed object. tryParseStructuredOutput (in event-bridge.ts) handles fence-stripping correctly when extracting the structured output from Pi, but the same logic isn't applied when downstream conditions reference the field. This warrants its own issue — not covered by any of the fixes above.

Environment

  • Platform: CLI
  • Database: SQLite
  • Provider: pi / minimax/MiniMax-M2.7
  • OS: macOS 25.3.0 (Darwin), Bun runtime

Impact

  • Affected workflows: maintainer-review-pr, repo-triage-minimax, any DAG with parallel Pi nodes — primarily review and triage flows that fan out per-PR / per-issue work.
  • Reproduction rate: Always when running 10+ concurrent Pi-review workflows. Intermittent at lower concurrency.
  • Workaround: Run sequentially (1 at a time) or in batches of 2–3.
  • Data loss risk: No (Workflow runs zombie in 'running' state when DAG node hits SDK error mid-stream #1561 ensures DB stays consistent). Just lost work — review aspects produce no findings, so the user merges blind or re-reviews manually.

Scope

  • Package(s) likely involved: providers (community/pi), workflows
  • Modules: providers:community/pi:event-bridge, providers:community/pi:provider, workflows:dag-executor (lines 902–916), workflows:executor-shared:isTransientNodeError

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething is broken

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions