Pi/Minimax SDK errors cascade under concurrent load — needs throttling, better classification, and richer error surface

## Summary

- **What broke**: When many Pi/Minimax sessions run concurrently — typical during multi-PR review batches and any workflow that fans out parallel AI nodes — sessions hit `dag.node_sdk_error_result` and abort. Heavy code-review prompts fail more often than light aspects (test-coverage, comment-quality, error-handling). Two distinct failure modes observed.
- **When it started**: Surfaced today (2026-05-04) running 11 `maintainer-review-pr` workflows in parallel (~100 Pi sessions in flight). Likely existed before — earlier sessions had partial failures we attributed to flake.
- **Severity**: `major` — review batches return mostly empty findings; users either re-run sequentially (slow, expensive) or merge without deep review.
- **Related**: #1561 fixed the *symptom* (zombie workflow runs after these errors). This issue is about the *cause* — preventing the errors in the first place.

## Two failure modes observed

### Fast cascade (1–12s)
All aspects on a PR crash near-simultaneously, in single-digit seconds. Looks like rate-limit-and-give-up.

Example — PR #1548 review (today's batch):
```
nodeId=code-review     errorSubtype=error  durationMs=1869   stopReason=error
nodeId=test-coverage   errorSubtype=error  durationMs=2260   stopReason=error
nodeId=comment-quality errorSubtype=error  durationMs=2261   stopReason=error
nodeId=error-handling  errorSubtype=error  durationMs=2327   stopReason=error
nodeId=docs-impact     errorSubtype=error  durationMs=2327   stopReason=error
```

5 aspects, all error, all in <2.5s. PR #1529 same pattern.

### Slow cascade (long-running session eventually errors)
A Pi session runs long, then errors. Looks like context overflow, hung tool call, or a stalled stream that the SDK eventually surfaces.

Example — `repo-triage-minimax`'s `closed-dedup-check`:
```
durationMs=1885830  errorSubtype=error  stopReason=error
```

31 minutes before erroring. The node used the Claude-only `agents:` feature which Pi can't run; Pi accepted the prompt anyway, ran for 31m, then errored.

## Evidence — today's 11-PR maintainer-review batch

| PR | Aspects ran/scheduled | Pattern |
|----|---|---|
| #1523 | 4/4 | clean (early in batch) |
| #1525 | 3/4 | `code-review` crashed |
| #1554 | 1/4 | 3 aspects crashed |
| #1557 | 3/5 | `code-review`, `docs-impact` crashed |
| #1533 | 1/4 | 3 aspects crashed |
| #1551 | 4/5 | `test-coverage` crashed |
| #1553 | 1/5 | 4 aspects crashed |
| #1555 | 0/1 | sole scheduled aspect crashed |
| #1529 | 0/5 | all crashed |
| #1548 | 0/5 | all crashed |

Pattern: success rate degrades roughly with batch saturation. Code-review (heaviest prompt) crashes most often. Lighter aspects crash when the system is most loaded. **The first PRs in the batch fared best; the later ones crashed almost completely.**

By contrast, when a single Pi/Minimax workflow ran alone today (`maintainer-standup-minimax`, `repo-triage-minimax`, isolated `maintainer-review-pr`), they completed cleanly. The error pattern is concurrency-driven, not per-prompt-driven.

## Steps to Reproduce

1. Launch ~10+ `maintainer-review-pr` workflows in parallel against open PRs:
   ```bash
   for pr in 1523 1525 1529 1533 1548 1551 1552 1553 1554 1555 1557; do
     archon workflow run maintainer-review-pr "review PR #$pr" &
   done
   ```
2. Watch logs for `dag.node_sdk_error_result` — most heavy-prompt nodes will fail in the back half of the batch.
3. The first ~3–4 reviews complete OK; the rest degrade as concurrent Pi-session count rises.

## What we know about the SDK error

Today's logs surface very little upstream context:
```json
{
  \"nodeId\": \"...\",
  \"errorSubtype\": \"error\",
  \"sessionId\": \"019df...\",
  \"stopReason\": \"error\",
  \"durationMs\": ...,
  \"msg\": \"dag.node_sdk_error_result\"
}
```

`errorSubtype: \"error\"` is the catch-all bucket. We don't know whether Pi sent a 429, a TLS reset, an upstream Anthropic 529, an internal Pi assertion, or a hung-stream timeout. **Without the underlying error message, we can't classify these accurately as transient/fatal**, which is why the existing transient-retry path doesn't trigger.

Relevant code: `packages/workflows/src/dag-executor.ts:902-916` swallows the underlying SDK error message into a generic \"SDK returned error\" — only `errorSubtype` survives. The original `msg.errors` array (line 909) is logged but not propagated to the user-facing error message.

## Suggested fixes (multi-tier)

### 1. Surface the underlying error (low effort, high diagnostic value)

In `dag-executor.ts:914-916`:
```ts
// Currently:
throw new Error(\`Node '\${node.id}' failed: SDK returned \${subtype}\${errorsDetail}\`);
// errorsDetail is built from msg.errors but Pi rarely populates it.
```

Pi's adapter (`packages/providers/src/community/pi/event-bridge.ts`) maps Pi events into Archon's `MessageChunk` stream. When Pi rejects with a status / network error / SDK assertion, that error string should reach the result chunk's `errors[]`. Currently it goes into a `system` chunk that we log and discard. Plumb it through.

### 2. Pi-side concurrency throttle (medium effort)

The Pi adapter is the right place for this — a semaphore that caps concurrent `session.prompt()` calls. Configurable via `.archon/config.yaml`:
```yaml
assistants:
  pi:
    maxConcurrent: 4   # default that doesn't saturate Pi/Minimax
```

Honest fallback: when the limit's saturated, queue not error. Workflow nodes that would otherwise spawn 5 simultaneous Pi calls instead serialize through the semaphore.

This is a Pi-specific feature, not a workflow-engine feature — Claude SDK does its own throttling at the SDK layer; Pi doesn't.

### 3. Better transient-error classification (medium effort)

`packages/workflows/src/executor-shared.ts` (`isTransientNodeError`) currently checks for keywords like `ECONNRESET`, `429`, etc. on the surfaced error string. Once #1 is in place, common Pi-side transients (rate-limit, upstream-timeout) will become detectable and trigger the existing retry-with-backoff path.

Today, every Pi SDK error gets the generic `\"SDK returned error\"` message which doesn't match any transient pattern → no retry → hard fail.

### 4. Per-PR review-pr workflow throttling (low effort, narrow fix)

The `maintainer-review-pr` workflow fans out 5 aspects in one DAG layer. For Pi-provider runs, the workflow could be rewritten to run aspects sequentially (or in pairs). This is a YAML-only change with no code impact — but it only fixes this one workflow, not the underlying issue.

## Recommendation order

1. **Fix #1 first** — even if we did nothing else, surfacing the real error makes everything easier to diagnose. ~30 min of work.
2. **Then #3** — once errors are real, transient-retry starts working. Many of today's failures would have self-recovered.
3. **Then #2** — the proper architectural fix, but more work.
4. **Skip #4** — narrow workaround that masks the real problem.

## Adjacent bug (separate issue, mentioning here for cross-reference)

PR #1552's review during the same batch hit a *different* failure mode: `condition_json_parse_failed` on the gate's output. The gate produced JSON wrapped in markdown fences (\\`\\`\\`json\\n{...}\\n\\`\\`\\`), and the condition-evaluator's `\$gate.output.verdict == 'review'` couldn't parse the field because `\$gate.output` came through as a string, not a parsed object. `tryParseStructuredOutput` (in `event-bridge.ts`) handles fence-stripping correctly when extracting the structured output from Pi, but the same logic isn't applied when downstream conditions reference the field. **This warrants its own issue** — not covered by any of the fixes above.

## Environment

- Platform: CLI
- Database: SQLite
- Provider: pi / minimax/MiniMax-M2.7
- OS: macOS 25.3.0 (Darwin), Bun runtime

## Impact

- **Affected workflows**: `maintainer-review-pr`, `repo-triage-minimax`, any DAG with parallel Pi nodes — primarily review and triage flows that fan out per-PR / per-issue work.
- **Reproduction rate**: Always when running 10+ concurrent Pi-review workflows. Intermittent at lower concurrency.
- **Workaround**: Run sequentially (1 at a time) or in batches of 2–3.
- **Data loss risk**: No (#1561 ensures DB stays consistent). Just lost work — review aspects produce no findings, so the user merges blind or re-reviews manually.

## Scope

- Package(s) likely involved: `providers` (community/pi), `workflows`
- Modules: `providers:community/pi:event-bridge`, `providers:community/pi:provider`, `workflows:dag-executor` (lines 902–916), `workflows:executor-shared:isTransientNodeError`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pi/Minimax SDK errors cascade under concurrent load — needs throttling, better classification, and richer error surface #1569

Summary

Two failure modes observed

Fast cascade (1–12s)

Slow cascade (long-running session eventually errors)

Evidence — today's 11-PR maintainer-review batch

Steps to Reproduce

What we know about the SDK error

Suggested fixes (multi-tier)

1. Surface the underlying error (low effort, high diagnostic value)

2. Pi-side concurrency throttle (medium effort)

3. Better transient-error classification (medium effort)

4. Per-PR review-pr workflow throttling (low effort, narrow fix)

Recommendation order

Adjacent bug (separate issue, mentioning here for cross-reference)

Environment

Impact

Scope

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

PR	Aspects ran/scheduled	Pattern
#1523	4/4	clean (early in batch)
#1525	3/4	`code-review` crashed
#1554	1/4	3 aspects crashed
#1557	3/5	`code-review`, `docs-impact` crashed
#1533	1/4	3 aspects crashed
#1551	4/5	`test-coverage` crashed
#1553	1/5	4 aspects crashed
#1555	0/1	sole scheduled aspect crashed
#1529	0/5	all crashed
#1548	0/5	all crashed

Pi/Minimax SDK errors cascade under concurrent load — needs throttling, better classification, and richer error surface #1569

Description

Summary

Two failure modes observed

Fast cascade (1–12s)

Slow cascade (long-running session eventually errors)

Evidence — today's 11-PR maintainer-review batch

Steps to Reproduce

What we know about the SDK error

Suggested fixes (multi-tier)

1. Surface the underlying error (low effort, high diagnostic value)

2. Pi-side concurrency throttle (medium effort)

3. Better transient-error classification (medium effort)

4. Per-PR review-pr workflow throttling (low effort, narrow fix)

Recommendation order

Adjacent bug (separate issue, mentioning here for cross-reference)

Environment

Impact

Scope

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions