Skip to content

feat(instance-ai): Capture orchestrator plans into a builder eval dataset#29785

Draft
mutdmour wants to merge 1 commit intomasterfrom
evals-workflows-plans
Draft

feat(instance-ai): Capture orchestrator plans into a builder eval dataset#29785
mutdmour wants to merge 1 commit intomasterfrom
evals-workflows-plans

Conversation

@mutdmour
Copy link
Copy Markdown
Contributor

@mutdmour mutdmour commented May 5, 2026

Summary

Adds eval:capture-plans, a new CLI that drives a live n8n instance with each parent dataset prompt, captures the reconciled PlannedTask[] at the submit-plan boundary via SSE, and writes one row per build-workflow task to a target LangSmith dataset.

The point: the existing pairwise eval feeds vague Notion prompts to the builder, mixing orchestrator quality with builder quality. This new dataset isolates builder quality given a clean handover — the eval input is the same task spec that production hands to the builder via dispatchPlannedTask.

Resulting rows are shape-compatible with the existing pairwise CLI:

  • inputs.prompt = task.spec (fed straight into buildInProcess)
  • inputs.evals.{dos,donts} = inherited from the parent
  • metadata.derivedId = ${parentExampleId}/${slug(workflowName)} for stable, idempotent re-syncs

Resulting dataset on LangSmith: instance-ai-builder-from-plans (91 rows, captured from 75 of 77 parent examples — 2 skipped where the planner answered conversationally without submitting a plan).

How to test

# 1. Start n8n locally with instance-ai (or use any running dev instance)
# 2. From the repo root:
pnpm --filter @n8n/instance-ai eval:capture-plans \
  --base-url http://localhost:5678 \
  --target-dataset instance-ai-builder-from-plans-test \
  --max-examples 2 --verbose

# 3. Run the existing pairwise eval against the new dataset:
pnpm --filter @n8n/instance-ai eval:pairwise \
  --dataset instance-ai-builder-from-plans \
  --max-examples 5

Re-running the capture is idempotent — rows match by derivedId and update only when the spec content drifts.

Files

  • evaluations/harness/plan-capture.ts — pure capture helper. Drives SSE, reconstructs PlanningBlueprint from add-plan-item events, reads reconciled planItems from the latest tasks-update, declines plan-review, cancels the run.
  • evaluations/langsmith/builder-from-plans-sync.ts — sync orchestrator mirroring dataset-sync.ts. Captures concurrently, flattens to one row per build-workflow task, diffs against existing rows.
  • evaluations/cli/capture-plans.ts — CLI entry. Defaults: --parent-dataset notion-pairwise-workflows, --target-dataset instance-ai-builder-from-plans, concurrency 4, timeout 180s.
  • evaluations/__tests__/plan-capture.test.ts — 5 unit tests covering happy path, ask-user auto-resolve, timeout, no-plan-emitted, malformed SSE.
  • package.jsoneval:capture-plans script entry.

Related Linear tickets, Github issues, and Community forum posts

n/a (internal eval tooling)

Review / Merge checklist

  • I have seen this code, I have run this code, and I take responsibility for this code.
  • PR title and summary are descriptive.
  • Docs updated or follow-up ticket created.
  • Tests included.
  • PR Labeled with backport tag (not a hotfix).

🤖 PR Summary generated by AI

…aset

New `eval:capture-plans` CLI drives a live n8n instance with each
parent dataset prompt, captures the reconciled `PlannedTask[]` at the
`submit-plan` boundary via SSE, and writes one row per `build-workflow`
task to a target LangSmith dataset.

Resulting rows are shape-compatible with the existing pairwise CLI:
`inputs.prompt` carries the task spec, `inputs.evals.{dos,donts}` are
inherited from the parent. Metadata records the originating blueprint,
parent provenance, and a stable `derivedId` so re-runs match existing
rows by `${parentExampleId}/${slug(workflowName)}`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 5, 2026

! PR exceeds size limit (1,125 lines added)

This PR adds 1,125 lines, exceeding the 1,000-line limit (test files excluded).

Large PRs are harder to review and increase the risk of bugs going unnoticed. Please consider:

  • Breaking this into smaller, logically separate PRs
  • Moving unrelated changes to a follow-up PR

If the size is genuinely justified (e.g. generated code, large migrations, test fixtures), a maintainer can override by commenting /size-limit-override and then pushing a new commit or re-running this check.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 5, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@n8n-assistant n8n-assistant Bot added the n8n team Authored by the n8n team label May 5, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 5, 2026

Instance AI Workflow Eval Results

14/14 built | 3 run(s) | pass@3: 77% | pass^3: 50% | iterations: 57% / 57% / 69%

Workflow Build pass@3 pass^3
Every hour, fetch all records from an Airtable table. Use the HTTP Req 3/3 100% 100%
Create a workflow that handles contact form submissions via a webhook. 3/3 60% 60%
Get all the Linear issues created in the last 2 weeks. Filter them for 2/3 80% 8%
Every day, get the posts made in the past day on 3 different Slack cha 3/3 80% 60%
Create a form that collects: name, email, company, and interest level 3/3 100% 100%
Every day, fetch all open GitHub issues from repository 'acme-corp/bac 3/3 100% 29%
Every two weeks I want to check the amount of n8n usage and bug report 3/3 0% 0%
Create a workflow that receives webhook notifications with a JSON body 3/3 100% 100%
Fetch the latest posts from the JSONPlaceholder API (GET https://jsonp 3/3 66% 43%
Every day, fetch one post from the JSONPlaceholder API (GET https://js 3/3 100% 100%
Build a Telegram chatbot workflow for a family assistant. It should re 3/3 100% 3%
Every day at 8am, check the weather in Berlin using the OpenMeteo API 3/3 100% 64%
Every hour, check the current weather for London, New York, and Tokyo 3/3 50% 1%
I want you to build a workflow that will read n8n workflow databases a 3/3 0% 0%
Failure details

Create a workflow that handles contact form submissions via a webhook. / partial-action-failure — 0/3 passed

Run [builder_issue]: The Telegram node errored with 'Bad request - please check your parameters' and this caused the entire workflow to fail. The 'Append to Google Sheets' node did NOT execute at all. While the Auto-Reply
Run [builder_issue]: The workflow crashed when the Telegram node failed with 'Bad request - please check your parameters'. The Telegram node's chatId is set to a literal placeholder value '<__PLACEHOLDER_VALUE__Telegram t
Run [builder_issue]: The workflow fails the graceful handling requirement. When the Telegram node returns a 'Bad Request: chat not found' error, the workflow crashes entirely with 'Workflow error: Bad request - please che

Create a workflow that handles contact form submissions via a webhook. / invalid-email — 0/3 passed

Run [builder_issue]: The workflow crashed at the Auto-Reply Email node with 'Invalid email address (item 0)' when it encountered 'not-an-email'. Because the workflow error propagated at this node, the Telegram Team Notifi
Run [builder_issue]: The workflow crashed at the 'Auto-Reply Email' node with the error 'Invalid email address (item 0)' when it tried to send to 'not-an-email'. Because all three downstream nodes (Auto-Reply Email, Teleg
Run [builder_issue]: The workflow crashed at the 'Send Auto-Reply Email' node with 'Invalid email address (item 0)' because the Gmail node performs client-side validation of the recipient email address before making any H

Get all the Linear issues created in the last 2 weeks. Filter them for / happy-path — 1/3 passed

Run [builder_issue]: The workflow fails partway through. The 'Enrich & Filter Cross-Team' code node executed but produced zero output items. Tracing the root cause: the code node requires issue.creator?.email to identif
Run [build_failure]: Build failed: Run timed out after 600000ms

Get all the Linear issues created in the last 2 weeks. Filter them for / multi-team-creator — 1/3 passed

Run [builder_issue]: The workflow stopped at the 'Enrich & Filter Cross-Team' node, which produced no output. Examining the code, it filters issues using issue.creator?.email as the key into the membership map. However,
Run [build_failure]: Build failed: Run timed out after 600000ms

Get all the Linear issues created in the last 2 weeks. Filter them for / no-cross-team-issues — 2/3 passed

Run [build_failure]: Build failed: Run timed out after 600000ms

Get all the Linear issues created in the last 2 weeks. Filter them for / unknown-creator — 1/3 passed

Run [builder_issue]: The workflow did not crash (no runtime errors), so the 'handles unknown creator without crashing' part is technically satisfied. However, the checklist also requires that Alice's cross-team issues are
Run [build_failure]: Build failed: Run timed out after 600000ms

Get all the Linear issues created in the last 2 weeks. Filter them for / api-error — 0/3 passed

Run [builder_issue]: The workflow crashed with 'Cannot read properties of undefined (reading 'errors')' when the Linear node received an authentication error response. The Linear node failed to handle the error response g
Run [build_failure]: Build failed: Run timed out after 600000ms
Run [builder_issue]: The workflow crashed with 'Authorization failed - please check your credentials' when the Linear API returned an authentication error. There is no error handling branch — no Try/Catch node, no error w

Every day, get the posts made in the past day on 3 different Slack cha / channel-not-found — 0/3 passed

Run [builder_issue]: The workflow does not handle the channel_not_found error gracefully. When Fetch #product returns {"ok": false, "error": "channel_not_found"}, the Slack node throws an error ('Slack error response: cha
Run [builder_issue]: The workflow has no error handling for the Fetch #product node. When Fetch #product received a 404 response (mock returned {"ok":false,"error":"channel_not_found"}), the node threw 'Request failed w
Run [builder_issue]: The workflow has no error handling for the channel-not-found scenario. When Fetch #product received a mock response with {"ok": false, "error": "channel_not_found"}, the HTTP request node threw an err

Every day, get the posts made in the past day on 3 different Slack cha / insufficient-permissions — 1/3 passed

Run [builder_issue]: The workflow crashed when 'Fetch #product' received a 403 error (the mock returned a 'not_in_channel' error response). The workflow has no error handling on the Fetch #product node (no 'Continue on Er
Run [builder_issue]: The workflow crashed with a 403 error when Fetch #product received the 'not_in_channel' error response (HTTP 403). The workflow has no error handling for this scenario — there is no try/catch, no erro

Every day, fetch all open GitHub issues from repository 'acme-corp/bac / happy-path — 2/3 passed

Run [builder_issue]: The workflow failed to execute. The 'Split Issues' node has an empty 'fieldToSplitOut' parameter, which is a required field. This caused the execution to fail with 'The workflow has issues and cannot

Every day, fetch all open GitHub issues from repository 'acme-corp/bac / no-bugs — 2/3 passed

Run [builder_issue]: The workflow failed to execute cleanly. The 'Split Issues' node has a missing required parameter ('Fields To Split Out' is empty string), causing the execution to fail with 'The workflow has issues an

Every two weeks I want to check the amount of n8n usage and bug report / happy-path — 0/3 passed

Run [builder_issue]: The workflow failed to execute. The BigQuery: Usage Stats node has a misconfigured projectId set to a placeholder value '<PLACEHOLDER_VALUE__BigQuery Project ID>' which is not a valid BigQuery Pro
Run [builder_issue]: The workflow failed to execute. The 'Query BigQuery Usage' node has a misconfigured projectId set to the placeholder value '<PLACEHOLDER_VALUE__BigQuery Project ID>', which is not a valid BigQuery
Run [builder_issue]: The workflow failed to execute. The BigQuery: Usage Stats node has a misconfigured projectId set to a placeholder value '<PLACEHOLDER_VALUE__BigQuery Project ID>', which is not a valid BigQuery Pr

Fetch the latest posts from the JSONPlaceholder API (GET https://jsonp / happy-path — 0/3 passed

Run [-]: The workflow executed without errors and the Slack message was posted to #api-digest with a count of 6 remaining posts. However, the Filter Out 'qui' Titles node failed to correctly filter out items w
Run []: The workflow executed without errors and the filter node correctly removed posts with 'qui' in the title (ids 2 and 5). However, the Filter node output includes all 7 items — 5 passing items AND the 2
Run [builder_issue]: The workflow executed without errors and the Slack message was posted to #api-digest. However, the filter node failed to correctly exclude all titles containing 'qui'. The filter node's output include

Fetch the latest posts from the JSONPlaceholder API (GET https://jsonp / all-filtered — 2/3 passed

Run [builder_issue]: The Filter node ('Exclude "qui" Titles') was configured with caseSensitive: false and a notContains 'qui' condition, but it passed all 3 items through instead of filtering them out. All 3 titles conta

Build a Telegram chatbot workflow for a family assistant. It should re / distinct-telegram-chat — 1/3 passed

Run [verification_gap]: The workflow contains all the required structural elements: Telegram Trigger, Family Assistant Agent, OpenAI GPT-4o (chat model), Conversation Memory (memory node), and Send Reply. The connections sho
Run [verification_gap]: The workflow structure is correct — Telegram Trigger → Family Assistant (AI Agent) → Send Reply, with OpenAI Chat Model and Conversation Memory connected to the agent. The Send Reply node correctly us

Every day at 8am, check the weather in Berlin using the OpenMeteo API / happy-path — 2/3 passed

Run [mock_issue]: The workflow failed with an error. The 'Fetch Berlin Weather' node received a mock generation error response (_evalMockError: true) instead of valid weather data. The subsequent 'Analyse Precipitati

Every hour, check the current weather for London, New York, and Tokyo / happy-path — 0/3 passed

Run [builder_issue]: The workflow failed to execute. The pre-analysis flags a builder issue: 'Log to Airtable' has placeholder values for both the Airtable Base ID and Table ID (values are '<__PLACEHOLDER_VALUE__Airtable
Run [mock_issue]: Multiple issues prevent full checklist satisfaction:

  1. New York weather API mock failed: The 'Fetch New York Weather' node received {_evalMockError: true, message: 'Mock generation failed...'}

Run [builder_issue]: The workflow failed at the 'Aggregate Hot Cities' code node with 'Invalid or unexpected token'. Examining the jsCode parameter, the template literal uses backtick characters (`) which appear to have b

Every hour, check the current weather for London, New York, and Tokyo / no-alerts — 1/3 passed

Run [builder_issue]: The workflow failed to execute due to a builder configuration issue. The 'Log to Airtable' node has placeholder values for both the Airtable Base ID and Table ID (literally '<__PLACEHOLDER_VALUE__Airt
Run [mock_issue]: The workflow errored at the 'Aggregate Hot Cities' node with 'Invalid or unexpected token'. The Fetch New York mock generation failed (returned _evalMockError), which propagated through Merge Weather

I want you to build a workflow that will read n8n workflow databases a / happy-path — 0/3 passed

Run [builder_issue]: The workflow execution stopped at the 'Query Existing Row' node — it produced no output, which caused the downstream 'Row Exists?' IF node and all subsequent nodes ('Update Row', 'Insert Row') to not
Run [builder_issue]: The workflow executed without errors, but the data table population is incomplete. The workflow processes workflows one at a time using a splitInBatches loop (For Each Workflow), checks for an existin
Run [builder_issue]: The workflow executes without errors and contains a duplicate-prevention mechanism (Check Existing Row → Row Exists? IF node → Update vs Insert branches). However, only 1 row was inserted instead of 4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

n8n team Authored by the n8n team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant