feat(instance-ai): Capture orchestrator plans into a builder eval dataset#29785
feat(instance-ai): Capture orchestrator plans into a builder eval dataset#29785
Conversation
…aset
New `eval:capture-plans` CLI drives a live n8n instance with each
parent dataset prompt, captures the reconciled `PlannedTask[]` at the
`submit-plan` boundary via SSE, and writes one row per `build-workflow`
task to a target LangSmith dataset.
Resulting rows are shape-compatible with the existing pairwise CLI:
`inputs.prompt` carries the task spec, `inputs.evals.{dos,donts}` are
inherited from the parent. Metadata records the originating blueprint,
parent provenance, and a stable `derivedId` so re-runs match existing
rows by `${parentExampleId}/${slug(workflowName)}`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
! PR exceeds size limit (1,125 lines added)This PR adds 1,125 lines, exceeding the 1,000-line limit (test files excluded). Large PRs are harder to review and increase the risk of bugs going unnoticed. Please consider:
If the size is genuinely justified (e.g. generated code, large migrations, test fixtures), a maintainer can override by commenting |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Instance AI Workflow Eval Results14/14 built | 3 run(s) | pass@3: 77% | pass^3: 50% | iterations: 57% / 57% / 69%
Failure detailsCreate a workflow that handles contact form submissions via a webhook. / partial-action-failure — 0/3 passed
Create a workflow that handles contact form submissions via a webhook. / invalid-email — 0/3 passed
Get all the Linear issues created in the last 2 weeks. Filter them for / happy-path — 1/3 passed
Get all the Linear issues created in the last 2 weeks. Filter them for / multi-team-creator — 1/3 passed
Get all the Linear issues created in the last 2 weeks. Filter them for / no-cross-team-issues — 2/3 passed
Get all the Linear issues created in the last 2 weeks. Filter them for / unknown-creator — 1/3 passed
Get all the Linear issues created in the last 2 weeks. Filter them for / api-error — 0/3 passed
Every day, get the posts made in the past day on 3 different Slack cha / channel-not-found — 0/3 passed
Every day, get the posts made in the past day on 3 different Slack cha / insufficient-permissions — 1/3 passed
Every day, fetch all open GitHub issues from repository 'acme-corp/bac / happy-path — 2/3 passed
Every day, fetch all open GitHub issues from repository 'acme-corp/bac / no-bugs — 2/3 passed
Every two weeks I want to check the amount of n8n usage and bug report / happy-path — 0/3 passed
Fetch the latest posts from the JSONPlaceholder API (GET https://jsonp / happy-path — 0/3 passed
Fetch the latest posts from the JSONPlaceholder API (GET https://jsonp / all-filtered — 2/3 passed
Build a Telegram chatbot workflow for a family assistant. It should re / distinct-telegram-chat — 1/3 passed
Every day at 8am, check the weather in Berlin using the OpenMeteo API / happy-path — 2/3 passed
Every hour, check the current weather for London, New York, and Tokyo / happy-path — 0/3 passed
Every hour, check the current weather for London, New York, and Tokyo / no-alerts — 1/3 passed
I want you to build a workflow that will read n8n workflow databases a / happy-path — 0/3 passed
|
Summary
Adds
eval:capture-plans, a new CLI that drives a live n8n instance with each parent dataset prompt, captures the reconciledPlannedTask[]at thesubmit-planboundary via SSE, and writes one row perbuild-workflowtask to a target LangSmith dataset.The point: the existing pairwise eval feeds vague Notion prompts to the builder, mixing orchestrator quality with builder quality. This new dataset isolates builder quality given a clean handover — the eval input is the same task spec that production hands to the builder via
dispatchPlannedTask.Resulting rows are shape-compatible with the existing pairwise CLI:
inputs.prompt=task.spec(fed straight intobuildInProcess)inputs.evals.{dos,donts}= inherited from the parentmetadata.derivedId=${parentExampleId}/${slug(workflowName)}for stable, idempotent re-syncsResulting dataset on LangSmith:
instance-ai-builder-from-plans(91 rows, captured from 75 of 77 parent examples — 2 skipped where the planner answered conversationally without submitting a plan).How to test
Re-running the capture is idempotent — rows match by
derivedIdand update only when the spec content drifts.Files
evaluations/harness/plan-capture.ts— pure capture helper. Drives SSE, reconstructsPlanningBlueprintfromadd-plan-itemevents, reads reconciledplanItemsfrom the latesttasks-update, declinesplan-review, cancels the run.evaluations/langsmith/builder-from-plans-sync.ts— sync orchestrator mirroringdataset-sync.ts. Captures concurrently, flattens to one row perbuild-workflowtask, diffs against existing rows.evaluations/cli/capture-plans.ts— CLI entry. Defaults:--parent-dataset notion-pairwise-workflows,--target-dataset instance-ai-builder-from-plans, concurrency 4, timeout 180s.evaluations/__tests__/plan-capture.test.ts— 5 unit tests covering happy path, ask-user auto-resolve, timeout, no-plan-emitted, malformed SSE.package.json—eval:capture-plansscript entry.Related Linear tickets, Github issues, and Community forum posts
n/a (internal eval tooling)
Review / Merge checklist
🤖 PR Summary generated by AI