feat(0.25.0): improvement engine — one improvementDriver + reflective/agentic generators by drewstone · Pull Request #61 · tangle-network/agent-runtime

drewstone · 2026-05-25T16:56:37Z

Phase 3 (runtime side). Implements agent-eval 0.40.2's ImprovementDriver contract as ONE driver with a pluggable cost dial.

What ships

improvementDriver (src/improvement/) — the ONE driver. Owns the candidate lifecycle (worktree create → generate → finalize/discard, × populationSize); implements agent-eval's ImprovementDriver. Delegates the only thing that varies — how a candidate is produced — to a pluggable CandidateGenerator.
reflectiveGenerator — shots=1, no sandbox. Drafts patches via the existing improvement adapter + applies them. (The former 'analystDriver', now a generator.)
agenticGenerator — shots=N, full harness. Runs a real coding harness (claude/codex/opencode) in the worktree via the verified runLocalHarness primitive; the agent edits in place; maxShots retries on no-change; trusts the diff, not stdout.
New ./improvement subpath.

Why one driver, not two (analyst + autoresearch)

They're the same operation at two settings of one dial (maxImprovementShots + sandbox/tools). The reflective path is the cheap corner of the agentic path. One driver, two generators — no proliferation. (Design review settled this.)

Descoped: default tracing in handleChatTurn

The flywheel's production-trace capture is ALREADY served by src/otel-export.ts (shipped) + runCampaign's labeledStore (eval-time). A turn-level span default-on in handleChatTurn would be a marginal public-contract change — descoped rather than ship code that doesn't earn its place.

Tests

13 real-git tests (improvement-driver 5 + agentic-generator 5 + worktree-backed reflective 3), mocking ONLY the harness subprocess (the real process boundary). Suite 313/313, build + lint clean, dep bumped to agent-eval ^0.40.0 (0.40.2).

Mechanism confidence

Built on the verified Phase-2.8 pattern (in-process executor: createWorktree → runLocalHarness(cwd=worktree) → diff). Edits land in place on the same filesystem — no sandbox-mount ambiguity.

Phase 3 foundation. Picks up the published 0.40.1 campaign substrate (runCampaign, runImprovementLoop, ImprovementDriver, CodeSurface, evolutionaryDriver) so agent-runtime can implement analystDriver against the ImprovementDriver contract and wire default tracing. Baseline typecheck green against the new dep.

…worktree adapter The reflective analyst is now a DRIVER of agent-eval's one improvement loop, not a parallel loop. Implements ImprovementDriver<AnalystFinding> from @tangle-network/agent-eval@0.40.2. propose(): - pulls findings from the Phase-2 research report (report.findings when present, else ctx.findings) - drafts surface edits via the existing improvement adapter's proposeFromFindings (no new patch-drafting logic) - applies every drafted patch as ONE coherent improvement into a SINGLE worktree (PR-like) via the VCS-pluggable gitWorktreeAdapter - returns a CodeSurface{worktreeRef} the improvement loop measures on holdout - discards the worktree + proposes nothing if no patch applies (fail-clean) Mirrors the improvement adapter's proven apply invocation exactly (git apply --whitespace=fix -p0 -), run inside the candidate worktree. 5 real-git tests: applies+commits into a worktree and returns a CodeSurface (baseRef untouched); prefers report findings over ctx.findings; proposes nothing on no-findings / no-edits; discards the worktree and proposes nothing when a patch fails to apply (no orphaned worktree). Proves the 0.40.2 worktree adapter + ProposeContext shapes end-to-end — the prove-one-before-fanning gate before the heavyweight autoresearchDriver (sandbox runLoop propose) and the rest of 0.25.0. Suite: 308/308 (+5). Typecheck + lint clean against 0.40.2.

…Driver + pluggable generators Per the design review: we don't need two drivers. The reflective analyst and the full agentic autoresearch are the SAME operation at two settings of one dial (maxImprovementShots + sandbox/tools). Collapse them. - improvementDriver (src/improvement/improvement-driver.ts): the ONE driver. Implements agent-eval's ImprovementDriver; owns the candidate lifecycle (worktree create → generate → finalize/discard, × populationSize). Delegates the only thing that genuinely varies — HOW a candidate change is produced — to a pluggable CandidateGenerator. - CandidateGenerator: the byte-producing seam. Makes (uncommitted) changes in a worktree; the driver commits via the worktree adapter's finalize. - reflectiveGenerator (src/improvement/reflective-generator.ts): the shots=1, no-sandbox setting. Drafts patches via the existing improvement adapter and applies them. This is the former 'analystDriver', now expressed as a generator of the one driver. - agenticGenerator (shots=N, sandbox runLoop) is the forthcoming setting — it plugs into the SAME improvementDriver, not a parallel 'autoresearchDriver'. New ./improvement subpath (tsup entry + package.json export). Removed the standalone analystDriver export from ./analyst-loop. 5 real-git tests (renamed to improvement-driver.test.ts) pass through the unified API. Suite 308/308, build clean (dist/improvement.{js,d.ts} emitted), typecheck + lint clean against agent-eval 0.40.2. Result: one driver, one contract, dialed cheap→agentic — no proliferation.

…n the worktree The shots=N, full-tools setting of the one improvementDriver. Runs a real coding harness (claude/codex/opencode) inside the candidate worktree the driver already created; the agent reads the codebase + research report and edits in place; the driver commits the result into a CodeSurface. Built on the VERIFIED runLocalHarness primitive (src/mcp/local-harness.ts) — the same mechanism the Phase-2.8 in-process executor already uses: spawn the harness with cwd = the worktree, on the same filesystem, so edits land in place. No nested per-candidate sandbox (which would reintroduce a host<->sandbox worktree-transport problem); the OUTER sandbox is the loop's own execution context. maxShots = the DEPTH dial: run the harness; if the worktree stays clean (git status --porcelain empty), refine the prompt and retry up to maxShots; return on the first shot that changes the tree. We trust the DIFF, not the harness stdout. 5 tests, mocking ONLY the harness subprocess (the real process boundary) and using REAL git worktrees + a real dirty check: - applied=true when the harness changes the tree (+ prompt carries findings) - retries up to maxShots on no-change, then gives up - stops on the first shot that produces a change - end-to-end through improvementDriver: harness edit -> committed CodeSurface, main untouched - improvementDriver discards the worktree when nothing is produced With reflectiveGenerator (shots=1, no sandbox) this completes the cost dial: ONE improvementDriver, two generators (cheap reflective <-> full agentic), both emitting CodeSurfaces the loop measures + gates. Suite 313/313, build + lint clean against agent-eval 0.40.2.

…dirty check Two must-fix defects from the adversarial review of the improvement engine: 1. Worktree leak on throw (improvement-driver.ts). propose() created a worktree then called generate()/finalize() with no cleanup on throw — a throw from either leaked the worktree + branch (N per population). Wrapped the per-candidate body in try/catch: discard best-effort on throw (never masking the original error), then rethrow loud. + a test asserting the error propagates AND no orphaned worktree remains. 2. Silent fallback in worktreeDirty (agentic-generator.ts). It returned false on git error — folding 'I can't tell' into 'no change', which would discard a candidate and mask a real failure (git missing / corrupt index / killed mid-run). Now throws on result.error (ENOENT) and non-zero exit, per the no-silent-fallbacks doctrine. The worktree is a fresh checkout, so a git status failure is genuinely broken state, not a normal outcome. Suite 314/314 (+1), typecheck + lint clean.

drewstone added 6 commits May 25, 2026 09:02

chore(0.25.0): bump version — improvement engine (drivers + generators)

493f9ca

drewstone merged commit 3d7aad9 into main May 25, 2026
1 check passed

drewstone deleted the feat/0.25.0-phase3-tracing-analystdriver branch May 25, 2026 17:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(0.25.0): improvement engine — one improvementDriver + reflective/agentic generators#61

feat(0.25.0): improvement engine — one improvementDriver + reflective/agentic generators#61
drewstone merged 6 commits into
mainfrom
feat/0.25.0-phase3-tracing-analystdriver

drewstone commented May 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

drewstone commented May 25, 2026

What ships

Why one driver, not two (analyst + autoresearch)

Descoped: default tracing in handleChatTurn

Tests

Mechanism confidence

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant