feat(0.25.0): improvement engine — one improvementDriver + reflective/agentic generators#61
Merged
Merged
Conversation
Phase 3 foundation. Picks up the published 0.40.1 campaign substrate (runCampaign, runImprovementLoop, ImprovementDriver, CodeSurface, evolutionaryDriver) so agent-runtime can implement analystDriver against the ImprovementDriver contract and wire default tracing. Baseline typecheck green against the new dep.
…worktree adapter
The reflective analyst is now a DRIVER of agent-eval's one improvement loop,
not a parallel loop. Implements ImprovementDriver<AnalystFinding> from
@tangle-network/agent-eval@0.40.2.
propose():
- pulls findings from the Phase-2 research report (report.findings when
present, else ctx.findings)
- drafts surface edits via the existing improvement adapter's
proposeFromFindings (no new patch-drafting logic)
- applies every drafted patch as ONE coherent improvement into a SINGLE
worktree (PR-like) via the VCS-pluggable gitWorktreeAdapter
- returns a CodeSurface{worktreeRef} the improvement loop measures on holdout
- discards the worktree + proposes nothing if no patch applies (fail-clean)
Mirrors the improvement adapter's proven apply invocation exactly
(git apply --whitespace=fix -p0 -), run inside the candidate worktree.
5 real-git tests: applies+commits into a worktree and returns a CodeSurface
(baseRef untouched); prefers report findings over ctx.findings; proposes
nothing on no-findings / no-edits; discards the worktree and proposes nothing
when a patch fails to apply (no orphaned worktree).
Proves the 0.40.2 worktree adapter + ProposeContext shapes end-to-end — the
prove-one-before-fanning gate before the heavyweight autoresearchDriver
(sandbox runLoop propose) and the rest of 0.25.0.
Suite: 308/308 (+5). Typecheck + lint clean against 0.40.2.
…Driver + pluggable generators
Per the design review: we don't need two drivers. The reflective analyst and
the full agentic autoresearch are the SAME operation at two settings of one
dial (maxImprovementShots + sandbox/tools). Collapse them.
- improvementDriver (src/improvement/improvement-driver.ts): the ONE driver.
Implements agent-eval's ImprovementDriver; owns the candidate lifecycle
(worktree create → generate → finalize/discard, × populationSize). Delegates
the only thing that genuinely varies — HOW a candidate change is produced —
to a pluggable CandidateGenerator.
- CandidateGenerator: the byte-producing seam. Makes (uncommitted) changes in
a worktree; the driver commits via the worktree adapter's finalize.
- reflectiveGenerator (src/improvement/reflective-generator.ts): the
shots=1, no-sandbox setting. Drafts patches via the existing improvement
adapter and applies them. This is the former 'analystDriver', now expressed
as a generator of the one driver.
- agenticGenerator (shots=N, sandbox runLoop) is the forthcoming setting —
it plugs into the SAME improvementDriver, not a parallel 'autoresearchDriver'.
New ./improvement subpath (tsup entry + package.json export). Removed the
standalone analystDriver export from ./analyst-loop.
5 real-git tests (renamed to improvement-driver.test.ts) pass through the
unified API. Suite 308/308, build clean (dist/improvement.{js,d.ts} emitted),
typecheck + lint clean against agent-eval 0.40.2.
Result: one driver, one contract, dialed cheap→agentic — no proliferation.
…n the worktree
The shots=N, full-tools setting of the one improvementDriver. Runs a real
coding harness (claude/codex/opencode) inside the candidate worktree the
driver already created; the agent reads the codebase + research report and
edits in place; the driver commits the result into a CodeSurface.
Built on the VERIFIED runLocalHarness primitive (src/mcp/local-harness.ts) —
the same mechanism the Phase-2.8 in-process executor already uses: spawn the
harness with cwd = the worktree, on the same filesystem, so edits land in
place. No nested per-candidate sandbox (which would reintroduce a
host<->sandbox worktree-transport problem); the OUTER sandbox is the loop's
own execution context.
maxShots = the DEPTH dial: run the harness; if the worktree stays clean
(git status --porcelain empty), refine the prompt and retry up to maxShots;
return on the first shot that changes the tree. We trust the DIFF, not the
harness stdout.
5 tests, mocking ONLY the harness subprocess (the real process boundary) and
using REAL git worktrees + a real dirty check:
- applied=true when the harness changes the tree (+ prompt carries findings)
- retries up to maxShots on no-change, then gives up
- stops on the first shot that produces a change
- end-to-end through improvementDriver: harness edit -> committed CodeSurface,
main untouched
- improvementDriver discards the worktree when nothing is produced
With reflectiveGenerator (shots=1, no sandbox) this completes the cost dial:
ONE improvementDriver, two generators (cheap reflective <-> full agentic),
both emitting CodeSurfaces the loop measures + gates. Suite 313/313, build +
lint clean against agent-eval 0.40.2.
…dirty check Two must-fix defects from the adversarial review of the improvement engine: 1. Worktree leak on throw (improvement-driver.ts). propose() created a worktree then called generate()/finalize() with no cleanup on throw — a throw from either leaked the worktree + branch (N per population). Wrapped the per-candidate body in try/catch: discard best-effort on throw (never masking the original error), then rethrow loud. + a test asserting the error propagates AND no orphaned worktree remains. 2. Silent fallback in worktreeDirty (agentic-generator.ts). It returned false on git error — folding 'I can't tell' into 'no change', which would discard a candidate and mask a real failure (git missing / corrupt index / killed mid-run). Now throws on result.error (ENOENT) and non-zero exit, per the no-silent-fallbacks doctrine. The worktree is a fresh checkout, so a git status failure is genuinely broken state, not a normal outcome. Suite 314/314 (+1), typecheck + lint clean.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Phase 3 (runtime side). Implements agent-eval 0.40.2's
ImprovementDrivercontract as ONE driver with a pluggable cost dial.What ships
improvementDriver(src/improvement/) — the ONE driver. Owns the candidate lifecycle (worktree create → generate → finalize/discard, × populationSize); implements agent-eval'sImprovementDriver. Delegates the only thing that varies — how a candidate is produced — to a pluggableCandidateGenerator.reflectiveGenerator—shots=1, no sandbox. Drafts patches via the existing improvement adapter + applies them. (The former 'analystDriver', now a generator.)agenticGenerator—shots=N, full harness. Runs a real coding harness (claude/codex/opencode) in the worktree via the verifiedrunLocalHarnessprimitive; the agent edits in place;maxShotsretries on no-change; trusts the diff, not stdout../improvementsubpath.Why one driver, not two (analyst + autoresearch)
They're the same operation at two settings of one dial (
maxImprovementShots+ sandbox/tools). The reflective path is the cheap corner of the agentic path. One driver, two generators — no proliferation. (Design review settled this.)Descoped: default tracing in handleChatTurn
The flywheel's production-trace capture is ALREADY served by
src/otel-export.ts(shipped) +runCampaign's labeledStore (eval-time). A turn-level span default-on in handleChatTurn would be a marginal public-contract change — descoped rather than ship code that doesn't earn its place.Tests
13 real-git tests (improvement-driver 5 + agentic-generator 5 + worktree-backed reflective 3), mocking ONLY the harness subprocess (the real process boundary). Suite 313/313, build + lint clean, dep bumped to agent-eval ^0.40.0 (0.40.2).
Mechanism confidence
Built on the verified Phase-2.8 pattern (in-process executor: createWorktree → runLocalHarness(cwd=worktree) → diff). Edits land in place on the same filesystem — no sandbox-mount ambiguity.