Skip to content

feat(0.25.0): improvement engine — one improvementDriver + reflective/agentic generators#61

Merged
drewstone merged 6 commits into
mainfrom
feat/0.25.0-phase3-tracing-analystdriver
May 25, 2026
Merged

feat(0.25.0): improvement engine — one improvementDriver + reflective/agentic generators#61
drewstone merged 6 commits into
mainfrom
feat/0.25.0-phase3-tracing-analystdriver

Conversation

@drewstone
Copy link
Copy Markdown
Contributor

Phase 3 (runtime side). Implements agent-eval 0.40.2's ImprovementDriver contract as ONE driver with a pluggable cost dial.

What ships

  • improvementDriver (src/improvement/) — the ONE driver. Owns the candidate lifecycle (worktree create → generate → finalize/discard, × populationSize); implements agent-eval's ImprovementDriver. Delegates the only thing that varies — how a candidate is produced — to a pluggable CandidateGenerator.
  • reflectiveGeneratorshots=1, no sandbox. Drafts patches via the existing improvement adapter + applies them. (The former 'analystDriver', now a generator.)
  • agenticGeneratorshots=N, full harness. Runs a real coding harness (claude/codex/opencode) in the worktree via the verified runLocalHarness primitive; the agent edits in place; maxShots retries on no-change; trusts the diff, not stdout.
  • New ./improvement subpath.

Why one driver, not two (analyst + autoresearch)

They're the same operation at two settings of one dial (maxImprovementShots + sandbox/tools). The reflective path is the cheap corner of the agentic path. One driver, two generators — no proliferation. (Design review settled this.)

Descoped: default tracing in handleChatTurn

The flywheel's production-trace capture is ALREADY served by src/otel-export.ts (shipped) + runCampaign's labeledStore (eval-time). A turn-level span default-on in handleChatTurn would be a marginal public-contract change — descoped rather than ship code that doesn't earn its place.

Tests

13 real-git tests (improvement-driver 5 + agentic-generator 5 + worktree-backed reflective 3), mocking ONLY the harness subprocess (the real process boundary). Suite 313/313, build + lint clean, dep bumped to agent-eval ^0.40.0 (0.40.2).

Mechanism confidence

Built on the verified Phase-2.8 pattern (in-process executor: createWorktree → runLocalHarness(cwd=worktree) → diff). Edits land in place on the same filesystem — no sandbox-mount ambiguity.

drewstone added 6 commits May 25, 2026 09:02
Phase 3 foundation. Picks up the published 0.40.1 campaign substrate
(runCampaign, runImprovementLoop, ImprovementDriver, CodeSurface,
evolutionaryDriver) so agent-runtime can implement analystDriver against
the ImprovementDriver contract and wire default tracing. Baseline
typecheck green against the new dep.
…worktree adapter

The reflective analyst is now a DRIVER of agent-eval's one improvement loop,
not a parallel loop. Implements ImprovementDriver<AnalystFinding> from
@tangle-network/agent-eval@0.40.2.

propose():
  - pulls findings from the Phase-2 research report (report.findings when
    present, else ctx.findings)
  - drafts surface edits via the existing improvement adapter's
    proposeFromFindings (no new patch-drafting logic)
  - applies every drafted patch as ONE coherent improvement into a SINGLE
    worktree (PR-like) via the VCS-pluggable gitWorktreeAdapter
  - returns a CodeSurface{worktreeRef} the improvement loop measures on holdout
  - discards the worktree + proposes nothing if no patch applies (fail-clean)

Mirrors the improvement adapter's proven apply invocation exactly
(git apply --whitespace=fix -p0 -), run inside the candidate worktree.

5 real-git tests: applies+commits into a worktree and returns a CodeSurface
(baseRef untouched); prefers report findings over ctx.findings; proposes
nothing on no-findings / no-edits; discards the worktree and proposes nothing
when a patch fails to apply (no orphaned worktree).

Proves the 0.40.2 worktree adapter + ProposeContext shapes end-to-end — the
prove-one-before-fanning gate before the heavyweight autoresearchDriver
(sandbox runLoop propose) and the rest of 0.25.0.

Suite: 308/308 (+5). Typecheck + lint clean against 0.40.2.
…Driver + pluggable generators

Per the design review: we don't need two drivers. The reflective analyst and
the full agentic autoresearch are the SAME operation at two settings of one
dial (maxImprovementShots + sandbox/tools). Collapse them.

- improvementDriver (src/improvement/improvement-driver.ts): the ONE driver.
  Implements agent-eval's ImprovementDriver; owns the candidate lifecycle
  (worktree create → generate → finalize/discard, × populationSize). Delegates
  the only thing that genuinely varies — HOW a candidate change is produced —
  to a pluggable CandidateGenerator.
- CandidateGenerator: the byte-producing seam. Makes (uncommitted) changes in
  a worktree; the driver commits via the worktree adapter's finalize.
- reflectiveGenerator (src/improvement/reflective-generator.ts): the
  shots=1, no-sandbox setting. Drafts patches via the existing improvement
  adapter and applies them. This is the former 'analystDriver', now expressed
  as a generator of the one driver.
- agenticGenerator (shots=N, sandbox runLoop) is the forthcoming setting —
  it plugs into the SAME improvementDriver, not a parallel 'autoresearchDriver'.

New ./improvement subpath (tsup entry + package.json export). Removed the
standalone analystDriver export from ./analyst-loop.

5 real-git tests (renamed to improvement-driver.test.ts) pass through the
unified API. Suite 308/308, build clean (dist/improvement.{js,d.ts} emitted),
typecheck + lint clean against agent-eval 0.40.2.

Result: one driver, one contract, dialed cheap→agentic — no proliferation.
…n the worktree

The shots=N, full-tools setting of the one improvementDriver. Runs a real
coding harness (claude/codex/opencode) inside the candidate worktree the
driver already created; the agent reads the codebase + research report and
edits in place; the driver commits the result into a CodeSurface.

Built on the VERIFIED runLocalHarness primitive (src/mcp/local-harness.ts) —
the same mechanism the Phase-2.8 in-process executor already uses: spawn the
harness with cwd = the worktree, on the same filesystem, so edits land in
place. No nested per-candidate sandbox (which would reintroduce a
host<->sandbox worktree-transport problem); the OUTER sandbox is the loop's
own execution context.

maxShots = the DEPTH dial: run the harness; if the worktree stays clean
(git status --porcelain empty), refine the prompt and retry up to maxShots;
return on the first shot that changes the tree. We trust the DIFF, not the
harness stdout.

5 tests, mocking ONLY the harness subprocess (the real process boundary) and
using REAL git worktrees + a real dirty check:
  - applied=true when the harness changes the tree (+ prompt carries findings)
  - retries up to maxShots on no-change, then gives up
  - stops on the first shot that produces a change
  - end-to-end through improvementDriver: harness edit -> committed CodeSurface,
    main untouched
  - improvementDriver discards the worktree when nothing is produced

With reflectiveGenerator (shots=1, no sandbox) this completes the cost dial:
ONE improvementDriver, two generators (cheap reflective <-> full agentic),
both emitting CodeSurfaces the loop measures + gates. Suite 313/313, build +
lint clean against agent-eval 0.40.2.
…dirty check

Two must-fix defects from the adversarial review of the improvement engine:

1. Worktree leak on throw (improvement-driver.ts). propose() created a
   worktree then called generate()/finalize() with no cleanup on throw — a
   throw from either leaked the worktree + branch (N per population). Wrapped
   the per-candidate body in try/catch: discard best-effort on throw (never
   masking the original error), then rethrow loud. + a test asserting the
   error propagates AND no orphaned worktree remains.

2. Silent fallback in worktreeDirty (agentic-generator.ts). It returned false
   on git error — folding 'I can't tell' into 'no change', which would discard
   a candidate and mask a real failure (git missing / corrupt index / killed
   mid-run). Now throws on result.error (ENOENT) and non-zero exit, per the
   no-silent-fallbacks doctrine. The worktree is a fresh checkout, so a git
   status failure is genuinely broken state, not a normal outcome.

Suite 314/314 (+1), typecheck + lint clean.
@drewstone drewstone merged commit 3d7aad9 into main May 25, 2026
1 check passed
@drewstone drewstone deleted the feat/0.25.0-phase3-tracing-analystdriver branch May 25, 2026 17:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant