Add draft-to-design-doc eval (TDD-style, three variants)#8
Conversation
Adds a quick benchmark project under `evals/draft-to-design-doc/` that runs
Claude Code at three levels of scaffolding on the same Markdown design draft
and produces a Design Doc JSON for inspection.
Variants:
- vanilla — minimal CLAUDE.md, no plugin/MCP. Reads draft.md +
schema-reference.md and writes /app/output/design-doc.json directly.
- guided — extended prompt mirroring noesis:analyze-design-draft's
extract-design-model.md. Same task as vanilla.
- noesis — full noesis plugin installed via Dockerfile (bun install on
ubuntu:24.04 — needs GLIBC 2.39 for lbug native module). noesis-graph MCP
server registered through [[environment.mcp_servers]]. The skill is invoked
on /app/draft.md.
Both task sandboxes clone DDD-starter-dotnet into /app/repo/ so the agent can
inspect what is already implemented and classify model elements correctly as
modified vs added.
`tests/test.sh` is TDD-style: ONE specification (byte-for-byte identical
between the two task copies) asserts the expected correct ChangeSet shape:
- Sales BC in boundedContexts.modified (already in /app/repo/Sources/Sales)
- Sales.Pricing.Discounts module in modules.modified (already exists)
- ThresholdDiscount in buildingBlocks.added (genuinely new)
- Discount in buildingBlocks.modified (existing union gains a new variant)
Currently noesis-graph:save_design_doc enforces a green-field shape, so all
three variants fail this test. The failure IS the signal — when the
green-field constraint is lifted, runs producing the correct shape will start
passing.
`assessment_dimensions.json` is intentionally empty for now — quality grading
is left to manual inspection until the dimensions are designed.
Use `prepare-context.sh` to stage plugin sources into the noesis Dockerfile's
build context before running the noesis variant; see README.md for invocation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First three-way TDD runRan all three variants on the new TDD test (sequential checks, not parallel concurrency):
What works (vanilla + guided, with /app/repo mounted)Both ad-hoc variants consume What breaks for noesis (plugin green-field constraint)The noesis variant is worse on this dimension despite having scanning + MCP — // noesis output
{
"bcs_added": ["Sales"], // wrong — should be modified
"bcs_modified": [],
"modules_added": ["Pricing.Discounts"], // wrong shape AND wrong name
"bbs_added": ["ThresholdDiscount", "Discount", "DiscountsSqlRepository"]
// Discount and DiscountsSqlRepository wrong — exist in /app/repo
}Two distinct bugs
Run artifacts under |
Summary
evals/draft-to-design-doc/— a quick benchmark project comparing how Claude Code extracts a Design Doc JSON from a Markdown design draft across three levels of scaffolding (vanilla,guided,noesis).test.sh(identical between both task copies) asserts the expected correct ChangeSet shape — currently fails for all three variants because of the green-field constraint insave_design_doc. The failure is the spec; passes will arrive when the constraint is lifted.lbugnative module), registers thenoesis-graphMCP server through[[environment.mcp_servers]], and clones DDD-starter-dotnet into/app/repo/so the agent has the implemented codebase as the diff baseline.assessment_dimensions.jsonis intentionally empty for now — quality grading is owned by manual inspection until the dimensions are designed.Why
Local development needs a fast feedback loop on
analyze-design-draftskill quality across:Smoke results from a first run: vanilla 6:38 / 12 KB, guided 13:25 / 21 KB, noesis 9:29 / 21 KB. Detailed comparison lives in the run artifacts under
jobs/(gitignored).Open issues this surfaces
noesis-graph:save_design_docenforces a green-field ChangeSet shape (rejectsmodified/removedoutsideboundedContexts). The TDD test asserts the correct (post-fix) shape, so all variants currently fail. See README's "TDD note".analyze-design-draftworkflow (no topics / decisions extracted in our run);merge_documentreturnstopics_added: 0, decisions_added: 0. Worth investigating separately whether prompt-side or skill-side fix is needed.Test plan
bash evals/draft-to-design-doc/tasks/threshold-discount-extraction-noesis/environment/prepare-context.shsucceeds locallynasde run --variant vanilla --tasks threshold-discount-extraction --without-eval -C evals/draft-to-design-docruns to completion (expected: test.sh fails)nasde run --variant guided --tasks threshold-discount-extraction --without-eval -C evals/draft-to-design-docruns to completion (expected: test.sh fails)nasde run --variant noesis --tasks threshold-discount-extraction-noesis --without-eval -C evals/draft-to-design-docruns to completion (expected: test.sh fails)artifacts/workspace/output/design-doc.json(vanilla/guided) orartifacts/workspace/noesis/design-docs/*.json(noesis) and confirm the failure mode matches the green-field bug🤖 Generated with Claude Code