Add implement-design-doc-java skill#6
Conversation
Skill translates DesignDoc JSON contracts into Java domain model code, adapting to the target project's coding style (Lombok, plain Java, records). Includes 3 eval test cases with diverse Java project fixtures and DesignDoc JSONs covering shipping, pricing, and transfers domains. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Includes generated Java outputs (with/without skill), grading script, timing data, and benchmark JSONs for reviewer inspection. Iteration 3 with neutral prompts pending. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… fix Iteration 3: neutral prompts (no style hints in baseline), with_skill 96.3% vs without_skill 92.6%. Money fixture was missing — caused false failure. Iteration 4: added Money.java to plain-java fixture, with_skill 100% (27/27) vs without_skill 96.3% (26/27). Consistent skill advantage on package-private entity encapsulation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…iter 5-6 New fixtures: style-unusual-conventions (nested repo, sealed events in separate file, factory as inner class), style-cross-module (2 bounded contexts with shared types). Cleaned all DesignDoc JSONs of "Already exists"/"NEW" hints. Simplified all eval prompts to minimal form. Iteration 5 (payroll + order-delta): with_skill 100% vs without 95%. Iteration 6 (receiving + returns): with_skill 94.4% vs without 77.8% — first significant delta (+16.6pp) on hard scenarios. Key skill wins: reusing shared types and detecting subtle conventions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Payroll: with_skill 9/9 vs without 7/9 (+10pp). Removing "Already exists" hint from JSON exposed baseline failure on EmployeeId reuse. Order-delta: both 11/11 — non-discriminating, will be removed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Removed shipping and transfers evals (both 9/9 with and without skill). Kept 4 discriminating evals: pricing, payroll, receiving, returns. Iteration 8 (clean prompts, original 3 evals): with_skill 100% vs without 92.6%. Shipping/transfers confirmed non-discriminating. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Results identical to iter 6-8: with_skill 97.2% (35/36) vs without 80.6% (29/36), delta +16.7pp. All pass/fail patterns reproduced exactly. Consistent skill wins: package-private entities, shared type reuse. Consistent shared failure: factory_as_inner_class (both configs). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…er 10 Factory_as_inner_class assertion was invalid — DesignDoc has no factory building block, so expecting one violates "DesignDoc is authoritative". Reverted skill change that encouraged adding unrequested factories. Iteration 10 (receiving x2): with_skill 8/8 (100%) vs without 6/8 (75%) on both runs. Delta +25pp, fully repeatable. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Expanded description with more trigger phrases and broader coverage. Baseline trigger eval: 11/20 (55%) - all not-trigger correct, most should-trigger failing. May be eval tooling limitation (claude -p command-based triggering vs real skill system). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Packaged implement-design-doc-java as installable .skill file (399KB ZIP). Fixed skill name from "Implement Design Doc (Java)" to "implement-design-doc-java" (kebab-case required by packager). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
||
| ## Input | ||
|
|
||
| A DesignDoc JSON conforming to the schema at `contracts/design-doc-schema.json`. The JSON contains: |
There was a problem hiding this comment.
Czy kontrakt nie powinien być dołączony do skill'a? Jakie mogą być scenariusze w których odegrałoby to rolę z perspektywy skuteczności skill'a?
|
|
||
| 4. **Replicate the project's exact patterns.** When the project uses a specific pattern for domain events (e.g., `sealed interface Event permits ...` with inner records), aggregates (e.g., `pendingEvents` + `flushEvents()`), or use case coordination (e.g., facade + package-private services + @Configuration), replicate that exact pattern. Do not substitute a different pattern even if it's valid DDD — the goal is consistency with the existing codebase. | ||
|
|
||
| 5. **Implement domain model only.** Infrastructure implementations (persistence, messaging, HTTP) are out of scope unless the contract explicitly includes `external_integration` building blocks. Repository interfaces are in scope; their implementations are not. |
There was a problem hiding this comment.
To musimy przegadać wspólnie - na pewno skill powinien odpowiadać za implementację modelu domenowego - JEŚLI jednak pojawiłyby się klasy z poziomu infrastruktury - odpowiednio opisane - uważam że powinny one być normalnie implementowane . Koniecznie musimy dodać na to evale.
szjanikowski
left a comment
There was a problem hiding this comment.
Uwagi apropos skilla
|
|
||
| 2. **Adapt to the project's style.** Before writing any code, read existing sources to detect the project's conventions (see Style Detection below). The contract dictates *what* exists; the project dictates *how* it looks. | ||
|
|
||
| 3. **Reuse existing classes — never duplicate.** When a building block's `description` says "Already exists in the project" or a class with that name already exists in the codebase, import it from its current location. Do not create a new copy in a different package. Search the project for the class before creating it. This is critical for types like shared value objects (IDs, Money) that are used across modules. |
There was a problem hiding this comment.
W kontrakcie nie jest ściśle powiedziane że będzie uwaga "Already exists in the project" za każdym razem gdy konieczne jest re-użycie. To fajna konwencja ale potrzebujemy też testów gdzie tak wcale nie jest a MIMO to skill jest w stanie inteligentnie się połapać że powinien wykorzystać / rozbudować istniejące klasy.
Summary
Eval results (10 iterations, final 4 discriminating scenarios)
Results confirmed repeatable across 2 independent iterations with identical pass/fail patterns.
Structure
Key design decisions
Methodology notes
🤖 Generated with Claude Code