Ensemble test generation: use thinktank to validate test suites before running implementations

## Problem

During development of the A* pathfinding examples, a single agent wrote a test asserting `len(path) == 13` for a maze whose actual shortest path is 9. This incorrect test became the oracle for 13+ thinktank runs across Python and TypeScript, causing every correctly-implemented A* solution to show as "failed." We misdiagnosed this as correlated model failure when in fact **every agent was right and the test was wrong**.

This is the test-oracle version of the single-point-of-failure problem that thinktank was built to solve. If we'd used an ensemble to _write_ the tests, convergence analysis would have caught the outlier assertion before it became the ground truth.

## Proposed solution: two-phase ensemble workflow

### Phase 1: Ensemble test generation
```bash
thinktank run "write comprehensive tests for grid.py pathfinding" -n 5 -t "python -m pytest --collect-only"
```
- N agents independently write test cases
- Convergence analysis on the _assertions themselves_ catches disagreements
- If 4/5 agents assert path length 9 and 1 asserts 13, the outlier is flagged
- User reviews the converged tests before using them as oracle

### Phase 2: Ensemble implementation (existing workflow)
```bash
thinktank run "implement A* pathfinding in grid.py" -n 5 -t "python -m pytest"
```
- Runs against the ensemble-verified test suite
- Normal Copeland scoring applies

## Implementation options

### Option A: Documented workflow (minimal code change)
- Add a "Recommended workflows" section to README
- Document the two-phase pattern: generate tests first, then generate implementation
- Lowest effort, immediately useful

### Option B: `thinktank test-gen` command
- New command that runs Phase 1 with test-specific defaults
- Could auto-detect test framework and generate appropriate `--collect-only` or `--dry-run` validation
- Applies convergence analysis to test assertions specifically
- Could flag assertions where agents disagree on expected values

### Option C: `--generate-tests` flag on `thinktank run`
- Automatically runs Phase 1 before Phase 2 in a single invocation
- `thinktank run "implement A* in grid.py" -n 5 -t "python -m pytest" --generate-tests`
- Generates tests → reviews convergence → then runs implementation against the converged tests

## Real-world evidence from our own development

| Run | What happened | Root cause |
|-----|---------------|------------|
| 13 A* runs (Python + TS) | 0% pass rate, diagnosed as "correlated model failure" | Test asserted 13, correct answer is 9 |
| All agents passed 6/7 tests | The one failing test was the buggy one | Signal was there — we didn't investigate |
| Fix: changed assertion from 13 to 9 | Immediately: 20/20 agents pass all tests | Every agent was correct all along |

Key lesson: **a wrong test is worse than no test.** It actively penalizes correct implementations and corrupts Copeland scoring. Ensemble test generation directly addresses this.

## Scope

Starting with Option A (documented workflow) is the pragmatic choice — it's useful immediately and validates the pattern before building dedicated tooling. Options B/C are follow-ups if the pattern proves valuable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensemble test generation: use thinktank to validate test suites before running implementations #159

Problem

Proposed solution: two-phase ensemble workflow

Phase 1: Ensemble test generation

Phase 2: Ensemble implementation (existing workflow)

Implementation options

Option A: Documented workflow (minimal code change)

Option B: `thinktank test-gen` command

Option C: `--generate-tests` flag on `thinktank run`

Real-world evidence from our own development

Scope

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Run	What happened	Root cause
13 A* runs (Python + TS)	0% pass rate, diagnosed as "correlated model failure"	Test asserted 13, correct answer is 9
All agents passed 6/7 tests	The one failing test was the buggy one	Signal was there — we didn't investigate
Fix: changed assertion from 13 to 9	Immediately: 20/20 agents pass all tests	Every agent was correct all along

Ensemble test generation: use thinktank to validate test suites before running implementations #159

Description

Problem

Proposed solution: two-phase ensemble workflow

Phase 1: Ensemble test generation

Phase 2: Ensemble implementation (existing workflow)

Implementation options

Option A: Documented workflow (minimal code change)

Option B: thinktank test-gen command

Option C: --generate-tests flag on thinktank run

Real-world evidence from our own development

Scope

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Option B: `thinktank test-gen` command

Option C: `--generate-tests` flag on `thinktank run`