Problem
During development of the A* pathfinding examples, a single agent wrote a test asserting len(path) == 13 for a maze whose actual shortest path is 9. This incorrect test became the oracle for 13+ thinktank runs across Python and TypeScript, causing every correctly-implemented A* solution to show as "failed." We misdiagnosed this as correlated model failure when in fact every agent was right and the test was wrong.
This is the test-oracle version of the single-point-of-failure problem that thinktank was built to solve. If we'd used an ensemble to write the tests, convergence analysis would have caught the outlier assertion before it became the ground truth.
Proposed solution: two-phase ensemble workflow
Phase 1: Ensemble test generation
thinktank run "write comprehensive tests for grid.py pathfinding" -n 5 -t "python -m pytest --collect-only"
- N agents independently write test cases
- Convergence analysis on the assertions themselves catches disagreements
- If 4/5 agents assert path length 9 and 1 asserts 13, the outlier is flagged
- User reviews the converged tests before using them as oracle
Phase 2: Ensemble implementation (existing workflow)
thinktank run "implement A* pathfinding in grid.py" -n 5 -t "python -m pytest"
- Runs against the ensemble-verified test suite
- Normal Copeland scoring applies
Implementation options
Option A: Documented workflow (minimal code change)
- Add a "Recommended workflows" section to README
- Document the two-phase pattern: generate tests first, then generate implementation
- Lowest effort, immediately useful
Option B: thinktank test-gen command
- New command that runs Phase 1 with test-specific defaults
- Could auto-detect test framework and generate appropriate
--collect-only or --dry-run validation
- Applies convergence analysis to test assertions specifically
- Could flag assertions where agents disagree on expected values
Option C: --generate-tests flag on thinktank run
- Automatically runs Phase 1 before Phase 2 in a single invocation
thinktank run "implement A* in grid.py" -n 5 -t "python -m pytest" --generate-tests
- Generates tests → reviews convergence → then runs implementation against the converged tests
Real-world evidence from our own development
| Run |
What happened |
Root cause |
| 13 A* runs (Python + TS) |
0% pass rate, diagnosed as "correlated model failure" |
Test asserted 13, correct answer is 9 |
| All agents passed 6/7 tests |
The one failing test was the buggy one |
Signal was there — we didn't investigate |
| Fix: changed assertion from 13 to 9 |
Immediately: 20/20 agents pass all tests |
Every agent was correct all along |
Key lesson: a wrong test is worse than no test. It actively penalizes correct implementations and corrupts Copeland scoring. Ensemble test generation directly addresses this.
Scope
Starting with Option A (documented workflow) is the pragmatic choice — it's useful immediately and validates the pattern before building dedicated tooling. Options B/C are follow-ups if the pattern proves valuable.
Problem
During development of the A* pathfinding examples, a single agent wrote a test asserting
len(path) == 13for a maze whose actual shortest path is 9. This incorrect test became the oracle for 13+ thinktank runs across Python and TypeScript, causing every correctly-implemented A* solution to show as "failed." We misdiagnosed this as correlated model failure when in fact every agent was right and the test was wrong.This is the test-oracle version of the single-point-of-failure problem that thinktank was built to solve. If we'd used an ensemble to write the tests, convergence analysis would have caught the outlier assertion before it became the ground truth.
Proposed solution: two-phase ensemble workflow
Phase 1: Ensemble test generation
Phase 2: Ensemble implementation (existing workflow)
Implementation options
Option A: Documented workflow (minimal code change)
Option B:
thinktank test-gencommand--collect-onlyor--dry-runvalidationOption C:
--generate-testsflag onthinktank runthinktank run "implement A* in grid.py" -n 5 -t "python -m pytest" --generate-testsReal-world evidence from our own development
Key lesson: a wrong test is worse than no test. It actively penalizes correct implementations and corrupts Copeland scoring. Ensemble test generation directly addresses this.
Scope
Starting with Option A (documented workflow) is the pragmatic choice — it's useful immediately and validates the pattern before building dedicated tooling. Options B/C are follow-ups if the pattern proves valuable.