Skip to content

Ensemble test generation: use thinktank to validate test suites before running implementations #159

Description

@that-github-user

Problem

During development of the A* pathfinding examples, a single agent wrote a test asserting len(path) == 13 for a maze whose actual shortest path is 9. This incorrect test became the oracle for 13+ thinktank runs across Python and TypeScript, causing every correctly-implemented A* solution to show as "failed." We misdiagnosed this as correlated model failure when in fact every agent was right and the test was wrong.

This is the test-oracle version of the single-point-of-failure problem that thinktank was built to solve. If we'd used an ensemble to write the tests, convergence analysis would have caught the outlier assertion before it became the ground truth.

Proposed solution: two-phase ensemble workflow

Phase 1: Ensemble test generation

thinktank run "write comprehensive tests for grid.py pathfinding" -n 5 -t "python -m pytest --collect-only"
  • N agents independently write test cases
  • Convergence analysis on the assertions themselves catches disagreements
  • If 4/5 agents assert path length 9 and 1 asserts 13, the outlier is flagged
  • User reviews the converged tests before using them as oracle

Phase 2: Ensemble implementation (existing workflow)

thinktank run "implement A* pathfinding in grid.py" -n 5 -t "python -m pytest"
  • Runs against the ensemble-verified test suite
  • Normal Copeland scoring applies

Implementation options

Option A: Documented workflow (minimal code change)

  • Add a "Recommended workflows" section to README
  • Document the two-phase pattern: generate tests first, then generate implementation
  • Lowest effort, immediately useful

Option B: thinktank test-gen command

  • New command that runs Phase 1 with test-specific defaults
  • Could auto-detect test framework and generate appropriate --collect-only or --dry-run validation
  • Applies convergence analysis to test assertions specifically
  • Could flag assertions where agents disagree on expected values

Option C: --generate-tests flag on thinktank run

  • Automatically runs Phase 1 before Phase 2 in a single invocation
  • thinktank run "implement A* in grid.py" -n 5 -t "python -m pytest" --generate-tests
  • Generates tests → reviews convergence → then runs implementation against the converged tests

Real-world evidence from our own development

Run What happened Root cause
13 A* runs (Python + TS) 0% pass rate, diagnosed as "correlated model failure" Test asserted 13, correct answer is 9
All agents passed 6/7 tests The one failing test was the buggy one Signal was there — we didn't investigate
Fix: changed assertion from 13 to 9 Immediately: 20/20 agents pass all tests Every agent was correct all along

Key lesson: a wrong test is worse than no test. It actively penalizes correct implementations and corrupts Copeland scoring. Ensemble test generation directly addresses this.

Scope

Starting with Option A (documented workflow) is the pragmatic choice — it's useful immediately and validates the pattern before building dedicated tooling. Options B/C are follow-ups if the pattern proves valuable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions