Agent-based simulation of automated science under publish-or-perish incentives.
Paper-chase simulates a scientific publishing ecosystem to test which governance interventions keep the literature's truth-content (precision) and discovery rate (recall) high as agents optimize for novelty rewards. Each intervention — pre-registration, replication-and-retraction, measurement-invariance — is mapped onto the precision/recall Pareto plane across incentive pressure and systematic bias. A recurring result: an independent, cross-model auditor recovers precision that a same-base-model auditor cannot, because a same-source check carries the same systematic bias it is meant to catch.
We start with a validity gate on the statistical engine. As incentives increase reward for novel results over replication of previously published work, the literature's precision falls: rising QRP inflates the effective false-positive rate, so false positives accumulate in the standing literature, driving truth-content down.
Next, we implement interventions hypothesized to impact either/both precision and recall. Mapped to the Pareto plane, dominance regimes and tradeoffs are characterized for baseline, per-intervention, and intervention combinations.
Initial interventions include:
- pre-registration (hypothesis, methods, analysis)
- incentivized replication + retraction
- measurement-invariance requirements
Builds on Smaldino & McElreath, The natural selection of bad science (RSOS 2016).
For recent results, see example runs. For the curated synthesis — established findings by regime, each with mechanism and a status tag — see FINDINGS.md.
Statistical engine validated (FPR ≈ α at q=0; power monotone in n)
- With no mitigation, the literature's truth-content falls as the novelty:replication reward ratio rises; the qualitative crisis dynamic is reproduced.
Initial interventions (in progress)
- pre-registration: a modest precision lift (largest at low bias, vanishing as systematic bias takes over — it addresses QRP, not bias) at a small recall cost
- incentivized replication + retraction: raises precision substantially at low–moderate bias and cuts recall; but a same-base audit inherits the shared systematic bias, so its precision collapses once that bias is strong — an argument for cross-model audit
- measurement-invariance: the initial uniform sampling algorithm compute-restricted experiments to a small number of invariance-replications, so precision improvements were observed, but the significant reduction in recall dominated this intervention; next up: realistic sampling algorithms
Future extensions:
- cross-context bias persistence — does invariance keep its advantage when a shared base model's bias persists across contexts, not just within? (the load-bearing test; the current model draws each context's bias independently)
- adaptive (RL) agents that learn to game/reward-hack
- early-warning detection
uv sync # creates .venv, installs deps + dev group, writes uv.lock
uv run pytest # statistical engine: FPR ≈ α, power monotone in n
uv run python scripts/run_baseline.py # produces results/validity_gate.png- truth-content = TP / (TP + FP) = precision over the standing literature = 1 − FDR
- discovery rate = TP / (TP + FN) = recall (field-level power)
- Pareto gives a read on both and is necessary here since precision is gameable
('publish nothing' maps to precision = 1)
The framing — replication crisis in automated science as a problem worth stress-testing via simulation of mitigations — is from one of the project bullets on Konstantinos Voudouris's Pivotal mentor profile. The implementation here is mine; design choices, stylized parameter values, and errors are mine alone.