PhysLit

A pre-registered, audit-resolved diagnostic for physics literacy in large language models. PhysLit asks whether a frontier LLM can reason inside an unfamiliar physics framework — not whether it can solve textbook problems. Outputs are binary cognitive judgments, not leaderboard scores.

PhysLit is a research artifact, not a product. Every design decision optimizes for methodological auditability: pre-registered predictions, SHA-256-sealed inputs, fresh API session per stage, dual-LLM judging with an IRR gate, and a human-audit pathway for disagreement.

v0.1 result (2026-05-11)

Two predictions, locked at SHA-256 769818275e6a256...0c7df425 (tag prereg-v0.1-locked) before any production trial, evaluated on Aristotelian Mechanics across Claude Opus 4.7, GPT-5.5, and Gemini 3.1 Pro at N=5 trials each:

P1 — Induction failure under training-data conflict: CONFIRMED. 2 of 3 models (Claude, Gemini) introduce banned modern-physics concepts (dense, forceful, surface-supported, …) in ≥ 3/5 trials of Stage 1, despite an explicit ban in the prompt.
P3 — Meta-cognitive miscalibration: CONFIRMED. 10 trials contain at least one Stage-1-3 failure; in 7 of those 10 (70 %) the model fails to identify its own failure during Stage 4 self-reflection — well above the pre-registered 30 % threshold.

A third finding emerged from the methodology itself:

Cross-vendor LLM-judge inter-rater reliability = 36.67 %. Two independent judges (Claude + OpenAI) disagreed on more than a third of all PASS/FAIL classifications, triggering the prereg-mandated human audit. No single-judge LLM benchmark would have been reliable on this material.

Scope: 1 framework × 3 models × N=5 × 4 stages = 60 production API calls + 120 judge calls = 180 calls, ≈ $14 USD total.

Where to look	What's in it
`analysis/v0_1_report.md`	English narrative report — motivation, design, results, next steps
`analysis/v0_1_findings.md`	Auto-generated pre- and post-audit numerics + pipeline diagram
`analysis/v0_1_audit_human_review.md`	All 22 human-audit verdicts on DISAGREE cases
`results/<model-id>/`	Verbatim trial JSONs + judge verdicts for every API call

02_fmv result (2026-05-18)

PhysLit's second framework experiment: the F=mv World, a counterfactual world where a body's pace is set by the push acting on it at that moment (force ∝ velocity, not acceleration). Four predictions locked at tag prereg-02_fmv-locked before any production trial, evaluated across the same three models at N=5:

P1 — Induction failure: REFUTED. Only 4 of 15 Stage 1 trials failed — all four Gemini. Claude and GPT induced the F=mv rules cleanly, without sliding back to F=ma. Frontier models did not fail to reason inside this counterfactual world — the opposite of v0.1.
P2 — Meta-cognitive miscalibration: CONFIRMED. 4 of 6 failure-containing trials over-claim in Stage 4 self-reflection (66.7 %).
P3 — Mechanical criteria reduce judge disagreement: PARTIALLY CONFIRMED. Dual-judge IRR 26.67 % — down from v0.1's 36.67 %, but not below the 25 % bar.
P4 — Stage 3 quantitative leak: REFUTED. 0 of 45 quantitative predictions named the right direction with an F=ma ratio.

Two methodology findings:

A mechanically-specified criterion makes an LLM disagree-resolver reliable. An LLM resolver run against the mechanical 02_fmv criteria agreed with the human audit on 12/12 content cases (100 %) — versus 29.4 % in v0.2 on v0.1's interpretation-laden criteria.
Judge reliability does not transfer across frameworks. The OpenAI judge was the more reliable of the two on v0.1 Aristotelian; on F=mv it agreed with the human audit on 3/14 disagreement cases, the Claude judge on 11/14 — same prompts, same models.

Scope: content axis only (the N9-N12 structural axis is out of scope by explicit prereg decision). 60 production + 120 judge + 12 resolver calls ≈ $17.3 USD.

Where to look	What's in it
`analysis/02_fmv_report.md`	English narrative report — motivation, design, results
`analysis/02_fmv_findings.md`	Judging report + post-audit numerics
`analysis/02_fmv_audit_human_review.md`	Human verdicts on all 14 disagreement cases
`results/<model-id>/02_fmv/`	Verbatim trial JSONs (+ `.md` companions) + judge verdicts

02_fmv.1 result (2026-05-18)

The structural axis (necessary conditions N9-N12: parsimony, independence, traceability, hierarchy) applied to the 60 frozen 02_fmv trials — an additive re-analysis layer, no new tested-model trials. Two predictions locked at tag prereg-02_fmv.1-locked before judging:

P1 — Mechanical (Stage-1-only) criteria lower the structural IRR: REFUTED. The structural-axis dual-judge IRR is 46.67 % — above v0.2 Aristotelian's 40 %, not below. The v0.2 Stage 1+2 double-count was a real defect, but it was not the dominant cause of structural disagreement: the 7 splits were N10/N11/N12 judgment calls, not counting artifacts.
P2 — The structural axis catches a content-missed failure: CONFIRMED. 8 of the 9 trials that passed all three content stages fail the structural axis. Only 1 of 15 trials survives as composite PASS.

Two methodology findings:

Judge reliability reverses completely between axes. On the content axis the Claude judge agreed with the human audit 86 % and the OpenAI judge 21 %; on the structural axis the order flips — Claude 14 %, OpenAI 86 %. Same models, same trials, only the task changed. Judge reliability is task-dependent, not model-dependent.
Content and structural quality are anti-correlated. GPT passed all 5 content axes but failed all 5 structural; Gemini failed all 5 content but passed 3 of 5 structural. The two axes measure genuinely different competences.

Scope: structural axis over the frozen 02_fmv trials. 30 structural-judge + 7 resolver calls ≈ $4.0 USD.

Where to look	What's in it
`analysis/02_fmv_1_report.md`	English narrative report — design, results
`analysis/02_fmv_1_findings.md`	Judging report + post-audit numerics
`analysis/02_fmv_1_structural_audit_human_review.md`	Human verdicts on all 7 structural disagreement cases
`results/<model-id>/02_fmv/structural/`	Verbatim structural-judge verdicts

02_fmv.2 result (2026-05-20)

Single-variable control experiment on the F=mv framework: same observations, same models, same N=5, same Stage 2-4 prompts, same judges, same criteria — only the Stage 1 prompt changes (one natural-language axiomatisation paragraph added). Two predictions locked at tag prereg-02_fmv.2-locked:

P1 — Axiomatisation raises structural pass rate: STRONGLY CONFIRMED. Treatment structural PASS 11/15 vs control 5/15 — at the doubling threshold. Per-model: Claude 2/5 → 5/5, GPT 0/5 → 2/5, Gemini 3/5 → 4/5.
P2 — Content competence does not degrade: CONFIRMED. Treatment content PASS 9/15 vs control 9/15 — exactly flat. No coverage was traded for parsimony.

Composite (content AND structural) jumped from 1/15 → 6/15 — a six-fold increase, driven entirely by the structural axis.

The 02_fmv.1 self-organisation thesis is causally confirmed. 02_fmv.1 §2.7 predicted that the structural failure was a self-organisation gap, not a knowledge gap: models that know the right rules can axiomatise them when asked but don't by default; models that do not know them cannot. Both halves replicate — Claude/GPT (content-strong) respond to the cue; Gemini (content-weak) barely moves. The descriptive finding from 02_fmv.1 is now a causal/mechanistic one.

One failure mode worth recording: Claude trial 2 lost its content axis under the treatment — Stage 2 fabricated "the track pushes upward to cancel the downward pull" (P3 violation). Parsimony pressure can push a model to invent a balancing mechanism when the observations are silent. A follow-on instruction should explicitly forbid introducing forces not in the observations.

Scope: 60 new tested-model trials + 120 judge + 16 resolver calls ≈ $5.5 USD.

Where to look	What's in it
`analysis/02_fmv_2_report.md`	English narrative report
`analysis/02_fmv_2_findings.md`	Judging report + post-audit numerics
`analysis/02_fmv_2_audit_human_review.md`	Human verdicts on all 16 disagreement cases
`frameworks/02_fmv/prompts/stage1_induction_axiomatised.md`	The treatment Stage 1 prompt (the manipulated variable)
`results/<model-id>/02_fmv_2/`	Treatment trials + judge verdicts

v0.3 result (2026-05-20)

Cross-framework replication of the axiomatisation control: the same one-paragraph instruction from 02_fmv.2, applied to the v0.1 Aristotelian framework. Byte-identical wording. Two predictions locked at tag prereg-v0.3-locked:

P1 — Axiomatisation raises structural pass rate: STRONGLY CONFIRMED. Treatment structural PASS 15/15 vs control 8/15 — saturated. Absolute lift +7, exceeding the prereg's +5 STRONGLY threshold and the 02_fmv.2 lift of +6. Per-model: Claude 5/5 → 5/5 (already saturated), GPT 0/5 → 5/5 (perfect ceiling), Gemini 3/5 → 5/5.
P2 — Content competence does not degrade: CONFIRMED. Treatment content PASS 6/15 vs control 5/15 (+1).

Composite (content AND structural) jumped from 2/15 → 6/15 — the same composite ceiling as F=mv (1/15 → 6/15).

The axiomatisation effect generalises across frameworks. The same intervention produced the same shape of result on a counterfactual world (F=mv) and a historical one (Aristotelian): structural moves dramatically up, content holds roughly flat, composite jumps. The 02_fmv.1 self-organisation thesis is causally confirmed on a second framework.

One important side finding: every one of the 8 content disagreements audited FAIL. The Claude content judge took the lenient direction (8/8 PASS) and was wrong on every one. The pattern: parsimony pressure can pull a model toward training-data vocabulary ("denser", "speeds up / slows down", explicit naming of Galileo's vacuum result). Parallel to (but broader than) the 02_fmv.2 Claude-t2 P3 fabrication. A future round should sharpen the instruction to forbid introducing vocabulary beyond what the observations provide.

Scope: 60 new tested-model trials + 120 judge + 13 resolver calls ≈ $6.8 USD.

Where to look	What's in it
`analysis/v0_3_report.md`	English narrative report — cross-framework comparison central
`analysis/v0_3_findings.md`	Judging report + post-audit numerics
`analysis/v0_3_audit_human_review.md`	Human verdicts on all 11 disagreement cases
`frameworks/01_aristotelian/prompts/stage1_induction_axiomatised.md`	Treatment Stage 1 prompt (byte-identical insertion to 02_fmv.2's)
`results/<model-id>/01_aristotelian_3/`	Treatment trials + judge verdicts

Why this exists

Existing LLM physics benchmarks count correct answers and report a percentage. Two structural flaws follow:

The percentage cannot distinguish "understands physics" from "has seen similar problems during training."
The percentage carries no information about cognitive boundaries — 90 % vs 91 % tells you nothing about what the model can and cannot do.

PhysLit asks a different question: can the model do the cognitive work that constitutes physical reasoning — induction, formulation, prediction — inside a framework whose conclusions don't match its training prior? Aristotelian Mechanics is the cleanest test case: historically real, internally consistent, present in training data primarily as a position the training data argues against. A model that has "learned Aristotle" is precisely a model that has learned to dismiss this framework; the test is whether it can suspend that dismissal long enough to reason inside the framework on its own terms.

Full motivation and design rationale: docs/product-spec.md (中文).

How it works

flowchart LR
    PRE["prereg<br/>(SHA-256 sealed)"] --> RUN["Production runner<br/>3 models × 5 trials × 4 stages"]
    RUN --> JUDGE["Dual-judge<br/>(Claude + OpenAI)"]
    JUDGE --> IRR{"IRR > 25 %?"}
    IRR -->|no| PUB["Publish verdicts"]
    IRR -->|yes| AUDIT["Human audit"] --> PUB

Four design rules, all enforced in code:

Pre-registration is irreversible. Predictions live in predictions/v0_1_prereg.md, SHA-256-sealed and git-tag-locked. A pre-commit hook (scripts/verify_prereg_integrity.py) and a matching CI check fail any silent edit. New predictions require a new tag.
Fresh API session per stage. Stages 1, 2, 3, 4 each create a new client and a new session UUID. No context reuse, no multi-turn — the model only sees its own prior outputs replayed as text.
Open data verbatim. Every prompt sent + every response received is committed under prompts/ and results/. Selective publishing is forbidden — failed trials are committed as failure records.
Dual-judge IRR + human-audit gate. Stage-1-3 PASS/FAIL judgments run through two independent LLM judges; disagreement > 25 % on any stage triggers a human audit before results can be published.

Full architectural rules: CLAUDE.md.

Reproduce v0.1

Every verdict in the v0.1 report is reproducible from the locked tag.

git clone https://github.com/dongzhang84/physlit
cd physlit
git checkout prereg-v0.1-locked
uv sync

# .env.local
ANTHROPIC_API_KEY=...
OPENAI_API_KEY=...
GEMINI_API_KEY=...

uv run python scripts/run_v0_1.py        # ≈ $5.76, 60 production API calls, ~30 min
uv run python scripts/judge_v0_1.py      # ≈ $8.23, 120 judge API calls
uv run python scripts/apply_audit.py     # 0 cost — replays the 22 committed audit verdicts

analysis/v0_1_findings.md will now contain both pre-audit and post-audit blocks. The 22 audit verdicts are committed both as prose (analysis/v0_1_audit_human_review.md) and as an embedded dict in scripts/apply_audit.py; no human re-audit is required to reproduce the published verdicts. Tested-model output is non-deterministic across vendors, so your trial responses will not be byte-identical to ours — but the verdict pattern is robust per prereg.

Repo layout

physlit/
├── predictions/v0_1_prereg.md            pre-reg, SHA-256 sealed, tag-locked
├── frameworks/01_aristotelian/           12 observations, criteria, prediction scenarios
├── prompts/                              all stage + judge prompts, frozen at lock
├── results/<model-id>/01_aristotelian/   60 trial JSONs + 120 judge verdicts, verbatim
├── analysis/                             findings, audit, narrative report
├── scripts/                              run / judge / audit / verify
├── src/physlit/                          runners, schema, judges (Python, mypy strict)
├── docs/product-spec.md                  methodology, design rules, predictions
├── docs/implementation-guide.md          phase-by-phase build plan
├── CLAUDE.md                             architectural rules (load-bearing)
├── CHANGELOG.md                          phase-by-phase release notes
├── LICENSE                               MIT — code
└── LICENSE-DATA                          CC BY 4.0 — frameworks, predictions, prompts, results, analysis

Local development

uv sync                              # install deps + dev tools
uv run pre-commit install            # one-time: hook ruff + prereg-integrity + spec validators

Local gates (must all pass before commit):

uv run ruff format --check .
uv run ruff check .
uv run mypy
uv run pytest
uv run python scripts/verify_prereg_integrity.py    # confirms prereg SHA-256 unchanged

CI never runs real API calls — only mocks in tests/test_runners_with_mock.py. Costly runs (run_v0_1.py, judge_v0_1.py) are gated by a confirmation prompt when the estimated spend exceeds $5.

Status & roadmap

Round	Scope	Status
v0.1	Aristotelian Mechanics, content axis × 3 models × N=5	✅ Done — 2026-05-11
v0.2	Structural axis (N9-N12) + LLM disagree-resolvers, additive re-analysis of v0.1	✅ Done — 2026-05-13
02_fmv	F=mv counterfactual world, content axis × 3 models × N=5	✅ Done — 2026-05-18
02_fmv.1	Structural axis (N9-N12) on the F=mv trials, additive re-analysis	✅ Done — 2026-05-18
02_fmv.2	Axiomatisation control: single-variable Stage 1 prompt change vs `02_fmv`	✅ Done — 2026-05-20
v0.3	Cross-framework replication of `02_fmv.2`'s axiomatisation control on Aristotelian	✅ Done — 2026-05-20
next	Further frameworks judged under a common mechanical-criteria standard; per-axis judge validation	Planned

Pre-registration is framework-scoped from 02_fmv onward (tag prereg-<id>-locked). The original v1.0 ambition of 15 frameworks has been retired in favor of methodology-first iteration.

Contributing

PhysLit welcomes:

Reproduction reports — run scripts/run_v0_1.py + judge_v0_1.py + apply_audit.py yourself and open an issue if your verdict pattern diverges from ours.
Methodology critique as GitHub issues — especially around the IRR threshold (25 %) and the audit pathway.
Framework proposals for v0.2 — open an issue describing the framework, its Category (A: historical / B: counterfactual self-consistent / C: arbitrary rules), and a draft observation set. Authoring tier and minimum content checklist live in docs/implementation-guide.md.
Code PRs must pass ruff check, mypy --strict, pytest, and the prereg integrity hook.

PhysLit does not accept:

Changes to the locked prereg or any frozen artifact under the prereg-v0.1-locked tag.
Pull requests that compromise the four design rules (multi-turn shortcuts, judge-pruning to lower IRR, selective result publishing, alias-pinned model IDs).

License

Code (src/, tests/, scripts/, configs) — MIT
Data (frameworks/, predictions/, prompts/, results/, analysis/) — CC BY 4.0

The split is deliberate: re-use the code freely without attribution friction; re-use the data with attribution so the prereg trail stays traceable.

Citation

If you use PhysLit in academic or evaluation work, please cite the locked prereg tag:

@misc{physlit_v0_1_2026,
  author       = {Zhang, Dong},
  title        = {{PhysLit v0.1}: A Pre-Registered Diagnostic of LLM Physics Literacy on Aristotelian Mechanics},
  year         = {2026},
  howpublished = {\url{https://github.com/dongzhang84/physlit}},
  note         = {Pre-registration tag \texttt{prereg-v0.1-locked}, SHA-256 \texttt{769818275e6a25665116f13be2a4be440f00a8f49453fd8587239b410c7df425}}
}

Upstream

PhysLit grew out of indie-product-playbook. The original spec lives at ideas/physlit.md upstream.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PhysLit

v0.1 result (2026-05-11)

02_fmv result (2026-05-18)

02_fmv.1 result (2026-05-18)

02_fmv.2 result (2026-05-20)

v0.3 result (2026-05-20)

Why this exists

How it works

Reproduce v0.1

Repo layout

Local development

Status & roadmap

Contributing

License

Citation

Upstream

About

Licenses found

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 295 Commits
.github/workflows		.github/workflows
analysis		analysis
docs		docs
frameworks		frameworks
predictions		predictions
prompts		prompts
results		results
scripts		scripts
src/physlit		src/physlit
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
LICENSE-DATA		LICENSE-DATA
README.md		README.md
SPRINT.md		SPRINT.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

PhysLit

v0.1 result (2026-05-11)

02_fmv result (2026-05-18)

02_fmv.1 result (2026-05-18)

02_fmv.2 result (2026-05-20)

v0.3 result (2026-05-20)

Why this exists

How it works

Reproduce v0.1

Repo layout

Local development

Status & roadmap

Contributing

License

Citation

Upstream

About

Topics

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages