TracEx extracts structured experimental data from condensed-matter physics papers and turns every extracted measurement into a citable, machine-verifiable record — traceable back to the exact sentence in the PDF where it was reported.
- Motivation
- Approach
- Output
- Why This Matters
- Quickstart
- Repository Layout
- Key Design Invariants
- Domain
TracEx was built to support my PhD thesis writing and a companion perspective article on magnetic materials. Material science papers are filled with a lot fluff to convince the referees the importance of work and a lot intrepretation of results that is often not right. I wanted to extract experimental results with provenance and at a low API cost.
Unfortunately, existing extraction systems do not provide this. NEMAD (Zhang et al., 2024; arXiv:2409.15675) and GPTArticleExtractor (Zhang et al., 2024; arXiv:2401.05875) build large magnetic-property databases using GPT-3.5/4 with manual spot-checking; neither links a value to its source sentence. Broader extraction pipelines — Ghosh & Tewari (2025; arXiv:2510.01235), Kamatchi Sundaram et al. (2026; arXiv:2602.04602), MatSKRAFT (Hira et al., 2025; arXiv:2509.10448), AgentCAT (2026; arXiv:2602.18479) — share the same gap: the chain of custody from source sentence to database entry is broken. The most provenance-aware system published to date, Rameshbabu et al. (2026; arXiv:2604.07584), tags each value by extraction method (text, figure, derivation) but not by source location. LitXBench (Chong & Colindres, 2026; arXiv:2604.07649) benchmarks frontier LLMs on experimental extraction and documents the structural failure: pipelines systematically assign measurements to the wrong experimental context because attribution is never anchored to specific text. Therefore, I had to do it myself.
TracEx addresses this by stamping every sentence with an immutable ID before the LLM sees anything, extracting numeric candidates deterministically before any LLM synthesis, and verifying every output value against its cited sentence after the LLM responds.
Development note. The pipeline has not been formally benchmarked; Data extraction from figures is not yet implemented and remains in development.
The pipeline has five stages:
PDF
│
▼
[Stage 1 — Parse + Render] (deterministic)
document_index.json ← structured index with immutable sentence IDs
manuscript.md ← full text with <!--s:section:sN--> anchors on every sentence
│
▼
[Stage 2 — Deterministic Extraction] (deterministic, no LLM)
compounds.json ← all chemical entities (regex + pymatgen validation)
quantities.json ← all number-unit pairs (quantulum3)
│
▼
[Stage 3 — LLM Synthesis] (1 LLM call per paper)
synthesis.json ← per-sample measurement records, each citing sentence IDs
│
▼
[Stage 4 — Verification Scoring] (deterministic, no LLM)
synthesis_scored.json ← same records with per-measurement A/B/C provenance grade
verification_report.json
│
▼
[Stage 5 — Render Notes] (deterministic)
notes/sample_<id>.md ← one Obsidian-compatible note per sample
Stage 1 stamps every sentence in the paper with an immutable ID (abstract:s3,
results:s12, table:1:r2c3, fig:mvsH:caption). These IDs are embedded as HTML comments in
manuscript.md and are never modified by any later stage. The LLM cannot invent them; it can
only cite them.
Stage 2 runs two deterministic extractors over the full paper, building an inventory of what
the paper actually contains: every chemical formula (validated against the periodic table via
pymatgen), and every number-unit pair (parsed by quantulum3). This inventory is not an
interpretation — it is a fact about the text.
Stage 3 gives the LLM the full manuscript.md (with sentence anchors) and the two
inventories as context. The LLM's task is synthesis and attribution: for each sample, identify
every experimental measurement, assign a property and technique, and cite the sentence IDs that
support the claim. This is what the LLM is good at. The LLM makes exactly one call per paper,
regardless of the number of samples.
Stage 4 verifies every output measurement against the source text without calling an LLM.
For each measurement, it looks up the cited sentences in document_index.json and checks:
- Does at least one cited sentence contain the sample's name or alias?
- Does at least one cited sentence contain the reported numeric value?
Each measurement is graded deterministically:
| Grade | Condition |
|---|---|
| A | At least one cited sentence contains both the value and the sample alias |
| B | Value found in a cited sentence; sample alias found in a different cited sentence |
| C | Value not found in any cited sentence — the LLM cited the wrong location, or fabricated the value |
Grade-C measurements are flagged with a warning in the rendered notes and in
verification_report.json. They are retained (not silently dropped), so a human reviewer can
inspect them.
Each sample gets an Obsidian-compatible markdown note:
---
aliases: ["CrSBr"]
paper: 2309.04778v1
type: sample
---
# CrSBr crystal
## Known Properties
| Property | Value | Source |
|----------|-------|--------|
| Néel temperature | 132 ± 1 K | `abstract:s9` |
| single-layer magnetic transition temperature | 160 K | `abstract:s9` |
## Experiments
### micro-Raman spectroscopy
| Property | Value | Source | Grade |
|----------|-------|--------|-------|
| Ag1 mode position | 113.5 cm-1 | `fig:1c:caption` | A |
| Ag3 mode position | 342.7 cm-1 | `fig:1c:caption` | A |
| Ag2 mode FWHM | 3.5 cm-1 | `fig:1c:caption` | A |
---
### DC magnetometry
*temperature: 2 K*
> **Fig.**: Fig. 3a, Fig. S3, Fig. 4b
| Property | Value | Source | Grade |
|----------|-------|--------|-------|
| saturation magnetization | 3 μB/Cr | `abstract:s15` | B |
| saturation field (a-axis) | 0.5 T | `abstract:s16` | B |
| saturation field (c-axis) | 2 T | `abstract:s16` | B |
| ⚠️ critical temperature | 160 to 170 K | `abstract:s18` | C |
---
*Grade A: value and sample alias co-occur in the same cited sentence. B: value found in a cited sentence, but not co-located with the sample alias. C: value not found in any cited sentence (⚠️).*The Source column contains sentence IDs as written — they are searchable strings that uniquely
identify the location in the paper. The Grade column gives an at-a-glance provenance signal.
A measurement database built with TracEx has a property that every published extraction system surveyed in 2025–2026 lacks: every value can be challenged at the sentence level. A reviewer can take any row from the output, look up the cited sentence IDs in the PDF, and check whether the value is actually there. Grade-A records can be trusted at machine speed; Grade-C records are automatically flagged for human review.
This directly addresses the failure mode documented in LitXBench (arXiv:2604.07649): extraction pipelines that assign measurements to the wrong experimental context. Because TracEx anchors values to specific sentence IDs before any LLM reasoning occurs, the attribution error is auditable rather than silent.
Compared to the closest prior art (arXiv:2604.07584's tier-based tagging), TracEx operates at finer granularity: a T1 tag in that system confirms a value came from text or a table; a grade-A record in TracEx confirms the value and the sample name both appear in the same cited sentence. The difference matters for scientific reproducibility — a tier tag tells a reviewer where to look; a sentence ID tells them exactly what to read.
This is not perfect — the verifier can be fooled by values that appear in the paper in a different context than the one the LLM intended. But it eliminates the most common failure mode of LLM extraction: numbers that look right but come from nowhere.
- Python 3.13
- An Anthropic or compatible API key (TracEx uses
litellmas the LLM gateway) doclingfor PDF parsing (requirestorch— seeTraceEx/requirements.txt)
git clone https://github.com/sadityakr/TracEx
cd TracEx
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r TraceEx/requirements.txtSet your API key:
export ANTHROPIC_API_KEY=sk-ant-... # or OPENAI_API_KEY, GEMINI_API_KEY, etc.PYTHONUTF8=1 HF_HUB_DISABLE_SYMLINKS_WARNING=1 \
python -m TraceEx.scripts.run_pipeline "path/to/paper.pdf" \
--output-dir .tmp/my_run/Stage outputs are written to the run directory as they complete. Sample notes appear in
.tmp/my_run/notes/. The verification report is at .tmp/my_run/verification_report.json.
You can also run individual stages:
# Stage 1 only — parse and render manuscript
python -m TraceEx.scripts.blocks.parse paper.pdf --output-dir .tmp/run1/
# Stage 2 only — extract compounds and quantities
python -m TraceEx.scripts.blocks.extract_raw --run-dir .tmp/run1/
# Stage 3 only — LLM synthesis
python -m TraceEx.scripts.blocks.synthesize --run-dir .tmp/run1/ --paper-id my_paper
# Stage 4 only — verification scoring
python -m TraceEx.scripts.blocks.score --run-dir .tmp/run1/
# Stage 5 only — render notes
python -m TraceEx.scripts.blocks.render --run-dir .tmp/run1/TraceEx/scripts/
blocks/
parse/ Stage 1 — PDF parsing and sentence-ID stamping
extract_raw/ Stage 2 — Deterministic compound and quantity extraction
synthesize/ Stage 3 — LLM synthesis (prompt builder + single call)
score/ Stage 4 — Deterministic verification scoring
render/ Stage 5 — Obsidian note rendering
tools/
llm_caller.py Universal LLM gateway (all calls go through here)
llm_call_budget.py Budget tracker — enforces 1-call-per-paper constraint
model_config.py Model tier configuration ("think" vs "fast")
run_pipeline.py End-to-end orchestrator
TraceEx/tests/ 124 unit and integration tests
directives/ Architecture specs and implementation status
input/ Sample PDFs
-
Every extracted value has a sentence-ID citation. No value appears in the output without at least one
section:sNpointer that can be resolved against the original text. -
Sentence IDs are stamped before the LLM sees anything. They are set in Stage 1 and never modified. The LLM cannot invent them; it can only cite existing IDs.
-
Verification is deterministic. Stage 4 never calls an LLM. The A/B/C grade is a text-search result, not a model judgment.
-
One LLM call per paper. The deterministic pre-extraction (Stage 2) compresses the paper's numerical content into a scannable inventory, making single-call synthesis feasible even for papers with many samples and measurements.
-
All LLM calls go through
call_llm(). The gateway enforces budget tracking, structured output via tool-use, and schema validation. No stage calls the API directly.
TracEx is designed for condensed-matter physics papers. The deterministic extractors handle domain-specific notation (chemical formulas validated against the periodic table, units like Oe, kOe, emu/g, µ_B). The LLM synthesis prompt is written for the domain vocabulary (exchange bias, coercive field, zero-field-cooled protocols, SQUID magnetometry).
Adapting TracEx to another domain would require updating the compound extractor patterns, the quantity extractor unit list, and the synthesis prompt — the pipeline architecture is otherwise domain-agnostic.