Skip to content

sadityakr/TracEx

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TracEx

TracEx extracts structured experimental data from condensed-matter physics papers and turns every extracted measurement into a citable, machine-verifiable record — traceable back to the exact sentence in the PDF where it was reported.


Contents


Motivation

TracEx was built to support my PhD thesis writing and a companion perspective article on magnetic materials. Material science papers are filled with a lot fluff to convince the referees the importance of work and a lot intrepretation of results that is often not right. I wanted to extract experimental results with provenance and at a low API cost.

Unfortunately, existing extraction systems do not provide this. NEMAD (Zhang et al., 2024; arXiv:2409.15675) and GPTArticleExtractor (Zhang et al., 2024; arXiv:2401.05875) build large magnetic-property databases using GPT-3.5/4 with manual spot-checking; neither links a value to its source sentence. Broader extraction pipelines — Ghosh & Tewari (2025; arXiv:2510.01235), Kamatchi Sundaram et al. (2026; arXiv:2602.04602), MatSKRAFT (Hira et al., 2025; arXiv:2509.10448), AgentCAT (2026; arXiv:2602.18479) — share the same gap: the chain of custody from source sentence to database entry is broken. The most provenance-aware system published to date, Rameshbabu et al. (2026; arXiv:2604.07584), tags each value by extraction method (text, figure, derivation) but not by source location. LitXBench (Chong & Colindres, 2026; arXiv:2604.07649) benchmarks frontier LLMs on experimental extraction and documents the structural failure: pipelines systematically assign measurements to the wrong experimental context because attribution is never anchored to specific text. Therefore, I had to do it myself.

TracEx addresses this by stamping every sentence with an immutable ID before the LLM sees anything, extracting numeric candidates deterministically before any LLM synthesis, and verifying every output value against its cited sentence after the LLM responds.

Development note. The pipeline has not been formally benchmarked; Data extraction from figures is not yet implemented and remains in development.


Approach

The pipeline has five stages:

PDF
 │
 ▼
[Stage 1 — Parse + Render]                    (deterministic)
 document_index.json   ← structured index with immutable sentence IDs
 manuscript.md         ← full text with <!--s:section:sN--> anchors on every sentence
 │
 ▼
[Stage 2 — Deterministic Extraction]           (deterministic, no LLM)
 compounds.json        ← all chemical entities (regex + pymatgen validation)
 quantities.json       ← all number-unit pairs (quantulum3)
 │
 ▼
[Stage 3 — LLM Synthesis]                      (1 LLM call per paper)
 synthesis.json        ← per-sample measurement records, each citing sentence IDs
 │
 ▼
[Stage 4 — Verification Scoring]               (deterministic, no LLM)
 synthesis_scored.json ← same records with per-measurement A/B/C provenance grade
 verification_report.json
 │
 ▼
[Stage 5 — Render Notes]                       (deterministic)
 notes/sample_<id>.md  ← one Obsidian-compatible note per sample

How provenance is enforced

Stage 1 stamps every sentence in the paper with an immutable ID (abstract:s3, results:s12, table:1:r2c3, fig:mvsH:caption). These IDs are embedded as HTML comments in manuscript.md and are never modified by any later stage. The LLM cannot invent them; it can only cite them.

Stage 2 runs two deterministic extractors over the full paper, building an inventory of what the paper actually contains: every chemical formula (validated against the periodic table via pymatgen), and every number-unit pair (parsed by quantulum3). This inventory is not an interpretation — it is a fact about the text.

Stage 3 gives the LLM the full manuscript.md (with sentence anchors) and the two inventories as context. The LLM's task is synthesis and attribution: for each sample, identify every experimental measurement, assign a property and technique, and cite the sentence IDs that support the claim. This is what the LLM is good at. The LLM makes exactly one call per paper, regardless of the number of samples.

Stage 4 verifies every output measurement against the source text without calling an LLM. For each measurement, it looks up the cited sentences in document_index.json and checks:

  • Does at least one cited sentence contain the sample's name or alias?
  • Does at least one cited sentence contain the reported numeric value?

Each measurement is graded deterministically:

Grade Condition
A At least one cited sentence contains both the value and the sample alias
B Value found in a cited sentence; sample alias found in a different cited sentence
C Value not found in any cited sentence — the LLM cited the wrong location, or fabricated the value

Grade-C measurements are flagged with a warning in the rendered notes and in verification_report.json. They are retained (not silently dropped), so a human reviewer can inspect them.


Output

Each sample gets an Obsidian-compatible markdown note:

---
aliases: ["CrSBr"]
paper: 2309.04778v1
type: sample
---

# CrSBr crystal

## Known Properties

| Property | Value | Source |
|----------|-------|--------|
| Néel temperature | 132 ± 1 K | `abstract:s9` |
| single-layer magnetic transition temperature | 160 K | `abstract:s9` |

## Experiments

### micro-Raman spectroscopy

| Property | Value | Source | Grade |
|----------|-------|--------|-------|
| Ag1 mode position | 113.5 cm-1 | `fig:1c:caption` | A |
| Ag3 mode position | 342.7 cm-1 | `fig:1c:caption` | A |
| Ag2 mode FWHM | 3.5 cm-1 | `fig:1c:caption` | A |

---

### DC magnetometry
*temperature: 2 K*
> **Fig.**: Fig. 3a, Fig. S3, Fig. 4b

| Property | Value | Source | Grade |
|----------|-------|--------|-------|
| saturation magnetization | 3 μB/Cr | `abstract:s15` | B |
| saturation field (a-axis) | 0.5 T | `abstract:s16` | B |
| saturation field (c-axis) | 2 T | `abstract:s16` | B |
| ⚠️ critical temperature | 160 to 170 K | `abstract:s18` | C |

---
*Grade A: value and sample alias co-occur in the same cited sentence. B: value found in a cited sentence, but not co-located with the sample alias. C: value not found in any cited sentence (⚠️).*

The Source column contains sentence IDs as written — they are searchable strings that uniquely identify the location in the paper. The Grade column gives an at-a-glance provenance signal.


Why This Matters

A measurement database built with TracEx has a property that every published extraction system surveyed in 2025–2026 lacks: every value can be challenged at the sentence level. A reviewer can take any row from the output, look up the cited sentence IDs in the PDF, and check whether the value is actually there. Grade-A records can be trusted at machine speed; Grade-C records are automatically flagged for human review.

This directly addresses the failure mode documented in LitXBench (arXiv:2604.07649): extraction pipelines that assign measurements to the wrong experimental context. Because TracEx anchors values to specific sentence IDs before any LLM reasoning occurs, the attribution error is auditable rather than silent.

Compared to the closest prior art (arXiv:2604.07584's tier-based tagging), TracEx operates at finer granularity: a T1 tag in that system confirms a value came from text or a table; a grade-A record in TracEx confirms the value and the sample name both appear in the same cited sentence. The difference matters for scientific reproducibility — a tier tag tells a reviewer where to look; a sentence ID tells them exactly what to read.

This is not perfect — the verifier can be fooled by values that appear in the paper in a different context than the one the LLM intended. But it eliminates the most common failure mode of LLM extraction: numbers that look right but come from nowhere.


Quickstart

Requirements

  • Python 3.13
  • An Anthropic or compatible API key (TracEx uses litellm as the LLM gateway)
  • docling for PDF parsing (requires torch — see TraceEx/requirements.txt)

Installation

git clone https://github.com/sadityakr/TracEx
cd TracEx
python -m venv .venv
source .venv/bin/activate        # Windows: .venv\Scripts\activate
pip install -r TraceEx/requirements.txt

Set your API key:

export ANTHROPIC_API_KEY=sk-ant-...   # or OPENAI_API_KEY, GEMINI_API_KEY, etc.

Run the pipeline

PYTHONUTF8=1 HF_HUB_DISABLE_SYMLINKS_WARNING=1 \
  python -m TraceEx.scripts.run_pipeline "path/to/paper.pdf" \
         --output-dir .tmp/my_run/

Stage outputs are written to the run directory as they complete. Sample notes appear in .tmp/my_run/notes/. The verification report is at .tmp/my_run/verification_report.json.

You can also run individual stages:

# Stage 1 only — parse and render manuscript
python -m TraceEx.scripts.blocks.parse paper.pdf --output-dir .tmp/run1/

# Stage 2 only — extract compounds and quantities
python -m TraceEx.scripts.blocks.extract_raw --run-dir .tmp/run1/

# Stage 3 only — LLM synthesis
python -m TraceEx.scripts.blocks.synthesize --run-dir .tmp/run1/ --paper-id my_paper

# Stage 4 only — verification scoring
python -m TraceEx.scripts.blocks.score --run-dir .tmp/run1/

# Stage 5 only — render notes
python -m TraceEx.scripts.blocks.render --run-dir .tmp/run1/

Repository Layout

TraceEx/scripts/
  blocks/
    parse/           Stage 1 — PDF parsing and sentence-ID stamping
    extract_raw/     Stage 2 — Deterministic compound and quantity extraction
    synthesize/      Stage 3 — LLM synthesis (prompt builder + single call)
    score/           Stage 4 — Deterministic verification scoring
    render/          Stage 5 — Obsidian note rendering
  tools/
    llm_caller.py    Universal LLM gateway (all calls go through here)
    llm_call_budget.py  Budget tracker — enforces 1-call-per-paper constraint
    model_config.py  Model tier configuration ("think" vs "fast")
  run_pipeline.py    End-to-end orchestrator

TraceEx/tests/       124 unit and integration tests
directives/          Architecture specs and implementation status
input/               Sample PDFs

Key Design Invariants

  1. Every extracted value has a sentence-ID citation. No value appears in the output without at least one section:sN pointer that can be resolved against the original text.

  2. Sentence IDs are stamped before the LLM sees anything. They are set in Stage 1 and never modified. The LLM cannot invent them; it can only cite existing IDs.

  3. Verification is deterministic. Stage 4 never calls an LLM. The A/B/C grade is a text-search result, not a model judgment.

  4. One LLM call per paper. The deterministic pre-extraction (Stage 2) compresses the paper's numerical content into a scannable inventory, making single-call synthesis feasible even for papers with many samples and measurements.

  5. All LLM calls go through call_llm(). The gateway enforces budget tracking, structured output via tool-use, and schema validation. No stage calls the API directly.


Domain

TracEx is designed for condensed-matter physics papers. The deterministic extractors handle domain-specific notation (chemical formulas validated against the periodic table, units like Oe, kOe, emu/g, µ_B). The LLM synthesis prompt is written for the domain vocabulary (exchange bias, coercive field, zero-field-cooled protocols, SQUID magnetometry).

Adapting TracEx to another domain would require updating the compound extractor patterns, the quantity extractor unit list, and the synthesis prompt — the pipeline architecture is otherwise domain-agnostic.


About

Five-stage pipeline that extracts experimental measurements from condensed-matter physics papers and grades every value's provenance against the source sentences.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages