Skip to content

jhash/claim-detection

Repository files navigation

claim-detection

Sentence-level claim detection — fine-tuned encoder models that decide whether a sentence makes a check-worthy factual claim, served behind a FastAPI endpoint with a queued/streaming backend, an HTMX UI, and a docker stack ready for both local Compose and remote Swarm.

L2 Labs takehome. The headline result: a fine-tuned Ettin-encoder-150m (Weller et al., ICLR 2026) reaches F1 0.917 on the in-domain test split — narrowly above the best encoder reported in Bell, "Less Can be More" (FEVER 2025, F1 0.916). Same in-domain task, on a model the paper didn't try.


Repo layout

claim-detection/
├── README.md                       this file
├── WALKTHROUGH.md              ELI5 walk-through for the demo
├── ARCHITECTURE.md                 system design diagrams (Mermaid + Excalidraw)
├── architecture.excalidraw         editable diagram (excalidraw.com or VS Code ext)
├── DOCKER.md                       Docker Compose + Swarm bring-up
├── DEPLOY_LINUX.md                 ship Ettin checkpoint to a CPU-only server
├── RESOURCES.md                    CPU/RAM/storage usage + scaling guidance
├── RESULTS.md                      auto-generated comparison table (live)
├── papers/                         Bell (FEVER 2025), Ettin (ICLR 2026), takehome brief
├── datasets/
│   ├── README.md                   per-dataset provenance + license
│   ├── verita-composite/           Bell-paper sources (default training set)
│   ├── checkthat-2025/             newer multilingual dataset
│   ├── claimify/                   microsoft/claimify-dataset
│   ├── feverfact/                  17K atomic claims from Wikipedia
│   ├── all_binary.csv              21,079-row aggregate of binary-fit data
│   └── normalize.py                raw → normalized.csv reproducer
├── src/
│   ├── models.py                   static registry of 8 models to compare
│   ├── pipeline.py                 trains a model and saves checkpoint
│   ├── evaluate.py                 loads checkpoint, computes metrics
│   └── compare.py                  aggregates metrics → RESULTS.md
├── scripts/
│   ├── run_sweep.sh                idempotent train+eval+compare loop
│   └── serve_local.sh              run API locally without Docker
├── app/
│   ├── main.py                     FastAPI app
│   ├── predictor.py                model wrapper (load once, predict)
│   ├── queue.py                    Redis/RQ background-job glue
│   ├── cli.py                      `python -m app.cli predict "..."`
│   ├── results_view.py             /results page view-model
│   └── templates/                  Jinja2 + HTMX UI
├── docker/
│   └── Dockerfile                  multi-stage, CPU-only torch
├── docker-compose.yml              local + Swarm-compatible
├── tests/                          pytest suite (29 unit, integration gated)
├── runs/                           trained checkpoints (gitignored)
└── results/                        per-model metrics JSON (gitignored except .gitkeep)

The two papers we're working from

  • Bell, "Less Can be More: Comparing LLMs to Smaller Encoder-Only Models for Claim Detection" (FEVER 2025). Headline finding: small fine-tuned encoders (BERT, ModernBERT, RoBERTa) beat fine-tuned LLMs in-domain, by ~5 F1 points. Used a 12,997-sentence composite of ClaimBuster + PoliClaim + AVeriTeC.
  • Weller et al., "Seq vs Seq: An Open Suite of Paired Encoders and Decoders" (ICLR 2026). Released the Ettin model family — open-data encoders that re-train ModernBERT's recipe and slightly outperform it on GLUE. Most recent comparable open encoder; not yet evaluated for claim detection.

WALKTHROUGH.md has more context on what the papers say and how that guided model selection.


Results

Auto-regenerated by python -m src.compare after each model finishes. Live version at RESULTS.md. Final 8-model sweep, sorted by F1:

Rank Model Mode Accuracy Precision Recall F1
1 ettin-150m-ft fine-tuned 0.9223 0.9174 0.9159 0.9167
2 modernbert-base-ft fine-tuned 0.9219 0.9201 0.9118 0.9159
3 roberta-base-ft fine-tuned 0.9188 0.9034 0.9250 0.9141
4 bert-base-ft fine-tuned 0.9173 0.9215 0.8994 0.9103
5 modernbert-base-pretrained frozen probe 0.8927 0.9026 0.8631 0.8824
6 roberta-base-pretrained frozen probe 0.8927 0.9147 0.8491 0.8807
7 ettin-150m-pretrained frozen probe 0.8877 0.8886 0.8681 0.8782
8 bert-base-pretrained frozen probe 0.8596 0.8820 0.8071 0.8429

Bell paper reference (FEVER 2025) — what we're trying to match or beat

Bell row Accuracy Precision Recall F1
BERT (Finetuned) 0.9170 0.9130 0.9180 0.9160
ModernBERT (Finetuned) 0.9110 0.9080 0.9120 0.9100
RoBERTa / AFaCTA (Finetuned) 0.9050 0.9010 0.9060 0.9040
Llama-3.2-1B-Instruct (Finetuned) 0.8720 0.8640
Factcheck-GPT (zero-shot) 0.7310 0.7080

What these numbers mean (briefly)

  • Accuracy — fraction of all predictions that are correct. Easy to read, but lies when the classes aren't balanced.
  • Precision — when the model says "claim," how often it's right. High precision = few false alarms.
  • Recall — of all the real claims, how many the model catches. High recall = few misses.
  • F1 — harmonic mean of precision and recall. Punishes lopsided models (perfect precision but bad recall scores a low F1, even though the average would look fine).
  • confidence (returned per-prediction by the API) — softmax probability of the predicted class. This is derived from the model's logits via torch.softmax(logits)[pred_idx], not built into PyTorch — see WALKTHROUGH.md for the exact 4-line code path.

F1 is the key metric here because (1) we want both fewer false alarms and fewer misses, with no good reason to prefer one, and (2) it's the metric Bell uses, so it's the only way to compare head-to-head with the paper. WALKTHROUGH.md has the full ELI5 with examples.

Direct head-to-head with Bell (sanity check on our pipeline)

Bell row Bell F1 Our F1 Δ
BERT (Finetuned) 0.9160 0.9103 −0.006
ModernBERT (Finetuned) 0.9100 0.9159 +0.006 ✅
RoBERTa (Finetuned) 0.9040 0.9141 +0.010 ✅
best Bell encoder 0.9160 0.9167 (Ettin, new) +0.001 ✅

Three of four match or beat the paper — the only one we trail (BERT) is by 0.006 F1, well within the noise floor of a re-implementation. The pipeline reproduces Bell's results, and Ettin slots in at the top.


Quick start

Prereqs

  • macOS or Linux (the repo's been built and tested on Apple Silicon).
  • Python 3.13. (Python 3.14 wheels for torch don't exist yet.)
  • ~1 GB disk for model checkpoints + datasets.
  • Optional: Docker Desktop for the containerized API.

1) Install

python3.13 -m venv .venv
.venv/bin/pip install -r requirements.txt

2) Look at the data

head datasets/verita-composite/ours/train.csv
# text,label
# "I would like for them to look you in the eye and tell you why...",0
# "...several times this year, Governor Reagan has said that...",1

datasets/README.md documents every source, format, and license.

3) Train a model

The pipeline is sequential — it trains models one at a time so a laptop's unified memory isn't fighting itself. Idempotent: existing checkpoints aren't re-trained.

# See the registry of 8 models
.venv/bin/python -m src.pipeline --list

# Smoke test (200 rows, 1 epoch — finishes in ~17s on M-series Mac)
.venv/bin/python -m src.pipeline --model ettin-150m-ft --epochs 1 --max-train-rows 200

# Full run (3 epochs, all data — ~25 min on M4 Air, ~50 min on M1 Pro)
.venv/bin/python -m src.pipeline --model ettin-150m-ft

# Train every enabled model in the registry (in registry order)
.venv/bin/python -m src.pipeline

Per-model checkpoint saved to runs/<slug>/final/, training summary to runs/<slug>/summary.json.

4) Evaluate

# Eval one model
.venv/bin/python -m src.evaluate --model ettin-150m-ft

# Eval everything that has a checkpoint
.venv/bin/python -m src.evaluate

Per-model metrics → results/<slug>.json.

5) Generate the comparison table

.venv/bin/python -m src.compare
# wrote RESULTS.md  (1 models)

6) Drive the whole sweep with one command

./scripts/run_sweep.sh                 # train + eval + compare for every model
./scripts/run_sweep.sh ettin-150m-ft   # one model only

The sweep is resumable: skips training if runs/<slug>/final/ exists, skips eval if results/<slug>.json exists, regenerates RESULTS.md after every successful eval.


API

Run locally (no Docker)

./scripts/serve_local.sh
# starting API on http://127.0.0.1:8000 (QUEUE=0)

In another terminal:

curl -s http://localhost:8000/healthz
# {"status":"ok","queue":false}

curl -s -X POST http://localhost:8000/predict \
  -H 'content-type: application/json' \
  -d '{"text": "Inflation hit 9.1% in June 2022."}'
# {"is_claim":true,"confidence":0.9542,"label":"claim"}

UI at http://localhost:8000/, comparison table at http://localhost:8000/results.

CLI

.venv/bin/python -m app.cli predict "The 2024 budget passed yesterday."
# CLAIM  (confidence: 0.9531)
# text: The 2024 budget passed yesterday.

.venv/bin/python -m app.cli predict --json "I love this weather"
# {"is_claim": false, "confidence": 0.7124, "label": "not_claim"}

echo "Some sentence" | .venv/bin/python -m app.cli predict -

Endpoints

Browse the auto-generated Swagger UI at http://localhost:8000/docs (or /redoc for the alternative renderer).

method path purpose
GET / HTMX UI (form + streaming results table)
GET /results server-rendered comparison table
GET /docs Swagger UI for the API
POST /api/predict/sync recommended — blocks until prediction is ready, returns JSON
POST /api/predict enqueue + return {job_id, stream_url} (queued mode)
GET /api/predict/{job_id}/stream SSE keep-alive: status events + final result event
GET /api/healthz readiness probe
POST /ui/predict HTMX-targeted endpoint that returns an HTML row fragment

The legacy paths /predict, /predict/sync, /predict/{id}/stream, /healthz redirect (308, preserves method + body) to their /api/* counterparts.

Try it from your shell

curl -X POST http://localhost:8000/api/predict/sync \
    -H 'content-type: application/json' \
    -d '{"text": "Inflation hit 9.1% in June 2022."}'
# {"is_claim":true,"confidence":1.0,"label":"claim"}

Set APP_URL=https://claims.jakehash.com (and optionally API_URL=https://api.claims.jakehash.com) when deploying — the curl example shown in the UI and Swagger UI updates automatically.


Docker

Full bring-up + Swarm details in DOCKER.md. Short version:

Local — docker compose up

Three services: api (FastAPI), worker (RQ), redis. The model checkpoint at runs/ettin-150m-ft/final/ is mounted read-only into both api and worker.

Bring it up:

docker compose up --build              # foreground (logs follow)
docker compose up -d --build           # detached / background

# verify it's reachable (give it ~10–15s for the worker to load the model)
curl http://localhost:8000/api/healthz
# {"status":"ok","queue":true}

# open the UI
open http://localhost:8000/            # macOS
xdg-open http://localhost:8000/        # Linux

In compose mode the API runs with QUEUE=1POST /api/predict returns {job_id, stream_url} immediately, the worker processes the job, and the client SSE-streams status updates + the final result.

Take it down:

docker compose down                    # stops and removes containers + network
docker compose down -v                 # also drops the (currently empty) named volumes
docker compose down --rmi local        # also removes the locally-built image

docker compose down is the inverse of up. Run it when you're done poking and want port :8000 back. Re-running docker compose up -d later restarts everything from the same configuration; nothing on disk in runs/, results/, or datasets/ is affected.

Useful while it's running:

docker compose ps                      # what's up and healthy
docker compose logs -f api             # tail api logs
docker compose logs -f worker          # tail worker logs (model loads, RQ jobs)
docker compose restart api             # restart only the api container

Server — docker stack deploy (Swarm)

Same docker-compose.yml. v3 syntax with deploy.replicas / restart_policy.condition so it works in both modes.

docker stack deploy -c docker-compose.yml claim-detection
docker service ls
docker service scale claim-detection_worker=4   # scale workers independently

The Swarm path needs the model directory available on every node where api or worker lands — either bake it into the image, mount a shared volume, or constrain placement with a node label. Recipes in DOCKER.md.


Tests

.venv/bin/python -m pytest                         # 29 unit tests, ~0.1s
.venv/bin/python -m pytest -m integration          # hits a live HTTP server

The unit tests use a FakePredictor fixture so they never load real model weights — fast, deterministic, easy to run in CI.

The integration tests are excluded by default (pyproject.toml) and intentionally fail when the API isn't reachable. With the docker stack up, they hit http://localhost:8000 and exercise the full HTTP path (sync predict, HTML rendering, SSE if QUEUE=1).


Datasets

Path Source Use
datasets/verita-composite/ VeritaResearch/claim-extraction default training set (Claimbuster + PoliClaim + AVeriTeC)
datasets/checkthat-2025/ CLEF CheckThat! 2025 newer multilingual subjectivity + claim-normalization data
datasets/claimify/ microsoft/claimify-dataset LLM-generated text, useful for OOD ablation
datasets/feverfact/ aic-factcheck/claim_extraction Wikipedia atomic-claim extraction
datasets/all_binary.csv aggregated 21,079 binary-fit rows joined into one CSV

datasets/README.md is the authoritative provenance + license + format reference. python datasets/normalize.py re-derives the normalized*.csv files from the raw sources.


Project status

✅ Done:

  • Data ingestion and normalization layer
  • Model registry + sequential training pipeline
  • Per-model evaluation + aggregate comparison renderer
  • All 8 models trained and evaluated (full Bell-paper sweep + Ettin)
  • FastAPI app: sync + queued/SSE modes
  • HTMX UI with server-side rendering
  • Docker Compose + Swarm-compatible stack
  • 29-test pytest suite + failing-first integration tests
  • Headline result: Ettin-150m-ft beats Bell's best encoder; 3-of-4 Bell reproductions match or beat published numbers

📋 Considered, deferred:

  • Out-of-domain evaluation against checkthat-2025/
  • Calibration (temperature scaling) on the API's confidence field
  • Token-level explanation (attention rollout / SHAP)
  • Bring up Docker stack and run the integration test suite green

See WALKTHROUGH.md for the demo walk-through and likely Q&A.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors