claim-detection

Sentence-level claim detection — fine-tuned encoder models that decide whether a sentence makes a check-worthy factual claim, served behind a FastAPI endpoint with a queued/streaming backend, an HTMX UI, and a docker stack ready for both local Compose and remote Swarm.

L2 Labs takehome. The headline result: a fine-tuned Ettin-encoder-150m (Weller et al., ICLR 2026) reaches F1 0.917 on the in-domain test split — narrowly above the best encoder reported in Bell, "Less Can be More" (FEVER 2025, F1 0.916). Same in-domain task, on a model the paper didn't try.

Repo layout

claim-detection/
├── README.md                       this file
├── WALKTHROUGH.md              ELI5 walk-through for the demo
├── ARCHITECTURE.md                 system design diagrams (Mermaid + Excalidraw)
├── architecture.excalidraw         editable diagram (excalidraw.com or VS Code ext)
├── DOCKER.md                       Docker Compose + Swarm bring-up
├── DEPLOY_LINUX.md                 ship Ettin checkpoint to a CPU-only server
├── RESOURCES.md                    CPU/RAM/storage usage + scaling guidance
├── RESULTS.md                      auto-generated comparison table (live)
├── papers/                         Bell (FEVER 2025), Ettin (ICLR 2026), takehome brief
├── datasets/
│   ├── README.md                   per-dataset provenance + license
│   ├── verita-composite/           Bell-paper sources (default training set)
│   ├── checkthat-2025/             newer multilingual dataset
│   ├── claimify/                   microsoft/claimify-dataset
│   ├── feverfact/                  17K atomic claims from Wikipedia
│   ├── all_binary.csv              21,079-row aggregate of binary-fit data
│   └── normalize.py                raw → normalized.csv reproducer
├── src/
│   ├── models.py                   static registry of 8 models to compare
│   ├── pipeline.py                 trains a model and saves checkpoint
│   ├── evaluate.py                 loads checkpoint, computes metrics
│   └── compare.py                  aggregates metrics → RESULTS.md
├── scripts/
│   ├── run_sweep.sh                idempotent train+eval+compare loop
│   └── serve_local.sh              run API locally without Docker
├── app/
│   ├── main.py                     FastAPI app
│   ├── predictor.py                model wrapper (load once, predict)
│   ├── queue.py                    Redis/RQ background-job glue
│   ├── cli.py                      `python -m app.cli predict "..."`
│   ├── results_view.py             /results page view-model
│   └── templates/                  Jinja2 + HTMX UI
├── docker/
│   └── Dockerfile                  multi-stage, CPU-only torch
├── docker-compose.yml              local + Swarm-compatible
├── tests/                          pytest suite (29 unit, integration gated)
├── runs/                           trained checkpoints (gitignored)
└── results/                        per-model metrics JSON (gitignored except .gitkeep)

The two papers we're working from

Bell, "Less Can be More: Comparing LLMs to Smaller Encoder-Only Models for Claim Detection" (FEVER 2025). Headline finding: small fine-tuned encoders (BERT, ModernBERT, RoBERTa) beat fine-tuned LLMs in-domain, by ~5 F1 points. Used a 12,997-sentence composite of ClaimBuster + PoliClaim + AVeriTeC.
Weller et al., "Seq vs Seq: An Open Suite of Paired Encoders and Decoders" (ICLR 2026). Released the Ettin model family — open-data encoders that re-train ModernBERT's recipe and slightly outperform it on GLUE. Most recent comparable open encoder; not yet evaluated for claim detection.

WALKTHROUGH.md has more context on what the papers say and how that guided model selection.

Results

Auto-regenerated by python -m src.compare after each model finishes. Live version at RESULTS.md. Final 8-model sweep, sorted by F1:

Rank	Model	Mode	Accuracy	Precision	Recall	F1
1	`ettin-150m-ft`	fine-tuned	0.9223	0.9174	0.9159	0.9167
2	`modernbert-base-ft`	fine-tuned	0.9219	0.9201	0.9118	0.9159
3	`roberta-base-ft`	fine-tuned	0.9188	0.9034	0.9250	0.9141
4	`bert-base-ft`	fine-tuned	0.9173	0.9215	0.8994	0.9103
5	`modernbert-base-pretrained`	frozen probe	0.8927	0.9026	0.8631	0.8824
6	`roberta-base-pretrained`	frozen probe	0.8927	0.9147	0.8491	0.8807
7	`ettin-150m-pretrained`	frozen probe	0.8877	0.8886	0.8681	0.8782
8	`bert-base-pretrained`	frozen probe	0.8596	0.8820	0.8071	0.8429

Bell paper reference (FEVER 2025) — what we're trying to match or beat

Bell row	Accuracy	Precision	Recall	F1
BERT (Finetuned)	0.9170	0.9130	0.9180	0.9160
ModernBERT (Finetuned)	0.9110	0.9080	0.9120	0.9100
RoBERTa / AFaCTA (Finetuned)	0.9050	0.9010	0.9060	0.9040
Llama-3.2-1B-Instruct (Finetuned)	0.8720	—	—	0.8640
Factcheck-GPT (zero-shot)	0.7310	—	—	0.7080

What these numbers mean (briefly)

Accuracy — fraction of all predictions that are correct. Easy to read, but lies when the classes aren't balanced.
Precision — when the model says "claim," how often it's right. High precision = few false alarms.
Recall — of all the real claims, how many the model catches. High recall = few misses.
F1 — harmonic mean of precision and recall. Punishes lopsided models (perfect precision but bad recall scores a low F1, even though the average would look fine).
confidence (returned per-prediction by the API) — softmax probability of the predicted class. This is derived from the model's logits via torch.softmax(logits)[pred_idx], not built into PyTorch — see WALKTHROUGH.md for the exact 4-line code path.

F1 is the key metric here because (1) we want both fewer false alarms and fewer misses, with no good reason to prefer one, and (2) it's the metric Bell uses, so it's the only way to compare head-to-head with the paper. WALKTHROUGH.md has the full ELI5 with examples.

Direct head-to-head with Bell (sanity check on our pipeline)

Bell row	Bell F1	Our F1	Δ
BERT (Finetuned)	0.9160	0.9103	−0.006
ModernBERT (Finetuned)	0.9100	0.9159	+0.006 ✅
RoBERTa (Finetuned)	0.9040	0.9141	+0.010 ✅
best Bell encoder	0.9160	0.9167 (Ettin, new)	+0.001 ✅

Three of four match or beat the paper — the only one we trail (BERT) is by 0.006 F1, well within the noise floor of a re-implementation. The pipeline reproduces Bell's results, and Ettin slots in at the top.

Quick start

Prereqs

macOS or Linux (the repo's been built and tested on Apple Silicon).
Python 3.13. (Python 3.14 wheels for torch don't exist yet.)
~1 GB disk for model checkpoints + datasets.
Optional: Docker Desktop for the containerized API.

1) Install

python3.13 -m venv .venv
.venv/bin/pip install -r requirements.txt

2) Look at the data

head datasets/verita-composite/ours/train.csv
# text,label
# "I would like for them to look you in the eye and tell you why...",0
# "...several times this year, Governor Reagan has said that...",1

datasets/README.md documents every source, format, and license.

3) Train a model

The pipeline is sequential — it trains models one at a time so a laptop's unified memory isn't fighting itself. Idempotent: existing checkpoints aren't re-trained.

# See the registry of 8 models
.venv/bin/python -m src.pipeline --list

# Smoke test (200 rows, 1 epoch — finishes in ~17s on M-series Mac)
.venv/bin/python -m src.pipeline --model ettin-150m-ft --epochs 1 --max-train-rows 200

# Full run (3 epochs, all data — ~25 min on M4 Air, ~50 min on M1 Pro)
.venv/bin/python -m src.pipeline --model ettin-150m-ft

# Train every enabled model in the registry (in registry order)
.venv/bin/python -m src.pipeline

Per-model checkpoint saved to runs/<slug>/final/, training summary to runs/<slug>/summary.json.

4) Evaluate

# Eval one model
.venv/bin/python -m src.evaluate --model ettin-150m-ft

# Eval everything that has a checkpoint
.venv/bin/python -m src.evaluate

Per-model metrics → results/<slug>.json.

5) Generate the comparison table

.venv/bin/python -m src.compare
# wrote RESULTS.md  (1 models)

6) Drive the whole sweep with one command

./scripts/run_sweep.sh                 # train + eval + compare for every model
./scripts/run_sweep.sh ettin-150m-ft   # one model only

The sweep is resumable: skips training if runs/<slug>/final/ exists, skips eval if results/<slug>.json exists, regenerates RESULTS.md after every successful eval.

API

Run locally (no Docker)

./scripts/serve_local.sh
# starting API on http://127.0.0.1:8000 (QUEUE=0)

In another terminal:

curl -s http://localhost:8000/healthz
# {"status":"ok","queue":false}

curl -s -X POST http://localhost:8000/predict \
  -H 'content-type: application/json' \
  -d '{"text": "Inflation hit 9.1% in June 2022."}'
# {"is_claim":true,"confidence":0.9542,"label":"claim"}

UI at http://localhost:8000/, comparison table at http://localhost:8000/results.

CLI

.venv/bin/python -m app.cli predict "The 2024 budget passed yesterday."
# CLAIM  (confidence: 0.9531)
# text: The 2024 budget passed yesterday.

.venv/bin/python -m app.cli predict --json "I love this weather"
# {"is_claim": false, "confidence": 0.7124, "label": "not_claim"}

echo "Some sentence" | .venv/bin/python -m app.cli predict -

Endpoints

Browse the auto-generated Swagger UI at http://localhost:8000/docs (or /redoc for the alternative renderer).

method	path	purpose
`GET`	`/`	HTMX UI (form + streaming results table)
`GET`	`/results`	server-rendered comparison table
`GET`	`/docs`	Swagger UI for the API
`POST`	`/api/predict/sync`	recommended — blocks until prediction is ready, returns JSON
`POST`	`/api/predict`	enqueue + return `{job_id, stream_url}` (queued mode)
`GET`	`/api/predict/{job_id}/stream`	SSE keep-alive: status events + final result event
`GET`	`/api/healthz`	readiness probe
`POST`	`/ui/predict`	HTMX-targeted endpoint that returns an HTML row fragment

The legacy paths /predict, /predict/sync, /predict/{id}/stream, /healthz redirect (308, preserves method + body) to their /api/* counterparts.

Try it from your shell

curl -X POST http://localhost:8000/api/predict/sync \
    -H 'content-type: application/json' \
    -d '{"text": "Inflation hit 9.1% in June 2022."}'
# {"is_claim":true,"confidence":1.0,"label":"claim"}

Set APP_URL=https://claims.jakehash.com (and optionally API_URL=https://api.claims.jakehash.com) when deploying — the curl example shown in the UI and Swagger UI updates automatically.

Docker

Full bring-up + Swarm details in DOCKER.md. Short version:

Local — `docker compose up`

Three services: api (FastAPI), worker (RQ), redis. The model checkpoint at runs/ettin-150m-ft/final/ is mounted read-only into both api and worker.

Bring it up:

docker compose up --build              # foreground (logs follow)
docker compose up -d --build           # detached / background

# verify it's reachable (give it ~10–15s for the worker to load the model)
curl http://localhost:8000/api/healthz
# {"status":"ok","queue":true}

# open the UI
open http://localhost:8000/            # macOS
xdg-open http://localhost:8000/        # Linux

In compose mode the API runs with QUEUE=1 — POST /api/predict returns {job_id, stream_url} immediately, the worker processes the job, and the client SSE-streams status updates + the final result.

Take it down:

docker compose down                    # stops and removes containers + network
docker compose down -v                 # also drops the (currently empty) named volumes
docker compose down --rmi local        # also removes the locally-built image

docker compose down is the inverse of up. Run it when you're done poking and want port :8000 back. Re-running docker compose up -d later restarts everything from the same configuration; nothing on disk in runs/, results/, or datasets/ is affected.

Useful while it's running:

docker compose ps                      # what's up and healthy
docker compose logs -f api             # tail api logs
docker compose logs -f worker          # tail worker logs (model loads, RQ jobs)
docker compose restart api             # restart only the api container

Server — `docker stack deploy` (Swarm)

Same docker-compose.yml. v3 syntax with deploy.replicas / restart_policy.condition so it works in both modes.

docker stack deploy -c docker-compose.yml claim-detection
docker service ls
docker service scale claim-detection_worker=4   # scale workers independently

The Swarm path needs the model directory available on every node where api or worker lands — either bake it into the image, mount a shared volume, or constrain placement with a node label. Recipes in DOCKER.md.

Tests

.venv/bin/python -m pytest                         # 29 unit tests, ~0.1s
.venv/bin/python -m pytest -m integration          # hits a live HTTP server

The unit tests use a FakePredictor fixture so they never load real model weights — fast, deterministic, easy to run in CI.

The integration tests are excluded by default (pyproject.toml) and intentionally fail when the API isn't reachable. With the docker stack up, they hit http://localhost:8000 and exercise the full HTTP path (sync predict, HTML rendering, SSE if QUEUE=1).

Datasets

Path	Source	Use
`datasets/verita-composite/`	VeritaResearch/claim-extraction	default training set (Claimbuster + PoliClaim + AVeriTeC)
`datasets/checkthat-2025/`	CLEF CheckThat! 2025	newer multilingual subjectivity + claim-normalization data
`datasets/claimify/`	microsoft/claimify-dataset	LLM-generated text, useful for OOD ablation
`datasets/feverfact/`	aic-factcheck/claim_extraction	Wikipedia atomic-claim extraction
`datasets/all_binary.csv`	aggregated	21,079 binary-fit rows joined into one CSV

datasets/README.md is the authoritative provenance + license + format reference. python datasets/normalize.py re-derives the normalized*.csv files from the raw sources.

Project status

✅ Done:

Data ingestion and normalization layer
Model registry + sequential training pipeline
Per-model evaluation + aggregate comparison renderer
All 8 models trained and evaluated (full Bell-paper sweep + Ettin)
FastAPI app: sync + queued/SSE modes
HTMX UI with server-side rendering
Docker Compose + Swarm-compatible stack
29-test pytest suite + failing-first integration tests
Headline result: Ettin-150m-ft beats Bell's best encoder; 3-of-4 Bell reproductions match or beat published numbers

📋 Considered, deferred:

Out-of-domain evaluation against checkthat-2025/
Calibration (temperature scaling) on the API's confidence field
Token-level explanation (attention rollout / SHAP)
Bring up Docker stack and run the integration test suite green

See WALKTHROUGH.md for the demo walk-through and likely Q&A.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

claim-detection

Repo layout

The two papers we're working from

Results

Bell paper reference (FEVER 2025) — what we're trying to match or beat

What these numbers mean (briefly)

Direct head-to-head with Bell (sanity check on our pipeline)

Quick start

Prereqs

1) Install

2) Look at the data

3) Train a model

4) Evaluate

5) Generate the comparison table

6) Drive the whole sweep with one command

API

Run locally (no Docker)

CLI

Endpoints

Try it from your shell

Docker

Local — `docker compose up`

Server — `docker stack deploy` (Swarm)

Tests

Datasets

Project status

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
app		app
claims		claims
datasets		datasets
docker		docker
papers		papers
results		results
scripts		scripts
src		src
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
DEPLOY_LINUX.md		DEPLOY_LINUX.md
DOCKER.md		DOCKER.md
README.md		README.md
RESOURCES.md		RESOURCES.md
RESULTS.md		RESULTS.md
WALKTHROUGH.md		WALKTHROUGH.md
architecture.excalidraw		architecture.excalidraw
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

claim-detection

Repo layout

The two papers we're working from

Results

Bell paper reference (FEVER 2025) — what we're trying to match or beat

What these numbers mean (briefly)

Direct head-to-head with Bell (sanity check on our pipeline)

Quick start

Prereqs

1) Install

2) Look at the data

3) Train a model

4) Evaluate

5) Generate the comparison table

6) Drive the whole sweep with one command

API

Run locally (no Docker)

CLI

Endpoints

Try it from your shell

Docker

Local — docker compose up

Server — docker stack deploy (Swarm)

Tests

Datasets

Project status

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Local — `docker compose up`

Server — `docker stack deploy` (Swarm)

Packages