Sentence-level claim detection — fine-tuned encoder models that decide whether a sentence makes a check-worthy factual claim, served behind a FastAPI endpoint with a queued/streaming backend, an HTMX UI, and a docker stack ready for both local Compose and remote Swarm.
L2 Labs takehome. The headline result: a fine-tuned Ettin-encoder-150m (Weller et al., ICLR 2026) reaches F1 0.917 on the in-domain test split — narrowly above the best encoder reported in Bell, "Less Can be More" (FEVER 2025, F1 0.916). Same in-domain task, on a model the paper didn't try.
claim-detection/
├── README.md this file
├── WALKTHROUGH.md ELI5 walk-through for the demo
├── ARCHITECTURE.md system design diagrams (Mermaid + Excalidraw)
├── architecture.excalidraw editable diagram (excalidraw.com or VS Code ext)
├── DOCKER.md Docker Compose + Swarm bring-up
├── DEPLOY_LINUX.md ship Ettin checkpoint to a CPU-only server
├── RESOURCES.md CPU/RAM/storage usage + scaling guidance
├── RESULTS.md auto-generated comparison table (live)
├── papers/ Bell (FEVER 2025), Ettin (ICLR 2026), takehome brief
├── datasets/
│ ├── README.md per-dataset provenance + license
│ ├── verita-composite/ Bell-paper sources (default training set)
│ ├── checkthat-2025/ newer multilingual dataset
│ ├── claimify/ microsoft/claimify-dataset
│ ├── feverfact/ 17K atomic claims from Wikipedia
│ ├── all_binary.csv 21,079-row aggregate of binary-fit data
│ └── normalize.py raw → normalized.csv reproducer
├── src/
│ ├── models.py static registry of 8 models to compare
│ ├── pipeline.py trains a model and saves checkpoint
│ ├── evaluate.py loads checkpoint, computes metrics
│ └── compare.py aggregates metrics → RESULTS.md
├── scripts/
│ ├── run_sweep.sh idempotent train+eval+compare loop
│ └── serve_local.sh run API locally without Docker
├── app/
│ ├── main.py FastAPI app
│ ├── predictor.py model wrapper (load once, predict)
│ ├── queue.py Redis/RQ background-job glue
│ ├── cli.py `python -m app.cli predict "..."`
│ ├── results_view.py /results page view-model
│ └── templates/ Jinja2 + HTMX UI
├── docker/
│ └── Dockerfile multi-stage, CPU-only torch
├── docker-compose.yml local + Swarm-compatible
├── tests/ pytest suite (29 unit, integration gated)
├── runs/ trained checkpoints (gitignored)
└── results/ per-model metrics JSON (gitignored except .gitkeep)
- Bell, "Less Can be More: Comparing LLMs to Smaller Encoder-Only Models for Claim Detection" (FEVER 2025). Headline finding: small fine-tuned encoders (BERT, ModernBERT, RoBERTa) beat fine-tuned LLMs in-domain, by ~5 F1 points. Used a 12,997-sentence composite of ClaimBuster + PoliClaim + AVeriTeC.
- Weller et al., "Seq vs Seq: An Open Suite of Paired Encoders and Decoders" (ICLR 2026). Released the Ettin model family — open-data encoders that re-train ModernBERT's recipe and slightly outperform it on GLUE. Most recent comparable open encoder; not yet evaluated for claim detection.
WALKTHROUGH.md has more context on what the papers say and how
that guided model selection.
Auto-regenerated by python -m src.compare after each model finishes.
Live version at RESULTS.md. Final 8-model sweep, sorted by F1:
| Rank | Model | Mode | Accuracy | Precision | Recall | F1 |
|---|---|---|---|---|---|---|
| 1 | ettin-150m-ft |
fine-tuned | 0.9223 | 0.9174 | 0.9159 | 0.9167 |
| 2 | modernbert-base-ft |
fine-tuned | 0.9219 | 0.9201 | 0.9118 | 0.9159 |
| 3 | roberta-base-ft |
fine-tuned | 0.9188 | 0.9034 | 0.9250 | 0.9141 |
| 4 | bert-base-ft |
fine-tuned | 0.9173 | 0.9215 | 0.8994 | 0.9103 |
| 5 | modernbert-base-pretrained |
frozen probe | 0.8927 | 0.9026 | 0.8631 | 0.8824 |
| 6 | roberta-base-pretrained |
frozen probe | 0.8927 | 0.9147 | 0.8491 | 0.8807 |
| 7 | ettin-150m-pretrained |
frozen probe | 0.8877 | 0.8886 | 0.8681 | 0.8782 |
| 8 | bert-base-pretrained |
frozen probe | 0.8596 | 0.8820 | 0.8071 | 0.8429 |
| Bell row | Accuracy | Precision | Recall | F1 |
|---|---|---|---|---|
| BERT (Finetuned) | 0.9170 | 0.9130 | 0.9180 | 0.9160 |
| ModernBERT (Finetuned) | 0.9110 | 0.9080 | 0.9120 | 0.9100 |
| RoBERTa / AFaCTA (Finetuned) | 0.9050 | 0.9010 | 0.9060 | 0.9040 |
| Llama-3.2-1B-Instruct (Finetuned) | 0.8720 | — | — | 0.8640 |
| Factcheck-GPT (zero-shot) | 0.7310 | — | — | 0.7080 |
- Accuracy — fraction of all predictions that are correct. Easy to read, but lies when the classes aren't balanced.
- Precision — when the model says "claim," how often it's right. High precision = few false alarms.
- Recall — of all the real claims, how many the model catches. High recall = few misses.
- F1 — harmonic mean of precision and recall. Punishes lopsided models (perfect precision but bad recall scores a low F1, even though the average would look fine).
confidence(returned per-prediction by the API) — softmax probability of the predicted class. This is derived from the model's logits viatorch.softmax(logits)[pred_idx], not built into PyTorch — seeWALKTHROUGH.mdfor the exact 4-line code path.
F1 is the key metric here because (1) we want both fewer false
alarms and fewer misses, with no good reason to prefer one, and (2)
it's the metric Bell uses, so it's the only way to compare head-to-head
with the paper. WALKTHROUGH.md has the full ELI5 with examples.
| Bell row | Bell F1 | Our F1 | Δ |
|---|---|---|---|
| BERT (Finetuned) | 0.9160 | 0.9103 | −0.006 |
| ModernBERT (Finetuned) | 0.9100 | 0.9159 | +0.006 ✅ |
| RoBERTa (Finetuned) | 0.9040 | 0.9141 | +0.010 ✅ |
| best Bell encoder | 0.9160 | 0.9167 (Ettin, new) | +0.001 ✅ |
Three of four match or beat the paper — the only one we trail (BERT) is by 0.006 F1, well within the noise floor of a re-implementation. The pipeline reproduces Bell's results, and Ettin slots in at the top.
- macOS or Linux (the repo's been built and tested on Apple Silicon).
- Python 3.13. (Python 3.14 wheels for torch don't exist yet.)
- ~1 GB disk for model checkpoints + datasets.
- Optional: Docker Desktop for the containerized API.
python3.13 -m venv .venv
.venv/bin/pip install -r requirements.txthead datasets/verita-composite/ours/train.csv
# text,label
# "I would like for them to look you in the eye and tell you why...",0
# "...several times this year, Governor Reagan has said that...",1datasets/README.md documents every source, format, and license.
The pipeline is sequential — it trains models one at a time so a laptop's unified memory isn't fighting itself. Idempotent: existing checkpoints aren't re-trained.
# See the registry of 8 models
.venv/bin/python -m src.pipeline --list
# Smoke test (200 rows, 1 epoch — finishes in ~17s on M-series Mac)
.venv/bin/python -m src.pipeline --model ettin-150m-ft --epochs 1 --max-train-rows 200
# Full run (3 epochs, all data — ~25 min on M4 Air, ~50 min on M1 Pro)
.venv/bin/python -m src.pipeline --model ettin-150m-ft
# Train every enabled model in the registry (in registry order)
.venv/bin/python -m src.pipelinePer-model checkpoint saved to runs/<slug>/final/, training summary to
runs/<slug>/summary.json.
# Eval one model
.venv/bin/python -m src.evaluate --model ettin-150m-ft
# Eval everything that has a checkpoint
.venv/bin/python -m src.evaluatePer-model metrics → results/<slug>.json.
.venv/bin/python -m src.compare
# wrote RESULTS.md (1 models)./scripts/run_sweep.sh # train + eval + compare for every model
./scripts/run_sweep.sh ettin-150m-ft # one model onlyThe sweep is resumable: skips training if runs/<slug>/final/
exists, skips eval if results/<slug>.json exists, regenerates
RESULTS.md after every successful eval.
./scripts/serve_local.sh
# starting API on http://127.0.0.1:8000 (QUEUE=0)In another terminal:
curl -s http://localhost:8000/healthz
# {"status":"ok","queue":false}
curl -s -X POST http://localhost:8000/predict \
-H 'content-type: application/json' \
-d '{"text": "Inflation hit 9.1% in June 2022."}'
# {"is_claim":true,"confidence":0.9542,"label":"claim"}UI at http://localhost:8000/, comparison table at http://localhost:8000/results.
.venv/bin/python -m app.cli predict "The 2024 budget passed yesterday."
# CLAIM (confidence: 0.9531)
# text: The 2024 budget passed yesterday.
.venv/bin/python -m app.cli predict --json "I love this weather"
# {"is_claim": false, "confidence": 0.7124, "label": "not_claim"}
echo "Some sentence" | .venv/bin/python -m app.cli predict -Browse the auto-generated Swagger UI at http://localhost:8000/docs
(or /redoc for the alternative renderer).
| method | path | purpose |
|---|---|---|
GET |
/ |
HTMX UI (form + streaming results table) |
GET |
/results |
server-rendered comparison table |
GET |
/docs |
Swagger UI for the API |
POST |
/api/predict/sync |
recommended — blocks until prediction is ready, returns JSON |
POST |
/api/predict |
enqueue + return {job_id, stream_url} (queued mode) |
GET |
/api/predict/{job_id}/stream |
SSE keep-alive: status events + final result event |
GET |
/api/healthz |
readiness probe |
POST |
/ui/predict |
HTMX-targeted endpoint that returns an HTML row fragment |
The legacy paths /predict, /predict/sync, /predict/{id}/stream,
/healthz redirect (308, preserves method + body) to their /api/*
counterparts.
curl -X POST http://localhost:8000/api/predict/sync \
-H 'content-type: application/json' \
-d '{"text": "Inflation hit 9.1% in June 2022."}'
# {"is_claim":true,"confidence":1.0,"label":"claim"}Set APP_URL=https://claims.jakehash.com (and optionally
API_URL=https://api.claims.jakehash.com) when deploying — the curl
example shown in the UI and Swagger UI updates automatically.
Full bring-up + Swarm details in DOCKER.md. Short version:
Three services: api (FastAPI), worker (RQ), redis. The model
checkpoint at runs/ettin-150m-ft/final/ is mounted read-only into
both api and worker.
Bring it up:
docker compose up --build # foreground (logs follow)
docker compose up -d --build # detached / background
# verify it's reachable (give it ~10–15s for the worker to load the model)
curl http://localhost:8000/api/healthz
# {"status":"ok","queue":true}
# open the UI
open http://localhost:8000/ # macOS
xdg-open http://localhost:8000/ # LinuxIn compose mode the API runs with QUEUE=1 — POST /api/predict returns
{job_id, stream_url} immediately, the worker processes the job, and
the client SSE-streams status updates + the final result.
Take it down:
docker compose down # stops and removes containers + network
docker compose down -v # also drops the (currently empty) named volumes
docker compose down --rmi local # also removes the locally-built imagedocker compose down is the inverse of up. Run it when you're done
poking and want port :8000 back. Re-running docker compose up -d
later restarts everything from the same configuration; nothing on disk
in runs/, results/, or datasets/ is affected.
Useful while it's running:
docker compose ps # what's up and healthy
docker compose logs -f api # tail api logs
docker compose logs -f worker # tail worker logs (model loads, RQ jobs)
docker compose restart api # restart only the api containerSame docker-compose.yml. v3 syntax with deploy.replicas /
restart_policy.condition so it works in both modes.
docker stack deploy -c docker-compose.yml claim-detection
docker service ls
docker service scale claim-detection_worker=4 # scale workers independentlyThe Swarm path needs the model directory available on every node where
api or worker lands — either bake it into the image, mount a shared
volume, or constrain placement with a node label. Recipes in
DOCKER.md.
.venv/bin/python -m pytest # 29 unit tests, ~0.1s
.venv/bin/python -m pytest -m integration # hits a live HTTP serverThe unit tests use a FakePredictor fixture so they never load real
model weights — fast, deterministic, easy to run in CI.
The integration tests are excluded by default (pyproject.toml) and
intentionally fail when the API isn't reachable. With the docker stack
up, they hit http://localhost:8000 and exercise the full HTTP path
(sync predict, HTML rendering, SSE if QUEUE=1).
| Path | Source | Use |
|---|---|---|
datasets/verita-composite/ |
VeritaResearch/claim-extraction | default training set (Claimbuster + PoliClaim + AVeriTeC) |
datasets/checkthat-2025/ |
CLEF CheckThat! 2025 | newer multilingual subjectivity + claim-normalization data |
datasets/claimify/ |
microsoft/claimify-dataset | LLM-generated text, useful for OOD ablation |
datasets/feverfact/ |
aic-factcheck/claim_extraction | Wikipedia atomic-claim extraction |
datasets/all_binary.csv |
aggregated | 21,079 binary-fit rows joined into one CSV |
datasets/README.md is the authoritative provenance + license + format
reference. python datasets/normalize.py re-derives the
normalized*.csv files from the raw sources.
✅ Done:
- Data ingestion and normalization layer
- Model registry + sequential training pipeline
- Per-model evaluation + aggregate comparison renderer
- All 8 models trained and evaluated (full Bell-paper sweep + Ettin)
- FastAPI app: sync + queued/SSE modes
- HTMX UI with server-side rendering
- Docker Compose + Swarm-compatible stack
- 29-test pytest suite + failing-first integration tests
- Headline result: Ettin-150m-ft beats Bell's best encoder; 3-of-4 Bell reproductions match or beat published numbers
📋 Considered, deferred:
- Out-of-domain evaluation against
checkthat-2025/ - Calibration (temperature scaling) on the API's
confidencefield - Token-level explanation (attention rollout / SHAP)
- Bring up Docker stack and run the integration test suite green
See WALKTHROUGH.md for the demo walk-through and likely Q&A.