A writing tool that helps forecasters and question-authors catch ambiguity, fuzzy resolution criteria, and missing operationalization before a question goes live. Targets the well-known quality problems in platforms like Metaculus, where vague wording leads to disputed resolutions and wasted forecaster attention.
- Paste a draft question (title, optional resolution criteria, background) and get a rubric-driven critique.
- Rubric coverage: resolution criteria clarity, time bounds, operationalization of fuzzy terms, edge cases, source authority, scope drift.
- Inline flagging: every issue is tied to the exact span that triggered it.
- Suggested rewrites: concrete alternative phrasing per issue — accept, edit, or ignore.
- Severity ranking: ranked by likelihood of causing a disputed resolution, not model confidence.
- Per-user history: revisit past critiques and spot recurring patterns in your own writing.
Not a forecaster, not a Metaculus integration, not a team collaboration tool, not a general-purpose grammar checker. Scope is narrow to forecasting-question quality.
- ≥70% recall on real disputed-question issues from a held-out set of 50 Metaculus questions.
- ≤1 false-positive flag per question on clean, never-disputed questions.
- Blind reviewer prefers suggested rewrite over original on ≥70% of test questions.
Backend — Python 3.11+, FastAPI, Anthropic Claude API (Sonnet), Pydantic for structured output.
Frontend — Next.js 14 (App Router) + TypeScript, Tailwind + shadcn/ui, TipTap (or contenteditable) for inline span highlighting and click-to-accept rewrites.
Auth & storage — Clerk (magic link + Google OAuth), Supabase Postgres keyed by Clerk user ID.
Infra — Vercel (frontend), Railway (backend), Upstash Redis (rate limiting), Sentry (errors).
Eval & CI — pytest + custom eval script, 50-question annotated JSONL test set in git, dated eval-history JSON, GitHub Actions failing the PR if recall drops.
Cost controls — 10 req/hr anonymous per-IP, daily spend cap on the Anthropic key, ~4,000 char input cap, kill-switch endpoint.
| Phase | Theme | Output | Exit criteria | Status |
|---|---|---|---|---|
| 1 | Rubric spike | CLI returning structured critique against hand-built rubric | Catches ≥8/10 known-ambiguous questions; findings are specific, not generic | ✅ Met (recall@high 79%, fp@high 0.40 at n=19 labeled questions) |
| 2 | Critique quality & suggestions | Per-issue rewrites + eval harness vs. hand-annotated references | Rewrites rated "meaningfully better" by blind reviewer on ≥70% of 50-question eval set | ✅ Met on internal sample (94% rewrite-better across 51 pairs from n=19 questions); confirm at n=50 before public launch |
| 3 | Web interface | Next.js + FastAPI app with auth, inline-flagged editor, accept/reject rewrites, history page | Paste-to-critique <5s; signup-to-first-critique <30s; history loads across devices | 8 / 14 steps done. Deploys live on Vercel + Railway with Clerk JWT verification + Sentry error tracking + question-field scrubbing. Upstash rate limit / Supabase persistence / frontend Clerk still to integrate |
| 4 | Polish & distribution | Hosted demo, README with examples, rubric/eval writeup | Public for 7 days without spend-cap breach or downtime; ≥1 rubric item fires per session on average | Not started |
- Linter is live end-to-end and deployed. Next.js frontend on Vercel, FastAPI backend on Railway. Backend now verifies Clerk session JWTs via the official
clerk-backend-apiSDK; static-bearer fallback remains for back-compat until the frontend's Clerk integration (step 8) lands. - Held-out set is at n=19 (14 ambiguous + 5 clean) — under the spec's 50-question target. Bulk scraping isn't an option (Metaculus API doesn't expose resolution criteria; web pages are Cloudflare-protected against non-browser clients), so curation runs one tab at a time via a Claude-in-Chrome extractor at ~3-5 min per question. Growing to n=50 is ~1.5-2.5 hours of focused work and is deferred until pre-public-launch. See
backend/README.md. - Phase 1 exit met at high severity (recall@high 79%, fp@high 0.40 — both well past spec targets).
- Phase 2 success metric met on internal sample. Blind review on 51 rewrite pairs scored 94% rewrite-better (target ≥70%). Re-confirmation at n=50 questions before any public launch.
- Phase 3 — Clerk backend auth + Sentry error tracking live. Eight of fourteen Phase 3 steps done (FastAPI wrapper, Clerk JWT verification, Next.js scaffold, TipTap editor, example gallery, Sentry init both ends with question-field scrubbing, Vercel deploy, Railway deploy). All six external service accounts configured. Remaining: Upstash rate-limit middleware, Supabase persistence + history page, frontend Clerk provider.