diff --git a/.claude/skills/e2e-pr/SKILL.md b/.claude/skills/e2e-pr/SKILL.md
new file mode 100644
index 00000000..d0941475
--- /dev/null
+++ b/.claude/skills/e2e-pr/SKILL.md
@@ -0,0 +1,65 @@
+---
+name: e2e-pr
+description: Run the Playwright end-to-end suite against the routes a PR changes and report results. Use when verifying UI behavior for a branch or PR, or checking that a change didn't break key pages.
+---
+
+# e2e-pr
+
+Run the Playwright end-to-end suite against the routes a PR touches, and report results.
+
+## Prerequisites
+
+- `@playwright/test` is installed and `playwright.config.ts` exists at the repo root.
+- The config auto-boots `pnpm dev` (its `webServer` block) and waits for `/about`,
+  so you do NOT need to start a server yourself unless one is already running.
+- In CI the browser comes from `playwright install`; in the hosted sandbox it's
+  the pre-installed Chromium at `/opt/pw-browsers`. The config handles both.
+
+## Steps
+
+1. **Find changed files** vs the base branch (default `origin/main`):
+   ```
+   git diff --name-only origin/main...HEAD
+   ```
+
+2. **Map changed files to affected routes** (best-effort):
+   - `src/app/(group)/foo/page.tsx` → `/foo` (drop `src/app`, drop `(route-group)`
+     segments, drop the trailing `/page.tsx`)
+   - `src/app/(standard)/page.tsx` → `/`
+   - `layout.tsx` / `template.tsx` changes affect every route beneath them.
+   - Dynamic segments (`[id]`) can't be visited without a concrete value — note
+     them but don't try to test them blindly.
+   - Changes under `src/components/**` are shared and can't be mapped to a single
+     route → treat as "broad" (run the whole suite).
+
+3. **Choose scope:**
+   - If specific route files changed and there are specs covering them, run those:
+     ```
+     pnpm test:e2e -g "<route or feature keyword>"
+     ```
+     or pass specific spec files: `pnpm test:e2e tests/e2e/<file>.spec.ts`
+   - If changes are broad (shared components, layout, config) OR the suite is
+     small, just run everything: `pnpm test:e2e`
+
+4. **Run** the chosen command. The first run cold-compiles routes in `next dev`,
+   so allow time — the config already uses generous timeouts.
+
+5. **Report:**
+   - Pass/fail, number of tests run, and any failures with their error message.
+   - On failure, point to the HTML report (`pnpm test:e2e:report`) and the trace
+     (saved under `test-results/` on first retry).
+   - Call out any changed routes that have NO e2e coverage as gaps (not failures),
+     so the author can decide whether to add a spec.
+
+## Notes
+
+- Data-dependent routes (homepage, `/analysis/*`) need env/secrets or network
+  mocking — see `tests/e2e/README.md`. Don't add flaky assertions on them.
+- Do NOT auto-write new specs here; that's authoring. This skill runs existing
+  tests. (Use `pr-describe` or a dedicated authoring step to add coverage.)
+
+## Args
+
+Optional base branch to diff against (default `origin/main`).
+
+Example: `/e2e-pr` or `/e2e-pr staging`
diff --git a/.claude/skills/pr-check/SKILL.md b/.claude/skills/pr-check/SKILL.md
new file mode 100644
index 00000000..a3b56ace
--- /dev/null
+++ b/.claude/skills/pr-check/SKILL.md
@@ -0,0 +1,57 @@
+---
+name: pr-check
+description: Run the full quality gate on the current branch — TypeScript typecheck, ESLint, web/CLI Jest tests, and blueprint validation — then summarize pass/fail. Use before opening a PR, or to check whether a branch is CI-ready.
+---
+
+# pr-check
+
+Run a full quality gate on the current branch before or after a PR is created. Orchestrates type checking, linting, and tests, then produces a structured summary.
+
+## Steps
+
+1. **Identify scope** — get the diff summary:
+   ```
+   git diff --stat origin/main...HEAD
+   ```
+   Note whether changes touch: source code, blueprint files, tests, docs, config.
+
+2. **Run checks in parallel where possible** (report results as each finishes):
+
+   | Check | Command | When to run |
+   |-------|---------|-------------|
+   | TypeScript | `pnpm typecheck` | Always |
+   | Lint | `pnpm lint` | Always |
+   | Web tests | `pnpm test:web` | If `src/app/`, `src/components/`, `src/hooks/`, or `src/point-functions/` changed |
+   | CLI tests | `pnpm test:cli` | If `src/cli/` or `src/lib/` changed |
+   | Blueprint validate | see `blueprint-validate` skill | If any `.yaml`/`.yml`/`.json` blueprint files changed |
+
+3. **Collect results** — for each check record:
+   - Status: ✅ pass / ❌ fail / ⚠️ warnings / ⏭️ skipped (not applicable)
+   - Error/warning count
+   - Key details (first 5 errors max per check to keep output readable)
+
+4. **Produce a summary table:**
+   ```
+   ## PR Check Results — <branch-name>
+
+   | Check         | Status | Details                  |
+   |---------------|--------|--------------------------|
+   | TypeScript    | ✅     | 0 errors                 |
+   | Lint          | ⚠️     | 2 warnings, 0 errors     |
+   | Web tests     | ✅     | 47 passed                |
+   | CLI tests     | ⏭️     | No CLI files changed     |
+   | Blueprints    | ✅     | 1 file validated         |
+
+   **Overall: READY TO MERGE** / **NEEDS FIXES**
+   ```
+
+5. If any check fails, list the specific errors that need to be addressed.
+
+6. If `--comment` is passed as an arg, post this summary as a GitHub PR comment using the available GitHub MCP tools.
+
+## Args
+
+- `--comment`: Post the results as a GitHub PR comment (requires PR number to be detectable from the branch)
+- Base branch (e.g. `main`, `staging`): defaults to `origin/main`
+
+Example: `/pr-check` or `/pr-check --comment` or `/pr-check staging`
diff --git a/.claude/skills/pr-describe/SKILL.md b/.claude/skills/pr-describe/SKILL.md
new file mode 100644
index 00000000..eafafa41
--- /dev/null
+++ b/.claude/skills/pr-describe/SKILL.md
@@ -0,0 +1,266 @@
+---
+name: pr-describe
+description: Generate a structured PR description from the branch diff (Summary, Changes, Test plan, Risks, Related Issues), with optional static-gated before/after screenshots for visual changes. Use when writing or updating a pull request description.
+---
+
+# pr-describe
+
+Generate a high-quality PR description from the branch diff. The description
+itself is the core deliverable and always runs. Before/after screenshots are a
+**bonus that only activates on static, visual PRs** — they're strictly
+time-boxed, **fail soft** (any problem → clean text-only description), and are
+**off by default for data-driven routes** (see "Default screenshot scope").
+
+## Output contract
+
+- If a PR already exists for the current branch → update its **title and body**
+  (GitHub MCP `update_pull_request`). Only change the title if it doesn't already
+  follow the convention below.
+- If no PR exists → print the finished **title + markdown body** for the user.
+  Do **NOT** open a PR unless the user explicitly asked for one.
+
+## PR title convention
+
+Produce a properly tagged title in **Conventional Commits** form:
+
+```
+type(scope): imperative summary
+```
+
+- **type** — infer from the dominant change: `feat` (new capability), `fix`
+  (bug fix), `docs`, `test` (tests/test infra), `ci` (CI/workflows), `refactor`,
+  `perf`, `chore`, `build`, `style`.
+- **scope** — optional, a short area derived from the diff (e.g. `e2e`, `auth`,
+  `header`, `cli`, `pr-eval`). Omit if it spans many areas.
+- **summary** — imperative mood, lower-case start, **no trailing period**, aim
+  for ≤ ~70 chars total.
+- If the change mixes types, pick the one that best describes the user-facing
+  intent (a feature with its tests is still `feat`).
+
+Examples: `feat(header): collapse nav into a menu under 380px` ·
+`test(e2e): add Playwright smoke suite` · `fix(cli): handle empty blueprint id`.
+
+## Security guardrails (NON-NEGOTIABLE — this repo is public)
+
+Screenshots publish whatever they render. On a **public** repo, anything you
+commit is world-readable via raw URLs **permanently** (git history, forks, CDN
+cache) — deleting the branch does NOT undo it. So:
+
+1. **Route denylist — never screenshot these, no exceptions:**
+   `/admin*`, `/api*`, and any authenticated/session/account route. If a changed
+   route matches, skip it and note "omitted for safety", do not capture it.
+2. **Capture only against a secret-free environment.** Use the local `pnpm dev`
+   with NO real storage/API secrets wired up, so data-driven pages render
+   empty/mock and there is nothing sensitive in frame. Do **not** screenshot a
+   preview deploy that is backed by real data/secrets and then commit it here.
+3. **Treat committed images as permanent and public.** Never tell the user they
+   can "delete before merge" to undo exposure — that is false.
+4. **If a capture could contain anything sensitive, do NOT commit it to the
+   branch.** Prefer CI-artifact hosting (collaborator-only, auto-expiring) or a
+   PR comment. Committing to the public branch is only for plainly non-sensitive,
+   static UI.
+5. **Surface images for human review before pushing.** Show the user what was
+   captured and let them confirm; never push blind.
+
+---
+
+## Phase 1 — Analyze the diff (always runs)
+
+```
+git fetch origin <base> --quiet
+git diff --stat origin/<base>...HEAD
+git diff origin/<base>...HEAD
+git log origin/<base>..HEAD --format='%s%n%b'
+```
+
+Draft the description with these sections:
+
+- **Summary** — 1-3 sentences on what changed and why.
+- **Motivation / context** — the problem or request behind it.
+- **Changes** — bulleted, grouped by area (UI, API, CLI, tests, docs…).
+- **Test plan** — what you ran (`pnpm typecheck`, `pnpm lint`, `pnpm test:web`,
+  `pnpm test:e2e`) and the result, plus manual steps if any.
+- **Risks / rollback** — the blast radius and how to undo. Call out anything
+  reviewers should scrutinize (data migrations, auth/permissions, external API
+  or cost impact, breaking changes, shared components touched). State how to
+  revert (usually "revert this PR" — but note it if a migration or deploy step
+  makes rollback non-trivial). If the change is low-risk and self-contained, say
+  so in one line rather than padding.
+- **Screenshots** — filled in by Phase 4 if applicable, else omitted.
+- **Related Issues** — link tickets and related work. Use `Closes #123` for
+  issues this PR resolves (auto-closes them on merge), `Refs #456` for related
+  PRs/issues, plus any relevant docs. Omit the section entirely if there's
+  nothing to link — don't invent issue numbers.
+
+### Worked example
+
+```md
+## Summary
+Tightens the site header on mobile so the nav no longer wraps under 380px.
+
+## Motivation / context
+The logo + nav links overflowed on small screens, pushing the theme toggle
+off-canvas. Reported in #41.
+
+## Changes
+- **UI:** right-align nav links and shrink logo on `sm` breakpoint (`Header.tsx`)
+- **UI:** swap hover underline for opacity to avoid layout shift
+- **Tests:** add an e2e assertion that the header is visible at 360px width
+
+## Test plan
+- `pnpm typecheck` ✅  ·  `pnpm lint` ✅  ·  `pnpm test:e2e` ✅ (3 passed)
+- Manually checked /about at 320 / 375 / 768px.
+
+## Risks / rollback
+Low-risk, CSS-only and self-contained. Revert this PR to undo.
+
+## Screenshots
+(before/after table inserted by Phase 4)
+
+## Related Issues
+Closes #41
+```
+
+## Phase 2 — Does the screenshot step apply?
+
+Visual-change heuristic — TRUE if any changed file matches:
+- `src/app/**/(page|layout|template).tsx`
+- `src/components/**/*.tsx`
+- `**/*.css`
+
+If FALSE → skip to Phase 5 (text-only). If TRUE → continue, subject to the
+default scope and give-up policy below.
+
+### Default screenshot scope (practicality gate)
+
+Screenshots only pay off on **static, secret-free routes**. Data-driven routes
+(homepage, `/analysis/*`, `/pairs`, `/latest`, `/model/*`, etc.) render
+empty/mock against the secret-free dev env — an uninformative shot — and cost a
+slow capture. So **by default, only capture known-static routes**:
+
+- **Default static allowlist:** `/about`, `/what-is-an-eval`. (Extend this list
+  as more static pages are confirmed safe + stable.)
+- Any mapped route **not** on the allowlist is **skipped by default** with a note:
+  _"skipped: data-driven route (pass `--routes` to force)"_.
+- The user can **override** with `--routes /foo,/bar` to force specific routes
+  (e.g. against a preview deploy with real data, where they accept the tradeoff).
+- The **security denylist always wins** over any override — `/admin*`, `/api*`,
+  and auth routes are never captured even if explicitly passed.
+
+If, after this gate, there are **no routes left to shoot** → skip to Phase 5
+(text-only) with a one-line note. Don't boot servers for nothing.
+
+## Phase 3 — Capture screenshots (time-boxed, fail-soft)
+
+> **GIVE-UP POLICY — bail to text-only (Phase 5) and add a one-line note if ANY hold:**
+> - Total screenshot phase exceeds **~6 minutes** wall-clock.
+> - A dev server fails to become ready within **120s**.
+> - Zero routes can be mapped (e.g. only shared components / dynamic routes changed
+>   and the user gave no route to shoot).
+> - The screenshot script captures nothing (`scripts/pr-screenshots.mjs` exits non-zero).
+> - Any unexpected error. Never let screenshots block the description.
+
+**3a. Determine routes** (cap at the 3 most relevant; note any you dropped).
+Apply these filters **in order**:
+1. Map changed `src/app/**/page.tsx` to URLs (see `e2e-pr` for the mapping rules).
+2. **Security denylist (always, non-overridable):** drop any `/admin*`, `/api*`,
+   or authenticated route; note as "omitted for safety".
+3. **Default static gate (see Phase 2):** unless the user passed `--routes`, drop
+   anything not on the static allowlist; note as "skipped: data-driven route".
+   If `--routes` was passed, use exactly those (still subject to step 2).
+4. Skip dynamic (`[id]`) routes unless the user supplies a concrete URL.
+5. Shared-component-only change with nothing left → ask the user for 1-2
+   representative static routes, or skip with a note. Don't guess across the app.
+- Confirm the dev server has **no real storage/API secrets** in its env before
+  capturing (guardrail #2). If you can't confirm that, skip screenshots.
+- If no routes survive the filters → skip to Phase 5 (text-only).
+
+Let `SLUG` = sanitized branch name, `ROUTES` = comma-separated list, e.g. `/about,/what-is-an-eval`.
+
+**3b. Capture AFTER (current branch, HEAD).** Reuse a running dev server on
+`:3172` if present, else the config/`pnpm dev` will serve it. Then:
+```
+node scripts/pr-screenshots.mjs --base-url http://localhost:3172 \
+  --routes "$ROUTES" --out .github/pr-media/$SLUG --label after
+```
+
+**3c. Capture BEFORE (base branch) in an isolated worktree** so the working tree
+is untouched. Symlink `node_modules` to avoid a slow reinstall (valid as long as
+the PR didn't change dependencies — if it did, note that the "before" shot may
+be approximate):
+```
+WT=$(mktemp -d)
+git worktree add --detach "$WT" origin/<base>
+ln -s "$PWD/node_modules" "$WT/node_modules"
+( cd "$WT" && pnpm exec next dev -p 3173 ) &   # remember the PID
+# poll http://localhost:3173/about until it responds (cap 120s)
+node scripts/pr-screenshots.mjs --base-url http://localhost:3173 \
+  --routes "$ROUTES" --out .github/pr-media/$SLUG --label before
+# then ALWAYS clean up:
+kill <pid>; git worktree remove --force "$WT"
+```
+
+If BEFORE fails but AFTER succeeded, proceed with after-only + a note.
+
+## Phase 4 — Host & embed
+
+> **Before any commit: re-confirm the captures are non-sensitive (guardrails
+> #1–#4) and show them to the user for a quick look (guardrail #5).** Committing
+> to a public branch is permanent and irreversible. If there is any doubt about
+> the contents, use CI-artifact hosting instead (see "Sensitive captures" below)
+> or skip embedding entirely.
+
+GitHub renders images only from URLs, so the PNGs must be committed and pushed
+before they resolve. For plainly non-sensitive, static UI, commit them to the PR
+branch and reference raw URLs:
+
+```
+git add .github/pr-media/$SLUG
+git commit -m "Add PR before/after screenshots"
+git push
+```
+
+Derive `OWNER/REPO` from `git remote get-url origin` and `BRANCH` from
+`git rev-parse --abbrev-ref HEAD`. For each route build a row:
+
+```md
+### Screenshots
+
+#### `/about`
+| Before | After |
+|--------|-------|
+| ![before](https://raw.githubusercontent.com/OWNER/REPO/BRANCH/.github/pr-media/SLUG/about-before.png) | ![after](https://raw.githubusercontent.com/OWNER/REPO/BRANCH/.github/pr-media/SLUG/about-after.png) |
+```
+
+(Omit the "Before" cell for routes where only the after shot exists — new pages.)
+
+> **Tradeoff:** this commits PNGs into the PR branch and they show in the diff.
+> On a public repo this is **permanent and world-readable** — do NOT claim they
+> can be "deleted before merge" to undo exposure. To keep them out of the PR's
+> own diff you can use a dedicated orphan `pr-media` branch, but that is still
+> public; it changes visibility-in-diff, not exposure.
+
+**Sensitive captures → don't commit; use CI artifacts instead.** If a shot could
+contain anything non-public, skip the commit and have the e2e workflow upload the
+images via `actions/upload-artifact` (collaborator-only, auto-expiring). Link the
+run/artifact from the PR body rather than embedding a public raw URL.
+
+## Phase 5 — Finalize
+
+- Assemble the full body (Phase 1 sections + Phase 4 screenshots if any).
+- If a PR exists → update its body via GitHub MCP. Else → print the markdown.
+- If screenshots were skipped, include one honest line, e.g.
+  _"Screenshots skipped: change is API-only"_ or _"…: dev server didn't boot in time"_.
+  Never silently drop them without saying why.
+
+## Args
+
+- Base branch to diff against (default `origin/main`).
+- `--routes /foo,/bar` — **override the default static-only gate** and capture
+  exactly these routes (still subject to the security denylist). Use this for
+  shared-component changes or when shooting a preview deploy with real data.
+
+By default (no `--routes`), only known-static routes (`/about`,
+`/what-is-an-eval`) are captured; data-driven routes are skipped with a note.
+
+Example: `/pr-describe` · `/pr-describe staging` · `/pr-describe --routes /pairs,/latest`
diff --git a/.github/workflows/e2e.yml b/.github/workflows/e2e.yml
new file mode 100644
index 00000000..ab4669cb
--- /dev/null
+++ b/.github/workflows/e2e.yml
@@ -0,0 +1,49 @@
+name: E2E Tests
+
+on:
+  pull_request:
+    paths-ignore:
+      - '**/*.md'
+      - 'docs/**'
+  push:
+    branches: [main]
+    paths-ignore:
+      - '**/*.md'
+      - 'docs/**'
+
+concurrency:
+  group: e2e-${{ github.ref }}
+  cancel-in-progress: true
+
+jobs:
+  e2e:
+    runs-on: ubuntu-latest
+    timeout-minutes: 20
+    steps:
+      - uses: actions/checkout@v4
+
+      - uses: pnpm/action-setup@v4
+
+      - uses: actions/setup-node@v4
+        with:
+          node-version: 18
+          cache: pnpm
+
+      - name: Install dependencies
+        run: pnpm install --frozen-lockfile
+
+      - name: Install Playwright Chromium
+        run: pnpm exec playwright install --with-deps chromium
+
+      - name: Run E2E tests
+        run: pnpm test:e2e
+        env:
+          CI: 'true'
+
+      - name: Upload Playwright report
+        if: ${{ !cancelled() }}
+        uses: actions/upload-artifact@v4
+        with:
+          name: playwright-report
+          path: playwright-report/
+          retention-days: 14
diff --git a/.gitignore b/.gitignore
index aecac04c..0c90d529 100644
--- a/.gitignore
+++ b/.gitignore
@@ -16,7 +16,8 @@
 catechism_dump.txt
 
 **/node_modules/**
-.claude
+.claude/*
+!.claude/skills
 
 # testing
 /coverage
@@ -24,6 +25,12 @@ catechism_dump.txt
 .swc
 .auth
 
+# playwright
+/test-results/
+/playwright-report/
+/blob-report/
+/playwright/.cache/
+
 # next.js
 /.next/
 /out/
diff --git a/package.json b/package.json
index c30e12b1..d0206fd3 100644
--- a/package.json
+++ b/package.json
@@ -20,6 +20,10 @@
     "test": "pnpm test:web && pnpm test:cli",
     "test:web": "jest --config jest.config.js",
     "test:cli": "node --experimental-vm-modules node_modules/jest/bin/jest.js --config jest.config.cli.js",
+    "test:e2e": "playwright test",
+    "test:e2e:ui": "playwright test --ui",
+    "test:e2e:report": "playwright show-report",
+    "pr:screenshots": "node scripts/pr-screenshots.mjs",
     "test:personality": "tsx src/cli/personality-test.ts",
     "compare:personality": "tsx src/cli/compare-personality.ts",
     "test:preference": "tsx src/cli/preference-test.ts",
@@ -129,6 +133,7 @@
   "devDependencies": {
     "@jest/globals": "^30.0.5",
     "@next/bundle-analyzer": "^15.4.4",
+    "@playwright/test": "1.55.0",
     "@sentry/cli": "^2.57.0",
     "@tailwindcss/postcss": "^4.1.11",
     "@tailwindcss/typography": "^0.5.16",
diff --git a/playwright.config.ts b/playwright.config.ts
new file mode 100644
index 00000000..dfa89006
--- /dev/null
+++ b/playwright.config.ts
@@ -0,0 +1,56 @@
+import { defineConfig, devices } from '@playwright/test';
+import { existsSync } from 'node:fs';
+
+/**
+ * In the hosted agent sandbox, Chromium is pre-installed at this path and
+ * `playwright install` is disabled. When the binary is present we point
+ * Playwright at it directly; otherwise (local dev, CI) we let Playwright use
+ * its own bundled browser installed via `playwright install chromium`.
+ */
+const SANDBOX_CHROMIUM = '/opt/pw-browsers/chromium';
+const executablePath = existsSync(SANDBOX_CHROMIUM) ? SANDBOX_CHROMIUM : undefined;
+
+const PORT = 3172;
+const BASE_URL = process.env.E2E_BASE_URL ?? `http://localhost:${PORT}`;
+
+export default defineConfig({
+  testDir: './tests/e2e',
+  fullyParallel: true,
+  forbidOnly: !!process.env.CI,
+  retries: process.env.CI ? 2 : 0,
+  workers: process.env.CI ? 1 : undefined,
+  reporter: process.env.CI
+    ? [['list'], ['html', { open: 'never' }]]
+    : [['list']],
+  timeout: 60_000,
+  expect: { timeout: 15_000 },
+  use: {
+    baseURL: BASE_URL,
+    navigationTimeout: 45_000,
+    trace: 'on-first-retry',
+    screenshot: 'only-on-failure',
+    video: 'retain-on-failure',
+  },
+  projects: [
+    {
+      name: 'chromium',
+      use: { ...devices['Desktop Chrome'], launchOptions: { executablePath } },
+    },
+  ],
+  /**
+   * When E2E_BASE_URL is set we assume the app is already running (e.g. a
+   * production build or a remote deploy) and skip booting a dev server.
+   * Otherwise boot `pnpm dev`; the readiness probe hits /about — a static,
+   * dependency-free route — which also pre-compiles it so the first test is fast.
+   */
+  webServer: process.env.E2E_BASE_URL
+    ? undefined
+    : {
+        command: 'pnpm dev',
+        url: `http://localhost:${PORT}/about`,
+        reuseExistingServer: !process.env.CI,
+        timeout: 180_000,
+        stdout: 'pipe',
+        stderr: 'pipe',
+      },
+});
diff --git a/pnpm-lock.yaml b/pnpm-lock.yaml
index 2794962b..15eb0e19 100644
--- a/pnpm-lock.yaml
+++ b/pnpm-lock.yaml
@@ -82,7 +82,7 @@ importers:
         version: 1.2.7(@types/react-dom@19.1.6(@types/react@19.1.8))(@types/react@19.1.8)(react-dom@19.1.0(react@19.1.0))(react@19.1.0)
       '@sentry/nextjs':
         specifier: ^10.21.0
-        version: 10.21.0(@opentelemetry/context-async-hooks@2.2.0(@opentelemetry/api@1.9.0))(@opentelemetry/core@2.2.0(@opentelemetry/api@1.9.0))(@opentelemetry/sdk-trace-base@2.2.0(@opentelemetry/api@1.9.0))(encoding@0.1.13)(next@15.5.9(@babel/core@7.27.4)(@opentelemetry/api@1.9.0)(react-dom@19.1.0(react@19.1.0))(react@19.1.0))(react@19.1.0)(webpack@5.102.1)
+        version: 10.21.0(@opentelemetry/context-async-hooks@2.2.0(@opentelemetry/api@1.9.0))(@opentelemetry/core@2.2.0(@opentelemetry/api@1.9.0))(@opentelemetry/sdk-trace-base@2.2.0(@opentelemetry/api@1.9.0))(encoding@0.1.13)(next@15.5.9(@babel/core@7.27.4)(@opentelemetry/api@1.9.0)(@playwright/test@1.55.0)(react-dom@19.1.0(react@19.1.0))(react@19.1.0))(react@19.1.0)(webpack@5.102.1)
       '@sentry/node':
         specifier: ^10.21.0
         version: 10.21.0
@@ -166,7 +166,7 @@ importers:
         version: 3.1.0
       next:
         specifier: 15.5.9
-        version: 15.5.9(@babel/core@7.27.4)(@opentelemetry/api@1.9.0)(react-dom@19.1.0(react@19.1.0))(react@19.1.0)
+        version: 15.5.9(@babel/core@7.27.4)(@opentelemetry/api@1.9.0)(@playwright/test@1.55.0)(react-dom@19.1.0(react@19.1.0))(react@19.1.0)
       next-themes:
         specifier: ^0.4.6
         version: 0.4.6(react-dom@19.1.0(react@19.1.0))(react@19.1.0)
@@ -237,6 +237,9 @@ importers:
       '@next/bundle-analyzer':
         specifier: ^15.4.4
         version: 15.4.4
+      '@playwright/test':
+        specifier: 1.55.0
+        version: 1.55.0
       '@sentry/cli':
         specifier: ^2.57.0
         version: 2.57.0(encoding@0.1.13)
@@ -1825,6 +1828,11 @@ packages:
     resolution: {integrity: sha512-YLT9Zo3oNPJoBjBc4q8G2mjU4tqIbf5CEOORbUUr48dCD9q3umJ3IPlVqOqDakPfd2HuwccBaqlGhN4Gmr5OWg==}
     engines: {node: ^12.20.0 || ^14.18.0 || >=16.0.0}
 
+  '@playwright/test@1.55.0':
+    resolution: {integrity: sha512-04IXzPwHrW69XusN/SIdDdKZBzMfOT9UNT/YiJit/xpy2VuAoB8NHc8Aplb96zsWDddLnbkPL3TsmrS04ZU2xQ==}
+    engines: {node: '>=18'}
+    hasBin: true
+
   '@polka/url@1.0.0-next.29':
     resolution: {integrity: sha512-wwQAWhWSuHaag8c4q/KN/vCoeOJYshAIvMQwD4GpSb3OiZklFfvAgmj0VCBBImRpuF/aFgIRzllXlVX93Jevww==}
 
@@ -4558,6 +4566,11 @@ packages:
   fs.realpath@1.0.0:
     resolution: {integrity: sha512-OO0pH2lK6a0hZnAdau5ItzHPI6pUlvI7jMVnxUQRtw4owF2wk8lOSabtGDCTP4Ggrg2MbGnWO9X8K1t4+fGMDw==}
 
+  fsevents@2.3.2:
+    resolution: {integrity: sha512-xiqMQR4xAeHTuB9uWm+fFRcIOgKBMiOBP+eXiyT7jsgVCq1bkVygt00oASowB7EdtpOHaaPgKt812P9ab+DDKA==}
+    engines: {node: ^8.16.0 || ^10.6.0 || >=11.0.0}
+    os: [darwin]
+
   fsevents@2.3.3:
     resolution: {integrity: sha512-5xoDfX+fL7faATnagmWPpbFtwh/R77WmMMqqHGS65C3vvB0YHrgF+B1YmZ3441tMj5n63k0212XNoJwzlhffQw==}
     engines: {node: ^8.16.0 || ^10.6.0 || >=11.0.0}
@@ -5885,6 +5898,16 @@ packages:
     resolution: {integrity: sha512-HRDzbaKjC+AOWVXxAU/x54COGeIv9eb+6CkDSQoNTt4XyWoIJvuPsXizxu/Fr23EiekbtZwmh1IcIG/l/a10GQ==}
     engines: {node: '>=8'}
 
+  playwright-core@1.55.0:
+    resolution: {integrity: sha512-GvZs4vU3U5ro2nZpeiwyb0zuFaqb9sUiAJuyrWpcGouD8y9/HLgGbNRjIph7zU9D3hnPaisMl9zG9CgFi/biIg==}
+    engines: {node: '>=18'}
+    hasBin: true
+
+  playwright@1.55.0:
+    resolution: {integrity: sha512-sdCWStblvV1YU909Xqx0DhOjPZE4/5lJsIS84IfN9dAZfcl/CIZ5O8l3o0j7hPMjDvqoTF8ZUcc+i/GL5erstA==}
+    engines: {node: '>=18'}
+    hasBin: true
+
   postcss-import@15.1.0:
     resolution: {integrity: sha512-hpr+J05B2FVYUAXHeK1YyI267J/dDDhMU6B6civm8hSY1jYJnBXxzKDKDswzJmtLHryrjhnDjqqp/49t8FALew==}
     engines: {node: '>=14.0.0'}
@@ -9217,6 +9240,10 @@ snapshots:
 
   '@pkgr/core@0.2.7': {}
 
+  '@playwright/test@1.55.0':
+    dependencies:
+      playwright: 1.55.0
+
   '@polka/url@1.0.0-next.29': {}
 
   '@prisma/instrumentation@6.15.0(@opentelemetry/api@1.9.0)':
@@ -9997,7 +10024,7 @@ snapshots:
 
   '@sentry/core@10.21.0': {}
 
-  '@sentry/nextjs@10.21.0(@opentelemetry/context-async-hooks@2.2.0(@opentelemetry/api@1.9.0))(@opentelemetry/core@2.2.0(@opentelemetry/api@1.9.0))(@opentelemetry/sdk-trace-base@2.2.0(@opentelemetry/api@1.9.0))(encoding@0.1.13)(next@15.5.9(@babel/core@7.27.4)(@opentelemetry/api@1.9.0)(react-dom@19.1.0(react@19.1.0))(react@19.1.0))(react@19.1.0)(webpack@5.102.1)':
+  '@sentry/nextjs@10.21.0(@opentelemetry/context-async-hooks@2.2.0(@opentelemetry/api@1.9.0))(@opentelemetry/core@2.2.0(@opentelemetry/api@1.9.0))(@opentelemetry/sdk-trace-base@2.2.0(@opentelemetry/api@1.9.0))(encoding@0.1.13)(next@15.5.9(@babel/core@7.27.4)(@opentelemetry/api@1.9.0)(@playwright/test@1.55.0)(react-dom@19.1.0(react@19.1.0))(react@19.1.0))(react@19.1.0)(webpack@5.102.1)':
     dependencies:
       '@opentelemetry/api': 1.9.0
       '@opentelemetry/semantic-conventions': 1.37.0
@@ -10011,7 +10038,7 @@ snapshots:
       '@sentry/vercel-edge': 10.21.0
       '@sentry/webpack-plugin': 4.5.0(encoding@0.1.13)(webpack@5.102.1)
       chalk: 3.0.0
-      next: 15.5.9(@babel/core@7.27.4)(@opentelemetry/api@1.9.0)(react-dom@19.1.0(react@19.1.0))(react@19.1.0)
+      next: 15.5.9(@babel/core@7.27.4)(@opentelemetry/api@1.9.0)(@playwright/test@1.55.0)(react-dom@19.1.0(react@19.1.0))(react@19.1.0)
       resolve: 1.22.8
       rollup: 4.44.2
       stacktrace-parser: 0.1.11
@@ -12331,6 +12358,9 @@ snapshots:
 
   fs.realpath@1.0.0: {}
 
+  fsevents@2.3.2:
+    optional: true
+
   fsevents@2.3.3:
     optional: true
 
@@ -13818,7 +13848,7 @@ snapshots:
       react: 19.1.0
       react-dom: 19.1.0(react@19.1.0)
 
-  next@15.5.9(@babel/core@7.27.4)(@opentelemetry/api@1.9.0)(react-dom@19.1.0(react@19.1.0))(react@19.1.0):
+  next@15.5.9(@babel/core@7.27.4)(@opentelemetry/api@1.9.0)(@playwright/test@1.55.0)(react-dom@19.1.0(react@19.1.0))(react@19.1.0):
     dependencies:
       '@next/env': 15.5.9
       '@swc/helpers': 0.5.15
@@ -13837,6 +13867,7 @@ snapshots:
       '@next/swc-win32-arm64-msvc': 15.5.7
       '@next/swc-win32-x64-msvc': 15.5.7
       '@opentelemetry/api': 1.9.0
+      '@playwright/test': 1.55.0
       sharp: 0.34.3
     transitivePeerDependencies:
       - '@babel/core'
@@ -14068,6 +14099,14 @@ snapshots:
     dependencies:
       find-up: 4.1.0
 
+  playwright-core@1.55.0: {}
+
+  playwright@1.55.0:
+    dependencies:
+      playwright-core: 1.55.0
+    optionalDependencies:
+      fsevents: 2.3.2
+
   postcss-import@15.1.0(postcss@8.5.6):
     dependencies:
       postcss: 8.5.6
diff --git a/scripts/pr-screenshots.mjs b/scripts/pr-screenshots.mjs
new file mode 100644
index 00000000..8688a48b
--- /dev/null
+++ b/scripts/pr-screenshots.mjs
@@ -0,0 +1,80 @@
+// Capture full-page screenshots of routes, for PR before/after comparisons.
+//
+// Usage:
+//   node scripts/pr-screenshots.mjs \
+//     --base-url http://localhost:3172 \
+//     --routes /about,/what-is-an-eval \
+//     --out .github/pr-media/my-branch \
+//     --label after
+//
+// Designed to FAIL SOFT: a route that errors or times out is skipped (and
+// logged), not fatal, so the orchestrating skill can still produce a partial
+// result. Exits non-zero only if NOTHING was captured.
+//
+// Reuses the @playwright/test Chromium (no extra dependency). In the hosted
+// sandbox it launches the pre-installed browser at /opt/pw-browsers/chromium.
+
+import { chromium } from '@playwright/test';
+import { existsSync, mkdirSync } from 'node:fs';
+import path from 'node:path';
+
+function arg(name, fallback) {
+  const i = process.argv.indexOf(`--${name}`);
+  return i !== -1 && process.argv[i + 1] ? process.argv[i + 1] : fallback;
+}
+
+const baseUrl = arg('base-url', 'http://localhost:3172').replace(/\/$/, '');
+const routes = arg('routes', '/')
+  .split(',')
+  .map((r) => r.trim())
+  .filter(Boolean);
+const outDir = arg('out', '.github/pr-media');
+const label = arg('label', 'after');
+const viewport = {
+  width: Number(arg('width', '1280')),
+  height: Number(arg('height', '800')),
+};
+const perRouteTimeoutMs = Number(arg('timeout', '45000'));
+
+const SANDBOX_CHROMIUM = '/opt/pw-browsers/chromium';
+const executablePath = existsSync(SANDBOX_CHROMIUM) ? SANDBOX_CHROMIUM : undefined;
+
+function slug(route) {
+  const s = route.replace(/^\/+|\/+$/g, '').replace(/[^a-zA-Z0-9._-]+/g, '_');
+  return s || 'home';
+}
+
+mkdirSync(outDir, { recursive: true });
+
+const browser = await chromium.launch({ executablePath });
+const context = await browser.newContext({ viewport });
+const results = [];
+
+for (const route of routes) {
+  const page = await context.newPage();
+  const file = path.join(outDir, `${slug(route)}-${label}.png`);
+  try {
+    // 'load' (not 'networkidle') — Next.js dev keeps an HMR websocket open,
+    // so networkidle would never settle.
+    await page.goto(`${baseUrl}${route}`, {
+      waitUntil: 'load',
+      timeout: perRouteTimeoutMs,
+    });
+    await page.waitForTimeout(750); // let fonts/animations settle
+    await page.screenshot({ path: file, fullPage: true });
+    results.push({ route, file, ok: true });
+    console.log(`OK   ${route} -> ${file}`);
+  } catch (err) {
+    results.push({ route, ok: false, error: String(err?.message || err) });
+    console.log(`FAIL ${route}: ${err?.message || err}`);
+  } finally {
+    await page.close();
+  }
+}
+
+await context.close();
+await browser.close();
+
+const ok = results.filter((r) => r.ok).length;
+console.log(`\n${ok}/${routes.length} screenshots captured in ${outDir}`);
+process.exit(ok > 0 ? 0 : 1);
diff --git a/tests/e2e/README.md b/tests/e2e/README.md
new file mode 100644
index 00000000..bd7c1179
--- /dev/null
+++ b/tests/e2e/README.md
@@ -0,0 +1,51 @@
+# End-to-end tests
+
+Playwright e2e tests for the Weval web app.
+
+## Running
+
+```bash
+pnpm test:e2e          # run the suite (auto-boots `pnpm dev` on :3172)
+pnpm test:e2e:ui       # interactive UI mode
+pnpm test:e2e:report   # open the last HTML report
+```
+
+You don't need to start the dev server yourself — `playwright.config.ts` has a
+`webServer` block that boots `pnpm dev` and waits for `/about` to respond. If a
+dev server is already running on `:3172` it is reused (locally).
+
+To run against an already-running app (e.g. a production build or a deployed
+preview) instead of booting dev:
+
+```bash
+E2E_BASE_URL=https://your-preview.example.com pnpm test:e2e
+```
+
+## Browser binary
+
+- **Local / CI:** Playwright uses its own bundled Chromium. Install it once with
+  `pnpm exec playwright install --with-deps chromium`.
+- **Hosted agent sandbox:** Chromium is pre-installed at `/opt/pw-browsers`. The
+  config auto-detects it (`executablePath`) and never downloads.
+
+## What's safe to test here
+
+Smoke tests deliberately target **statically rendered, dependency-free routes**
+(`/about`, `/what-is-an-eval`, …) so they pass in CI without any secrets.
+
+Routes that read from storage (S3) or call external LLM APIs — the homepage,
+`/analysis/*`, `/latest`, etc. — will be slow or error without env/network. To
+cover those, either:
+
+- provide the relevant env vars (see `.env.template`), or
+- intercept network calls with `page.route(...)` and serve fixtures.
+
+Keep flaky, data-dependent assertions out of the default suite.
+
+## Conventions
+
+- Prefer role/text locators (`getByRole`, `getByText`) and `a[href*="…"]` over
+  brittle CSS/nth-child selectors.
+- Never use hard `waitForTimeout` for synchronization — rely on web-first
+  assertions (`await expect(locator).toBeVisible()`), which auto-wait.
+- Add a `data-testid` to a component only when no accessible/role selector works.
diff --git a/tests/e2e/smoke.spec.ts b/tests/e2e/smoke.spec.ts
new file mode 100644
index 00000000..10da7e08
--- /dev/null
+++ b/tests/e2e/smoke.spec.ts
@@ -0,0 +1,25 @@
+import { test, expect } from '@playwright/test';
+
+/**
+ * Smoke tests target dependency-free, statically rendered routes so they pass
+ * in CI without storage/API secrets. Data-driven routes (the homepage,
+ * /analysis, etc.) hit storage and external LLM APIs — to test those, provide
+ * env vars or mock the network first. See tests/e2e/README.md.
+ */
+test.describe('smoke', () => {
+  test('about page renders its title and key content', async ({ page }) => {
+    await page.goto('/about');
+
+    await expect(page).toHaveTitle(/About Weval/i);
+    await expect(
+      page.getByRole('heading', { name: /what are evaluations\?/i }),
+    ).toBeVisible();
+  });
+
+  test('about page links out to the Collective Intelligence Project', async ({ page }) => {
+    await page.goto('/about');
+
+    const cipLink = page.locator('a[href*="cip.org"]').first();
+    await expect(cipLink).toBeVisible();
+  });
+});