From 31b225170305915418fab99fdcf26760fff79f46 Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Tue, 30 Jun 2026 18:10:07 +0000
Subject: [PATCH 01/11] Add agent skills for PR testing and evaluation

Five Claude Code skills for quality-gating PRs in this Next.js/TypeScript project:

- test-pr: run Jest tests scoped to files changed vs main
- typecheck-pr: tsc --noEmit, surfaces errors new to the branch
- lint-pr: Next.js ESLint on changed files, with optional --fix
- blueprint-validate: validates Weval blueprint YAML structure and staging limits
- pr-check: orchestrates all checks and produces a structured summary table

Also updates .gitignore to use .claude/* + !.claude/skills so skill files are tracked while the rest of .claude/ stays ignored.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011rZDvWQFXH3m42kYMzEY2g
---
 .claude/skills/blueprint-validate.md | 45 ++++++++++++++++++++++++
 .claude/skills/lint-pr.md            | 34 ++++++++++++++++++
 .claude/skills/pr-check.md           | 52 ++++++++++++++++++++++++++++
 .claude/skills/test-pr.md            | 37 ++++++++++++++++++++
 .claude/skills/typecheck-pr.md       | 30 ++++++++++++++++
 .gitignore                           |  1 +
 6 files changed, 199 insertions(+)
 create mode 100644 .claude/skills/blueprint-validate.md
 create mode 100644 .claude/skills/lint-pr.md
 create mode 100644 .claude/skills/pr-check.md
 create mode 100644 .claude/skills/test-pr.md
 create mode 100644 .claude/skills/typecheck-pr.md

diff --git a/.claude/skills/blueprint-validate.md b/.claude/skills/blueprint-validate.md
new file mode 100644
index 0000000..aff3139
--- /dev/null
+++ b/.claude/skills/blueprint-validate.md
@@ -0,0 +1,45 @@
+# blueprint-validate
+
+Validate Weval blueprint YAML/JSON files changed in the current branch. This mirrors Layer 1 of the three-layer PR evaluation system described in `docs/PR_EVALUATION_SETUP.md`.
+
+## Steps
+
+1. Find changed blueprint files:
+   ```
+   git diff --name-only origin/main...HEAD -- 'examples/**' '*.yaml' '*.yml' '*.json'
+   ```
+   Focus on files in `examples/blueprints/` or any `.yaml`/`.json` blueprint files.
+
+2. For each changed blueprint file, validate:
+
+   **Structural checks:**
+   - Parse YAML/JSON — if it fails to parse, report the syntax error immediately
+   - Required top-level fields: `id`, `prompts` (check `docs/BLUEPRINT_FORMAT.md` for full schema)
+   - `id` must be a non-empty string
+   - `prompts` must be a non-empty array
+   - Each prompt must have the required fields
+
+   **Size/cost checks (staging limits from Layer 2):**
+   - Warn if prompt count > 10 (would be auto-trimmed in staging eval)
+   - Warn if model list is large (only CORE models supported in staging)
+   - Note: these are warnings, not errors — production evals have no limits
+
+   **Security check:**
+   - If the file is under `blueprints/users/{username}/`, the username in the path must match the PR author
+   - Get the PR author from `git log --format='%ae' -1` or from GitHub context if available
+
+3. For any blueprint that passes validation, summarize:
+   - Number of prompts
+   - Models targeted
+   - Estimated staging evaluation scope (after auto-trim if applicable)
+
+4. Report:
+   - ✅ or ❌ per file with specific error details
+   - Warnings for staging-limit exceedances
+   - A final pass/fail verdict
+
+## Args
+
+Optional: path to a specific blueprint file to validate instead of scanning changed files.
+
+Example: `/blueprint-validate` or `/blueprint-validate examples/blueprints/my-blueprint.yaml`
diff --git a/.claude/skills/lint-pr.md b/.claude/skills/lint-pr.md
new file mode 100644
index 0000000..8d1c0da
--- /dev/null
+++ b/.claude/skills/lint-pr.md
@@ -0,0 +1,34 @@
+# lint-pr
+
+Run Next.js ESLint on files changed in the current branch and report new lint errors.
+
+## Steps
+
+1. Find changed JS/TS/TSX files:
+   ```
+   git diff --name-only origin/main...HEAD -- '*.ts' '*.tsx' '*.js' '*.jsx'
+   ```
+
+2. Run lint on the changed files only (faster than full lint):
+   ```
+   pnpm exec next lint --file <file1> --file <file2> ...
+   ```
+   If there are more than 20 changed files, run the full lint instead:
+   ```
+   pnpm lint
+   ```
+
+3. Report:
+   - Pass/fail overall
+   - Each error and warning with file path, line number, rule name, and message
+   - Distinguish errors (must fix) from warnings (should fix)
+   - If zero issues: confirm clean
+
+4. Do NOT auto-fix lint errors unless the user explicitly passes `--fix` as an arg.
+   If `--fix` is passed, run `pnpm lint:fix` on the changed files and show a diff of what changed.
+
+## Args
+
+Optional: `--fix` to automatically apply safe lint fixes.
+
+Example: `/lint-pr` or `/lint-pr --fix`
diff --git a/.claude/skills/pr-check.md b/.claude/skills/pr-check.md
new file mode 100644
index 0000000..c66b2d4
--- /dev/null
+++ b/.claude/skills/pr-check.md
@@ -0,0 +1,52 @@
+# pr-check
+
+Run a full quality gate on the current branch before or after a PR is created. Orchestrates type checking, linting, and tests, then produces a structured summary.
+
+## Steps
+
+1. **Identify scope** — get the diff summary:
+   ```
+   git diff --stat origin/main...HEAD
+   ```
+   Note whether changes touch: source code, blueprint files, tests, docs, config.
+
+2. **Run checks in parallel where possible** (report results as each finishes):
+
+   | Check | Command | When to run |
+   |-------|---------|-------------|
+   | TypeScript | `pnpm typecheck` | Always |
+   | Lint | `pnpm lint` | Always |
+   | Web tests | `pnpm test:web` | If `src/app/`, `src/components/`, `src/hooks/`, or `src/point-functions/` changed |
+   | CLI tests | `pnpm test:cli` | If `src/cli/` or `src/lib/` changed |
+   | Blueprint validate | see `blueprint-validate` skill | If any `.yaml`/`.yml`/`.json` blueprint files changed |
+
+3. **Collect results** — for each check record:
+   - Status: ✅ pass / ❌ fail / ⚠️ warnings / ⏭️ skipped (not applicable)
+   - Error/warning count
+   - Key details (first 5 errors max per check to keep output readable)
+
+4. **Produce a summary table:**
+   ```
+   ## PR Check Results — <branch-name>
+
+   | Check         | Status | Details                  |
+   |---------------|--------|--------------------------|
+   | TypeScript    | ✅     | 0 errors                 |
+   | Lint          | ⚠️     | 2 warnings, 0 errors     |
+   | Web tests     | ✅     | 47 passed                |
+   | CLI tests     | ⏭️     | No CLI files changed     |
+   | Blueprints    | ✅     | 1 file validated         |
+
+   **Overall: READY TO MERGE** / **NEEDS FIXES**
+   ```
+
+5. If any check fails, list the specific errors that need to be addressed.
+
+6. If `--comment` is passed as an arg, post this summary as a GitHub PR comment using the available GitHub MCP tools.
+
+## Args
+
+- `--comment`: Post the results as a GitHub PR comment (requires PR number to be detectable from the branch)
+- Base branch (e.g. `main`, `staging`): defaults to `origin/main`
+
+Example: `/pr-check` or `/pr-check --comment` or `/pr-check staging`
diff --git a/.claude/skills/test-pr.md b/.claude/skills/test-pr.md
new file mode 100644
index 0000000..1be1953
--- /dev/null
+++ b/.claude/skills/test-pr.md
@@ -0,0 +1,37 @@
+# test-pr
+
+Run Jest tests for files changed in the current branch compared to main.
+
+## Steps
+
+1. Find changed files:
+   ```
+   git diff --name-only origin/main...HEAD
+   ```
+
+2. Identify which changed files have corresponding tests or ARE test files. Test files live in:
+   - Same directory as source with `.test.ts` / `.test.tsx` suffix
+   - `src/point-functions/` tests are colocated
+   - Jest configs: `jest.config.js` (web) and `jest.config.cli.js` (CLI)
+
+3. Decide which test suite(s) to run:
+   - If changes touch `src/app/`, `src/components/`, `src/hooks/` → web suite: `pnpm test:web`
+   - If changes touch `src/cli/`, `src/lib/` → CLI suite: `pnpm test:cli`
+   - If changes touch `src/point-functions/` → web suite (tests are colocated)
+   - Run both suites if changes span both areas
+
+4. Run only the tests for changed/affected files where possible:
+   - For targeted runs: `pnpm test:web -- --testPathPattern="<pattern>"`
+   - If no specific pattern can be identified, run the full relevant suite
+
+5. Report:
+   - Pass/fail status
+   - Number of test suites and tests run
+   - Any failing tests with their error messages
+   - Which files were changed that had no test coverage (note as gaps, not failures)
+
+## Args
+
+Optional: a PR number or branch name to compare against. Defaults to `origin/main`.
+
+Example: `/test-pr 42` or `/test-pr feature/my-branch`
diff --git a/.claude/skills/typecheck-pr.md b/.claude/skills/typecheck-pr.md
new file mode 100644
index 0000000..404bbe8
--- /dev/null
+++ b/.claude/skills/typecheck-pr.md
@@ -0,0 +1,30 @@
+# typecheck-pr
+
+Run TypeScript type checking on the project and surface errors introduced by the current branch.
+
+## Steps
+
+1. Get the list of changed `.ts` / `.tsx` files:
+   ```
+   git diff --name-only origin/main...HEAD -- '*.ts' '*.tsx'
+   ```
+
+2. Run the full type check (tsc checks the whole project, not individual files):
+   ```
+   pnpm typecheck
+   ```
+   This runs `tsc --noEmit` using the root `tsconfig.json`.
+
+3. If the full typecheck fails, compare against main to isolate which errors are NEW:
+   - Note the total error count from the current branch
+   - Optionally stash changes, run `pnpm typecheck` on main, then restore — but only do this if there are errors and the user needs to know which are pre-existing vs. introduced
+
+4. Report:
+   - Overall pass/fail
+   - List of NEW type errors introduced by this branch (file, line, error message)
+   - Count of pre-existing errors if any (so the author knows what's theirs to fix)
+   - Flag any errors in files the PR didn't touch (may indicate type cascades from interface changes)
+
+## Args
+
+Optional: base branch to compare against. Defaults to `origin/main`.
diff --git a/.gitignore b/.gitignore
index aecac04..1513e23 100644
--- a/.gitignore
+++ b/.gitignore
@@ -17,6 +17,7 @@ catechism_dump.txt
 
 **/node_modules/**
 .claude
+!.claude/skills
 
 # testing
 /coverage

From 5c1411b1a3b9b91804d4329c8a713470ad080b40 Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Tue, 30 Jun 2026 18:10:28 +0000
Subject: [PATCH 02/11] Fix .gitignore to correctly unignore .claude/skills
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Change .claude → .claude/* so the !.claude/skills negation takes effect.
Git cannot negate inside an ignored directory; switching to a glob on
the directory's contents allows the exception to work.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011rZDvWQFXH3m42kYMzEY2g
---
 .gitignore | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/.gitignore b/.gitignore
index 1513e23..9bc32f4 100644
--- a/.gitignore
+++ b/.gitignore
@@ -16,7 +16,7 @@
 catechism_dump.txt
 
 **/node_modules/**
-.claude
+.claude/*
 !.claude/skills
 
 # testing

From 9a139d9a534fff2daa09ead720960359f4029468 Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Tue, 30 Jun 2026 18:30:45 +0000
Subject: [PATCH 03/11] Add Playwright e2e foundation + PR testing skills
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Stand up a deterministic, version-controlled e2e layer (best-practice
@playwright/test specs in CI) and two skills that drive it as PRs are created.

Foundation:
- playwright.config.ts: boots `pnpm dev`, pre-warms /about as the readiness
  probe, retries+trace+video on CI. Auto-detects the sandbox Chromium at
  /opt/pw-browsers and falls back to bundled browser locally/in CI.
- tests/e2e/smoke.spec.ts: smoke tests on /about (static, secret-free → green
  in CI without storage/API keys). README documents env caveats + conventions.
- .github/workflows/e2e.yml: runs the suite on PRs/pushes, uploads HTML report.
- scripts/pr-screenshots.mjs: fail-soft full-page screenshotter reused by the
  PR-description skill.
- package.json scripts: test:e2e, test:e2e:ui, test:e2e:report, pr:screenshots.

Skills:
- e2e-pr: run the suite scoped to a PR's changed routes and report.
- pr-describe: generate a PR description from the diff and, for visual changes,
  attach before/after screenshots (base captured in an isolated git worktree).
  Strictly time-boxed and fail-soft — falls back to a text-only description.

Verified locally: both smoke tests pass and the screenshot script captures a
full-page PNG via the sandbox Chromium.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011rZDvWQFXH3m42kYMzEY2g
---
 .claude/skills/e2e-pr.md      |  60 ++++++++++++++++
 .claude/skills/pr-describe.md | 132 ++++++++++++++++++++++++++++++++++
 .github/workflows/e2e.yml     |  51 +++++++++++++
 .gitignore                    |   6 ++
 package.json                  |   5 ++
 playwright.config.ts          |  56 +++++++++++++++
 pnpm-lock.yaml                |  49 +++++++++++--
 scripts/pr-screenshots.mjs    |  80 +++++++++++++++++++++
 tests/e2e/README.md           |  51 +++++++++++++
 tests/e2e/smoke.spec.ts       |  25 +++++++
 10 files changed, 510 insertions(+), 5 deletions(-)
 create mode 100644 .claude/skills/e2e-pr.md
 create mode 100644 .claude/skills/pr-describe.md
 create mode 100644 .github/workflows/e2e.yml
 create mode 100644 playwright.config.ts
 create mode 100644 scripts/pr-screenshots.mjs
 create mode 100644 tests/e2e/README.md
 create mode 100644 tests/e2e/smoke.spec.ts

diff --git a/.claude/skills/e2e-pr.md b/.claude/skills/e2e-pr.md
new file mode 100644
index 0000000..8ae923c
--- /dev/null
+++ b/.claude/skills/e2e-pr.md
@@ -0,0 +1,60 @@
+# e2e-pr
+
+Run the Playwright end-to-end suite against the routes a PR touches, and report results.
+
+## Prerequisites
+
+- `@playwright/test` is installed and `playwright.config.ts` exists at the repo root.
+- The config auto-boots `pnpm dev` (its `webServer` block) and waits for `/about`,
+  so you do NOT need to start a server yourself unless one is already running.
+- In CI the browser comes from `playwright install`; in the hosted sandbox it's
+  the pre-installed Chromium at `/opt/pw-browsers`. The config handles both.
+
+## Steps
+
+1. **Find changed files** vs the base branch (default `origin/main`):
+   ```
+   git diff --name-only origin/main...HEAD
+   ```
+
+2. **Map changed files to affected routes** (best-effort):
+   - `src/app/(group)/foo/page.tsx` → `/foo` (drop `src/app`, drop `(route-group)`
+     segments, drop the trailing `/page.tsx`)
+   - `src/app/(standard)/page.tsx` → `/`
+   - `layout.tsx` / `template.tsx` changes affect every route beneath them.
+   - Dynamic segments (`[id]`) can't be visited without a concrete value — note
+     them but don't try to test them blindly.
+   - Changes under `src/components/**` are shared and can't be mapped to a single
+     route → treat as "broad" (run the whole suite).
+
+3. **Choose scope:**
+   - If specific route files changed and there are specs covering them, run those:
+     ```
+     pnpm test:e2e -g "<route or feature keyword>"
+     ```
+     or pass specific spec files: `pnpm test:e2e tests/e2e/<file>.spec.ts`
+   - If changes are broad (shared components, layout, config) OR the suite is
+     small, just run everything: `pnpm test:e2e`
+
+4. **Run** the chosen command. The first run cold-compiles routes in `next dev`,
+   so allow time — the config already uses generous timeouts.
+
+5. **Report:**
+   - Pass/fail, number of tests run, and any failures with their error message.
+   - On failure, point to the HTML report (`pnpm test:e2e:report`) and the trace
+     (saved under `test-results/` on first retry).
+   - Call out any changed routes that have NO e2e coverage as gaps (not failures),
+     so the author can decide whether to add a spec.
+
+## Notes
+
+- Data-dependent routes (homepage, `/analysis/*`) need env/secrets or network
+  mocking — see `tests/e2e/README.md`. Don't add flaky assertions on them.
+- Do NOT auto-write new specs here; that's authoring. This skill runs existing
+  tests. (Use `pr-describe` or a dedicated authoring step to add coverage.)
+
+## Args
+
+Optional base branch to diff against (default `origin/main`).
+
+Example: `/e2e-pr` or `/e2e-pr staging`
diff --git a/.claude/skills/pr-describe.md b/.claude/skills/pr-describe.md
new file mode 100644
index 0000000..690bf44
--- /dev/null
+++ b/.claude/skills/pr-describe.md
@@ -0,0 +1,132 @@
+# pr-describe
+
+Generate a high-quality PR description from the branch diff. When the change is
+**visual** (touches UI routes/components/styles), TRY to attach before/after
+screenshots — but the attempt is strictly time-boxed and **fails soft**: if
+anything goes wrong, produce a clean text-only description and move on.
+
+## Output contract
+
+- If a PR already exists for the current branch → update its body
+  (GitHub MCP `update_pull_request`).
+- If no PR exists → print the finished markdown for the user to use. Do **NOT**
+  open a PR unless the user explicitly asked for one.
+
+---
+
+## Phase 1 — Analyze the diff (always runs)
+
+```
+git fetch origin <base> --quiet
+git diff --stat origin/<base>...HEAD
+git diff origin/<base>...HEAD
+git log origin/<base>..HEAD --format='%s%n%b'
+```
+
+Draft the description with these sections:
+
+- **Summary** — 1-3 sentences on what changed and why.
+- **Motivation / context** — the problem or request behind it.
+- **Changes** — bulleted, grouped by area (UI, API, CLI, tests, docs…).
+- **Test plan** — what you ran (`pnpm typecheck`, `pnpm lint`, `pnpm test:web`,
+  `pnpm test:e2e`) and the result, plus manual steps if any.
+- **Screenshots** — filled in by Phase 4 if applicable, else omitted.
+
+## Phase 2 — Does the screenshot step apply?
+
+Visual-change heuristic — TRUE if any changed file matches:
+- `src/app/**/(page|layout|template).tsx`
+- `src/components/**/*.tsx`
+- `**/*.css`
+
+If FALSE → skip to Phase 5 (text-only). If TRUE → continue, subject to the
+give-up policy below.
+
+## Phase 3 — Capture screenshots (time-boxed, fail-soft)
+
+> **GIVE-UP POLICY — bail to text-only (Phase 5) and add a one-line note if ANY hold:**
+> - Total screenshot phase exceeds **~6 minutes** wall-clock.
+> - A dev server fails to become ready within **120s**.
+> - Zero routes can be mapped (e.g. only shared components / dynamic routes changed
+>   and the user gave no route to shoot).
+> - The screenshot script captures nothing (`scripts/pr-screenshots.mjs` exits non-zero).
+> - Any unexpected error. Never let screenshots block the description.
+
+**3a. Determine routes** (cap at the 3 most relevant; note any you dropped):
+- Map changed `src/app/**/page.tsx` to URLs (see `e2e-pr` for the mapping rules).
+- Shared-component-only change → ask the user for 1-2 representative routes, OR
+  skip with a note. Don't guess across the whole app.
+- Skip dynamic (`[id]`) routes unless the user supplies a concrete URL.
+
+Let `SLUG` = sanitized branch name, `ROUTES` = comma-separated list, e.g. `/about,/what-is-an-eval`.
+
+**3b. Capture AFTER (current branch, HEAD).** Reuse a running dev server on
+`:3172` if present, else the config/`pnpm dev` will serve it. Then:
+```
+node scripts/pr-screenshots.mjs --base-url http://localhost:3172 \
+  --routes "$ROUTES" --out .github/pr-media/$SLUG --label after
+```
+
+**3c. Capture BEFORE (base branch) in an isolated worktree** so the working tree
+is untouched. Symlink `node_modules` to avoid a slow reinstall (valid as long as
+the PR didn't change dependencies — if it did, note that the "before" shot may
+be approximate):
+```
+WT=$(mktemp -d)
+git worktree add --detach "$WT" origin/<base>
+ln -s "$PWD/node_modules" "$WT/node_modules"
+( cd "$WT" && pnpm exec next dev -p 3173 ) &   # remember the PID
+# poll http://localhost:3173/about until it responds (cap 120s)
+node scripts/pr-screenshots.mjs --base-url http://localhost:3173 \
+  --routes "$ROUTES" --out .github/pr-media/$SLUG --label before
+# then ALWAYS clean up:
+kill <pid>; git worktree remove --force "$WT"
+```
+
+If BEFORE fails but AFTER succeeded, proceed with after-only + a note.
+
+## Phase 4 — Host & embed
+
+GitHub renders images only from URLs, so the PNGs must be committed and pushed
+before they resolve. Commit them to the PR branch and reference raw URLs:
+
+```
+git add .github/pr-media/$SLUG
+git commit -m "Add PR before/after screenshots"
+git push
+```
+
+Derive `OWNER/REPO` from `git remote get-url origin` and `BRANCH` from
+`git rev-parse --abbrev-ref HEAD`. For each route build a row:
+
+```md
+### Screenshots
+
+#### `/about`
+| Before | After |
+|--------|-------|
+| ![before](https://raw.githubusercontent.com/OWNER/REPO/BRANCH/.github/pr-media/SLUG/about-before.png) | ![after](https://raw.githubusercontent.com/OWNER/REPO/BRANCH/.github/pr-media/SLUG/about-after.png) |
+```
+
+(Omit the "Before" cell for routes where only the after shot exists — new pages.)
+
+> **Tradeoff:** this commits PNGs into the PR branch (they show in the diff). They
+> can be deleted before merge, or — if you'd rather keep them out of the diff —
+> commit them to a dedicated orphan `pr-media` branch and point the raw URLs at
+> that branch instead. Default is the PR branch for simplicity.
+
+## Phase 5 — Finalize
+
+- Assemble the full body (Phase 1 sections + Phase 4 screenshots if any).
+- If a PR exists → update its body via GitHub MCP. Else → print the markdown.
+- If screenshots were skipped, include one honest line, e.g.
+  _"Screenshots skipped: change is API-only"_ or _"…: dev server didn't boot in time"_.
+  Never silently drop them without saying why.
+
+## Args
+
+- Base branch to diff against (default `origin/main`).
+- Optional route list to screenshot, e.g. `--routes /about,/pairs` (overrides
+  auto-detection — useful for shared-component changes).
+
+Example: `/pr-describe` · `/pr-describe staging` · `/pr-describe --routes /about,/pairs`
diff --git a/.github/workflows/e2e.yml b/.github/workflows/e2e.yml
new file mode 100644
index 0000000..75a799b
--- /dev/null
+++ b/.github/workflows/e2e.yml
@@ -0,0 +1,51 @@
+name: E2E Tests
+
+on:
+  pull_request:
+    paths-ignore:
+      - '**/*.md'
+      - 'docs/**'
+  push:
+    branches: [main]
+    paths-ignore:
+      - '**/*.md'
+      - 'docs/**'
+
+concurrency:
+  group: e2e-${{ github.ref }}
+  cancel-in-progress: true
+
+jobs:
+  e2e:
+    runs-on: ubuntu-latest
+    timeout-minutes: 20
+    steps:
+      - uses: actions/checkout@v4
+
+      - uses: pnpm/action-setup@v4
+        with:
+          version: 9
+
+      - uses: actions/setup-node@v4
+        with:
+          node-version: 18
+          cache: pnpm
+
+      - name: Install dependencies
+        run: pnpm install --frozen-lockfile
+
+      - name: Install Playwright Chromium
+        run: pnpm exec playwright install --with-deps chromium
+
+      - name: Run E2E tests
+        run: pnpm test:e2e
+        env:
+          CI: 'true'
+
+      - name: Upload Playwright report
+        if: ${{ !cancelled() }}
+        uses: actions/upload-artifact@v4
+        with:
+          name: playwright-report
+          path: playwright-report/
+          retention-days: 14
diff --git a/.gitignore b/.gitignore
index 9bc32f4..0c90d52 100644
--- a/.gitignore
+++ b/.gitignore
@@ -25,6 +25,12 @@ catechism_dump.txt
 .swc
 .auth
 
+# playwright
+/test-results/
+/playwright-report/
+/blob-report/
+/playwright/.cache/
+
 # next.js
 /.next/
 /out/
diff --git a/package.json b/package.json
index c30e12b..d0206fd 100644
--- a/package.json
+++ b/package.json
@@ -20,6 +20,10 @@
     "test": "pnpm test:web && pnpm test:cli",
     "test:web": "jest --config jest.config.js",
     "test:cli": "node --experimental-vm-modules node_modules/jest/bin/jest.js --config jest.config.cli.js",
+    "test:e2e": "playwright test",
+    "test:e2e:ui": "playwright test --ui",
+    "test:e2e:report": "playwright show-report",
+    "pr:screenshots": "node scripts/pr-screenshots.mjs",
     "test:personality": "tsx src/cli/personality-test.ts",
     "compare:personality": "tsx src/cli/compare-personality.ts",
     "test:preference": "tsx src/cli/preference-test.ts",
@@ -129,6 +133,7 @@
   "devDependencies": {
     "@jest/globals": "^30.0.5",
     "@next/bundle-analyzer": "^15.4.4",
+    "@playwright/test": "1.55.0",
     "@sentry/cli": "^2.57.0",
     "@tailwindcss/postcss": "^4.1.11",
     "@tailwindcss/typography": "^0.5.16",
diff --git a/playwright.config.ts b/playwright.config.ts
new file mode 100644
index 0000000..dfa8900
--- /dev/null
+++ b/playwright.config.ts
@@ -0,0 +1,56 @@
+import { defineConfig, devices } from '@playwright/test';
+import { existsSync } from 'node:fs';
+
+/**
+ * In the hosted agent sandbox, Chromium is pre-installed at this path and
+ * `playwright install` is disabled. When the binary is present we point
+ * Playwright at it directly; otherwise (local dev, CI) we let Playwright use
+ * its own bundled browser installed via `playwright install chromium`.
+ */
+const SANDBOX_CHROMIUM = '/opt/pw-browsers/chromium';
+const executablePath = existsSync(SANDBOX_CHROMIUM) ? SANDBOX_CHROMIUM : undefined;
+
+const PORT = 3172;
+const BASE_URL = process.env.E2E_BASE_URL ?? `http://localhost:${PORT}`;
+
+export default defineConfig({
+  testDir: './tests/e2e',
+  fullyParallel: true,
+  forbidOnly: !!process.env.CI,
+  retries: process.env.CI ? 2 : 0,
+  workers: process.env.CI ? 1 : undefined,
+  reporter: process.env.CI
+    ? [['list'], ['html', { open: 'never' }]]
+    : [['list']],
+  timeout: 60_000,
+  expect: { timeout: 15_000 },
+  use: {
+    baseURL: BASE_URL,
+    navigationTimeout: 45_000,
+    trace: 'on-first-retry',
+    screenshot: 'only-on-failure',
+    video: 'retain-on-failure',
+  },
+  projects: [
+    {
+      name: 'chromium',
+      use: { ...devices['Desktop Chrome'], launchOptions: { executablePath } },
+    },
+  ],
+  /**
+   * When E2E_BASE_URL is set we assume the app is already running (e.g. a
+   * production build or a remote deploy) and skip booting a dev server.
+   * Otherwise boot `pnpm dev`; the readiness probe hits /about — a static,
+   * dependency-free route — which also pre-compiles it so the first test is fast.
+   */
+  webServer: process.env.E2E_BASE_URL
+    ? undefined
+    : {
+        command: 'pnpm dev',
+        url: `http://localhost:${PORT}/about`,
+        reuseExistingServer: !process.env.CI,
+        timeout: 180_000,
+        stdout: 'pipe',
+        stderr: 'pipe',
+      },
+});
diff --git a/pnpm-lock.yaml b/pnpm-lock.yaml
index 2794962..15eb0e1 100644
--- a/pnpm-lock.yaml
+++ b/pnpm-lock.yaml
@@ -82,7 +82,7 @@ importers:
         version: 1.2.7(@types/react-dom@19.1.6(@types/react@19.1.8))(@types/react@19.1.8)(react-dom@19.1.0(react@19.1.0))(react@19.1.0)
       '@sentry/nextjs':
         specifier: ^10.21.0
-        version: 10.21.0(@opentelemetry/context-async-hooks@2.2.0(@opentelemetry/api@1.9.0))(@opentelemetry/core@2.2.0(@opentelemetry/api@1.9.0))(@opentelemetry/sdk-trace-base@2.2.0(@opentelemetry/api@1.9.0))(encoding@0.1.13)(next@15.5.9(@babel/core@7.27.4)(@opentelemetry/api@1.9.0)(react-dom@19.1.0(react@19.1.0))(react@19.1.0))(react@19.1.0)(webpack@5.102.1)
+        version: 10.21.0(@opentelemetry/context-async-hooks@2.2.0(@opentelemetry/api@1.9.0))(@opentelemetry/core@2.2.0(@opentelemetry/api@1.9.0))(@opentelemetry/sdk-trace-base@2.2.0(@opentelemetry/api@1.9.0))(encoding@0.1.13)(next@15.5.9(@babel/core@7.27.4)(@opentelemetry/api@1.9.0)(@playwright/test@1.55.0)(react-dom@19.1.0(react@19.1.0))(react@19.1.0))(react@19.1.0)(webpack@5.102.1)
       '@sentry/node':
         specifier: ^10.21.0
         version: 10.21.0
@@ -166,7 +166,7 @@ importers:
         version: 3.1.0
       next:
         specifier: 15.5.9
-        version: 15.5.9(@babel/core@7.27.4)(@opentelemetry/api@1.9.0)(react-dom@19.1.0(react@19.1.0))(react@19.1.0)
+        version: 15.5.9(@babel/core@7.27.4)(@opentelemetry/api@1.9.0)(@playwright/test@1.55.0)(react-dom@19.1.0(react@19.1.0))(react@19.1.0)
       next-themes:
         specifier: ^0.4.6
         version: 0.4.6(react-dom@19.1.0(react@19.1.0))(react@19.1.0)
@@ -237,6 +237,9 @@ importers:
       '@next/bundle-analyzer':
         specifier: ^15.4.4
         version: 15.4.4
+      '@playwright/test':
+        specifier: 1.55.0
+        version: 1.55.0
       '@sentry/cli':
         specifier: ^2.57.0
         version: 2.57.0(encoding@0.1.13)
@@ -1825,6 +1828,11 @@ packages:
     resolution: {integrity: sha512-YLT9Zo3oNPJoBjBc4q8G2mjU4tqIbf5CEOORbUUr48dCD9q3umJ3IPlVqOqDakPfd2HuwccBaqlGhN4Gmr5OWg==}
     engines: {node: ^12.20.0 || ^14.18.0 || >=16.0.0}
 
+  '@playwright/test@1.55.0':
+    resolution: {integrity: sha512-04IXzPwHrW69XusN/SIdDdKZBzMfOT9UNT/YiJit/xpy2VuAoB8NHc8Aplb96zsWDddLnbkPL3TsmrS04ZU2xQ==}
+    engines: {node: '>=18'}
+    hasBin: true
+
   '@polka/url@1.0.0-next.29':
     resolution: {integrity: sha512-wwQAWhWSuHaag8c4q/KN/vCoeOJYshAIvMQwD4GpSb3OiZklFfvAgmj0VCBBImRpuF/aFgIRzllXlVX93Jevww==}
 
@@ -4558,6 +4566,11 @@ packages:
   fs.realpath@1.0.0:
     resolution: {integrity: sha512-OO0pH2lK6a0hZnAdau5ItzHPI6pUlvI7jMVnxUQRtw4owF2wk8lOSabtGDCTP4Ggrg2MbGnWO9X8K1t4+fGMDw==}
 
+  fsevents@2.3.2:
+    resolution: {integrity: sha512-xiqMQR4xAeHTuB9uWm+fFRcIOgKBMiOBP+eXiyT7jsgVCq1bkVygt00oASowB7EdtpOHaaPgKt812P9ab+DDKA==}
+    engines: {node: ^8.16.0 || ^10.6.0 || >=11.0.0}
+    os: [darwin]
+
   fsevents@2.3.3:
     resolution: {integrity: sha512-5xoDfX+fL7faATnagmWPpbFtwh/R77WmMMqqHGS65C3vvB0YHrgF+B1YmZ3441tMj5n63k0212XNoJwzlhffQw==}
     engines: {node: ^8.16.0 || ^10.6.0 || >=11.0.0}
@@ -5885,6 +5898,16 @@ packages:
     resolution: {integrity: sha512-HRDzbaKjC+AOWVXxAU/x54COGeIv9eb+6CkDSQoNTt4XyWoIJvuPsXizxu/Fr23EiekbtZwmh1IcIG/l/a10GQ==}
     engines: {node: '>=8'}
 
+  playwright-core@1.55.0:
+    resolution: {integrity: sha512-GvZs4vU3U5ro2nZpeiwyb0zuFaqb9sUiAJuyrWpcGouD8y9/HLgGbNRjIph7zU9D3hnPaisMl9zG9CgFi/biIg==}
+    engines: {node: '>=18'}
+    hasBin: true
+
+  playwright@1.55.0:
+    resolution: {integrity: sha512-sdCWStblvV1YU909Xqx0DhOjPZE4/5lJsIS84IfN9dAZfcl/CIZ5O8l3o0j7hPMjDvqoTF8ZUcc+i/GL5erstA==}
+    engines: {node: '>=18'}
+    hasBin: true
+
   postcss-import@15.1.0:
     resolution: {integrity: sha512-hpr+J05B2FVYUAXHeK1YyI267J/dDDhMU6B6civm8hSY1jYJnBXxzKDKDswzJmtLHryrjhnDjqqp/49t8FALew==}
     engines: {node: '>=14.0.0'}
@@ -9217,6 +9240,10 @@ snapshots:
 
   '@pkgr/core@0.2.7': {}
 
+  '@playwright/test@1.55.0':
+    dependencies:
+      playwright: 1.55.0
+
   '@polka/url@1.0.0-next.29': {}
 
   '@prisma/instrumentation@6.15.0(@opentelemetry/api@1.9.0)':
@@ -9997,7 +10024,7 @@ snapshots:
 
   '@sentry/core@10.21.0': {}
 
-  '@sentry/nextjs@10.21.0(@opentelemetry/context-async-hooks@2.2.0(@opentelemetry/api@1.9.0))(@opentelemetry/core@2.2.0(@opentelemetry/api@1.9.0))(@opentelemetry/sdk-trace-base@2.2.0(@opentelemetry/api@1.9.0))(encoding@0.1.13)(next@15.5.9(@babel/core@7.27.4)(@opentelemetry/api@1.9.0)(react-dom@19.1.0(react@19.1.0))(react@19.1.0))(react@19.1.0)(webpack@5.102.1)':
+  '@sentry/nextjs@10.21.0(@opentelemetry/context-async-hooks@2.2.0(@opentelemetry/api@1.9.0))(@opentelemetry/core@2.2.0(@opentelemetry/api@1.9.0))(@opentelemetry/sdk-trace-base@2.2.0(@opentelemetry/api@1.9.0))(encoding@0.1.13)(next@15.5.9(@babel/core@7.27.4)(@opentelemetry/api@1.9.0)(@playwright/test@1.55.0)(react-dom@19.1.0(react@19.1.0))(react@19.1.0))(react@19.1.0)(webpack@5.102.1)':
     dependencies:
       '@opentelemetry/api': 1.9.0
       '@opentelemetry/semantic-conventions': 1.37.0
@@ -10011,7 +10038,7 @@ snapshots:
       '@sentry/vercel-edge': 10.21.0
       '@sentry/webpack-plugin': 4.5.0(encoding@0.1.13)(webpack@5.102.1)
       chalk: 3.0.0
-      next: 15.5.9(@babel/core@7.27.4)(@opentelemetry/api@1.9.0)(react-dom@19.1.0(react@19.1.0))(react@19.1.0)
+      next: 15.5.9(@babel/core@7.27.4)(@opentelemetry/api@1.9.0)(@playwright/test@1.55.0)(react-dom@19.1.0(react@19.1.0))(react@19.1.0)
       resolve: 1.22.8
       rollup: 4.44.2
       stacktrace-parser: 0.1.11
@@ -12331,6 +12358,9 @@ snapshots:
 
   fs.realpath@1.0.0: {}
 
+  fsevents@2.3.2:
+    optional: true
+
   fsevents@2.3.3:
     optional: true
 
@@ -13818,7 +13848,7 @@ snapshots:
       react: 19.1.0
       react-dom: 19.1.0(react@19.1.0)
 
-  next@15.5.9(@babel/core@7.27.4)(@opentelemetry/api@1.9.0)(react-dom@19.1.0(react@19.1.0))(react@19.1.0):
+  next@15.5.9(@babel/core@7.27.4)(@opentelemetry/api@1.9.0)(@playwright/test@1.55.0)(react-dom@19.1.0(react@19.1.0))(react@19.1.0):
     dependencies:
       '@next/env': 15.5.9
       '@swc/helpers': 0.5.15
@@ -13837,6 +13867,7 @@ snapshots:
       '@next/swc-win32-arm64-msvc': 15.5.7
       '@next/swc-win32-x64-msvc': 15.5.7
       '@opentelemetry/api': 1.9.0
+      '@playwright/test': 1.55.0
       sharp: 0.34.3
     transitivePeerDependencies:
       - '@babel/core'
@@ -14068,6 +14099,14 @@ snapshots:
     dependencies:
       find-up: 4.1.0
 
+  playwright-core@1.55.0: {}
+
+  playwright@1.55.0:
+    dependencies:
+      playwright-core: 1.55.0
+    optionalDependencies:
+      fsevents: 2.3.2
+
   postcss-import@15.1.0(postcss@8.5.6):
     dependencies:
       postcss: 8.5.6
diff --git a/scripts/pr-screenshots.mjs b/scripts/pr-screenshots.mjs
new file mode 100644
index 0000000..8688a48
--- /dev/null
+++ b/scripts/pr-screenshots.mjs
@@ -0,0 +1,80 @@
+// Capture full-page screenshots of routes, for PR before/after comparisons.
+//
+// Usage:
+//   node scripts/pr-screenshots.mjs \
+//     --base-url http://localhost:3172 \
+//     --routes /about,/what-is-an-eval \
+//     --out .github/pr-media/my-branch \
+//     --label after
+//
+// Designed to FAIL SOFT: a route that errors or times out is skipped (and
+// logged), not fatal, so the orchestrating skill can still produce a partial
+// result. Exits non-zero only if NOTHING was captured.
+//
+// Reuses the @playwright/test Chromium (no extra dependency). In the hosted
+// sandbox it launches the pre-installed browser at /opt/pw-browsers/chromium.
+
+import { chromium } from '@playwright/test';
+import { existsSync, mkdirSync } from 'node:fs';
+import path from 'node:path';
+
+function arg(name, fallback) {
+  const i = process.argv.indexOf(`--${name}`);
+  return i !== -1 && process.argv[i + 1] ? process.argv[i + 1] : fallback;
+}
+
+const baseUrl = arg('base-url', 'http://localhost:3172').replace(/\/$/, '');
+const routes = arg('routes', '/')
+  .split(',')
+  .map((r) => r.trim())
+  .filter(Boolean);
+const outDir = arg('out', '.github/pr-media');
+const label = arg('label', 'after');
+const viewport = {
+  width: Number(arg('width', '1280')),
+  height: Number(arg('height', '800')),
+};
+const perRouteTimeoutMs = Number(arg('timeout', '45000'));
+
+const SANDBOX_CHROMIUM = '/opt/pw-browsers/chromium';
+const executablePath = existsSync(SANDBOX_CHROMIUM) ? SANDBOX_CHROMIUM : undefined;
+
+function slug(route) {
+  const s = route.replace(/^\/+|\/+$/g, '').replace(/[^a-zA-Z0-9._-]+/g, '_');
+  return s || 'home';
+}
+
+mkdirSync(outDir, { recursive: true });
+
+const browser = await chromium.launch({ executablePath });
+const context = await browser.newContext({ viewport });
+const results = [];
+
+for (const route of routes) {
+  const page = await context.newPage();
+  const file = path.join(outDir, `${slug(route)}-${label}.png`);
+  try {
+    // 'load' (not 'networkidle') — Next.js dev keeps an HMR websocket open,
+    // so networkidle would never settle.
+    await page.goto(`${baseUrl}${route}`, {
+      waitUntil: 'load',
+      timeout: perRouteTimeoutMs,
+    });
+    await page.waitForTimeout(750); // let fonts/animations settle
+    await page.screenshot({ path: file, fullPage: true });
+    results.push({ route, file, ok: true });
+    console.log(`OK   ${route} -> ${file}`);
+  } catch (err) {
+    results.push({ route, ok: false, error: String(err?.message || err) });
+    console.log(`FAIL ${route}: ${err?.message || err}`);
+  } finally {
+    await page.close();
+  }
+}
+
+await context.close();
+await browser.close();
+
+const ok = results.filter((r) => r.ok).length;
+console.log(`\n${ok}/${routes.length} screenshots captured in ${outDir}`);
+process.exit(ok > 0 ? 0 : 1);
diff --git a/tests/e2e/README.md b/tests/e2e/README.md
new file mode 100644
index 0000000..bd7c117
--- /dev/null
+++ b/tests/e2e/README.md
@@ -0,0 +1,51 @@
+# End-to-end tests
+
+Playwright e2e tests for the Weval web app.
+
+## Running
+
+```bash
+pnpm test:e2e          # run the suite (auto-boots `pnpm dev` on :3172)
+pnpm test:e2e:ui       # interactive UI mode
+pnpm test:e2e:report   # open the last HTML report
+```
+
+You don't need to start the dev server yourself — `playwright.config.ts` has a
+`webServer` block that boots `pnpm dev` and waits for `/about` to respond. If a
+dev server is already running on `:3172` it is reused (locally).
+
+To run against an already-running app (e.g. a production build or a deployed
+preview) instead of booting dev:
+
+```bash
+E2E_BASE_URL=https://your-preview.example.com pnpm test:e2e
+```
+
+## Browser binary
+
+- **Local / CI:** Playwright uses its own bundled Chromium. Install it once with
+  `pnpm exec playwright install --with-deps chromium`.
+- **Hosted agent sandbox:** Chromium is pre-installed at `/opt/pw-browsers`. The
+  config auto-detects it (`executablePath`) and never downloads.
+
+## What's safe to test here
+
+Smoke tests deliberately target **statically rendered, dependency-free routes**
+(`/about`, `/what-is-an-eval`, …) so they pass in CI without any secrets.
+
+Routes that read from storage (S3) or call external LLM APIs — the homepage,
+`/analysis/*`, `/latest`, etc. — will be slow or error without env/network. To
+cover those, either:
+
+- provide the relevant env vars (see `.env.template`), or
+- intercept network calls with `page.route(...)` and serve fixtures.
+
+Keep flaky, data-dependent assertions out of the default suite.
+
+## Conventions
+
+- Prefer role/text locators (`getByRole`, `getByText`) and `a[href*="…"]` over
+  brittle CSS/nth-child selectors.
+- Never use hard `waitForTimeout` for synchronization — rely on web-first
+  assertions (`await expect(locator).toBeVisible()`), which auto-wait.
+- Add a `data-testid` to a component only when no accessible/role selector works.
diff --git a/tests/e2e/smoke.spec.ts b/tests/e2e/smoke.spec.ts
new file mode 100644
index 0000000..10da7e0
--- /dev/null
+++ b/tests/e2e/smoke.spec.ts
@@ -0,0 +1,25 @@
+import { test, expect } from '@playwright/test';
+
+/**
+ * Smoke tests target dependency-free, statically rendered routes so they pass
+ * in CI without storage/API secrets. Data-driven routes (the homepage,
+ * /analysis, etc.) hit storage and external LLM APIs — to test those, provide
+ * env vars or mock the network first. See tests/e2e/README.md.
+ */
+test.describe('smoke', () => {
+  test('about page renders its title and key content', async ({ page }) => {
+    await page.goto('/about');
+
+    await expect(page).toHaveTitle(/About Weval/i);
+    await expect(
+      page.getByRole('heading', { name: /what are evaluations\?/i }),
+    ).toBeVisible();
+  });
+
+  test('about page links out to the Collective Intelligence Project', async ({ page }) => {
+    await page.goto('/about');
+
+    const cipLink = page.locator('a[href*="cip.org"]').first();
+    await expect(cipLink).toBeVisible();
+  });
+});

From d2e631a3e8ac5f3145f3e03e97394bf342aabcd7 Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Tue, 30 Jun 2026 19:15:36 +0000
Subject: [PATCH 04/11] Add security guardrails to pr-describe skill

The repo is public, so committed screenshots are permanent and world-readable.
Bake in non-negotiable rules: route denylist (/admin*, /api*, auth routes),
secret-free capture env only, no false "delete before merge" comfort, CI-artifact
hosting for anything potentially sensitive, and human review before pushing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011rZDvWQFXH3m42kYMzEY2g
---
 .claude/skills/pr-describe.md | 50 +++++++++++++++++++++++++++++++----
 1 file changed, 45 insertions(+), 5 deletions(-)

diff --git a/.claude/skills/pr-describe.md b/.claude/skills/pr-describe.md
index 690bf44..c177fbb 100644
--- a/.claude/skills/pr-describe.md
+++ b/.claude/skills/pr-describe.md
@@ -12,6 +12,28 @@ anything goes wrong, produce a clean text-only description and move on.
 - If no PR exists → print the finished markdown for the user to use. Do **NOT**
   open a PR unless the user explicitly asked for one.
 
+## Security guardrails (NON-NEGOTIABLE — this repo is public)
+
+Screenshots publish whatever they render. On a **public** repo, anything you
+commit is world-readable via raw URLs **permanently** (git history, forks, CDN
+cache) — deleting the branch does NOT undo it. So:
+
+1. **Route denylist — never screenshot these, no exceptions:**
+   `/admin*`, `/api*`, and any authenticated/session/account route. If a changed
+   route matches, skip it and note "omitted for safety", do not capture it.
+2. **Capture only against a secret-free environment.** Use the local `pnpm dev`
+   with NO real storage/API secrets wired up, so data-driven pages render
+   empty/mock and there is nothing sensitive in frame. Do **not** screenshot a
+   preview deploy that is backed by real data/secrets and then commit it here.
+3. **Treat committed images as permanent and public.** Never tell the user they
+   can "delete before merge" to undo exposure — that is false.
+4. **If a capture could contain anything sensitive, do NOT commit it to the
+   branch.** Prefer CI-artifact hosting (collaborator-only, auto-expiring) or a
+   PR comment. Committing to the public branch is only for plainly non-sensitive,
+   static UI.
+5. **Surface images for human review before pushing.** Show the user what was
+   captured and let them confirm; never push blind.
+
 ---
 
 ## Phase 1 — Analyze the diff (always runs)
@@ -54,9 +76,14 @@ give-up policy below.
 
 **3a. Determine routes** (cap at the 3 most relevant; note any you dropped):
 - Map changed `src/app/**/page.tsx` to URLs (see `e2e-pr` for the mapping rules).
+- **Apply the route denylist (see Security guardrails):** drop any `/admin*`,
+  `/api*`, or authenticated route and note it as "omitted for safety". This
+  filter runs before anything is captured.
 - Shared-component-only change → ask the user for 1-2 representative routes, OR
   skip with a note. Don't guess across the whole app.
 - Skip dynamic (`[id]`) routes unless the user supplies a concrete URL.
+- Confirm the dev server has **no real storage/API secrets** in its env before
+  capturing (guardrail #2). If you can't confirm that, skip screenshots.
 
 Let `SLUG` = sanitized branch name, `ROUTES` = comma-separated list, e.g. `/about,/what-is-an-eval`.
 
@@ -87,8 +114,15 @@ If BEFORE fails but AFTER succeeded, proceed with after-only + a note.
 
 ## Phase 4 — Host & embed
 
+> **Before any commit: re-confirm the captures are non-sensitive (guardrails
+> #1–#4) and show them to the user for a quick look (guardrail #5).** Committing
+> to a public branch is permanent and irreversible. If there is any doubt about
+> the contents, use CI-artifact hosting instead (see "Sensitive captures" below)
+> or skip embedding entirely.
+
 GitHub renders images only from URLs, so the PNGs must be committed and pushed
-before they resolve. Commit them to the PR branch and reference raw URLs:
+before they resolve. For plainly non-sensitive, static UI, commit them to the PR
+branch and reference raw URLs:
 
 ```
 git add .github/pr-media/$SLUG
@@ -110,10 +144,16 @@ Derive `OWNER/REPO` from `git remote get-url origin` and `BRANCH` from
 
 (Omit the "Before" cell for routes where only the after shot exists — new pages.)
 
-> **Tradeoff:** this commits PNGs into the PR branch (they show in the diff). They
-> can be deleted before merge, or — if you'd rather keep them out of the diff —
-> commit them to a dedicated orphan `pr-media` branch and point the raw URLs at
-> that branch instead. Default is the PR branch for simplicity.
+> **Tradeoff:** this commits PNGs into the PR branch and they show in the diff.
+> On a public repo this is **permanent and world-readable** — do NOT claim they
+> can be "deleted before merge" to undo exposure. To keep them out of the PR's
+> own diff you can use a dedicated orphan `pr-media` branch, but that is still
+> public; it changes visibility-in-diff, not exposure.
+
+**Sensitive captures → don't commit; use CI artifacts instead.** If a shot could
+contain anything non-public, skip the commit and have the e2e workflow upload the
+images via `actions/upload-artifact` (collaborator-only, auto-expiring). Link the
+run/artifact from the PR body rather than embedding a public raw URL.
 
 ## Phase 5 — Finalize
 

From 1f36dd0505403043dd06732cecfedf3b6c59abb0 Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Tue, 30 Jun 2026 19:31:10 +0000
Subject: [PATCH 05/11] Gate pr-describe screenshots to static routes by
 default

Screenshots only pay off on static, secret-free routes; data-driven pages render
empty/mock against the local dev env. Default to a static allowlist (/about,
/what-is-an-eval) and skip everything else with a note, overridable via --routes.
Filters run in order: route mapping -> security denylist (non-overridable) ->
static gate (overridable) -> dynamic-route skip. If nothing survives, fall back
to a text-only description without booting servers.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011rZDvWQFXH3m42kYMzEY2g
---
 .claude/skills/pr-describe.md | 60 +++++++++++++++++++++++++----------
 1 file changed, 44 insertions(+), 16 deletions(-)

diff --git a/.claude/skills/pr-describe.md b/.claude/skills/pr-describe.md
index c177fbb..724e32c 100644
--- a/.claude/skills/pr-describe.md
+++ b/.claude/skills/pr-describe.md
@@ -1,9 +1,10 @@
 # pr-describe
 
-Generate a high-quality PR description from the branch diff. When the change is
-**visual** (touches UI routes/components/styles), TRY to attach before/after
-screenshots — but the attempt is strictly time-boxed and **fails soft**: if
-anything goes wrong, produce a clean text-only description and move on.
+Generate a high-quality PR description from the branch diff. The description
+itself is the core deliverable and always runs. Before/after screenshots are a
+**bonus that only activates on static, visual PRs** — they're strictly
+time-boxed, **fail soft** (any problem → clean text-only description), and are
+**off by default for data-driven routes** (see "Default screenshot scope").
 
 ## Output contract
 
@@ -62,7 +63,26 @@ Visual-change heuristic — TRUE if any changed file matches:
 - `**/*.css`
 
 If FALSE → skip to Phase 5 (text-only). If TRUE → continue, subject to the
-give-up policy below.
+default scope and give-up policy below.
+
+### Default screenshot scope (practicality gate)
+
+Screenshots only pay off on **static, secret-free routes**. Data-driven routes
+(homepage, `/analysis/*`, `/pairs`, `/latest`, `/model/*`, etc.) render
+empty/mock against the secret-free dev env — an uninformative shot — and cost a
+slow capture. So **by default, only capture known-static routes**:
+
+- **Default static allowlist:** `/about`, `/what-is-an-eval`. (Extend this list
+  as more static pages are confirmed safe + stable.)
+- Any mapped route **not** on the allowlist is **skipped by default** with a note:
+  _"skipped: data-driven route (pass `--routes` to force)"_.
+- The user can **override** with `--routes /foo,/bar` to force specific routes
+  (e.g. against a preview deploy with real data, where they accept the tradeoff).
+- The **security denylist always wins** over any override — `/admin*`, `/api*`,
+  and auth routes are never captured even if explicitly passed.
+
+If, after this gate, there are **no routes left to shoot** → skip to Phase 5
+(text-only) with a one-line note. Don't boot servers for nothing.
 
 ## Phase 3 — Capture screenshots (time-boxed, fail-soft)
 
@@ -74,16 +94,20 @@ give-up policy below.
 > - The screenshot script captures nothing (`scripts/pr-screenshots.mjs` exits non-zero).
 > - Any unexpected error. Never let screenshots block the description.
 
-**3a. Determine routes** (cap at the 3 most relevant; note any you dropped):
-- Map changed `src/app/**/page.tsx` to URLs (see `e2e-pr` for the mapping rules).
-- **Apply the route denylist (see Security guardrails):** drop any `/admin*`,
-  `/api*`, or authenticated route and note it as "omitted for safety". This
-  filter runs before anything is captured.
-- Shared-component-only change → ask the user for 1-2 representative routes, OR
-  skip with a note. Don't guess across the whole app.
-- Skip dynamic (`[id]`) routes unless the user supplies a concrete URL.
+**3a. Determine routes** (cap at the 3 most relevant; note any you dropped).
+Apply these filters **in order**:
+1. Map changed `src/app/**/page.tsx` to URLs (see `e2e-pr` for the mapping rules).
+2. **Security denylist (always, non-overridable):** drop any `/admin*`, `/api*`,
+   or authenticated route; note as "omitted for safety".
+3. **Default static gate (see Phase 2):** unless the user passed `--routes`, drop
+   anything not on the static allowlist; note as "skipped: data-driven route".
+   If `--routes` was passed, use exactly those (still subject to step 2).
+4. Skip dynamic (`[id]`) routes unless the user supplies a concrete URL.
+5. Shared-component-only change with nothing left → ask the user for 1-2
+   representative static routes, or skip with a note. Don't guess across the app.
 - Confirm the dev server has **no real storage/API secrets** in its env before
   capturing (guardrail #2). If you can't confirm that, skip screenshots.
+- If no routes survive the filters → skip to Phase 5 (text-only).
 
 Let `SLUG` = sanitized branch name, `ROUTES` = comma-separated list, e.g. `/about,/what-is-an-eval`.
 
@@ -166,7 +190,11 @@ run/artifact from the PR body rather than embedding a public raw URL.
 ## Args
 
 - Base branch to diff against (default `origin/main`).
-- Optional route list to screenshot, e.g. `--routes /about,/pairs` (overrides
-  auto-detection — useful for shared-component changes).
+- `--routes /foo,/bar` — **override the default static-only gate** and capture
+  exactly these routes (still subject to the security denylist). Use this for
+  shared-component changes or when shooting a preview deploy with real data.
+
+By default (no `--routes`), only known-static routes (`/about`,
+`/what-is-an-eval`) are captured; data-driven routes are skipped with a note.
 
-Example: `/pr-describe` · `/pr-describe staging` · `/pr-describe --routes /about,/pairs`
+Example: `/pr-describe` · `/pr-describe staging` · `/pr-describe --routes /pairs,/latest`

From fe96f099c7473e5399ab66c0251a75d5f46f3cef Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Tue, 30 Jun 2026 19:52:00 +0000
Subject: [PATCH 06/11] Remove redundant single-check PR skills

Drop test-pr, typecheck-pr, lint-pr, and blueprint-validate. pr-check already
orchestrates typecheck/lint/tests directly via their pnpm commands, so the
standalone variants were redundant. Remaining skills: pr-check, e2e-pr, pr-describe.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011rZDvWQFXH3m42kYMzEY2g
---
 .claude/skills/blueprint-validate.md | 45 ----------------------------
 .claude/skills/lint-pr.md            | 34 ---------------------
 .claude/skills/test-pr.md            | 37 -----------------------
 .claude/skills/typecheck-pr.md       | 30 -------------------
 4 files changed, 146 deletions(-)
 delete mode 100644 .claude/skills/blueprint-validate.md
 delete mode 100644 .claude/skills/lint-pr.md
 delete mode 100644 .claude/skills/test-pr.md
 delete mode 100644 .claude/skills/typecheck-pr.md

diff --git a/.claude/skills/blueprint-validate.md b/.claude/skills/blueprint-validate.md
deleted file mode 100644
index aff3139..0000000
--- a/.claude/skills/blueprint-validate.md
+++ /dev/null
@@ -1,45 +0,0 @@
-# blueprint-validate
-
-Validate Weval blueprint YAML/JSON files changed in the current branch. This mirrors Layer 1 of the three-layer PR evaluation system described in `docs/PR_EVALUATION_SETUP.md`.
-
-## Steps
-
-1. Find changed blueprint files:
-   ```
-   git diff --name-only origin/main...HEAD -- 'examples/**' '*.yaml' '*.yml' '*.json'
-   ```
-   Focus on files in `examples/blueprints/` or any `.yaml`/`.json` blueprint files.
-
-2. For each changed blueprint file, validate:
-
-   **Structural checks:**
-   - Parse YAML/JSON — if it fails to parse, report the syntax error immediately
-   - Required top-level fields: `id`, `prompts` (check `docs/BLUEPRINT_FORMAT.md` for full schema)
-   - `id` must be a non-empty string
-   - `prompts` must be a non-empty array
-   - Each prompt must have the required fields
-
-   **Size/cost checks (staging limits from Layer 2):**
-   - Warn if prompt count > 10 (would be auto-trimmed in staging eval)
-   - Warn if model list is large (only CORE models supported in staging)
-   - Note: these are warnings, not errors — production evals have no limits
-
-   **Security check:**
-   - If the file is under `blueprints/users/{username}/`, the username in the path must match the PR author
-   - Get the PR author from `git log --format='%ae' -1` or from GitHub context if available
-
-3. For any blueprint that passes validation, summarize:
-   - Number of prompts
-   - Models targeted
-   - Estimated staging evaluation scope (after auto-trim if applicable)
-
-4. Report:
-   - ✅ or ❌ per file with specific error details
-   - Warnings for staging-limit exceedances
-   - A final pass/fail verdict
-
-## Args
-
-Optional: path to a specific blueprint file to validate instead of scanning changed files.
-
-Example: `/blueprint-validate` or `/blueprint-validate examples/blueprints/my-blueprint.yaml`
diff --git a/.claude/skills/lint-pr.md b/.claude/skills/lint-pr.md
deleted file mode 100644
index 8d1c0da..0000000
--- a/.claude/skills/lint-pr.md
+++ /dev/null
@@ -1,34 +0,0 @@
-# lint-pr
-
-Run Next.js ESLint on files changed in the current branch and report new lint errors.
-
-## Steps
-
-1. Find changed JS/TS/TSX files:
-   ```
-   git diff --name-only origin/main...HEAD -- '*.ts' '*.tsx' '*.js' '*.jsx'
-   ```
-
-2. Run lint on the changed files only (faster than full lint):
-   ```
-   pnpm exec next lint --file <file1> --file <file2> ...
-   ```
-   If there are more than 20 changed files, run the full lint instead:
-   ```
-   pnpm lint
-   ```
-
-3. Report:
-   - Pass/fail overall
-   - Each error and warning with file path, line number, rule name, and message
-   - Distinguish errors (must fix) from warnings (should fix)
-   - If zero issues: confirm clean
-
-4. Do NOT auto-fix lint errors unless the user explicitly passes `--fix` as an arg.
-   If `--fix` is passed, run `pnpm lint:fix` on the changed files and show a diff of what changed.
-
-## Args
-
-Optional: `--fix` to automatically apply safe lint fixes.
-
-Example: `/lint-pr` or `/lint-pr --fix`
diff --git a/.claude/skills/test-pr.md b/.claude/skills/test-pr.md
deleted file mode 100644
index 1be1953..0000000
--- a/.claude/skills/test-pr.md
+++ /dev/null
@@ -1,37 +0,0 @@
-# test-pr
-
-Run Jest tests for files changed in the current branch compared to main.
-
-## Steps
-
-1. Find changed files:
-   ```
-   git diff --name-only origin/main...HEAD
-   ```
-
-2. Identify which changed files have corresponding tests or ARE test files. Test files live in:
-   - Same directory as source with `.test.ts` / `.test.tsx` suffix
-   - `src/point-functions/` tests are colocated
-   - Jest configs: `jest.config.js` (web) and `jest.config.cli.js` (CLI)
-
-3. Decide which test suite(s) to run:
-   - If changes touch `src/app/`, `src/components/`, `src/hooks/` → web suite: `pnpm test:web`
-   - If changes touch `src/cli/`, `src/lib/` → CLI suite: `pnpm test:cli`
-   - If changes touch `src/point-functions/` → web suite (tests are colocated)
-   - Run both suites if changes span both areas
-
-4. Run only the tests for changed/affected files where possible:
-   - For targeted runs: `pnpm test:web -- --testPathPattern="<pattern>"`
-   - If no specific pattern can be identified, run the full relevant suite
-
-5. Report:
-   - Pass/fail status
-   - Number of test suites and tests run
-   - Any failing tests with their error messages
-   - Which files were changed that had no test coverage (note as gaps, not failures)
-
-## Args
-
-Optional: a PR number or branch name to compare against. Defaults to `origin/main`.
-
-Example: `/test-pr 42` or `/test-pr feature/my-branch`
diff --git a/.claude/skills/typecheck-pr.md b/.claude/skills/typecheck-pr.md
deleted file mode 100644
index 404bbe8..0000000
--- a/.claude/skills/typecheck-pr.md
+++ /dev/null
@@ -1,30 +0,0 @@
-# typecheck-pr
-
-Run TypeScript type checking on the project and surface errors introduced by the current branch.
-
-## Steps
-
-1. Get the list of changed `.ts` / `.tsx` files:
-   ```
-   git diff --name-only origin/main...HEAD -- '*.ts' '*.tsx'
-   ```
-
-2. Run the full type check (tsc checks the whole project, not individual files):
-   ```
-   pnpm typecheck
-   ```
-   This runs `tsc --noEmit` using the root `tsconfig.json`.
-
-3. If the full typecheck fails, compare against main to isolate which errors are NEW:
-   - Note the total error count from the current branch
-   - Optionally stash changes, run `pnpm typecheck` on main, then restore — but only do this if there are errors and the user needs to know which are pre-existing vs. introduced
-
-4. Report:
-   - Overall pass/fail
-   - List of NEW type errors introduced by this branch (file, line, error message)
-   - Count of pre-existing errors if any (so the author knows what's theirs to fix)
-   - Flag any errors in files the PR didn't touch (may indicate type cascades from interface changes)
-
-## Args
-
-Optional: base branch to compare against. Defaults to `origin/main`.

From d077034c3176ac70e493329e897a4babc97d0bd9 Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Tue, 30 Jun 2026 20:10:36 +0000
Subject: [PATCH 07/11] Add Risks/rollback section to pr-describe template
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Borrowed from common PR-description skills: an explicit blast-radius + how-to-undo
section so reviewers see what to scrutinize. Kept honest — one line for low-risk,
self-contained changes rather than padding.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011rZDvWQFXH3m42kYMzEY2g
---
 .claude/skills/pr-describe.md | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/.claude/skills/pr-describe.md b/.claude/skills/pr-describe.md
index 724e32c..03d2d1d 100644
--- a/.claude/skills/pr-describe.md
+++ b/.claude/skills/pr-describe.md
@@ -53,6 +53,12 @@ Draft the description with these sections:
 - **Changes** — bulleted, grouped by area (UI, API, CLI, tests, docs…).
 - **Test plan** — what you ran (`pnpm typecheck`, `pnpm lint`, `pnpm test:web`,
   `pnpm test:e2e`) and the result, plus manual steps if any.
+- **Risks / rollback** — the blast radius and how to undo. Call out anything
+  reviewers should scrutinize (data migrations, auth/permissions, external API
+  or cost impact, breaking changes, shared components touched). State how to
+  revert (usually "revert this PR" — but note it if a migration or deploy step
+  makes rollback non-trivial). If the change is low-risk and self-contained, say
+  so in one line rather than padding.
 - **Screenshots** — filled in by Phase 4 if applicable, else omitted.
 
 ## Phase 2 — Does the screenshot step apply?

From e9faf94afd2a0d39627a0d2687d32339eac934f4 Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Tue, 30 Jun 2026 20:15:40 +0000
Subject: [PATCH 08/11] Add Related Issues section and a worked example to
 pr-describe

Related Issues supports Closes/Refs linking (auto-closes issues on merge). The
worked example anchors the model to consistent output and shows all sections
filled in for a small UI change. Test plan + Risks/rollback retained.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011rZDvWQFXH3m42kYMzEY2g
---
 .claude/skills/pr-describe.md | 33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/.claude/skills/pr-describe.md b/.claude/skills/pr-describe.md
index 03d2d1d..6562bd8 100644
--- a/.claude/skills/pr-describe.md
+++ b/.claude/skills/pr-describe.md
@@ -60,6 +60,39 @@ Draft the description with these sections:
   makes rollback non-trivial). If the change is low-risk and self-contained, say
   so in one line rather than padding.
 - **Screenshots** — filled in by Phase 4 if applicable, else omitted.
+- **Related Issues** — link tickets and related work. Use `Closes #123` for
+  issues this PR resolves (auto-closes them on merge), `Refs #456` for related
+  PRs/issues, plus any relevant docs. Omit the section entirely if there's
+  nothing to link — don't invent issue numbers.
+
+### Worked example
+
+```md
+## Summary
+Tightens the site header on mobile so the nav no longer wraps under 380px.
+
+## Motivation / context
+The logo + nav links overflowed on small screens, pushing the theme toggle
+off-canvas. Reported in #41.
+
+## Changes
+- **UI:** right-align nav links and shrink logo on `sm` breakpoint (`Header.tsx`)
+- **UI:** swap hover underline for opacity to avoid layout shift
+- **Tests:** add an e2e assertion that the header is visible at 360px width
+
+## Test plan
+- `pnpm typecheck` ✅  ·  `pnpm lint` ✅  ·  `pnpm test:e2e` ✅ (3 passed)
+- Manually checked /about at 320 / 375 / 768px.
+
+## Risks / rollback
+Low-risk, CSS-only and self-contained. Revert this PR to undo.
+
+## Screenshots
+(before/after table inserted by Phase 4)
+
+## Related Issues
+Closes #41
+```
 
 ## Phase 2 — Does the screenshot step apply?
 

From cb6b9c730b892eaec9802c7061972b47e920c037 Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Tue, 30 Jun 2026 20:17:01 +0000
Subject: [PATCH 09/11] Convert skills to discoverable <name>/SKILL.md layout

Claude Code only auto-discovers skills at .claude/skills/<name>/SKILL.md with
YAML frontmatter (name + description). The flat .md files were not discoverable.
Move pr-check, e2e-pr, and pr-describe into that structure and add trigger-
oriented descriptions so the harness surfaces them.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011rZDvWQFXH3m42kYMzEY2g
---
 .claude/skills/{e2e-pr.md => e2e-pr/SKILL.md}           | 5 +++++
 .claude/skills/{pr-check.md => pr-check/SKILL.md}       | 5 +++++
 .claude/skills/{pr-describe.md => pr-describe/SKILL.md} | 5 +++++
 3 files changed, 15 insertions(+)
 rename .claude/skills/{e2e-pr.md => e2e-pr/SKILL.md} (91%)
 rename .claude/skills/{pr-check.md => pr-check/SKILL.md} (89%)
 rename .claude/skills/{pr-describe.md => pr-describe/SKILL.md} (97%)

diff --git a/.claude/skills/e2e-pr.md b/.claude/skills/e2e-pr/SKILL.md
similarity index 91%
rename from .claude/skills/e2e-pr.md
rename to .claude/skills/e2e-pr/SKILL.md
index 8ae923c..d094147 100644
--- a/.claude/skills/e2e-pr.md
+++ b/.claude/skills/e2e-pr/SKILL.md
@@ -1,3 +1,8 @@
+---
+name: e2e-pr
+description: Run the Playwright end-to-end suite against the routes a PR changes and report results. Use when verifying UI behavior for a branch or PR, or checking that a change didn't break key pages.
+---
+
 # e2e-pr
 
 Run the Playwright end-to-end suite against the routes a PR touches, and report results.
diff --git a/.claude/skills/pr-check.md b/.claude/skills/pr-check/SKILL.md
similarity index 89%
rename from .claude/skills/pr-check.md
rename to .claude/skills/pr-check/SKILL.md
index c66b2d4..a3b56ac 100644
--- a/.claude/skills/pr-check.md
+++ b/.claude/skills/pr-check/SKILL.md
@@ -1,3 +1,8 @@
+---
+name: pr-check
+description: Run the full quality gate on the current branch — TypeScript typecheck, ESLint, web/CLI Jest tests, and blueprint validation — then summarize pass/fail. Use before opening a PR, or to check whether a branch is CI-ready.
+---
+
 # pr-check
 
 Run a full quality gate on the current branch before or after a PR is created. Orchestrates type checking, linting, and tests, then produces a structured summary.
diff --git a/.claude/skills/pr-describe.md b/.claude/skills/pr-describe/SKILL.md
similarity index 97%
rename from .claude/skills/pr-describe.md
rename to .claude/skills/pr-describe/SKILL.md
index 6562bd8..0d535cb 100644
--- a/.claude/skills/pr-describe.md
+++ b/.claude/skills/pr-describe/SKILL.md
@@ -1,3 +1,8 @@
+---
+name: pr-describe
+description: Generate a structured PR description from the branch diff (Summary, Changes, Test plan, Risks, Related Issues), with optional static-gated before/after screenshots for visual changes. Use when writing or updating a pull request description.
+---
+
 # pr-describe
 
 Generate a high-quality PR description from the branch diff. The description

From 4dad721a5bfc5daaaa95fd4c91accfa6963a46c2 Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Tue, 30 Jun 2026 20:52:44 +0000
Subject: [PATCH 10/11] Fix e2e workflow pnpm version conflict

pnpm/action-setup@v4 errors when both `version` and package.json's
`packageManager` specify a pnpm version. Drop the explicit version: 9 and let
the action read pnpm@9.6.0 from packageManager.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011rZDvWQFXH3m42kYMzEY2g
---
 .github/workflows/e2e.yml | 2 --
 1 file changed, 2 deletions(-)

diff --git a/.github/workflows/e2e.yml b/.github/workflows/e2e.yml
index 75a799b..ab4669c 100644
--- a/.github/workflows/e2e.yml
+++ b/.github/workflows/e2e.yml
@@ -23,8 +23,6 @@ jobs:
       - uses: actions/checkout@v4
 
       - uses: pnpm/action-setup@v4
-        with:
-          version: 9
 
       - uses: actions/setup-node@v4
         with:

From c14cc658e75d1938d741dc1c72d399b74ce61f66 Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Tue, 30 Jun 2026 20:57:10 +0000
Subject: [PATCH 11/11] Add PR title convention to pr-describe

Produce Conventional Commits titles (type(scope): summary), inferring type/scope
from the diff. The skill now sets/updates the PR title as well as the body.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011rZDvWQFXH3m42kYMzEY2g
---
 .claude/skills/pr-describe/SKILL.md | 30 +++++++++++++++++++++++++----
 1 file changed, 26 insertions(+), 4 deletions(-)

diff --git a/.claude/skills/pr-describe/SKILL.md b/.claude/skills/pr-describe/SKILL.md
index 0d535cb..eafafa4 100644
--- a/.claude/skills/pr-describe/SKILL.md
+++ b/.claude/skills/pr-describe/SKILL.md
@@ -13,10 +13,32 @@ time-boxed, **fail soft** (any problem → clean text-only description), and are
 
 ## Output contract
 
-- If a PR already exists for the current branch → update its body
-  (GitHub MCP `update_pull_request`).
-- If no PR exists → print the finished markdown for the user to use. Do **NOT**
-  open a PR unless the user explicitly asked for one.
+- If a PR already exists for the current branch → update its **title and body**
+  (GitHub MCP `update_pull_request`). Only change the title if it doesn't already
+  follow the convention below.
+- If no PR exists → print the finished **title + markdown body** for the user.
+  Do **NOT** open a PR unless the user explicitly asked for one.
+
+## PR title convention
+
+Produce a properly tagged title in **Conventional Commits** form:
+
+```
+type(scope): imperative summary
+```
+
+- **type** — infer from the dominant change: `feat` (new capability), `fix`
+  (bug fix), `docs`, `test` (tests/test infra), `ci` (CI/workflows), `refactor`,
+  `perf`, `chore`, `build`, `style`.
+- **scope** — optional, a short area derived from the diff (e.g. `e2e`, `auth`,
+  `header`, `cli`, `pr-eval`). Omit if it spans many areas.
+- **summary** — imperative mood, lower-case start, **no trailing period**, aim
+  for ≤ ~70 chars total.
+- If the change mixes types, pick the one that best describes the user-facing
+  intent (a feature with its tests is still `feat`).
+
+Examples: `feat(header): collapse nav into a menu under 380px` ·
+`test(e2e): add Playwright smoke suite` · `fix(cli): handle empty blueprint id`.
 
 ## Security guardrails (NON-NEGOTIABLE — this repo is public)