From 31b225170305915418fab99fdcf26760fff79f46 Mon Sep 17 00:00:00 2001 From: Claude Date: Tue, 30 Jun 2026 18:10:07 +0000 Subject: [PATCH 01/11] Add agent skills for PR testing and evaluation Five Claude Code skills for quality-gating PRs in this Next.js/TypeScript project: - test-pr: run Jest tests scoped to files changed vs main - typecheck-pr: tsc --noEmit, surfaces errors new to the branch - lint-pr: Next.js ESLint on changed files, with optional --fix - blueprint-validate: validates Weval blueprint YAML structure and staging limits - pr-check: orchestrates all checks and produces a structured summary table Also updates .gitignore to use .claude/* + !.claude/skills so skill files are tracked while the rest of .claude/ stays ignored. Co-Authored-By: Claude Sonnet 4.6 Claude-Session: https://claude.ai/code/session_011rZDvWQFXH3m42kYMzEY2g --- .claude/skills/blueprint-validate.md | 45 ++++++++++++++++++++++++ .claude/skills/lint-pr.md | 34 ++++++++++++++++++ .claude/skills/pr-check.md | 52 ++++++++++++++++++++++++++++ .claude/skills/test-pr.md | 37 ++++++++++++++++++++ .claude/skills/typecheck-pr.md | 30 ++++++++++++++++ .gitignore | 1 + 6 files changed, 199 insertions(+) create mode 100644 .claude/skills/blueprint-validate.md create mode 100644 .claude/skills/lint-pr.md create mode 100644 .claude/skills/pr-check.md create mode 100644 .claude/skills/test-pr.md create mode 100644 .claude/skills/typecheck-pr.md diff --git a/.claude/skills/blueprint-validate.md b/.claude/skills/blueprint-validate.md new file mode 100644 index 0000000..aff3139 --- /dev/null +++ b/.claude/skills/blueprint-validate.md @@ -0,0 +1,45 @@ +# blueprint-validate + +Validate Weval blueprint YAML/JSON files changed in the current branch. This mirrors Layer 1 of the three-layer PR evaluation system described in `docs/PR_EVALUATION_SETUP.md`. + +## Steps + +1. Find changed blueprint files: + ``` + git diff --name-only origin/main...HEAD -- 'examples/**' '*.yaml' '*.yml' '*.json' + ``` + Focus on files in `examples/blueprints/` or any `.yaml`/`.json` blueprint files. + +2. For each changed blueprint file, validate: + + **Structural checks:** + - Parse YAML/JSON — if it fails to parse, report the syntax error immediately + - Required top-level fields: `id`, `prompts` (check `docs/BLUEPRINT_FORMAT.md` for full schema) + - `id` must be a non-empty string + - `prompts` must be a non-empty array + - Each prompt must have the required fields + + **Size/cost checks (staging limits from Layer 2):** + - Warn if prompt count > 10 (would be auto-trimmed in staging eval) + - Warn if model list is large (only CORE models supported in staging) + - Note: these are warnings, not errors — production evals have no limits + + **Security check:** + - If the file is under `blueprints/users/{username}/`, the username in the path must match the PR author + - Get the PR author from `git log --format='%ae' -1` or from GitHub context if available + +3. For any blueprint that passes validation, summarize: + - Number of prompts + - Models targeted + - Estimated staging evaluation scope (after auto-trim if applicable) + +4. Report: + - ✅ or ❌ per file with specific error details + - Warnings for staging-limit exceedances + - A final pass/fail verdict + +## Args + +Optional: path to a specific blueprint file to validate instead of scanning changed files. + +Example: `/blueprint-validate` or `/blueprint-validate examples/blueprints/my-blueprint.yaml` diff --git a/.claude/skills/lint-pr.md b/.claude/skills/lint-pr.md new file mode 100644 index 0000000..8d1c0da --- /dev/null +++ b/.claude/skills/lint-pr.md @@ -0,0 +1,34 @@ +# lint-pr + +Run Next.js ESLint on files changed in the current branch and report new lint errors. + +## Steps + +1. Find changed JS/TS/TSX files: + ``` + git diff --name-only origin/main...HEAD -- '*.ts' '*.tsx' '*.js' '*.jsx' + ``` + +2. Run lint on the changed files only (faster than full lint): + ``` + pnpm exec next lint --file --file ... + ``` + If there are more than 20 changed files, run the full lint instead: + ``` + pnpm lint + ``` + +3. Report: + - Pass/fail overall + - Each error and warning with file path, line number, rule name, and message + - Distinguish errors (must fix) from warnings (should fix) + - If zero issues: confirm clean + +4. Do NOT auto-fix lint errors unless the user explicitly passes `--fix` as an arg. + If `--fix` is passed, run `pnpm lint:fix` on the changed files and show a diff of what changed. + +## Args + +Optional: `--fix` to automatically apply safe lint fixes. + +Example: `/lint-pr` or `/lint-pr --fix` diff --git a/.claude/skills/pr-check.md b/.claude/skills/pr-check.md new file mode 100644 index 0000000..c66b2d4 --- /dev/null +++ b/.claude/skills/pr-check.md @@ -0,0 +1,52 @@ +# pr-check + +Run a full quality gate on the current branch before or after a PR is created. Orchestrates type checking, linting, and tests, then produces a structured summary. + +## Steps + +1. **Identify scope** — get the diff summary: + ``` + git diff --stat origin/main...HEAD + ``` + Note whether changes touch: source code, blueprint files, tests, docs, config. + +2. **Run checks in parallel where possible** (report results as each finishes): + + | Check | Command | When to run | + |-------|---------|-------------| + | TypeScript | `pnpm typecheck` | Always | + | Lint | `pnpm lint` | Always | + | Web tests | `pnpm test:web` | If `src/app/`, `src/components/`, `src/hooks/`, or `src/point-functions/` changed | + | CLI tests | `pnpm test:cli` | If `src/cli/` or `src/lib/` changed | + | Blueprint validate | see `blueprint-validate` skill | If any `.yaml`/`.yml`/`.json` blueprint files changed | + +3. **Collect results** — for each check record: + - Status: ✅ pass / ❌ fail / ⚠️ warnings / ⏭️ skipped (not applicable) + - Error/warning count + - Key details (first 5 errors max per check to keep output readable) + +4. **Produce a summary table:** + ``` + ## PR Check Results — + + | Check | Status | Details | + |---------------|--------|--------------------------| + | TypeScript | ✅ | 0 errors | + | Lint | ⚠️ | 2 warnings, 0 errors | + | Web tests | ✅ | 47 passed | + | CLI tests | ⏭️ | No CLI files changed | + | Blueprints | ✅ | 1 file validated | + + **Overall: READY TO MERGE** / **NEEDS FIXES** + ``` + +5. If any check fails, list the specific errors that need to be addressed. + +6. If `--comment` is passed as an arg, post this summary as a GitHub PR comment using the available GitHub MCP tools. + +## Args + +- `--comment`: Post the results as a GitHub PR comment (requires PR number to be detectable from the branch) +- Base branch (e.g. `main`, `staging`): defaults to `origin/main` + +Example: `/pr-check` or `/pr-check --comment` or `/pr-check staging` diff --git a/.claude/skills/test-pr.md b/.claude/skills/test-pr.md new file mode 100644 index 0000000..1be1953 --- /dev/null +++ b/.claude/skills/test-pr.md @@ -0,0 +1,37 @@ +# test-pr + +Run Jest tests for files changed in the current branch compared to main. + +## Steps + +1. Find changed files: + ``` + git diff --name-only origin/main...HEAD + ``` + +2. Identify which changed files have corresponding tests or ARE test files. Test files live in: + - Same directory as source with `.test.ts` / `.test.tsx` suffix + - `src/point-functions/` tests are colocated + - Jest configs: `jest.config.js` (web) and `jest.config.cli.js` (CLI) + +3. Decide which test suite(s) to run: + - If changes touch `src/app/`, `src/components/`, `src/hooks/` → web suite: `pnpm test:web` + - If changes touch `src/cli/`, `src/lib/` → CLI suite: `pnpm test:cli` + - If changes touch `src/point-functions/` → web suite (tests are colocated) + - Run both suites if changes span both areas + +4. Run only the tests for changed/affected files where possible: + - For targeted runs: `pnpm test:web -- --testPathPattern=""` + - If no specific pattern can be identified, run the full relevant suite + +5. Report: + - Pass/fail status + - Number of test suites and tests run + - Any failing tests with their error messages + - Which files were changed that had no test coverage (note as gaps, not failures) + +## Args + +Optional: a PR number or branch name to compare against. Defaults to `origin/main`. + +Example: `/test-pr 42` or `/test-pr feature/my-branch` diff --git a/.claude/skills/typecheck-pr.md b/.claude/skills/typecheck-pr.md new file mode 100644 index 0000000..404bbe8 --- /dev/null +++ b/.claude/skills/typecheck-pr.md @@ -0,0 +1,30 @@ +# typecheck-pr + +Run TypeScript type checking on the project and surface errors introduced by the current branch. + +## Steps + +1. Get the list of changed `.ts` / `.tsx` files: + ``` + git diff --name-only origin/main...HEAD -- '*.ts' '*.tsx' + ``` + +2. Run the full type check (tsc checks the whole project, not individual files): + ``` + pnpm typecheck + ``` + This runs `tsc --noEmit` using the root `tsconfig.json`. + +3. If the full typecheck fails, compare against main to isolate which errors are NEW: + - Note the total error count from the current branch + - Optionally stash changes, run `pnpm typecheck` on main, then restore — but only do this if there are errors and the user needs to know which are pre-existing vs. introduced + +4. Report: + - Overall pass/fail + - List of NEW type errors introduced by this branch (file, line, error message) + - Count of pre-existing errors if any (so the author knows what's theirs to fix) + - Flag any errors in files the PR didn't touch (may indicate type cascades from interface changes) + +## Args + +Optional: base branch to compare against. Defaults to `origin/main`. diff --git a/.gitignore b/.gitignore index aecac04..1513e23 100644 --- a/.gitignore +++ b/.gitignore @@ -17,6 +17,7 @@ catechism_dump.txt **/node_modules/** .claude +!.claude/skills # testing /coverage From 5c1411b1a3b9b91804d4329c8a713470ad080b40 Mon Sep 17 00:00:00 2001 From: Claude Date: Tue, 30 Jun 2026 18:10:28 +0000 Subject: [PATCH 02/11] Fix .gitignore to correctly unignore .claude/skills MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Change .claude → .claude/* so the !.claude/skills negation takes effect. Git cannot negate inside an ignored directory; switching to a glob on the directory's contents allows the exception to work. Co-Authored-By: Claude Sonnet 4.6 Claude-Session: https://claude.ai/code/session_011rZDvWQFXH3m42kYMzEY2g --- .gitignore | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.gitignore b/.gitignore index 1513e23..9bc32f4 100644 --- a/.gitignore +++ b/.gitignore @@ -16,7 +16,7 @@ catechism_dump.txt **/node_modules/** -.claude +.claude/* !.claude/skills # testing From 9a139d9a534fff2daa09ead720960359f4029468 Mon Sep 17 00:00:00 2001 From: Claude Date: Tue, 30 Jun 2026 18:30:45 +0000 Subject: [PATCH 03/11] Add Playwright e2e foundation + PR testing skills MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Stand up a deterministic, version-controlled e2e layer (best-practice @playwright/test specs in CI) and two skills that drive it as PRs are created. Foundation: - playwright.config.ts: boots `pnpm dev`, pre-warms /about as the readiness probe, retries+trace+video on CI. Auto-detects the sandbox Chromium at /opt/pw-browsers and falls back to bundled browser locally/in CI. - tests/e2e/smoke.spec.ts: smoke tests on /about (static, secret-free → green in CI without storage/API keys). README documents env caveats + conventions. - .github/workflows/e2e.yml: runs the suite on PRs/pushes, uploads HTML report. - scripts/pr-screenshots.mjs: fail-soft full-page screenshotter reused by the PR-description skill. - package.json scripts: test:e2e, test:e2e:ui, test:e2e:report, pr:screenshots. Skills: - e2e-pr: run the suite scoped to a PR's changed routes and report. - pr-describe: generate a PR description from the diff and, for visual changes, attach before/after screenshots (base captured in an isolated git worktree). Strictly time-boxed and fail-soft — falls back to a text-only description. Verified locally: both smoke tests pass and the screenshot script captures a full-page PNG via the sandbox Chromium. Co-Authored-By: Claude Sonnet 4.6 Claude-Session: https://claude.ai/code/session_011rZDvWQFXH3m42kYMzEY2g --- .claude/skills/e2e-pr.md | 60 ++++++++++++++++ .claude/skills/pr-describe.md | 132 ++++++++++++++++++++++++++++++++++ .github/workflows/e2e.yml | 51 +++++++++++++ .gitignore | 6 ++ package.json | 5 ++ playwright.config.ts | 56 +++++++++++++++ pnpm-lock.yaml | 49 +++++++++++-- scripts/pr-screenshots.mjs | 80 +++++++++++++++++++++ tests/e2e/README.md | 51 +++++++++++++ tests/e2e/smoke.spec.ts | 25 +++++++ 10 files changed, 510 insertions(+), 5 deletions(-) create mode 100644 .claude/skills/e2e-pr.md create mode 100644 .claude/skills/pr-describe.md create mode 100644 .github/workflows/e2e.yml create mode 100644 playwright.config.ts create mode 100644 scripts/pr-screenshots.mjs create mode 100644 tests/e2e/README.md create mode 100644 tests/e2e/smoke.spec.ts diff --git a/.claude/skills/e2e-pr.md b/.claude/skills/e2e-pr.md new file mode 100644 index 0000000..8ae923c --- /dev/null +++ b/.claude/skills/e2e-pr.md @@ -0,0 +1,60 @@ +# e2e-pr + +Run the Playwright end-to-end suite against the routes a PR touches, and report results. + +## Prerequisites + +- `@playwright/test` is installed and `playwright.config.ts` exists at the repo root. +- The config auto-boots `pnpm dev` (its `webServer` block) and waits for `/about`, + so you do NOT need to start a server yourself unless one is already running. +- In CI the browser comes from `playwright install`; in the hosted sandbox it's + the pre-installed Chromium at `/opt/pw-browsers`. The config handles both. + +## Steps + +1. **Find changed files** vs the base branch (default `origin/main`): + ``` + git diff --name-only origin/main...HEAD + ``` + +2. **Map changed files to affected routes** (best-effort): + - `src/app/(group)/foo/page.tsx` → `/foo` (drop `src/app`, drop `(route-group)` + segments, drop the trailing `/page.tsx`) + - `src/app/(standard)/page.tsx` → `/` + - `layout.tsx` / `template.tsx` changes affect every route beneath them. + - Dynamic segments (`[id]`) can't be visited without a concrete value — note + them but don't try to test them blindly. + - Changes under `src/components/**` are shared and can't be mapped to a single + route → treat as "broad" (run the whole suite). + +3. **Choose scope:** + - If specific route files changed and there are specs covering them, run those: + ``` + pnpm test:e2e -g "" + ``` + or pass specific spec files: `pnpm test:e2e tests/e2e/.spec.ts` + - If changes are broad (shared components, layout, config) OR the suite is + small, just run everything: `pnpm test:e2e` + +4. **Run** the chosen command. The first run cold-compiles routes in `next dev`, + so allow time — the config already uses generous timeouts. + +5. **Report:** + - Pass/fail, number of tests run, and any failures with their error message. + - On failure, point to the HTML report (`pnpm test:e2e:report`) and the trace + (saved under `test-results/` on first retry). + - Call out any changed routes that have NO e2e coverage as gaps (not failures), + so the author can decide whether to add a spec. + +## Notes + +- Data-dependent routes (homepage, `/analysis/*`) need env/secrets or network + mocking — see `tests/e2e/README.md`. Don't add flaky assertions on them. +- Do NOT auto-write new specs here; that's authoring. This skill runs existing + tests. (Use `pr-describe` or a dedicated authoring step to add coverage.) + +## Args + +Optional base branch to diff against (default `origin/main`). + +Example: `/e2e-pr` or `/e2e-pr staging` diff --git a/.claude/skills/pr-describe.md b/.claude/skills/pr-describe.md new file mode 100644 index 0000000..690bf44 --- /dev/null +++ b/.claude/skills/pr-describe.md @@ -0,0 +1,132 @@ +# pr-describe + +Generate a high-quality PR description from the branch diff. When the change is +**visual** (touches UI routes/components/styles), TRY to attach before/after +screenshots — but the attempt is strictly time-boxed and **fails soft**: if +anything goes wrong, produce a clean text-only description and move on. + +## Output contract + +- If a PR already exists for the current branch → update its body + (GitHub MCP `update_pull_request`). +- If no PR exists → print the finished markdown for the user to use. Do **NOT** + open a PR unless the user explicitly asked for one. + +--- + +## Phase 1 — Analyze the diff (always runs) + +``` +git fetch origin --quiet +git diff --stat origin/...HEAD +git diff origin/...HEAD +git log origin/..HEAD --format='%s%n%b' +``` + +Draft the description with these sections: + +- **Summary** — 1-3 sentences on what changed and why. +- **Motivation / context** — the problem or request behind it. +- **Changes** — bulleted, grouped by area (UI, API, CLI, tests, docs…). +- **Test plan** — what you ran (`pnpm typecheck`, `pnpm lint`, `pnpm test:web`, + `pnpm test:e2e`) and the result, plus manual steps if any. +- **Screenshots** — filled in by Phase 4 if applicable, else omitted. + +## Phase 2 — Does the screenshot step apply? + +Visual-change heuristic — TRUE if any changed file matches: +- `src/app/**/(page|layout|template).tsx` +- `src/components/**/*.tsx` +- `**/*.css` + +If FALSE → skip to Phase 5 (text-only). If TRUE → continue, subject to the +give-up policy below. + +## Phase 3 — Capture screenshots (time-boxed, fail-soft) + +> **GIVE-UP POLICY — bail to text-only (Phase 5) and add a one-line note if ANY hold:** +> - Total screenshot phase exceeds **~6 minutes** wall-clock. +> - A dev server fails to become ready within **120s**. +> - Zero routes can be mapped (e.g. only shared components / dynamic routes changed +> and the user gave no route to shoot). +> - The screenshot script captures nothing (`scripts/pr-screenshots.mjs` exits non-zero). +> - Any unexpected error. Never let screenshots block the description. + +**3a. Determine routes** (cap at the 3 most relevant; note any you dropped): +- Map changed `src/app/**/page.tsx` to URLs (see `e2e-pr` for the mapping rules). +- Shared-component-only change → ask the user for 1-2 representative routes, OR + skip with a note. Don't guess across the whole app. +- Skip dynamic (`[id]`) routes unless the user supplies a concrete URL. + +Let `SLUG` = sanitized branch name, `ROUTES` = comma-separated list, e.g. `/about,/what-is-an-eval`. + +**3b. Capture AFTER (current branch, HEAD).** Reuse a running dev server on +`:3172` if present, else the config/`pnpm dev` will serve it. Then: +``` +node scripts/pr-screenshots.mjs --base-url http://localhost:3172 \ + --routes "$ROUTES" --out .github/pr-media/$SLUG --label after +``` + +**3c. Capture BEFORE (base branch) in an isolated worktree** so the working tree +is untouched. Symlink `node_modules` to avoid a slow reinstall (valid as long as +the PR didn't change dependencies — if it did, note that the "before" shot may +be approximate): +``` +WT=$(mktemp -d) +git worktree add --detach "$WT" origin/ +ln -s "$PWD/node_modules" "$WT/node_modules" +( cd "$WT" && pnpm exec next dev -p 3173 ) & # remember the PID +# poll http://localhost:3173/about until it responds (cap 120s) +node scripts/pr-screenshots.mjs --base-url http://localhost:3173 \ + --routes "$ROUTES" --out .github/pr-media/$SLUG --label before +# then ALWAYS clean up: +kill ; git worktree remove --force "$WT" +``` + +If BEFORE fails but AFTER succeeded, proceed with after-only + a note. + +## Phase 4 — Host & embed + +GitHub renders images only from URLs, so the PNGs must be committed and pushed +before they resolve. Commit them to the PR branch and reference raw URLs: + +``` +git add .github/pr-media/$SLUG +git commit -m "Add PR before/after screenshots" +git push +``` + +Derive `OWNER/REPO` from `git remote get-url origin` and `BRANCH` from +`git rev-parse --abbrev-ref HEAD`. For each route build a row: + +```md +### Screenshots + +#### `/about` +| Before | After | +|--------|-------| +| ![before](https://raw.githubusercontent.com/OWNER/REPO/BRANCH/.github/pr-media/SLUG/about-before.png) | ![after](https://raw.githubusercontent.com/OWNER/REPO/BRANCH/.github/pr-media/SLUG/about-after.png) | +``` + +(Omit the "Before" cell for routes where only the after shot exists — new pages.) + +> **Tradeoff:** this commits PNGs into the PR branch (they show in the diff). They +> can be deleted before merge, or — if you'd rather keep them out of the diff — +> commit them to a dedicated orphan `pr-media` branch and point the raw URLs at +> that branch instead. Default is the PR branch for simplicity. + +## Phase 5 — Finalize + +- Assemble the full body (Phase 1 sections + Phase 4 screenshots if any). +- If a PR exists → update its body via GitHub MCP. Else → print the markdown. +- If screenshots were skipped, include one honest line, e.g. + _"Screenshots skipped: change is API-only"_ or _"…: dev server didn't boot in time"_. + Never silently drop them without saying why. + +## Args + +- Base branch to diff against (default `origin/main`). +- Optional route list to screenshot, e.g. `--routes /about,/pairs` (overrides + auto-detection — useful for shared-component changes). + +Example: `/pr-describe` · `/pr-describe staging` · `/pr-describe --routes /about,/pairs` diff --git a/.github/workflows/e2e.yml b/.github/workflows/e2e.yml new file mode 100644 index 0000000..75a799b --- /dev/null +++ b/.github/workflows/e2e.yml @@ -0,0 +1,51 @@ +name: E2E Tests + +on: + pull_request: + paths-ignore: + - '**/*.md' + - 'docs/**' + push: + branches: [main] + paths-ignore: + - '**/*.md' + - 'docs/**' + +concurrency: + group: e2e-${{ github.ref }} + cancel-in-progress: true + +jobs: + e2e: + runs-on: ubuntu-latest + timeout-minutes: 20 + steps: + - uses: actions/checkout@v4 + + - uses: pnpm/action-setup@v4 + with: + version: 9 + + - uses: actions/setup-node@v4 + with: + node-version: 18 + cache: pnpm + + - name: Install dependencies + run: pnpm install --frozen-lockfile + + - name: Install Playwright Chromium + run: pnpm exec playwright install --with-deps chromium + + - name: Run E2E tests + run: pnpm test:e2e + env: + CI: 'true' + + - name: Upload Playwright report + if: ${{ !cancelled() }} + uses: actions/upload-artifact@v4 + with: + name: playwright-report + path: playwright-report/ + retention-days: 14 diff --git a/.gitignore b/.gitignore index 9bc32f4..0c90d52 100644 --- a/.gitignore +++ b/.gitignore @@ -25,6 +25,12 @@ catechism_dump.txt .swc .auth +# playwright +/test-results/ +/playwright-report/ +/blob-report/ +/playwright/.cache/ + # next.js /.next/ /out/ diff --git a/package.json b/package.json index c30e12b..d0206fd 100644 --- a/package.json +++ b/package.json @@ -20,6 +20,10 @@ "test": "pnpm test:web && pnpm test:cli", "test:web": "jest --config jest.config.js", "test:cli": "node --experimental-vm-modules node_modules/jest/bin/jest.js --config jest.config.cli.js", + "test:e2e": "playwright test", + "test:e2e:ui": "playwright test --ui", + "test:e2e:report": "playwright show-report", + "pr:screenshots": "node scripts/pr-screenshots.mjs", "test:personality": "tsx src/cli/personality-test.ts", "compare:personality": "tsx src/cli/compare-personality.ts", "test:preference": "tsx src/cli/preference-test.ts", @@ -129,6 +133,7 @@ "devDependencies": { "@jest/globals": "^30.0.5", "@next/bundle-analyzer": "^15.4.4", + "@playwright/test": "1.55.0", "@sentry/cli": "^2.57.0", "@tailwindcss/postcss": "^4.1.11", "@tailwindcss/typography": "^0.5.16", diff --git a/playwright.config.ts b/playwright.config.ts new file mode 100644 index 0000000..dfa8900 --- /dev/null +++ b/playwright.config.ts @@ -0,0 +1,56 @@ +import { defineConfig, devices } from '@playwright/test'; +import { existsSync } from 'node:fs'; + +/** + * In the hosted agent sandbox, Chromium is pre-installed at this path and + * `playwright install` is disabled. When the binary is present we point + * Playwright at it directly; otherwise (local dev, CI) we let Playwright use + * its own bundled browser installed via `playwright install chromium`. + */ +const SANDBOX_CHROMIUM = '/opt/pw-browsers/chromium'; +const executablePath = existsSync(SANDBOX_CHROMIUM) ? SANDBOX_CHROMIUM : undefined; + +const PORT = 3172; +const BASE_URL = process.env.E2E_BASE_URL ?? `http://localhost:${PORT}`; + +export default defineConfig({ + testDir: './tests/e2e', + fullyParallel: true, + forbidOnly: !!process.env.CI, + retries: process.env.CI ? 2 : 0, + workers: process.env.CI ? 1 : undefined, + reporter: process.env.CI + ? [['list'], ['html', { open: 'never' }]] + : [['list']], + timeout: 60_000, + expect: { timeout: 15_000 }, + use: { + baseURL: BASE_URL, + navigationTimeout: 45_000, + trace: 'on-first-retry', + screenshot: 'only-on-failure', + video: 'retain-on-failure', + }, + projects: [ + { + name: 'chromium', + use: { ...devices['Desktop Chrome'], launchOptions: { executablePath } }, + }, + ], + /** + * When E2E_BASE_URL is set we assume the app is already running (e.g. a + * production build or a remote deploy) and skip booting a dev server. + * Otherwise boot `pnpm dev`; the readiness probe hits /about — a static, + * dependency-free route — which also pre-compiles it so the first test is fast. + */ + webServer: process.env.E2E_BASE_URL + ? undefined + : { + command: 'pnpm dev', + url: `http://localhost:${PORT}/about`, + reuseExistingServer: !process.env.CI, + timeout: 180_000, + stdout: 'pipe', + stderr: 'pipe', + }, +}); diff --git a/pnpm-lock.yaml b/pnpm-lock.yaml index 2794962..15eb0e1 100644 --- a/pnpm-lock.yaml +++ b/pnpm-lock.yaml @@ -82,7 +82,7 @@ importers: version: 1.2.7(@types/react-dom@19.1.6(@types/react@19.1.8))(@types/react@19.1.8)(react-dom@19.1.0(react@19.1.0))(react@19.1.0) '@sentry/nextjs': specifier: ^10.21.0 - version: 10.21.0(@opentelemetry/context-async-hooks@2.2.0(@opentelemetry/api@1.9.0))(@opentelemetry/core@2.2.0(@opentelemetry/api@1.9.0))(@opentelemetry/sdk-trace-base@2.2.0(@opentelemetry/api@1.9.0))(encoding@0.1.13)(next@15.5.9(@babel/core@7.27.4)(@opentelemetry/api@1.9.0)(react-dom@19.1.0(react@19.1.0))(react@19.1.0))(react@19.1.0)(webpack@5.102.1) + version: 10.21.0(@opentelemetry/context-async-hooks@2.2.0(@opentelemetry/api@1.9.0))(@opentelemetry/core@2.2.0(@opentelemetry/api@1.9.0))(@opentelemetry/sdk-trace-base@2.2.0(@opentelemetry/api@1.9.0))(encoding@0.1.13)(next@15.5.9(@babel/core@7.27.4)(@opentelemetry/api@1.9.0)(@playwright/test@1.55.0)(react-dom@19.1.0(react@19.1.0))(react@19.1.0))(react@19.1.0)(webpack@5.102.1) '@sentry/node': specifier: ^10.21.0 version: 10.21.0 @@ -166,7 +166,7 @@ importers: version: 3.1.0 next: specifier: 15.5.9 - version: 15.5.9(@babel/core@7.27.4)(@opentelemetry/api@1.9.0)(react-dom@19.1.0(react@19.1.0))(react@19.1.0) + version: 15.5.9(@babel/core@7.27.4)(@opentelemetry/api@1.9.0)(@playwright/test@1.55.0)(react-dom@19.1.0(react@19.1.0))(react@19.1.0) next-themes: specifier: ^0.4.6 version: 0.4.6(react-dom@19.1.0(react@19.1.0))(react@19.1.0) @@ -237,6 +237,9 @@ importers: '@next/bundle-analyzer': specifier: ^15.4.4 version: 15.4.4 + '@playwright/test': + specifier: 1.55.0 + version: 1.55.0 '@sentry/cli': specifier: ^2.57.0 version: 2.57.0(encoding@0.1.13) @@ -1825,6 +1828,11 @@ packages: resolution: {integrity: sha512-YLT9Zo3oNPJoBjBc4q8G2mjU4tqIbf5CEOORbUUr48dCD9q3umJ3IPlVqOqDakPfd2HuwccBaqlGhN4Gmr5OWg==} engines: {node: ^12.20.0 || ^14.18.0 || >=16.0.0} + '@playwright/test@1.55.0': + resolution: {integrity: sha512-04IXzPwHrW69XusN/SIdDdKZBzMfOT9UNT/YiJit/xpy2VuAoB8NHc8Aplb96zsWDddLnbkPL3TsmrS04ZU2xQ==} + engines: {node: '>=18'} + hasBin: true + '@polka/url@1.0.0-next.29': resolution: {integrity: sha512-wwQAWhWSuHaag8c4q/KN/vCoeOJYshAIvMQwD4GpSb3OiZklFfvAgmj0VCBBImRpuF/aFgIRzllXlVX93Jevww==} @@ -4558,6 +4566,11 @@ packages: fs.realpath@1.0.0: resolution: {integrity: sha512-OO0pH2lK6a0hZnAdau5ItzHPI6pUlvI7jMVnxUQRtw4owF2wk8lOSabtGDCTP4Ggrg2MbGnWO9X8K1t4+fGMDw==} + fsevents@2.3.2: + resolution: {integrity: sha512-xiqMQR4xAeHTuB9uWm+fFRcIOgKBMiOBP+eXiyT7jsgVCq1bkVygt00oASowB7EdtpOHaaPgKt812P9ab+DDKA==} + engines: {node: ^8.16.0 || ^10.6.0 || >=11.0.0} + os: [darwin] + fsevents@2.3.3: resolution: {integrity: sha512-5xoDfX+fL7faATnagmWPpbFtwh/R77WmMMqqHGS65C3vvB0YHrgF+B1YmZ3441tMj5n63k0212XNoJwzlhffQw==} engines: {node: ^8.16.0 || ^10.6.0 || >=11.0.0} @@ -5885,6 +5898,16 @@ packages: resolution: {integrity: sha512-HRDzbaKjC+AOWVXxAU/x54COGeIv9eb+6CkDSQoNTt4XyWoIJvuPsXizxu/Fr23EiekbtZwmh1IcIG/l/a10GQ==} engines: {node: '>=8'} + playwright-core@1.55.0: + resolution: {integrity: sha512-GvZs4vU3U5ro2nZpeiwyb0zuFaqb9sUiAJuyrWpcGouD8y9/HLgGbNRjIph7zU9D3hnPaisMl9zG9CgFi/biIg==} + engines: {node: '>=18'} + hasBin: true + + playwright@1.55.0: + resolution: {integrity: sha512-sdCWStblvV1YU909Xqx0DhOjPZE4/5lJsIS84IfN9dAZfcl/CIZ5O8l3o0j7hPMjDvqoTF8ZUcc+i/GL5erstA==} + engines: {node: '>=18'} + hasBin: true + postcss-import@15.1.0: resolution: {integrity: sha512-hpr+J05B2FVYUAXHeK1YyI267J/dDDhMU6B6civm8hSY1jYJnBXxzKDKDswzJmtLHryrjhnDjqqp/49t8FALew==} engines: {node: '>=14.0.0'} @@ -9217,6 +9240,10 @@ snapshots: '@pkgr/core@0.2.7': {} + '@playwright/test@1.55.0': + dependencies: + playwright: 1.55.0 + '@polka/url@1.0.0-next.29': {} '@prisma/instrumentation@6.15.0(@opentelemetry/api@1.9.0)': @@ -9997,7 +10024,7 @@ snapshots: '@sentry/core@10.21.0': {} - '@sentry/nextjs@10.21.0(@opentelemetry/context-async-hooks@2.2.0(@opentelemetry/api@1.9.0))(@opentelemetry/core@2.2.0(@opentelemetry/api@1.9.0))(@opentelemetry/sdk-trace-base@2.2.0(@opentelemetry/api@1.9.0))(encoding@0.1.13)(next@15.5.9(@babel/core@7.27.4)(@opentelemetry/api@1.9.0)(react-dom@19.1.0(react@19.1.0))(react@19.1.0))(react@19.1.0)(webpack@5.102.1)': + '@sentry/nextjs@10.21.0(@opentelemetry/context-async-hooks@2.2.0(@opentelemetry/api@1.9.0))(@opentelemetry/core@2.2.0(@opentelemetry/api@1.9.0))(@opentelemetry/sdk-trace-base@2.2.0(@opentelemetry/api@1.9.0))(encoding@0.1.13)(next@15.5.9(@babel/core@7.27.4)(@opentelemetry/api@1.9.0)(@playwright/test@1.55.0)(react-dom@19.1.0(react@19.1.0))(react@19.1.0))(react@19.1.0)(webpack@5.102.1)': dependencies: '@opentelemetry/api': 1.9.0 '@opentelemetry/semantic-conventions': 1.37.0 @@ -10011,7 +10038,7 @@ snapshots: '@sentry/vercel-edge': 10.21.0 '@sentry/webpack-plugin': 4.5.0(encoding@0.1.13)(webpack@5.102.1) chalk: 3.0.0 - next: 15.5.9(@babel/core@7.27.4)(@opentelemetry/api@1.9.0)(react-dom@19.1.0(react@19.1.0))(react@19.1.0) + next: 15.5.9(@babel/core@7.27.4)(@opentelemetry/api@1.9.0)(@playwright/test@1.55.0)(react-dom@19.1.0(react@19.1.0))(react@19.1.0) resolve: 1.22.8 rollup: 4.44.2 stacktrace-parser: 0.1.11 @@ -12331,6 +12358,9 @@ snapshots: fs.realpath@1.0.0: {} + fsevents@2.3.2: + optional: true + fsevents@2.3.3: optional: true @@ -13818,7 +13848,7 @@ snapshots: react: 19.1.0 react-dom: 19.1.0(react@19.1.0) - next@15.5.9(@babel/core@7.27.4)(@opentelemetry/api@1.9.0)(react-dom@19.1.0(react@19.1.0))(react@19.1.0): + next@15.5.9(@babel/core@7.27.4)(@opentelemetry/api@1.9.0)(@playwright/test@1.55.0)(react-dom@19.1.0(react@19.1.0))(react@19.1.0): dependencies: '@next/env': 15.5.9 '@swc/helpers': 0.5.15 @@ -13837,6 +13867,7 @@ snapshots: '@next/swc-win32-arm64-msvc': 15.5.7 '@next/swc-win32-x64-msvc': 15.5.7 '@opentelemetry/api': 1.9.0 + '@playwright/test': 1.55.0 sharp: 0.34.3 transitivePeerDependencies: - '@babel/core' @@ -14068,6 +14099,14 @@ snapshots: dependencies: find-up: 4.1.0 + playwright-core@1.55.0: {} + + playwright@1.55.0: + dependencies: + playwright-core: 1.55.0 + optionalDependencies: + fsevents: 2.3.2 + postcss-import@15.1.0(postcss@8.5.6): dependencies: postcss: 8.5.6 diff --git a/scripts/pr-screenshots.mjs b/scripts/pr-screenshots.mjs new file mode 100644 index 0000000..8688a48 --- /dev/null +++ b/scripts/pr-screenshots.mjs @@ -0,0 +1,80 @@ +// Capture full-page screenshots of routes, for PR before/after comparisons. +// +// Usage: +// node scripts/pr-screenshots.mjs \ +// --base-url http://localhost:3172 \ +// --routes /about,/what-is-an-eval \ +// --out .github/pr-media/my-branch \ +// --label after +// +// Designed to FAIL SOFT: a route that errors or times out is skipped (and +// logged), not fatal, so the orchestrating skill can still produce a partial +// result. Exits non-zero only if NOTHING was captured. +// +// Reuses the @playwright/test Chromium (no extra dependency). In the hosted +// sandbox it launches the pre-installed browser at /opt/pw-browsers/chromium. + +import { chromium } from '@playwright/test'; +import { existsSync, mkdirSync } from 'node:fs'; +import path from 'node:path'; + +function arg(name, fallback) { + const i = process.argv.indexOf(`--${name}`); + return i !== -1 && process.argv[i + 1] ? process.argv[i + 1] : fallback; +} + +const baseUrl = arg('base-url', 'http://localhost:3172').replace(/\/$/, ''); +const routes = arg('routes', '/') + .split(',') + .map((r) => r.trim()) + .filter(Boolean); +const outDir = arg('out', '.github/pr-media'); +const label = arg('label', 'after'); +const viewport = { + width: Number(arg('width', '1280')), + height: Number(arg('height', '800')), +}; +const perRouteTimeoutMs = Number(arg('timeout', '45000')); + +const SANDBOX_CHROMIUM = '/opt/pw-browsers/chromium'; +const executablePath = existsSync(SANDBOX_CHROMIUM) ? SANDBOX_CHROMIUM : undefined; + +function slug(route) { + const s = route.replace(/^\/+|\/+$/g, '').replace(/[^a-zA-Z0-9._-]+/g, '_'); + return s || 'home'; +} + +mkdirSync(outDir, { recursive: true }); + +const browser = await chromium.launch({ executablePath }); +const context = await browser.newContext({ viewport }); +const results = []; + +for (const route of routes) { + const page = await context.newPage(); + const file = path.join(outDir, `${slug(route)}-${label}.png`); + try { + // 'load' (not 'networkidle') — Next.js dev keeps an HMR websocket open, + // so networkidle would never settle. + await page.goto(`${baseUrl}${route}`, { + waitUntil: 'load', + timeout: perRouteTimeoutMs, + }); + await page.waitForTimeout(750); // let fonts/animations settle + await page.screenshot({ path: file, fullPage: true }); + results.push({ route, file, ok: true }); + console.log(`OK ${route} -> ${file}`); + } catch (err) { + results.push({ route, ok: false, error: String(err?.message || err) }); + console.log(`FAIL ${route}: ${err?.message || err}`); + } finally { + await page.close(); + } +} + +await context.close(); +await browser.close(); + +const ok = results.filter((r) => r.ok).length; +console.log(`\n${ok}/${routes.length} screenshots captured in ${outDir}`); +process.exit(ok > 0 ? 0 : 1); diff --git a/tests/e2e/README.md b/tests/e2e/README.md new file mode 100644 index 0000000..bd7c117 --- /dev/null +++ b/tests/e2e/README.md @@ -0,0 +1,51 @@ +# End-to-end tests + +Playwright e2e tests for the Weval web app. + +## Running + +```bash +pnpm test:e2e # run the suite (auto-boots `pnpm dev` on :3172) +pnpm test:e2e:ui # interactive UI mode +pnpm test:e2e:report # open the last HTML report +``` + +You don't need to start the dev server yourself — `playwright.config.ts` has a +`webServer` block that boots `pnpm dev` and waits for `/about` to respond. If a +dev server is already running on `:3172` it is reused (locally). + +To run against an already-running app (e.g. a production build or a deployed +preview) instead of booting dev: + +```bash +E2E_BASE_URL=https://your-preview.example.com pnpm test:e2e +``` + +## Browser binary + +- **Local / CI:** Playwright uses its own bundled Chromium. Install it once with + `pnpm exec playwright install --with-deps chromium`. +- **Hosted agent sandbox:** Chromium is pre-installed at `/opt/pw-browsers`. The + config auto-detects it (`executablePath`) and never downloads. + +## What's safe to test here + +Smoke tests deliberately target **statically rendered, dependency-free routes** +(`/about`, `/what-is-an-eval`, …) so they pass in CI without any secrets. + +Routes that read from storage (S3) or call external LLM APIs — the homepage, +`/analysis/*`, `/latest`, etc. — will be slow or error without env/network. To +cover those, either: + +- provide the relevant env vars (see `.env.template`), or +- intercept network calls with `page.route(...)` and serve fixtures. + +Keep flaky, data-dependent assertions out of the default suite. + +## Conventions + +- Prefer role/text locators (`getByRole`, `getByText`) and `a[href*="…"]` over + brittle CSS/nth-child selectors. +- Never use hard `waitForTimeout` for synchronization — rely on web-first + assertions (`await expect(locator).toBeVisible()`), which auto-wait. +- Add a `data-testid` to a component only when no accessible/role selector works. diff --git a/tests/e2e/smoke.spec.ts b/tests/e2e/smoke.spec.ts new file mode 100644 index 0000000..10da7e0 --- /dev/null +++ b/tests/e2e/smoke.spec.ts @@ -0,0 +1,25 @@ +import { test, expect } from '@playwright/test'; + +/** + * Smoke tests target dependency-free, statically rendered routes so they pass + * in CI without storage/API secrets. Data-driven routes (the homepage, + * /analysis, etc.) hit storage and external LLM APIs — to test those, provide + * env vars or mock the network first. See tests/e2e/README.md. + */ +test.describe('smoke', () => { + test('about page renders its title and key content', async ({ page }) => { + await page.goto('/about'); + + await expect(page).toHaveTitle(/About Weval/i); + await expect( + page.getByRole('heading', { name: /what are evaluations\?/i }), + ).toBeVisible(); + }); + + test('about page links out to the Collective Intelligence Project', async ({ page }) => { + await page.goto('/about'); + + const cipLink = page.locator('a[href*="cip.org"]').first(); + await expect(cipLink).toBeVisible(); + }); +}); From d2e631a3e8ac5f3145f3e03e97394bf342aabcd7 Mon Sep 17 00:00:00 2001 From: Claude Date: Tue, 30 Jun 2026 19:15:36 +0000 Subject: [PATCH 04/11] Add security guardrails to pr-describe skill The repo is public, so committed screenshots are permanent and world-readable. Bake in non-negotiable rules: route denylist (/admin*, /api*, auth routes), secret-free capture env only, no false "delete before merge" comfort, CI-artifact hosting for anything potentially sensitive, and human review before pushing. Co-Authored-By: Claude Sonnet 4.6 Claude-Session: https://claude.ai/code/session_011rZDvWQFXH3m42kYMzEY2g --- .claude/skills/pr-describe.md | 50 +++++++++++++++++++++++++++++++---- 1 file changed, 45 insertions(+), 5 deletions(-) diff --git a/.claude/skills/pr-describe.md b/.claude/skills/pr-describe.md index 690bf44..c177fbb 100644 --- a/.claude/skills/pr-describe.md +++ b/.claude/skills/pr-describe.md @@ -12,6 +12,28 @@ anything goes wrong, produce a clean text-only description and move on. - If no PR exists → print the finished markdown for the user to use. Do **NOT** open a PR unless the user explicitly asked for one. +## Security guardrails (NON-NEGOTIABLE — this repo is public) + +Screenshots publish whatever they render. On a **public** repo, anything you +commit is world-readable via raw URLs **permanently** (git history, forks, CDN +cache) — deleting the branch does NOT undo it. So: + +1. **Route denylist — never screenshot these, no exceptions:** + `/admin*`, `/api*`, and any authenticated/session/account route. If a changed + route matches, skip it and note "omitted for safety", do not capture it. +2. **Capture only against a secret-free environment.** Use the local `pnpm dev` + with NO real storage/API secrets wired up, so data-driven pages render + empty/mock and there is nothing sensitive in frame. Do **not** screenshot a + preview deploy that is backed by real data/secrets and then commit it here. +3. **Treat committed images as permanent and public.** Never tell the user they + can "delete before merge" to undo exposure — that is false. +4. **If a capture could contain anything sensitive, do NOT commit it to the + branch.** Prefer CI-artifact hosting (collaborator-only, auto-expiring) or a + PR comment. Committing to the public branch is only for plainly non-sensitive, + static UI. +5. **Surface images for human review before pushing.** Show the user what was + captured and let them confirm; never push blind. + --- ## Phase 1 — Analyze the diff (always runs) @@ -54,9 +76,14 @@ give-up policy below. **3a. Determine routes** (cap at the 3 most relevant; note any you dropped): - Map changed `src/app/**/page.tsx` to URLs (see `e2e-pr` for the mapping rules). +- **Apply the route denylist (see Security guardrails):** drop any `/admin*`, + `/api*`, or authenticated route and note it as "omitted for safety". This + filter runs before anything is captured. - Shared-component-only change → ask the user for 1-2 representative routes, OR skip with a note. Don't guess across the whole app. - Skip dynamic (`[id]`) routes unless the user supplies a concrete URL. +- Confirm the dev server has **no real storage/API secrets** in its env before + capturing (guardrail #2). If you can't confirm that, skip screenshots. Let `SLUG` = sanitized branch name, `ROUTES` = comma-separated list, e.g. `/about,/what-is-an-eval`. @@ -87,8 +114,15 @@ If BEFORE fails but AFTER succeeded, proceed with after-only + a note. ## Phase 4 — Host & embed +> **Before any commit: re-confirm the captures are non-sensitive (guardrails +> #1–#4) and show them to the user for a quick look (guardrail #5).** Committing +> to a public branch is permanent and irreversible. If there is any doubt about +> the contents, use CI-artifact hosting instead (see "Sensitive captures" below) +> or skip embedding entirely. + GitHub renders images only from URLs, so the PNGs must be committed and pushed -before they resolve. Commit them to the PR branch and reference raw URLs: +before they resolve. For plainly non-sensitive, static UI, commit them to the PR +branch and reference raw URLs: ``` git add .github/pr-media/$SLUG @@ -110,10 +144,16 @@ Derive `OWNER/REPO` from `git remote get-url origin` and `BRANCH` from (Omit the "Before" cell for routes where only the after shot exists — new pages.) -> **Tradeoff:** this commits PNGs into the PR branch (they show in the diff). They -> can be deleted before merge, or — if you'd rather keep them out of the diff — -> commit them to a dedicated orphan `pr-media` branch and point the raw URLs at -> that branch instead. Default is the PR branch for simplicity. +> **Tradeoff:** this commits PNGs into the PR branch and they show in the diff. +> On a public repo this is **permanent and world-readable** — do NOT claim they +> can be "deleted before merge" to undo exposure. To keep them out of the PR's +> own diff you can use a dedicated orphan `pr-media` branch, but that is still +> public; it changes visibility-in-diff, not exposure. + +**Sensitive captures → don't commit; use CI artifacts instead.** If a shot could +contain anything non-public, skip the commit and have the e2e workflow upload the +images via `actions/upload-artifact` (collaborator-only, auto-expiring). Link the +run/artifact from the PR body rather than embedding a public raw URL. ## Phase 5 — Finalize From 1f36dd0505403043dd06732cecfedf3b6c59abb0 Mon Sep 17 00:00:00 2001 From: Claude Date: Tue, 30 Jun 2026 19:31:10 +0000 Subject: [PATCH 05/11] Gate pr-describe screenshots to static routes by default Screenshots only pay off on static, secret-free routes; data-driven pages render empty/mock against the local dev env. Default to a static allowlist (/about, /what-is-an-eval) and skip everything else with a note, overridable via --routes. Filters run in order: route mapping -> security denylist (non-overridable) -> static gate (overridable) -> dynamic-route skip. If nothing survives, fall back to a text-only description without booting servers. Co-Authored-By: Claude Sonnet 4.6 Claude-Session: https://claude.ai/code/session_011rZDvWQFXH3m42kYMzEY2g --- .claude/skills/pr-describe.md | 60 +++++++++++++++++++++++++---------- 1 file changed, 44 insertions(+), 16 deletions(-) diff --git a/.claude/skills/pr-describe.md b/.claude/skills/pr-describe.md index c177fbb..724e32c 100644 --- a/.claude/skills/pr-describe.md +++ b/.claude/skills/pr-describe.md @@ -1,9 +1,10 @@ # pr-describe -Generate a high-quality PR description from the branch diff. When the change is -**visual** (touches UI routes/components/styles), TRY to attach before/after -screenshots — but the attempt is strictly time-boxed and **fails soft**: if -anything goes wrong, produce a clean text-only description and move on. +Generate a high-quality PR description from the branch diff. The description +itself is the core deliverable and always runs. Before/after screenshots are a +**bonus that only activates on static, visual PRs** — they're strictly +time-boxed, **fail soft** (any problem → clean text-only description), and are +**off by default for data-driven routes** (see "Default screenshot scope"). ## Output contract @@ -62,7 +63,26 @@ Visual-change heuristic — TRUE if any changed file matches: - `**/*.css` If FALSE → skip to Phase 5 (text-only). If TRUE → continue, subject to the -give-up policy below. +default scope and give-up policy below. + +### Default screenshot scope (practicality gate) + +Screenshots only pay off on **static, secret-free routes**. Data-driven routes +(homepage, `/analysis/*`, `/pairs`, `/latest`, `/model/*`, etc.) render +empty/mock against the secret-free dev env — an uninformative shot — and cost a +slow capture. So **by default, only capture known-static routes**: + +- **Default static allowlist:** `/about`, `/what-is-an-eval`. (Extend this list + as more static pages are confirmed safe + stable.) +- Any mapped route **not** on the allowlist is **skipped by default** with a note: + _"skipped: data-driven route (pass `--routes` to force)"_. +- The user can **override** with `--routes /foo,/bar` to force specific routes + (e.g. against a preview deploy with real data, where they accept the tradeoff). +- The **security denylist always wins** over any override — `/admin*`, `/api*`, + and auth routes are never captured even if explicitly passed. + +If, after this gate, there are **no routes left to shoot** → skip to Phase 5 +(text-only) with a one-line note. Don't boot servers for nothing. ## Phase 3 — Capture screenshots (time-boxed, fail-soft) @@ -74,16 +94,20 @@ give-up policy below. > - The screenshot script captures nothing (`scripts/pr-screenshots.mjs` exits non-zero). > - Any unexpected error. Never let screenshots block the description. -**3a. Determine routes** (cap at the 3 most relevant; note any you dropped): -- Map changed `src/app/**/page.tsx` to URLs (see `e2e-pr` for the mapping rules). -- **Apply the route denylist (see Security guardrails):** drop any `/admin*`, - `/api*`, or authenticated route and note it as "omitted for safety". This - filter runs before anything is captured. -- Shared-component-only change → ask the user for 1-2 representative routes, OR - skip with a note. Don't guess across the whole app. -- Skip dynamic (`[id]`) routes unless the user supplies a concrete URL. +**3a. Determine routes** (cap at the 3 most relevant; note any you dropped). +Apply these filters **in order**: +1. Map changed `src/app/**/page.tsx` to URLs (see `e2e-pr` for the mapping rules). +2. **Security denylist (always, non-overridable):** drop any `/admin*`, `/api*`, + or authenticated route; note as "omitted for safety". +3. **Default static gate (see Phase 2):** unless the user passed `--routes`, drop + anything not on the static allowlist; note as "skipped: data-driven route". + If `--routes` was passed, use exactly those (still subject to step 2). +4. Skip dynamic (`[id]`) routes unless the user supplies a concrete URL. +5. Shared-component-only change with nothing left → ask the user for 1-2 + representative static routes, or skip with a note. Don't guess across the app. - Confirm the dev server has **no real storage/API secrets** in its env before capturing (guardrail #2). If you can't confirm that, skip screenshots. +- If no routes survive the filters → skip to Phase 5 (text-only). Let `SLUG` = sanitized branch name, `ROUTES` = comma-separated list, e.g. `/about,/what-is-an-eval`. @@ -166,7 +190,11 @@ run/artifact from the PR body rather than embedding a public raw URL. ## Args - Base branch to diff against (default `origin/main`). -- Optional route list to screenshot, e.g. `--routes /about,/pairs` (overrides - auto-detection — useful for shared-component changes). +- `--routes /foo,/bar` — **override the default static-only gate** and capture + exactly these routes (still subject to the security denylist). Use this for + shared-component changes or when shooting a preview deploy with real data. + +By default (no `--routes`), only known-static routes (`/about`, +`/what-is-an-eval`) are captured; data-driven routes are skipped with a note. -Example: `/pr-describe` · `/pr-describe staging` · `/pr-describe --routes /about,/pairs` +Example: `/pr-describe` · `/pr-describe staging` · `/pr-describe --routes /pairs,/latest` From fe96f099c7473e5399ab66c0251a75d5f46f3cef Mon Sep 17 00:00:00 2001 From: Claude Date: Tue, 30 Jun 2026 19:52:00 +0000 Subject: [PATCH 06/11] Remove redundant single-check PR skills Drop test-pr, typecheck-pr, lint-pr, and blueprint-validate. pr-check already orchestrates typecheck/lint/tests directly via their pnpm commands, so the standalone variants were redundant. Remaining skills: pr-check, e2e-pr, pr-describe. Co-Authored-By: Claude Sonnet 4.6 Claude-Session: https://claude.ai/code/session_011rZDvWQFXH3m42kYMzEY2g --- .claude/skills/blueprint-validate.md | 45 ---------------------------- .claude/skills/lint-pr.md | 34 --------------------- .claude/skills/test-pr.md | 37 ----------------------- .claude/skills/typecheck-pr.md | 30 ------------------- 4 files changed, 146 deletions(-) delete mode 100644 .claude/skills/blueprint-validate.md delete mode 100644 .claude/skills/lint-pr.md delete mode 100644 .claude/skills/test-pr.md delete mode 100644 .claude/skills/typecheck-pr.md diff --git a/.claude/skills/blueprint-validate.md b/.claude/skills/blueprint-validate.md deleted file mode 100644 index aff3139..0000000 --- a/.claude/skills/blueprint-validate.md +++ /dev/null @@ -1,45 +0,0 @@ -# blueprint-validate - -Validate Weval blueprint YAML/JSON files changed in the current branch. This mirrors Layer 1 of the three-layer PR evaluation system described in `docs/PR_EVALUATION_SETUP.md`. - -## Steps - -1. Find changed blueprint files: - ``` - git diff --name-only origin/main...HEAD -- 'examples/**' '*.yaml' '*.yml' '*.json' - ``` - Focus on files in `examples/blueprints/` or any `.yaml`/`.json` blueprint files. - -2. For each changed blueprint file, validate: - - **Structural checks:** - - Parse YAML/JSON — if it fails to parse, report the syntax error immediately - - Required top-level fields: `id`, `prompts` (check `docs/BLUEPRINT_FORMAT.md` for full schema) - - `id` must be a non-empty string - - `prompts` must be a non-empty array - - Each prompt must have the required fields - - **Size/cost checks (staging limits from Layer 2):** - - Warn if prompt count > 10 (would be auto-trimmed in staging eval) - - Warn if model list is large (only CORE models supported in staging) - - Note: these are warnings, not errors — production evals have no limits - - **Security check:** - - If the file is under `blueprints/users/{username}/`, the username in the path must match the PR author - - Get the PR author from `git log --format='%ae' -1` or from GitHub context if available - -3. For any blueprint that passes validation, summarize: - - Number of prompts - - Models targeted - - Estimated staging evaluation scope (after auto-trim if applicable) - -4. Report: - - ✅ or ❌ per file with specific error details - - Warnings for staging-limit exceedances - - A final pass/fail verdict - -## Args - -Optional: path to a specific blueprint file to validate instead of scanning changed files. - -Example: `/blueprint-validate` or `/blueprint-validate examples/blueprints/my-blueprint.yaml` diff --git a/.claude/skills/lint-pr.md b/.claude/skills/lint-pr.md deleted file mode 100644 index 8d1c0da..0000000 --- a/.claude/skills/lint-pr.md +++ /dev/null @@ -1,34 +0,0 @@ -# lint-pr - -Run Next.js ESLint on files changed in the current branch and report new lint errors. - -## Steps - -1. Find changed JS/TS/TSX files: - ``` - git diff --name-only origin/main...HEAD -- '*.ts' '*.tsx' '*.js' '*.jsx' - ``` - -2. Run lint on the changed files only (faster than full lint): - ``` - pnpm exec next lint --file --file ... - ``` - If there are more than 20 changed files, run the full lint instead: - ``` - pnpm lint - ``` - -3. Report: - - Pass/fail overall - - Each error and warning with file path, line number, rule name, and message - - Distinguish errors (must fix) from warnings (should fix) - - If zero issues: confirm clean - -4. Do NOT auto-fix lint errors unless the user explicitly passes `--fix` as an arg. - If `--fix` is passed, run `pnpm lint:fix` on the changed files and show a diff of what changed. - -## Args - -Optional: `--fix` to automatically apply safe lint fixes. - -Example: `/lint-pr` or `/lint-pr --fix` diff --git a/.claude/skills/test-pr.md b/.claude/skills/test-pr.md deleted file mode 100644 index 1be1953..0000000 --- a/.claude/skills/test-pr.md +++ /dev/null @@ -1,37 +0,0 @@ -# test-pr - -Run Jest tests for files changed in the current branch compared to main. - -## Steps - -1. Find changed files: - ``` - git diff --name-only origin/main...HEAD - ``` - -2. Identify which changed files have corresponding tests or ARE test files. Test files live in: - - Same directory as source with `.test.ts` / `.test.tsx` suffix - - `src/point-functions/` tests are colocated - - Jest configs: `jest.config.js` (web) and `jest.config.cli.js` (CLI) - -3. Decide which test suite(s) to run: - - If changes touch `src/app/`, `src/components/`, `src/hooks/` → web suite: `pnpm test:web` - - If changes touch `src/cli/`, `src/lib/` → CLI suite: `pnpm test:cli` - - If changes touch `src/point-functions/` → web suite (tests are colocated) - - Run both suites if changes span both areas - -4. Run only the tests for changed/affected files where possible: - - For targeted runs: `pnpm test:web -- --testPathPattern=""` - - If no specific pattern can be identified, run the full relevant suite - -5. Report: - - Pass/fail status - - Number of test suites and tests run - - Any failing tests with their error messages - - Which files were changed that had no test coverage (note as gaps, not failures) - -## Args - -Optional: a PR number or branch name to compare against. Defaults to `origin/main`. - -Example: `/test-pr 42` or `/test-pr feature/my-branch` diff --git a/.claude/skills/typecheck-pr.md b/.claude/skills/typecheck-pr.md deleted file mode 100644 index 404bbe8..0000000 --- a/.claude/skills/typecheck-pr.md +++ /dev/null @@ -1,30 +0,0 @@ -# typecheck-pr - -Run TypeScript type checking on the project and surface errors introduced by the current branch. - -## Steps - -1. Get the list of changed `.ts` / `.tsx` files: - ``` - git diff --name-only origin/main...HEAD -- '*.ts' '*.tsx' - ``` - -2. Run the full type check (tsc checks the whole project, not individual files): - ``` - pnpm typecheck - ``` - This runs `tsc --noEmit` using the root `tsconfig.json`. - -3. If the full typecheck fails, compare against main to isolate which errors are NEW: - - Note the total error count from the current branch - - Optionally stash changes, run `pnpm typecheck` on main, then restore — but only do this if there are errors and the user needs to know which are pre-existing vs. introduced - -4. Report: - - Overall pass/fail - - List of NEW type errors introduced by this branch (file, line, error message) - - Count of pre-existing errors if any (so the author knows what's theirs to fix) - - Flag any errors in files the PR didn't touch (may indicate type cascades from interface changes) - -## Args - -Optional: base branch to compare against. Defaults to `origin/main`. From d077034c3176ac70e493329e897a4babc97d0bd9 Mon Sep 17 00:00:00 2001 From: Claude Date: Tue, 30 Jun 2026 20:10:36 +0000 Subject: [PATCH 07/11] Add Risks/rollback section to pr-describe template MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Borrowed from common PR-description skills: an explicit blast-radius + how-to-undo section so reviewers see what to scrutinize. Kept honest — one line for low-risk, self-contained changes rather than padding. Co-Authored-By: Claude Sonnet 4.6 Claude-Session: https://claude.ai/code/session_011rZDvWQFXH3m42kYMzEY2g --- .claude/skills/pr-describe.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/.claude/skills/pr-describe.md b/.claude/skills/pr-describe.md index 724e32c..03d2d1d 100644 --- a/.claude/skills/pr-describe.md +++ b/.claude/skills/pr-describe.md @@ -53,6 +53,12 @@ Draft the description with these sections: - **Changes** — bulleted, grouped by area (UI, API, CLI, tests, docs…). - **Test plan** — what you ran (`pnpm typecheck`, `pnpm lint`, `pnpm test:web`, `pnpm test:e2e`) and the result, plus manual steps if any. +- **Risks / rollback** — the blast radius and how to undo. Call out anything + reviewers should scrutinize (data migrations, auth/permissions, external API + or cost impact, breaking changes, shared components touched). State how to + revert (usually "revert this PR" — but note it if a migration or deploy step + makes rollback non-trivial). If the change is low-risk and self-contained, say + so in one line rather than padding. - **Screenshots** — filled in by Phase 4 if applicable, else omitted. ## Phase 2 — Does the screenshot step apply? From e9faf94afd2a0d39627a0d2687d32339eac934f4 Mon Sep 17 00:00:00 2001 From: Claude Date: Tue, 30 Jun 2026 20:15:40 +0000 Subject: [PATCH 08/11] Add Related Issues section and a worked example to pr-describe Related Issues supports Closes/Refs linking (auto-closes issues on merge). The worked example anchors the model to consistent output and shows all sections filled in for a small UI change. Test plan + Risks/rollback retained. Co-Authored-By: Claude Sonnet 4.6 Claude-Session: https://claude.ai/code/session_011rZDvWQFXH3m42kYMzEY2g --- .claude/skills/pr-describe.md | 33 +++++++++++++++++++++++++++++++++ 1 file changed, 33 insertions(+) diff --git a/.claude/skills/pr-describe.md b/.claude/skills/pr-describe.md index 03d2d1d..6562bd8 100644 --- a/.claude/skills/pr-describe.md +++ b/.claude/skills/pr-describe.md @@ -60,6 +60,39 @@ Draft the description with these sections: makes rollback non-trivial). If the change is low-risk and self-contained, say so in one line rather than padding. - **Screenshots** — filled in by Phase 4 if applicable, else omitted. +- **Related Issues** — link tickets and related work. Use `Closes #123` for + issues this PR resolves (auto-closes them on merge), `Refs #456` for related + PRs/issues, plus any relevant docs. Omit the section entirely if there's + nothing to link — don't invent issue numbers. + +### Worked example + +```md +## Summary +Tightens the site header on mobile so the nav no longer wraps under 380px. + +## Motivation / context +The logo + nav links overflowed on small screens, pushing the theme toggle +off-canvas. Reported in #41. + +## Changes +- **UI:** right-align nav links and shrink logo on `sm` breakpoint (`Header.tsx`) +- **UI:** swap hover underline for opacity to avoid layout shift +- **Tests:** add an e2e assertion that the header is visible at 360px width + +## Test plan +- `pnpm typecheck` ✅ · `pnpm lint` ✅ · `pnpm test:e2e` ✅ (3 passed) +- Manually checked /about at 320 / 375 / 768px. + +## Risks / rollback +Low-risk, CSS-only and self-contained. Revert this PR to undo. + +## Screenshots +(before/after table inserted by Phase 4) + +## Related Issues +Closes #41 +``` ## Phase 2 — Does the screenshot step apply? From cb6b9c730b892eaec9802c7061972b47e920c037 Mon Sep 17 00:00:00 2001 From: Claude Date: Tue, 30 Jun 2026 20:17:01 +0000 Subject: [PATCH 09/11] Convert skills to discoverable /SKILL.md layout Claude Code only auto-discovers skills at .claude/skills//SKILL.md with YAML frontmatter (name + description). The flat .md files were not discoverable. Move pr-check, e2e-pr, and pr-describe into that structure and add trigger- oriented descriptions so the harness surfaces them. Co-Authored-By: Claude Sonnet 4.6 Claude-Session: https://claude.ai/code/session_011rZDvWQFXH3m42kYMzEY2g --- .claude/skills/{e2e-pr.md => e2e-pr/SKILL.md} | 5 +++++ .claude/skills/{pr-check.md => pr-check/SKILL.md} | 5 +++++ .claude/skills/{pr-describe.md => pr-describe/SKILL.md} | 5 +++++ 3 files changed, 15 insertions(+) rename .claude/skills/{e2e-pr.md => e2e-pr/SKILL.md} (91%) rename .claude/skills/{pr-check.md => pr-check/SKILL.md} (89%) rename .claude/skills/{pr-describe.md => pr-describe/SKILL.md} (97%) diff --git a/.claude/skills/e2e-pr.md b/.claude/skills/e2e-pr/SKILL.md similarity index 91% rename from .claude/skills/e2e-pr.md rename to .claude/skills/e2e-pr/SKILL.md index 8ae923c..d094147 100644 --- a/.claude/skills/e2e-pr.md +++ b/.claude/skills/e2e-pr/SKILL.md @@ -1,3 +1,8 @@ +--- +name: e2e-pr +description: Run the Playwright end-to-end suite against the routes a PR changes and report results. Use when verifying UI behavior for a branch or PR, or checking that a change didn't break key pages. +--- + # e2e-pr Run the Playwright end-to-end suite against the routes a PR touches, and report results. diff --git a/.claude/skills/pr-check.md b/.claude/skills/pr-check/SKILL.md similarity index 89% rename from .claude/skills/pr-check.md rename to .claude/skills/pr-check/SKILL.md index c66b2d4..a3b56ac 100644 --- a/.claude/skills/pr-check.md +++ b/.claude/skills/pr-check/SKILL.md @@ -1,3 +1,8 @@ +--- +name: pr-check +description: Run the full quality gate on the current branch — TypeScript typecheck, ESLint, web/CLI Jest tests, and blueprint validation — then summarize pass/fail. Use before opening a PR, or to check whether a branch is CI-ready. +--- + # pr-check Run a full quality gate on the current branch before or after a PR is created. Orchestrates type checking, linting, and tests, then produces a structured summary. diff --git a/.claude/skills/pr-describe.md b/.claude/skills/pr-describe/SKILL.md similarity index 97% rename from .claude/skills/pr-describe.md rename to .claude/skills/pr-describe/SKILL.md index 6562bd8..0d535cb 100644 --- a/.claude/skills/pr-describe.md +++ b/.claude/skills/pr-describe/SKILL.md @@ -1,3 +1,8 @@ +--- +name: pr-describe +description: Generate a structured PR description from the branch diff (Summary, Changes, Test plan, Risks, Related Issues), with optional static-gated before/after screenshots for visual changes. Use when writing or updating a pull request description. +--- + # pr-describe Generate a high-quality PR description from the branch diff. The description From 4dad721a5bfc5daaaa95fd4c91accfa6963a46c2 Mon Sep 17 00:00:00 2001 From: Claude Date: Tue, 30 Jun 2026 20:52:44 +0000 Subject: [PATCH 10/11] Fix e2e workflow pnpm version conflict pnpm/action-setup@v4 errors when both `version` and package.json's `packageManager` specify a pnpm version. Drop the explicit version: 9 and let the action read pnpm@9.6.0 from packageManager. Co-Authored-By: Claude Sonnet 4.6 Claude-Session: https://claude.ai/code/session_011rZDvWQFXH3m42kYMzEY2g --- .github/workflows/e2e.yml | 2 -- 1 file changed, 2 deletions(-) diff --git a/.github/workflows/e2e.yml b/.github/workflows/e2e.yml index 75a799b..ab4669c 100644 --- a/.github/workflows/e2e.yml +++ b/.github/workflows/e2e.yml @@ -23,8 +23,6 @@ jobs: - uses: actions/checkout@v4 - uses: pnpm/action-setup@v4 - with: - version: 9 - uses: actions/setup-node@v4 with: From c14cc658e75d1938d741dc1c72d399b74ce61f66 Mon Sep 17 00:00:00 2001 From: Claude Date: Tue, 30 Jun 2026 20:57:10 +0000 Subject: [PATCH 11/11] Add PR title convention to pr-describe Produce Conventional Commits titles (type(scope): summary), inferring type/scope from the diff. The skill now sets/updates the PR title as well as the body. Co-Authored-By: Claude Sonnet 4.6 Claude-Session: https://claude.ai/code/session_011rZDvWQFXH3m42kYMzEY2g --- .claude/skills/pr-describe/SKILL.md | 30 +++++++++++++++++++++++++---- 1 file changed, 26 insertions(+), 4 deletions(-) diff --git a/.claude/skills/pr-describe/SKILL.md b/.claude/skills/pr-describe/SKILL.md index 0d535cb..eafafa4 100644 --- a/.claude/skills/pr-describe/SKILL.md +++ b/.claude/skills/pr-describe/SKILL.md @@ -13,10 +13,32 @@ time-boxed, **fail soft** (any problem → clean text-only description), and are ## Output contract -- If a PR already exists for the current branch → update its body - (GitHub MCP `update_pull_request`). -- If no PR exists → print the finished markdown for the user to use. Do **NOT** - open a PR unless the user explicitly asked for one. +- If a PR already exists for the current branch → update its **title and body** + (GitHub MCP `update_pull_request`). Only change the title if it doesn't already + follow the convention below. +- If no PR exists → print the finished **title + markdown body** for the user. + Do **NOT** open a PR unless the user explicitly asked for one. + +## PR title convention + +Produce a properly tagged title in **Conventional Commits** form: + +``` +type(scope): imperative summary +``` + +- **type** — infer from the dominant change: `feat` (new capability), `fix` + (bug fix), `docs`, `test` (tests/test infra), `ci` (CI/workflows), `refactor`, + `perf`, `chore`, `build`, `style`. +- **scope** — optional, a short area derived from the diff (e.g. `e2e`, `auth`, + `header`, `cli`, `pr-eval`). Omit if it spans many areas. +- **summary** — imperative mood, lower-case start, **no trailing period**, aim + for ≤ ~70 chars total. +- If the change mixes types, pick the one that best describes the user-facing + intent (a feature with its tests is still `feat`). + +Examples: `feat(header): collapse nav into a menu under 380px` · +`test(e2e): add Playwright smoke suite` · `fix(cli): handle empty blueprint id`. ## Security guardrails (NON-NEGOTIABLE — this repo is public)