Skip to content

ci: add 2027 eval workflow for per-PR CLI evaluation#1088

Open
caffeinum wants to merge 1 commit into
sanity-io:mainfrom
team2027:ci/2027-eval
Open

ci: add 2027 eval workflow for per-PR CLI evaluation#1088
caffeinum wants to merge 1 commit into
sanity-io:mainfrom
team2027:ci/2027-eval

Conversation

@caffeinum
Copy link
Copy Markdown

@caffeinum caffeinum commented May 19, 2026

Description

Integrated 2027's agent experience evaluations, allowing to test preview builds of Sanity CLI automatically on PRs labeled trigger: preview. Uses existing pkg-pr-new action to provide build preview URL to the 2027 runner (https://github.com/team2027/evals-action).

Uses prompt https://2027.dev/evals/sanity.io/prompts/1d8004c6-d00c-432e-998b-e868a957807c, and triggers an eval, passing over the per-commit @sanity/cli build. evals-action posts a sticky comment with eval status.

The intent is to catch CLI regressions that hurt agent workflows (Claude Code, Codex, Cursor, etc.) before they ship — measuring time, cost, error count, and score for an end-to-end "set up a Sanity project with the CLI" task.

Dogfooded extensively in the team2027/sanity-cli fork — pipeline confirmed exercising the per-PR build (team2027#3 (comment) -- intentionally broken CLI scored differently than working CLI).

Uses 2027 API, see full reference here: https://2027.dev/evals/api/openapi

What to review

  • .github/workflows/2027-eval.yml — single new file
  • Triggers off the existing Publish Preview Packages (pkg-pr-new) workflow's completion via SHA polling
  • Uses team2027/evals-action@v0.5.0 (pinned to a tag, action source is public)
  • Sticky PR comment + commit status come from the action; no custom commenting

Prerequisites for this to do anything:

  1. Repo secret EVALS_API_KEY set in sanity-io/cli → Settings → Secrets and variables → Actions. Generate at https://2027.dev/evals/sanity.io/settings.
  2. Without the secret, the job will fail loudly with api-key input is required — no silent skip.
  3. The prompt id (1d8004c6-d00c-432e-998b-e868a957807c) is owned by the sanity.io org domain on 2027 and is the "Getting Started: Staging CLI" prompt. Anyone with a sanity.io 2027 API key can run it. To use a different prompt, edit the PROMPT_ID env var.

Security considerations

  1. Prompt ID 1d8004c6-d00c-432e-998b-e868a957807c is public on this repo. However, 2027 API is authenticated via org-level API token (stored in EVALS_API_KEY repo secrets), so no information is visible to the public visitors, except for evals results in the public github comment
  2. The only data being sent to the 2027 API is the pkg-pr-new build SHA (which is public)
  3. By default, action polls for 60min -- might run into Github Actions limits if 2027 runs take too long. Alternatively, same testing could be configured via Github app, which would lift the need to run custom Github action.

Testing

  • Validated end-to-end in fork: 4 successful eval runs across 2 PRs, with both happy-path and intentionally-broken CLI scenarios
  • Action itself (team2027/evals-action) has its own test suite with contract tests for the API wire format

@caffeinum caffeinum requested a review from a team as a code owner May 19, 2026 00:47
@runeb
Copy link
Copy Markdown
Member

runeb commented May 19, 2026

@claude review this PR please

@squiggler-app
Copy link
Copy Markdown
Contributor

squiggler-app Bot commented May 19, 2026

Claude finished @runeb's task in 1m 35s —— View job


Review: 2027 eval workflow

Nice, focused change — single new workflow, polling logic is well-commented, and the failure-loud-on-missing-secret behavior is the right call. A few observations, ordered by importance.


1. Missing concurrency group — likely wastes CI time on rapid pushes

.github/workflows/2027-eval.yml:15-18 has no concurrency block, while pkg-pr-new.yml:7-9 does (cancel-in-progress: true).

Scenario: someone pushes commit A, then commit B 30s later on a trigger: preview PR.

  • pkg-pr-new for A is cancelled (its concurrency group fires).
  • The eval workflow for SHA A is not cancelled — it keeps polling for up to 10 minutes (30 × 20s), sees only the cancelled pkg-pr-new run, and never matches success / hard-failure, so it times out at step 1.
  • Net cost: a wasted 10-minute runner per superseded push, plus a confusing failed check on the PR for the old SHA.

Recommend mirroring pkg-pr-new:

concurrency:
  group: ${{ github.workflow }}-${{ github.head_ref || github.ref }}
  cancel-in-progress: true

Fix this →

2. 10-minute poll window may be tight in cold-cache scenarios

.github/workflows/2027-eval.yml:26 caps the wait at 30 iterations × 20s = 10 min. pkg-pr-new does pnpm install + pnpm build:cli; on a cold turbo cache or busy runners this can run 6-9 min. The eval job will then time out at step 1 even though pkg-pr-new is healthy and only a few seconds away.

Two small options: bump to 45-60 iterations, or extend the per-iteration sleep slightly. Cheap insurance — the polling loop itself costs almost nothing.

3. Third-party action pinned by tag, not SHA

.github/workflows/2027-eval.yml:60 uses team2027/evals-action@v0.5.0. Tags are mutable; for a workflow that handles a secret (EVALS_API_KEY), SHA-pinning is the standard hardening recommendation (it's also what GitHub's own docs call out). Other actions in this repo are mostly pinned by major-version tag too, so this matches house style — flagging only because the threat model is a bit higher here (API key exfil) than for, say, actions/checkout.

Suggested form, preserving readability:

uses: team2027/evals-action@<full-sha>  # v0.5.0

4. Minor: issues: write permission likely unused

.github/workflows/2027-eval.yml:8 requests issues: write. PR comments on GitHub are technically issue comments via the REST API, so pull-requests: write already covers the sticky comment behavior the action does. Unless evals-action opens issues on failure, you can probably drop this and tighten the surface. Worth confirming against the action's source before removing.

5. Conclusion-matching logic — looks correct, nicely subtle

.github/workflows/2027-eval.yml:35-44 uses case ",$CONCLUSIONS," with patterns like *,success,*. Worth a callout because at first glance it looks like it could false-match (*success* would match startup_failure), but the comma-bracketing makes it a clean token match. The "cancelled keeps polling" comment is exactly the right thing to document inline. Good code.

6. Nit: --jq produces a CSV inside a JSON command — works, but a --json + native parse is more idiomatic

.github/workflows/2027-eval.yml:30-34 joins conclusions into a comma-separated string and then does shell case-matching. It works and is concise. An alternative is to let jq do the predicate (jq 'any(.[]; .status=="completed" and .conclusion=="success")') and exit on the boolean. Not worth changing if you like the current shape — purely stylistic.


Summary

Approve in spirit — the only thing I'd push for before merging is adding the concurrency block (#1). #2 and #3 are worth doing but not blockers. #4-#6 are polish.

Dogfooding evidence in the description (4 runs in the fork, intentionally-broken CLI scoring differently) is exactly what I'd want to see for a workflow that gates on an external paid service.

  • Read .github/workflows/2027-eval.yml
  • Compare against PR description / verify wiring (pkg-pr-new alignment)
  • Check for security issues, correctness, and best practices
  • Post review
    · ci/2027-eval

@caffeinum
Copy link
Copy Markdown
Author

caffeinum commented May 19, 2026

Thanks for the thorough review! Pushed fc832e24 addressing 1, 2, 3, and 4:

  • 1 concurrency group — added matching pkg-pr-new's pattern

  • 3 SHA pinteam2027/evals-action@a6cff23b… with # v0.5.0 comment

  • 4 issues: write — dropped. Confirmed against evals-action/scripts/run.js — only calls github.rest.issues.{listComments,createComment,updateComment} which all run on PRs under pull-requests: write. No issue creation. Tighter surface.

Verified clean against actionlint locally.

🤖 Generated with Claude Code

@runeb
Copy link
Copy Markdown
Member

runeb commented May 19, 2026

Thanks @caffeinum. Question; is a more accurate permission set

  permissions:
    actions: read
    issues: write
    statuses: write

?

@caffeinum
Copy link
Copy Markdown
Author

caffeinum commented May 19, 2026

@runeb I think we don't need issues: write, as we're only posting comments on PR; however, we do need pull-requests: write as posting comments requires this access:

permissions:
  pull-requests: write
  statuses: write

Edit (22 May): We also need checks: read to access https://pkg.pr.new builds

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants