Skip to content

evals: harness for capturing flow runs and qualitative reviews#21

Open
jyliang wants to merge 2 commits into
mainfrom
evals-harness
Open

evals: harness for capturing flow runs and qualitative reviews#21
jyliang wants to merge 2 commits into
mainfrom
evals-harness

Conversation

@jyliang

@jyliang jyliang commented May 7, 2026

Copy link
Copy Markdown
Owner

Summary

  • Adds make eval-{task-new,record,review,compare,list} targets for capturing flow runs as versioned, reviewable artifacts.
  • A run is pinned to kernel SHA + active-cell SHA, copies the thread's handoff docs, captures the diff against the base branch, and writes a JSON manifest.
  • The review template is markdown with rich qualitative sections (overall impression, per-stage notes, doc readability for both human + machine readers, code quality, and a "patterns flow should learn" section). Numeric scores are optional.
  • Reviews live in evals/runs/<task>/<run-id>/reviews/<reviewer>.md and are committed alongside the run, so /flow:reflect (and humans iterating on flow) can read them next time.
  • eval-compare prints version pins and metric deltas side by side and points at shared reviewers' files for prose-level diffing.

Layout

evals/
  tasks/<task-id>/             # task definitions
  runs/<task-id>/<run-id>/     # captured runs (manifest + thread + diff + reviews)
  templates/                   # review template + task README scaffold

run-id = <utc-iso>-<kernel-sha7> — sortable, scannable.

Test plan

  • make eval-task-new TASK=<id> scaffolds tasks//{task.json, prompt.md, README.md}
  • make eval-record TASK=<id> PROJECT=<path> discovers the most recent thread, copies it, generates the diff, writes a manifest pinned to current kernel + cell SHAs
  • make eval-review TASK=<id> RUN=<run-id> opens $EDITOR on a pre-filled review template with an artifact index appended
  • make eval-compare TASK=<id> A=<a> B=<b> prints version + metric deltas and lists reviews
  • make eval-list summarises tasks and runs

🤖 Generated with Claude Code

jyliang and others added 2 commits May 6, 2026 23:40
Adds make eval-task-new / eval-record / eval-review / eval-compare /
eval-list to capture a flow run as a versioned record (kernel SHA + cell
SHA + thread + diff + manifest), open a long-form review template in
$EDITOR, and compare two runs side by side. Reviews are markdown so
they're diffable and committed alongside the run for the next iteration
of /flow:reflect to read.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Greenfield code-pipeline target — single-file Python TODO CLI in a
fresh repo at ~/Workspace/jyliang/mini-todo (init commit 9f05045).
Prompt is intentionally tight so divergences across runs reflect the
cell, not prompt ambiguity. Watching for spec stage pinning the
implicit edge cases, plan stage avoiding unnecessary structure, and
implement stage staying under 150 LOC.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant