evals: harness for capturing flow runs and qualitative reviews by jyliang · Pull Request #21 · jyliang/flow

jyliang · 2026-05-07T03:41:15Z

Summary

Adds make eval-{task-new,record,review,compare,list} targets for capturing flow runs as versioned, reviewable artifacts.
A run is pinned to kernel SHA + active-cell SHA, copies the thread's handoff docs, captures the diff against the base branch, and writes a JSON manifest.
The review template is markdown with rich qualitative sections (overall impression, per-stage notes, doc readability for both human + machine readers, code quality, and a "patterns flow should learn" section). Numeric scores are optional.
Reviews live in evals/runs/<task>/<run-id>/reviews/<reviewer>.md and are committed alongside the run, so /flow:reflect (and humans iterating on flow) can read them next time.
eval-compare prints version pins and metric deltas side by side and points at shared reviewers' files for prose-level diffing.

Layout

evals/
  tasks/<task-id>/             # task definitions
  runs/<task-id>/<run-id>/     # captured runs (manifest + thread + diff + reviews)
  templates/                   # review template + task README scaffold

run-id = <utc-iso>-<kernel-sha7> — sortable, scannable.

Test plan

make eval-task-new TASK=<id> scaffolds tasks//{task.json, prompt.md, README.md}
make eval-record TASK=<id> PROJECT=<path> discovers the most recent thread, copies it, generates the diff, writes a manifest pinned to current kernel + cell SHAs
make eval-review TASK=<id> RUN=<run-id> opens $EDITOR on a pre-filled review template with an artifact index appended
make eval-compare TASK=<id> A=<a> B=<b> prints version + metric deltas and lists reviews
make eval-list summarises tasks and runs

🤖 Generated with Claude Code

Adds make eval-task-new / eval-record / eval-review / eval-compare / eval-list to capture a flow run as a versioned record (kernel SHA + cell SHA + thread + diff + manifest), open a long-form review template in $EDITOR, and compare two runs side by side. Reviews are markdown so they're diffable and committed alongside the run for the next iteration of /flow:reflect to read. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Greenfield code-pipeline target — single-file Python TODO CLI in a fresh repo at ~/Workspace/jyliang/mini-todo (init commit 9f05045). Prompt is intentionally tight so divergences across runs reflect the cell, not prompt ambiguity. Watching for spec stage pinning the implicit edge cases, plan stage avoiding unnecessary structure, and implement stage staying under 150 LOC. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jyliang and others added 2 commits May 6, 2026 23:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evals: harness for capturing flow runs and qualitative reviews#21

evals: harness for capturing flow runs and qualitative reviews#21
jyliang wants to merge 2 commits into
mainfrom
evals-harness

jyliang commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jyliang commented May 7, 2026

Summary

Layout

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant