Seed copilot workflow benchmarks#29
Conversation
|
Agreed. The conceptual measurement target should come before the benchmark infrastructure. The benchmark should not just ask whether a scripted workflow completes. It should ask whether Gently can turn a scientist's intent into a safe, inspectable, scientifically useful experimental trace. I would define the benchmark around dimensions like:
So I would treat this PR as premature infrastructure unless it is reframed around a benchmark spec first. A better next step may be a short conceptual benchmark design document, then derive the task suite/scoring code from that. |
|
Follow-up implemented from this thread in commit I reframed the benchmark seed around a conceptual measurement contract before the runner/scoring mechanics. The docs now state that Gently should be measured on whether it turns scientific intent into a safe, inspectable, useful experimental trace, with dimensions for completion, scientific validity, safety, trace quality, efficiency, robustness, operator experience, and generalization. I also moved the new PR surface away from "copilot" terminology toward "agent workflow" benchmarks, while keeping a compatibility alias for older imports. Verification:
|
|
Follow-up implemented from the benchmark-concept thread in commit What changed:
Verification:
|
Addresses #4 and #8.
Summary:
Verification: