Seed copilot workflow benchmarks by ceej640 · Pull Request #29 · gently-project/gently

ceej640 · 2026-05-31T02:16:38Z

Addresses #4 and #8.

Summary:

Add a deterministic copilot workflow task suite covering navigation, acquisition, analysis, multi-step workflows, and error recovery.
Add trace-based scoring for completion, parameters, efficiency, and error handling.
Add a scriptable MockQueueServerClient for benchmark scenarios.
Extend the benchmark runner with a copilot command and document the workflow.

Verification:

.\.venv\Scripts\python.exe -m pytest tests/test_copilot_benchmarks.py -q -p no:cacheprovider
.\.venv\Scripts\python.exe -m benchmarks.runner copilot --tags navigation
git diff --check

pskeshu · 2026-06-01T02:28:20Z

Need first work on the conceptual level of what a benchmark is trying to measure, before we create the infrastructure.

…

On Sat, 30 May 2026 at 22:16, ceej640 ***@***.***> wrote: Addresses #4 <https://github.com/pskeshu/gently/issues/4> and #8 <https://github.com/pskeshu/gently/issues/8>. Summary: - Add a deterministic copilot workflow task suite covering navigation, acquisition, analysis, multi-step workflows, and error recovery. - Add trace-based scoring for completion, parameters, efficiency, and error handling. - Add a scriptable MockQueueServerClient for benchmark scenarios. - Extend the benchmark runner with a copilot command and document the workflow. Verification: - .\.venv\Scripts\python.exe -m pytest tests/test_copilot_benchmarks.py -q -p no:cacheprovider - .\.venv\Scripts\python.exe -m benchmarks.runner copilot --tags navigation - git diff --check ------------------------------ You can view, comment on, or merge this pull request online at: https://github.com/pskeshu/gently/pull/29 Commit Summary - 92bdbbd <pskeshu@92bdbbd> Seed copilot workflow benchmarks File Changes (7 files <https://github.com/pskeshu/gently/pull/29/files>) - *M* benchmarks/__init__.py <https://github.com/pskeshu/gently/pull/29/files#diff-c154eee9decb8fc4aca2b76f877f3082419528ad848f3800074693860239d9f4> (8) - *A* benchmarks/evaluator.py <https://github.com/pskeshu/gently/pull/29/files#diff-24a6ddefcb966ce3ec91127f787f2d0719f5edf614c950a1f08f7185d8de5f9f> (297) - *A* benchmarks/mock_client.py <https://github.com/pskeshu/gently/pull/29/files#diff-e194c42558f79c28a64456f97b52b14323d2a2bf91190c86f91fe74a9cb282b5> (103) - *M* benchmarks/runner.py <https://github.com/pskeshu/gently/pull/29/files#diff-e6cca26b2334dfa04b538f5b188830a9d1969adb3947abc22503c3ddf12f22db> (55) - *A* benchmarks/tasks/copilot_workflows.json <https://github.com/pskeshu/gently/pull/29/files#diff-8ae057d5fb10bee836116a694f4c0586dd991d234d39becea0b122651262bc09> (91) - *A* docs/copilot-benchmarks.md <https://github.com/pskeshu/gently/pull/29/files#diff-3ea56cf59d035c6951c861f3312c6e2076c8b1576128ed980d94653bea6a2f42> (59) - *A* tests/test_copilot_benchmarks.py <https://github.com/pskeshu/gently/pull/29/files#diff-6b20190eaec0cfb0dede4ca10fcc0e7cfeacbeb35bc9ba4a48f4c8b8c016833f> (106) Patch Links: - https://github.com/pskeshu/gently/pull/29.patch - https://github.com/pskeshu/gently/pull/29.diff — Reply to this email directly, view it on GitHub <https://github.com/pskeshu/gently/pull/29?email_source=notifications&email_token=ABVNN4GI3JRI4ISE4IGSSPD45OIZVA5CNFSNUABEM5UWIORPF5TWS5BNNB2WEL2QOVWGYUTFOF2WK43UF4ZTONZVGQYDANJRGSTHEZLBONXW5KTTOVRHGY3SNFRGKZFFMV3GK3TUVRTG633UMVZF6Y3MNFRWW>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABVNN4H3MWNIK5D3TTTEMHD45OIZVAVCNFSM6AAAAACZUCPCCSVHI2DSMVQWIX3LMV43ASLTON2WKOZUGU2TMMZRG4ZTSNQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

-- Kesavan

ceej640 · 2026-06-01T04:27:02Z

Agreed. The conceptual measurement target should come before the benchmark infrastructure.

The benchmark should not just ask whether a scripted workflow completes. It should ask whether Gently can turn a scientist's intent into a safe, inspectable, scientifically useful experimental trace.

I would define the benchmark around dimensions like:

task completion: did it reach the requested experimental state?
scientific validity: are controls, constraints, sample assumptions, and decision points appropriate?
hardware safety: did it avoid unsafe moves/exposure/device states?
trace quality: can a human reconstruct what happened and why?
efficiency: latency, tool-call count, unnecessary retries, queue/device wait time
robustness: recovery from missing data, failed tools, stale state, and ambiguous intent
operator experience: how many clarifications/edits were needed?
generalization: does the same benchmark concept work across imaging, bench, genetics, and analysis tasks?

So I would treat this PR as premature infrastructure unless it is reframed around a benchmark spec first. A better next step may be a short conceptual benchmark design document, then derive the task suite/scoring code from that.

ceej640 · 2026-06-01T04:59:03Z

Follow-up implemented from this thread in commit eb3d3bb.

I reframed the benchmark seed around a conceptual measurement contract before the runner/scoring mechanics. The docs now state that Gently should be measured on whether it turns scientific intent into a safe, inspectable, useful experimental trace, with dimensions for completion, scientific validity, safety, trace quality, efficiency, robustness, operator experience, and generalization.

I also moved the new PR surface away from "copilot" terminology toward "agent workflow" benchmarks, while keeping a compatibility alias for older imports.

Verification:

pytest tests/test_agent_workflow_benchmarks.py -q -p no:cacheprovider
python -m benchmarks.runner workflow --tags navigation
git diff --check

ceej640 · 2026-06-01T05:53:31Z

Follow-up implemented from the benchmark-concept thread in commit 3f282c9.

What changed:

Added rubric fields directly to benchmark task definitions: safety_constraints, scientific_validity, trace_quality_checks, operator_experience_checks, and expected_evidence.
The evaluator now includes a review_checklist and manual_review_required flag in each result, so deterministic trace scores are separated from the human-review dimensions.
The runner lists manual-review check counts and scored reports include the number of tasks requiring manual review.
Expanded the seed tasks and docs so the benchmark spec carries the conceptual measurement contract instead of only tool-call expectations.

Verification:

pytest tests/test_agent_workflow_benchmarks.py -q -p no:cacheprovider
python -m benchmarks.runner workflow --tags navigation
JSON validation for benchmarks/tasks/agent_workflows.json
git diff --check

Seed copilot workflow benchmarks

92bdbbd

Frame agent workflow benchmarks conceptually

eb3d3bb

Add benchmark review rubric fields

3f282c9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Seed copilot workflow benchmarks#29

Seed copilot workflow benchmarks#29
ceej640 wants to merge 3 commits into
gently-project:developmentfrom
ceej640:ceej/issue-4-8-benchmark-seed

ceej640 commented May 31, 2026

Uh oh!

pskeshu commented Jun 1, 2026 via email

Uh oh!

ceej640 commented Jun 1, 2026

Uh oh!

ceej640 commented Jun 1, 2026

Uh oh!

ceej640 commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ceej640 commented May 31, 2026

Uh oh!

pskeshu commented Jun 1, 2026 via email

Uh oh!

ceej640 commented Jun 1, 2026

Uh oh!

ceej640 commented Jun 1, 2026

Uh oh!

ceej640 commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants