Skip to content

Seed copilot workflow benchmarks#29

Open
ceej640 wants to merge 3 commits into
gently-project:developmentfrom
ceej640:ceej/issue-4-8-benchmark-seed
Open

Seed copilot workflow benchmarks#29
ceej640 wants to merge 3 commits into
gently-project:developmentfrom
ceej640:ceej/issue-4-8-benchmark-seed

Conversation

@ceej640

@ceej640 ceej640 commented May 31, 2026

Copy link
Copy Markdown
Collaborator

Addresses #4 and #8.

Summary:

  • Add a deterministic copilot workflow task suite covering navigation, acquisition, analysis, multi-step workflows, and error recovery.
  • Add trace-based scoring for completion, parameters, efficiency, and error handling.
  • Add a scriptable MockQueueServerClient for benchmark scenarios.
  • Extend the benchmark runner with a copilot command and document the workflow.

Verification:

  • .\.venv\Scripts\python.exe -m pytest tests/test_copilot_benchmarks.py -q -p no:cacheprovider
  • .\.venv\Scripts\python.exe -m benchmarks.runner copilot --tags navigation
  • git diff --check

@pskeshu

pskeshu commented Jun 1, 2026 via email

Copy link
Copy Markdown
Collaborator

@ceej640

ceej640 commented Jun 1, 2026

Copy link
Copy Markdown
Collaborator Author

Agreed. The conceptual measurement target should come before the benchmark infrastructure.

The benchmark should not just ask whether a scripted workflow completes. It should ask whether Gently can turn a scientist's intent into a safe, inspectable, scientifically useful experimental trace.

I would define the benchmark around dimensions like:

  • task completion: did it reach the requested experimental state?
  • scientific validity: are controls, constraints, sample assumptions, and decision points appropriate?
  • hardware safety: did it avoid unsafe moves/exposure/device states?
  • trace quality: can a human reconstruct what happened and why?
  • efficiency: latency, tool-call count, unnecessary retries, queue/device wait time
  • robustness: recovery from missing data, failed tools, stale state, and ambiguous intent
  • operator experience: how many clarifications/edits were needed?
  • generalization: does the same benchmark concept work across imaging, bench, genetics, and analysis tasks?

So I would treat this PR as premature infrastructure unless it is reframed around a benchmark spec first. A better next step may be a short conceptual benchmark design document, then derive the task suite/scoring code from that.

@ceej640

ceej640 commented Jun 1, 2026

Copy link
Copy Markdown
Collaborator Author

Follow-up implemented from this thread in commit eb3d3bb.

I reframed the benchmark seed around a conceptual measurement contract before the runner/scoring mechanics. The docs now state that Gently should be measured on whether it turns scientific intent into a safe, inspectable, useful experimental trace, with dimensions for completion, scientific validity, safety, trace quality, efficiency, robustness, operator experience, and generalization.

I also moved the new PR surface away from "copilot" terminology toward "agent workflow" benchmarks, while keeping a compatibility alias for older imports.

Verification:

  • pytest tests/test_agent_workflow_benchmarks.py -q -p no:cacheprovider
  • python -m benchmarks.runner workflow --tags navigation
  • git diff --check

@ceej640

ceej640 commented Jun 1, 2026

Copy link
Copy Markdown
Collaborator Author

Follow-up implemented from the benchmark-concept thread in commit 3f282c9.

What changed:

  • Added rubric fields directly to benchmark task definitions: safety_constraints, scientific_validity, trace_quality_checks, operator_experience_checks, and expected_evidence.
  • The evaluator now includes a review_checklist and manual_review_required flag in each result, so deterministic trace scores are separated from the human-review dimensions.
  • The runner lists manual-review check counts and scored reports include the number of tasks requiring manual review.
  • Expanded the seed tasks and docs so the benchmark spec carries the conceptual measurement contract instead of only tool-call expectations.

Verification:

  • pytest tests/test_agent_workflow_benchmarks.py -q -p no:cacheprovider
  • python -m benchmarks.runner workflow --tags navigation
  • JSON validation for benchmarks/tasks/agent_workflows.json
  • git diff --check

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants