Skip to content

feat(harness): default harness wiring + env-snapshot reporting#33

Draft
pradeepvrd wants to merge 1 commit into
submit/7-metricsfrom
submit/8-harness
Draft

feat(harness): default harness wiring + env-snapshot reporting#33
pradeepvrd wants to merge 1 commit into
submit/7-metricsfrom
submit/8-harness

Conversation

@pradeepvrd

@pradeepvrd pradeepvrd commented Jun 20, 2026

Copy link
Copy Markdown
Owner

The per-task run used to be a monolithic top-level loop in pkg/evaluator/evaluate.py (provision → execute agent → judge → teardown → write results); this decomposes it into devops_bench/harness/ — a default harness that wires deployer + agent + verification + metrics, resolves the agent's capabilities once, snapshots the run environment, and persists results through a pluggable reporter.

Behavior changes

  • Agent capabilities are resolved once at harness construction and snapshotted, so a mid-batch env mutation cannot desync what the agent ran with from what the record reports.
  • Each record carries capabilities_granted (use_mcp + skills) so consumers read what was actually granted instead of re-reading BENCH_USE_MCP.
  • Results are written through the reporter both before and after scoring, so raw execution output is inspectable independent of scoring.
  • chaos_spec / verification_spec accept native YAML (JSON-in-YAML strings still parse).
  • Generated files are captured by diffing the workspace before/after the agent run into <run_dir>/generated_files/.

Bugs fixed

  • The operator agent and the chaos injector resolve a shared default deployment + namespace at init, so they target the same workload when the relevant env vars are unset (previously they could diverge).

Comment thread devops_bench/harness/__init__.py Outdated
Comment thread devops_bench/harness/scenario.py Outdated
Comment thread devops_bench/harness/scenario.py Outdated
Comment thread devops_bench/harness/scenario.py Outdated
@pradeepvrd pradeepvrd changed the title feat(harness): default harness wiring + env-snapshot reporting [+#19] feat(harness): default harness wiring + env-snapshot reporting Jun 23, 2026
The per-task run used to be a monolithic top-level loop in `pkg/evaluator/evaluate.py` (provision → execute agent → judge → teardown → write results); this decomposes it into `devops_bench/harness/` — a default harness that wires deployer + agent + verification + metrics, resolves the agent's capabilities once, snapshots the run environment, and persists results through a pluggable reporter.

**Behavior changes**
- Agent capabilities are resolved once at harness construction and snapshotted, so a mid-batch env mutation cannot desync what the agent ran with from what the record reports.
- Each record carries `capabilities_granted` (use_mcp + skills) so consumers read what was actually granted instead of re-reading `BENCH_USE_MCP`.
- Results are written through the reporter both before and after scoring, so raw execution output is inspectable independent of scoring.
- `chaos_spec` / `verification_spec` accept native YAML (JSON-in-YAML strings still parse).
- Generated files are captured by diffing the workspace before/after the agent run into `<run_dir>/generated_files/`.

**Bugs fixed**
- The operator agent and the chaos injector resolve a shared default deployment + namespace at init, so they target the same workload when the relevant env vars are unset (previously they could diverge).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant