feat(harness): default harness wiring + env-snapshot reporting by pradeepvrd · Pull Request #33 · pradeepvrd/devops-bench

pradeepvrd · 2026-06-20T21:05:47Z

The per-task run used to be a monolithic top-level loop in pkg/evaluator/evaluate.py (provision → execute agent → judge → teardown → write results); this decomposes it into devops_bench/harness/ — a default harness that wires deployer + agent + verification + metrics, resolves the agent's capabilities once, snapshots the run environment, and persists results through a pluggable reporter.

Behavior changes

Agent capabilities are resolved once at harness construction and snapshotted, so a mid-batch env mutation cannot desync what the agent ran with from what the record reports.
Each record carries capabilities_granted (use_mcp + skills) so consumers read what was actually granted instead of re-reading BENCH_USE_MCP.
Results are written through the reporter both before and after scoring, so raw execution output is inspectable independent of scoring.
chaos_spec / verification_spec accept native YAML (JSON-in-YAML strings still parse).
Generated files are captured by diffing the workspace before/after the agent run into <run_dir>/generated_files/.

Bugs fixed

The operator agent and the chaos injector resolve a shared default deployment + namespace at init, so they target the same workload when the relevant env vars are unset (previously they could diverge).

The per-task run used to be a monolithic top-level loop in `pkg/evaluator/evaluate.py` (provision → execute agent → judge → teardown → write results); this decomposes it into `devops_bench/harness/` — a default harness that wires deployer + agent + verification + metrics, resolves the agent's capabilities once, snapshots the run environment, and persists results through a pluggable reporter. **Behavior changes** - Agent capabilities are resolved once at harness construction and snapshotted, so a mid-batch env mutation cannot desync what the agent ran with from what the record reports. - Each record carries `capabilities_granted` (use_mcp + skills) so consumers read what was actually granted instead of re-reading `BENCH_USE_MCP`. - Results are written through the reporter both before and after scoring, so raw execution output is inspectable independent of scoring. - `chaos_spec` / `verification_spec` accept native YAML (JSON-in-YAML strings still parse). - Generated files are captured by diffing the workspace before/after the agent run into `<run_dir>/generated_files/`. **Bugs fixed** - The operator agent and the chaos injector resolve a shared default deployment + namespace at init, so they target the same workload when the relevant env vars are unset (previously they could diverge).

pradeepvrd mentioned this pull request Jun 20, 2026

Cross-cutting harness refactor: layered devops_bench (Stage 1.5–3, reconciled) #23

Closed

pradeepvrd commented Jun 20, 2026

View reviewed changes

Comment thread devops_bench/harness/__init__.py Outdated

Comment thread devops_bench/harness/scenario.py Outdated

Comment thread devops_bench/harness/scenario.py Outdated

Comment thread devops_bench/harness/scenario.py Outdated

pradeepvrd force-pushed the submit/7-metrics branch from fd6fd3c to 0fc8f78 Compare June 21, 2026 01:30

pradeepvrd force-pushed the submit/8-harness branch from 46a0cbd to d219d2c Compare June 21, 2026 01:30

pradeepvrd force-pushed the submit/7-metrics branch from 0fc8f78 to 396ce1f Compare June 22, 2026 01:53

pradeepvrd force-pushed the submit/8-harness branch from d219d2c to 0c74d15 Compare June 22, 2026 01:53

pradeepvrd force-pushed the submit/7-metrics branch from 396ce1f to 148323c Compare June 23, 2026 05:04

pradeepvrd force-pushed the submit/8-harness branch from 0c74d15 to 9865495 Compare June 23, 2026 05:04

pradeepvrd force-pushed the submit/7-metrics branch from 148323c to 6cbbf71 Compare June 23, 2026 06:09

pradeepvrd force-pushed the submit/8-harness branch from 9865495 to 54182fe Compare June 23, 2026 06:09

pradeepvrd changed the title ~~feat(harness): default harness wiring + env-snapshot reporting [+#19]~~ feat(harness): default harness wiring + env-snapshot reporting Jun 23, 2026

pradeepvrd force-pushed the submit/7-metrics branch from 6cbbf71 to cb3145e Compare June 23, 2026 06:37

pradeepvrd force-pushed the submit/8-harness branch from 54182fe to a92f734 Compare June 23, 2026 06:37

pradeepvrd force-pushed the submit/7-metrics branch from cb3145e to 617f5e4 Compare June 23, 2026 07:33

pradeepvrd force-pushed the submit/8-harness branch from a92f734 to ecf7f3f Compare June 23, 2026 07:40

pradeepvrd force-pushed the submit/7-metrics branch from 617f5e4 to e349d59 Compare June 23, 2026 08:22

pradeepvrd force-pushed the submit/8-harness branch from ecf7f3f to 75281ee Compare June 23, 2026 08:22

pradeepvrd force-pushed the submit/7-metrics branch from e349d59 to 5eb3685 Compare June 23, 2026 18:18

pradeepvrd force-pushed the submit/8-harness branch from 75281ee to 3763c42 Compare June 23, 2026 18:24

pradeepvrd force-pushed the submit/7-metrics branch from 5eb3685 to 96891b5 Compare June 23, 2026 18:35

pradeepvrd force-pushed the submit/8-harness branch from 3763c42 to 969eda8 Compare June 23, 2026 18:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(harness): default harness wiring + env-snapshot reporting#33

feat(harness): default harness wiring + env-snapshot reporting#33
pradeepvrd wants to merge 1 commit into
submit/7-metricsfrom
submit/8-harness

pradeepvrd commented Jun 20, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pradeepvrd commented Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pradeepvrd commented Jun 20, 2026 •

edited

Loading