feat(harness): default harness wiring + env-snapshot reporting by pradeepvrd · Pull Request #120 · gke-labs/devops-bench

pradeepvrd · 2026-06-23T19:00:38Z

The per-task run used to be a monolithic top-level loop in pkg/evaluator/evaluate.py (provision → execute agent → judge → teardown → write results); this decomposes it into devops_bench/harness/ — a default harness that wires deployer + agent + verification + metrics, resolves the agent's capabilities once, snapshots the run environment, and persists results through a pluggable reporter.

Behavior changes

Agent capabilities are resolved once at harness construction and snapshotted, so a mid-batch env mutation cannot desync what the agent ran with from what the record reports.
Each record carries capabilities_granted (use_mcp + skills) so consumers read what was actually granted instead of re-reading BENCH_USE_MCP.
Results are written through the reporter both before and after scoring, so raw execution output is inspectable independent of scoring.
chaos_spec / verification_spec accept native YAML (JSON-in-YAML strings still parse).
Generated files are captured by diffing the workspace before/after the agent run into <run_dir>/generated_files/.

Bugs fixed

The operator agent and the chaos injector resolve a shared default deployment + namespace at init, so they target the same workload when the relevant env vars are unset (previously they could diverge).

The per-task run used to be a monolithic top-level loop in `pkg/evaluator/evaluate.py` (provision → execute agent → judge → teardown → write results); this decomposes it into `devops_bench/harness/` — a default harness that wires deployer + agent + verification + metrics, resolves the agent's capabilities once, snapshots the run environment, and persists results through a pluggable reporter. **Behavior changes** - Agent capabilities are resolved once at harness construction and snapshotted, so a mid-batch env mutation cannot desync what the agent ran with from what the record reports. - Each record carries `capabilities_granted` (use_mcp + skills) so consumers read what was actually granted instead of re-reading `BENCH_USE_MCP`. - Results are written through the reporter both before and after scoring, so raw execution output is inspectable independent of scoring. - `chaos_spec` / `verification_spec` accept native YAML (JSON-in-YAML strings still parse). - Generated files are captured by diffing the workspace before/after the agent run into `<run_dir>/generated_files/`. **Bugs fixed** - The operator agent and the chaos injector resolve a shared default deployment + namespace at init, so they target the same workload when the relevant env vars are unset (previously they could diverge).

…ic chaos report; rename package Address review feedback on the harness PR (#120): - Rename the package devops_bench/harness -> devops_bench/evalharness (and tests/unit/harness -> tests/unit/evalharness) so it is not confused with the agent harness; update all imports, logger channels, and migration-doc paths. Public classes (Harness, DefaultHarness, ResultReporter, ScenarioManager) are unchanged. - run() now wraps _score + the post-score rewrite in try/except: a judge or config failure (e.g. get_judge_model()) no longer escapes run() and discard an otherwise successful execution pass, whose raw results are already on disk. - Drop the scalar `score` field from every record (and from _RECORD_KEYS): it was seeded to 0 and never written, so every record shipped score: 0. The per-metric `scores` map is the source of truth; tests updated accordingly. - _drain_scenario deep-copies the chaos report on the join-timeout branch before stamping status, since the daemon thread may still be mid-write and a shallow dict() could capture a torn nested structure.

* feat(metrics): metrics suite + bundled skills The metrics suite used to be built inline in `pkg/evaluator/evaluate.py` (GEval construction, checklist parsing, scoring in `evaluate_metrics_batch`); this extracts it into `devops_bench/metrics/` as registry-driven evaluators (`METRICS`) over a typed `MetricContext`/`MetricScore`, with judges routed through the `models` layer and judge skills shipped as package data under `devops_bench/skills/`. **Behavior changes** - Each metric declares its own `applies()` and yields typed scores via the `METRICS` registry, replacing the monolithic batch function; downstream packages can add metrics without editing the pipeline. - Judges are instantiated through the `models` layer instead of constructing provider SDK clients directly (provider-agnostic scoring). - Judge skill markdown moves from a hard-coded `skills/` filesystem path to `devops_bench/skills/` package data, so it resolves under `pip install` / wheels. - The tool-invocation checklist item is dropped when MCP is disabled, so non-MCP runs are not scored against an inapplicable check. **Bugs fixed** - Checklist parsing uses `lstrip("- ")` instead of `strip("- ")`, so trailing hyphens in requirement text are no longer truncated. - Document-retrieval scoring guards empty `doc_name`/`url` fields, which previously matched all text and inflated the rate. * feat(harness): default harness wiring + env-snapshot reporting (#120) * feat(harness): default harness wiring + env-snapshot reporting The per-task run used to be a monolithic top-level loop in `pkg/evaluator/evaluate.py` (provision → execute agent → judge → teardown → write results); this decomposes it into `devops_bench/harness/` — a default harness that wires deployer + agent + verification + metrics, resolves the agent's capabilities once, snapshots the run environment, and persists results through a pluggable reporter. **Behavior changes** - Agent capabilities are resolved once at harness construction and snapshotted, so a mid-batch env mutation cannot desync what the agent ran with from what the record reports. - Each record carries `capabilities_granted` (use_mcp + skills) so consumers read what was actually granted instead of re-reading `BENCH_USE_MCP`. - Results are written through the reporter both before and after scoring, so raw execution output is inspectable independent of scoring. - `chaos_spec` / `verification_spec` accept native YAML (JSON-in-YAML strings still parse). - Generated files are captured by diffing the workspace before/after the agent run into `<run_dir>/generated_files/`. **Bugs fixed** - The operator agent and the chaos injector resolve a shared default deployment + namespace at init, so they target the same workload when the relevant env vars are unset (previously they could diverge).

pradeepvrd force-pushed the submit/7-metrics branch from 96891b5 to a321243 Compare June 23, 2026 20:09

pradeepvrd force-pushed the submit/8-harness branch from 969eda8 to a1f1afa Compare June 23, 2026 20:10

pradeepvrd force-pushed the submit/7-metrics branch from a321243 to da8e7c5 Compare June 25, 2026 04:19

pradeepvrd force-pushed the submit/8-harness branch 2 times, most recently from 173c46e to 73b19d0 Compare June 25, 2026 05:03

pradeepvrd marked this pull request as ready for review June 25, 2026 19:52

pradeepvrd requested review from itssimrank and jessie1111101 June 25, 2026 19:52

itssimrank reviewed Jun 25, 2026

View reviewed changes

Comment thread devops_bench/evalharness/__init__.py

itssimrank reviewed Jun 25, 2026

View reviewed changes

Comment thread devops_bench/evalharness/artifacts.py

itssimrank approved these changes Jun 25, 2026

View reviewed changes

jessie1111101 reviewed Jun 25, 2026

View reviewed changes

Comment thread devops_bench/harness/default.py Outdated

jessie1111101 reviewed Jun 26, 2026

View reviewed changes

Comment thread devops_bench/harness/default.py Outdated

jessie1111101 reviewed Jun 26, 2026

View reviewed changes

Comment thread devops_bench/evalharness/default.py

jessie1111101 reviewed Jun 26, 2026

View reviewed changes

Comment thread devops_bench/evalharness/default.py

pradeepvrd added 2 commits June 25, 2026 17:58

pradeepvrd force-pushed the submit/8-harness branch from 73b19d0 to 90d8d89 Compare June 26, 2026 01:03

pradeepvrd merged commit 8a155d3 into submit/7-metrics Jun 26, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(harness): default harness wiring + env-snapshot reporting#120

feat(harness): default harness wiring + env-snapshot reporting#120
pradeepvrd merged 2 commits into
submit/7-metricsfrom
submit/8-harness

pradeepvrd commented Jun 23, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

pradeepvrd commented Jun 23, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants