Skip to content

feat(harness): default harness wiring + env-snapshot reporting#120

Merged
pradeepvrd merged 2 commits into
submit/7-metricsfrom
submit/8-harness
Jun 26, 2026
Merged

feat(harness): default harness wiring + env-snapshot reporting#120
pradeepvrd merged 2 commits into
submit/7-metricsfrom
submit/8-harness

Conversation

@pradeepvrd

Copy link
Copy Markdown
Collaborator

The per-task run used to be a monolithic top-level loop in pkg/evaluator/evaluate.py (provision → execute agent → judge → teardown → write results); this decomposes it into devops_bench/harness/ — a default harness that wires deployer + agent + verification + metrics, resolves the agent's capabilities once, snapshots the run environment, and persists results through a pluggable reporter.

Behavior changes

  • Agent capabilities are resolved once at harness construction and snapshotted, so a mid-batch env mutation cannot desync what the agent ran with from what the record reports.
  • Each record carries capabilities_granted (use_mcp + skills) so consumers read what was actually granted instead of re-reading BENCH_USE_MCP.
  • Results are written through the reporter both before and after scoring, so raw execution output is inspectable independent of scoring.
  • chaos_spec / verification_spec accept native YAML (JSON-in-YAML strings still parse).
  • Generated files are captured by diffing the workspace before/after the agent run into <run_dir>/generated_files/.

Bugs fixed

  • The operator agent and the chaos injector resolve a shared default deployment + namespace at init, so they target the same workload when the relevant env vars are unset (previously they could diverge).

@pradeepvrd pradeepvrd force-pushed the submit/8-harness branch 2 times, most recently from 173c46e to 73b19d0 Compare June 25, 2026 05:03
@pradeepvrd pradeepvrd marked this pull request as ready for review June 25, 2026 19:52
Comment thread devops_bench/evalharness/__init__.py
Comment thread devops_bench/evalharness/artifacts.py
Comment thread devops_bench/harness/default.py Outdated
Comment thread devops_bench/harness/default.py Outdated
Comment thread devops_bench/evalharness/default.py
Comment thread devops_bench/evalharness/default.py
The per-task run used to be a monolithic top-level loop in `pkg/evaluator/evaluate.py` (provision → execute agent → judge → teardown → write results); this decomposes it into `devops_bench/harness/` — a default harness that wires deployer + agent + verification + metrics, resolves the agent's capabilities once, snapshots the run environment, and persists results through a pluggable reporter.

**Behavior changes**
- Agent capabilities are resolved once at harness construction and snapshotted, so a mid-batch env mutation cannot desync what the agent ran with from what the record reports.
- Each record carries `capabilities_granted` (use_mcp + skills) so consumers read what was actually granted instead of re-reading `BENCH_USE_MCP`.
- Results are written through the reporter both before and after scoring, so raw execution output is inspectable independent of scoring.
- `chaos_spec` / `verification_spec` accept native YAML (JSON-in-YAML strings still parse).
- Generated files are captured by diffing the workspace before/after the agent run into `<run_dir>/generated_files/`.

**Bugs fixed**
- The operator agent and the chaos injector resolve a shared default deployment + namespace at init, so they target the same workload when the relevant env vars are unset (previously they could diverge).
…ic chaos report; rename package

Address review feedback on the harness PR (#120):

- Rename the package devops_bench/harness -> devops_bench/evalharness (and
  tests/unit/harness -> tests/unit/evalharness) so it is not confused with the
  agent harness; update all imports, logger channels, and migration-doc paths.
  Public classes (Harness, DefaultHarness, ResultReporter, ScenarioManager)
  are unchanged.
- run() now wraps _score + the post-score rewrite in try/except: a judge or
  config failure (e.g. get_judge_model()) no longer escapes run() and discard
  an otherwise successful execution pass, whose raw results are already on disk.
- Drop the scalar `score` field from every record (and from _RECORD_KEYS): it
  was seeded to 0 and never written, so every record shipped score: 0. The
  per-metric `scores` map is the source of truth; tests updated accordingly.
- _drain_scenario deep-copies the chaos report on the join-timeout branch
  before stamping status, since the daemon thread may still be mid-write and a
  shallow dict() could capture a torn nested structure.
@pradeepvrd pradeepvrd merged commit 8a155d3 into submit/7-metrics Jun 26, 2026
1 check passed
pradeepvrd added a commit that referenced this pull request Jun 26, 2026
* feat(metrics): metrics suite + bundled skills

The metrics suite used to be built inline in `pkg/evaluator/evaluate.py` (GEval construction, checklist parsing, scoring in `evaluate_metrics_batch`); this extracts it into `devops_bench/metrics/` as registry-driven evaluators (`METRICS`) over a typed `MetricContext`/`MetricScore`, with judges routed through the `models` layer and judge skills shipped as package data under `devops_bench/skills/`.

**Behavior changes**
- Each metric declares its own `applies()` and yields typed scores via the `METRICS` registry, replacing the monolithic batch function; downstream packages can add metrics without editing the pipeline.
- Judges are instantiated through the `models` layer instead of constructing provider SDK clients directly (provider-agnostic scoring).
- Judge skill markdown moves from a hard-coded `skills/` filesystem path to `devops_bench/skills/` package data, so it resolves under `pip install` / wheels.
- The tool-invocation checklist item is dropped when MCP is disabled, so non-MCP runs are not scored against an inapplicable check.

**Bugs fixed**
- Checklist parsing uses `lstrip("- ")` instead of `strip("- ")`, so trailing hyphens in requirement text are no longer truncated.
- Document-retrieval scoring guards empty `doc_name`/`url` fields, which previously matched all text and inflated the rate.


* feat(harness): default harness wiring + env-snapshot reporting (#120)

* feat(harness): default harness wiring + env-snapshot reporting

The per-task run used to be a monolithic top-level loop in `pkg/evaluator/evaluate.py` (provision → execute agent → judge → teardown → write results); this decomposes it into `devops_bench/harness/` — a default harness that wires deployer + agent + verification + metrics, resolves the agent's capabilities once, snapshots the run environment, and persists results through a pluggable reporter.

**Behavior changes**
- Agent capabilities are resolved once at harness construction and snapshotted, so a mid-batch env mutation cannot desync what the agent ran with from what the record reports.
- Each record carries `capabilities_granted` (use_mcp + skills) so consumers read what was actually granted instead of re-reading `BENCH_USE_MCP`.
- Results are written through the reporter both before and after scoring, so raw execution output is inspectable independent of scoring.
- `chaos_spec` / `verification_spec` accept native YAML (JSON-in-YAML strings still parse).
- Generated files are captured by diffing the workspace before/after the agent run into `<run_dir>/generated_files/`.

**Bugs fixed**
- The operator agent and the chaos injector resolve a shared default deployment + namespace at init, so they target the same workload when the relevant env vars are unset (previously they could diverge).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants