feat(harness): default harness wiring + env-snapshot reporting#120
Merged
Conversation
96891b5 to
a321243
Compare
969eda8 to
a1f1afa
Compare
a321243 to
da8e7c5
Compare
173c46e to
73b19d0
Compare
itssimrank
reviewed
Jun 25, 2026
itssimrank
reviewed
Jun 25, 2026
itssimrank
approved these changes
Jun 25, 2026
The per-task run used to be a monolithic top-level loop in `pkg/evaluator/evaluate.py` (provision → execute agent → judge → teardown → write results); this decomposes it into `devops_bench/harness/` — a default harness that wires deployer + agent + verification + metrics, resolves the agent's capabilities once, snapshots the run environment, and persists results through a pluggable reporter. **Behavior changes** - Agent capabilities are resolved once at harness construction and snapshotted, so a mid-batch env mutation cannot desync what the agent ran with from what the record reports. - Each record carries `capabilities_granted` (use_mcp + skills) so consumers read what was actually granted instead of re-reading `BENCH_USE_MCP`. - Results are written through the reporter both before and after scoring, so raw execution output is inspectable independent of scoring. - `chaos_spec` / `verification_spec` accept native YAML (JSON-in-YAML strings still parse). - Generated files are captured by diffing the workspace before/after the agent run into `<run_dir>/generated_files/`. **Bugs fixed** - The operator agent and the chaos injector resolve a shared default deployment + namespace at init, so they target the same workload when the relevant env vars are unset (previously they could diverge).
…ic chaos report; rename package Address review feedback on the harness PR (#120): - Rename the package devops_bench/harness -> devops_bench/evalharness (and tests/unit/harness -> tests/unit/evalharness) so it is not confused with the agent harness; update all imports, logger channels, and migration-doc paths. Public classes (Harness, DefaultHarness, ResultReporter, ScenarioManager) are unchanged. - run() now wraps _score + the post-score rewrite in try/except: a judge or config failure (e.g. get_judge_model()) no longer escapes run() and discard an otherwise successful execution pass, whose raw results are already on disk. - Drop the scalar `score` field from every record (and from _RECORD_KEYS): it was seeded to 0 and never written, so every record shipped score: 0. The per-metric `scores` map is the source of truth; tests updated accordingly. - _drain_scenario deep-copies the chaos report on the join-timeout branch before stamping status, since the daemon thread may still be mid-write and a shallow dict() could capture a torn nested structure.
73b19d0 to
90d8d89
Compare
pradeepvrd
added a commit
that referenced
this pull request
Jun 26, 2026
* feat(metrics): metrics suite + bundled skills
The metrics suite used to be built inline in `pkg/evaluator/evaluate.py` (GEval construction, checklist parsing, scoring in `evaluate_metrics_batch`); this extracts it into `devops_bench/metrics/` as registry-driven evaluators (`METRICS`) over a typed `MetricContext`/`MetricScore`, with judges routed through the `models` layer and judge skills shipped as package data under `devops_bench/skills/`.
**Behavior changes**
- Each metric declares its own `applies()` and yields typed scores via the `METRICS` registry, replacing the monolithic batch function; downstream packages can add metrics without editing the pipeline.
- Judges are instantiated through the `models` layer instead of constructing provider SDK clients directly (provider-agnostic scoring).
- Judge skill markdown moves from a hard-coded `skills/` filesystem path to `devops_bench/skills/` package data, so it resolves under `pip install` / wheels.
- The tool-invocation checklist item is dropped when MCP is disabled, so non-MCP runs are not scored against an inapplicable check.
**Bugs fixed**
- Checklist parsing uses `lstrip("- ")` instead of `strip("- ")`, so trailing hyphens in requirement text are no longer truncated.
- Document-retrieval scoring guards empty `doc_name`/`url` fields, which previously matched all text and inflated the rate.
* feat(harness): default harness wiring + env-snapshot reporting (#120)
* feat(harness): default harness wiring + env-snapshot reporting
The per-task run used to be a monolithic top-level loop in `pkg/evaluator/evaluate.py` (provision → execute agent → judge → teardown → write results); this decomposes it into `devops_bench/harness/` — a default harness that wires deployer + agent + verification + metrics, resolves the agent's capabilities once, snapshots the run environment, and persists results through a pluggable reporter.
**Behavior changes**
- Agent capabilities are resolved once at harness construction and snapshotted, so a mid-batch env mutation cannot desync what the agent ran with from what the record reports.
- Each record carries `capabilities_granted` (use_mcp + skills) so consumers read what was actually granted instead of re-reading `BENCH_USE_MCP`.
- Results are written through the reporter both before and after scoring, so raw execution output is inspectable independent of scoring.
- `chaos_spec` / `verification_spec` accept native YAML (JSON-in-YAML strings still parse).
- Generated files are captured by diffing the workspace before/after the agent run into `<run_dir>/generated_files/`.
**Bugs fixed**
- The operator agent and the chaos injector resolve a shared default deployment + namespace at init, so they target the same workload when the relevant env vars are unset (previously they could diverge).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The per-task run used to be a monolithic top-level loop in
pkg/evaluator/evaluate.py(provision → execute agent → judge → teardown → write results); this decomposes it intodevops_bench/harness/— a default harness that wires deployer + agent + verification + metrics, resolves the agent's capabilities once, snapshots the run environment, and persists results through a pluggable reporter.Behavior changes
capabilities_granted(use_mcp + skills) so consumers read what was actually granted instead of re-readingBENCH_USE_MCP.chaos_spec/verification_specaccept native YAML (JSON-in-YAML strings still parse).<run_dir>/generated_files/.Bugs fixed