feat(harness): default harness wiring + env-snapshot reporting#33
Draft
pradeepvrd wants to merge 1 commit into
Draft
feat(harness): default harness wiring + env-snapshot reporting#33pradeepvrd wants to merge 1 commit into
pradeepvrd wants to merge 1 commit into
Conversation
pradeepvrd
commented
Jun 20, 2026
fd6fd3c to
0fc8f78
Compare
46a0cbd to
d219d2c
Compare
0fc8f78 to
396ce1f
Compare
d219d2c to
0c74d15
Compare
396ce1f to
148323c
Compare
0c74d15 to
9865495
Compare
148323c to
6cbbf71
Compare
9865495 to
54182fe
Compare
6cbbf71 to
cb3145e
Compare
54182fe to
a92f734
Compare
cb3145e to
617f5e4
Compare
a92f734 to
ecf7f3f
Compare
617f5e4 to
e349d59
Compare
ecf7f3f to
75281ee
Compare
e349d59 to
5eb3685
Compare
75281ee to
3763c42
Compare
The per-task run used to be a monolithic top-level loop in `pkg/evaluator/evaluate.py` (provision → execute agent → judge → teardown → write results); this decomposes it into `devops_bench/harness/` — a default harness that wires deployer + agent + verification + metrics, resolves the agent's capabilities once, snapshots the run environment, and persists results through a pluggable reporter. **Behavior changes** - Agent capabilities are resolved once at harness construction and snapshotted, so a mid-batch env mutation cannot desync what the agent ran with from what the record reports. - Each record carries `capabilities_granted` (use_mcp + skills) so consumers read what was actually granted instead of re-reading `BENCH_USE_MCP`. - Results are written through the reporter both before and after scoring, so raw execution output is inspectable independent of scoring. - `chaos_spec` / `verification_spec` accept native YAML (JSON-in-YAML strings still parse). - Generated files are captured by diffing the workspace before/after the agent run into `<run_dir>/generated_files/`. **Bugs fixed** - The operator agent and the chaos injector resolve a shared default deployment + namespace at init, so they target the same workload when the relevant env vars are unset (previously they could diverge).
5eb3685 to
96891b5
Compare
3763c42 to
969eda8
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The per-task run used to be a monolithic top-level loop in
pkg/evaluator/evaluate.py(provision → execute agent → judge → teardown → write results); this decomposes it intodevops_bench/harness/— a default harness that wires deployer + agent + verification + metrics, resolves the agent's capabilities once, snapshots the run environment, and persists results through a pluggable reporter.Behavior changes
capabilities_granted(use_mcp + skills) so consumers read what was actually granted instead of re-readingBENCH_USE_MCP.chaos_spec/verification_specaccept native YAML (JSON-in-YAML strings still parse).<run_dir>/generated_files/.Bugs fixed