feat(harness): evaluation orchestrator (Stage 3a) by pradeepvrd · Pull Request #10 · pradeepvrd/devops-bench

pradeepvrd · 2026-06-18T07:06:49Z

Decomposes the legacy orchestrator into devops_bench/harness/ (← pkg/manager/manager.py + the main() loop of pkg/evaluator/evaluate.py).

scenario.py (ScenarioManager: chaos+verification, port-forward, reports), default.py (phased pipeline), artifacts.py, base.py (Harness ABC + RunContext).
Agent execution is agent/model-agnostic via the AGENTS registry; scoring delegated to devops_bench.metrics (not reimplemented).
Stacked on Stage 2 (needs agents/chaos/verification/metrics).
Tests under tests/unit/harness/.

Stacked draft PR — part of the in-place Stage 2/3 restructure (see docs/migration/pr-plan.md). Base is the fork branch shown above; it will be retargeted to gke-labs/main once Stage 1 (gke-labs#89–92) merges. PRs are intended to be reviewed and merged in stage order.

Status: peer-reviewed by 2 teammates + senior sign-off on the full integration branch; full suite green (ruff + 374 unit tests). Do NOT mark ready until its stage is up for merge.

Modules moved/refactored: - pkg/manager/manager.py -> devops_bench/harness/scenario.py (ScenarioManager) - pkg/evaluator/evaluate.py main() loop -> devops_bench/harness/{base,default,artifacts}.py (decomposed pipeline) The monolithic main() run loop and the threaded ScenarioManager are split into the engine layer under devops_bench/harness/: - base.py: Harness ABC + RunContext construction; the run-phase contract. - scenario.py: ScenarioManager faithfully ported from manager.py (daemon-thread chaos+verification, chaos_active_event coordination, kubectl port-forward, get_reports -> (chaos_report, perf_report)), wired to devops_bench.chaos.ChaosAgent + devops_bench.verification. - default.py: DefaultHarness, the decomposed pipeline (provision -> optional background chaos -> agent run -> artifact capture -> teardown -> score -> report). - artifacts.py: generated-files diff/copy extracted from main(). Bugs fixed vs legacy: - none in this commit; the port is behavior-preserving and inherits the legacy bugs (port-forward PIPE deadlock, default deployment/namespace mismatch, join shorter than the verification budget, resource leak on early exception). Those are addressed in the follow-up fix(harness) commit. Improvements vs legacy: - Agent execution is model/provider-agnostic: the agent is resolved from the AGENTS registry by type (importing the concrete submodule so it self-registers) instead of the legacy hardcoded cli-vs-api dispatch + provider adapters. - RunContext (from core) threads task metadata + cluster details across phases. - Scoring is delegated to devops_bench.metrics.evaluate_metrics_batch and the ModelLayerJudge (via get_judge_model) rather than reimplementing scoring or the legacy provider-specific DeepEval judge wrappers. - Logging goes through core.get_logger; env reads go through core.config helpers. - Heavy deps (deepeval / mcp / provider SDKs) stay lazy: importing the harness package pulls none of them.

Modules moved/refactored: - see base move commit (devops_bench/harness/scenario.py, default.py) Bugs fixed vs legacy: - Port-forward pipe deadlock: the kubectl port-forward Popen captured stdout and stderr with PIPE but nothing ever read them, so kubectl could block once its output buffer filled under sustained chaos load. Both streams now go to DEVNULL. - Default deployment/namespace mismatch: replace_placeholders() defaulted TARGET_DEPLOYMENT_NAME/NAMESPACE to hello-app/production while start_scenario() defaulted them to hypercomputer-d1-frontend/default, so when the env was unset the agent prompt and the chaos target diverged. Both now read a single shared pair of module constants (_DEFAULT_TARGET_DEPLOYMENT / _DEFAULT_NAMESPACE). - Scenario join shorter than the verification budget: _SCENARIO_JOIN_SEC was 90s while the verification timeout is 120s, so a slow-but-completing verification was cut off, partial reports were read, and the join raced teardown. The verification budget is now the public scenario.VERIFICATION_TIMEOUT_SEC and the join is set above it (budget + 60 = 180s). - Resource leak on early exception: when a task errored before _drain_scenario joined the scenario thread, the daemon thread plus its kubectl port-forward and fortio load kept running across tasks. ScenarioManager.stop() now aborts the scenario (skips a pending verification) and terminates the port-forward, and _run_one calls it in its finally block. Improvements vs legacy: - none in this commit; behavioral/robustness improvements land in the follow-up feat(harness) commit.

…s poll, spec-driven chaos URL Modules moved/refactored: - see base move commit (devops_bench/harness/scenario.py, default.py) Bugs fixed vs legacy: - none (the legacy-inherited bugs were addressed in the preceding fix(harness) commit) Improvements vs legacy: - Clear error on an unregistered agent: resolve_agent() raises NotRegisteredError (naming the resolved key and listing the registered agents) when an imported agent module fails to register, instead of an opaque TypeError from calling None(). AGENTS.get already raises on a true miss; this guards the imported-but-did-not-register case. - Failed-task result records: a task that errors mid-run no longer drops out of results.json. _run_one returns a _failed_record() ({status: "failed", error, score: 0, output: "", plus the task's identifying fields}); successful records gain a status: "success" marker; run() collects every record; _score() skips failed records (no agent output to judge) while they are still written out for downstream parsers. - Port-forward liveness poll: after the settle delay, _inject_chaos_with_delay polls the kubectl port-forward and raises a clear RuntimeError if it already exited (missing deployment, auth error), so chaos load is never pointed at a dead tunnel. - Spec-driven chaos target URL (coordinates with the chaos agent reading the target from the spec): _inject_chaos_with_delay rewrites the action's target.service_url to the shared local-port constant, and _inject_fault reads that URL back from the action for the goal text rather than re-hardcoding it, so the spec is the single source of truth for the target URL.

pradeepvrd · 2026-06-20T08:07:34Z

Superseded by the reconciled cross-cutting refactor (see docs/refactor/e2e-refactor-sequencing-plan.md). Reworked into the layered devops_bench/ package on branch refactor/integration; replaced by the reworked component PRs and capstone #23. Closing as superseded.

pradeepvrd force-pushed the feat/devops-bench-harness branch from ce82ed5 to ccb985b Compare June 18, 2026 07:57

pradeepvrd force-pushed the integration/devops-bench-stage2-merged branch from a45ce16 to 110210e Compare June 18, 2026 07:57

pradeepvrd added 3 commits June 18, 2026 01:22

pradeepvrd force-pushed the integration/devops-bench-stage2-merged branch from 110210e to c80543d Compare June 18, 2026 08:23

pradeepvrd force-pushed the feat/devops-bench-harness branch from ccb985b to 8bbef2c Compare June 18, 2026 08:23

pradeepvrd closed this Jun 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(harness): evaluation orchestrator (Stage 3a)#10

feat(harness): evaluation orchestrator (Stage 3a)#10
pradeepvrd wants to merge 3 commits into
integration/devops-bench-stage2-mergedfrom
feat/devops-bench-harness

pradeepvrd commented Jun 18, 2026

Uh oh!

pradeepvrd commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pradeepvrd commented Jun 18, 2026

Uh oh!

pradeepvrd commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant