Skip to content

feat(harness): evaluation orchestrator (Stage 3a)#10

Closed
pradeepvrd wants to merge 3 commits into
integration/devops-bench-stage2-mergedfrom
feat/devops-bench-harness
Closed

feat(harness): evaluation orchestrator (Stage 3a)#10
pradeepvrd wants to merge 3 commits into
integration/devops-bench-stage2-mergedfrom
feat/devops-bench-harness

Conversation

@pradeepvrd

Copy link
Copy Markdown
Owner

Decomposes the legacy orchestrator into devops_bench/harness/ (← pkg/manager/manager.py + the main() loop of pkg/evaluator/evaluate.py).

  • scenario.py (ScenarioManager: chaos+verification, port-forward, reports), default.py (phased pipeline), artifacts.py, base.py (Harness ABC + RunContext).
  • Agent execution is agent/model-agnostic via the AGENTS registry; scoring delegated to devops_bench.metrics (not reimplemented).
  • Stacked on Stage 2 (needs agents/chaos/verification/metrics).
  • Tests under tests/unit/harness/.

Stacked draft PR — part of the in-place Stage 2/3 restructure (see docs/migration/pr-plan.md). Base is the fork branch shown above; it will be retargeted to gke-labs/main once Stage 1 (gke-labs#89–92) merges. PRs are intended to be reviewed and merged in stage order.

Status: peer-reviewed by 2 teammates + senior sign-off on the full integration branch; full suite green (ruff + 374 unit tests). Do NOT mark ready until its stage is up for merge.

@pradeepvrd pradeepvrd force-pushed the feat/devops-bench-harness branch from ce82ed5 to ccb985b Compare June 18, 2026 07:57
@pradeepvrd pradeepvrd force-pushed the integration/devops-bench-stage2-merged branch from a45ce16 to 110210e Compare June 18, 2026 07:57
Modules moved/refactored:
- pkg/manager/manager.py                 -> devops_bench/harness/scenario.py (ScenarioManager)
- pkg/evaluator/evaluate.py main() loop  -> devops_bench/harness/{base,default,artifacts}.py (decomposed pipeline)

The monolithic main() run loop and the threaded ScenarioManager are split into
the engine layer under devops_bench/harness/:
- base.py:      Harness ABC + RunContext construction; the run-phase contract.
- scenario.py:  ScenarioManager faithfully ported from manager.py (daemon-thread
                chaos+verification, chaos_active_event coordination, kubectl
                port-forward, get_reports -> (chaos_report, perf_report)), wired
                to devops_bench.chaos.ChaosAgent + devops_bench.verification.
- default.py:   DefaultHarness, the decomposed pipeline (provision -> optional
                background chaos -> agent run -> artifact capture -> teardown ->
                score -> report).
- artifacts.py: generated-files diff/copy extracted from main().

Bugs fixed vs legacy:
- none in this commit; the port is behavior-preserving and inherits the legacy
  bugs (port-forward PIPE deadlock, default deployment/namespace mismatch, join
  shorter than the verification budget, resource leak on early exception). Those
  are addressed in the follow-up fix(harness) commit.

Improvements vs legacy:
- Agent execution is model/provider-agnostic: the agent is resolved from the
  AGENTS registry by type (importing the concrete submodule so it self-registers)
  instead of the legacy hardcoded cli-vs-api dispatch + provider adapters.
- RunContext (from core) threads task metadata + cluster details across phases.
- Scoring is delegated to devops_bench.metrics.evaluate_metrics_batch and the
  ModelLayerJudge (via get_judge_model) rather than reimplementing scoring or the
  legacy provider-specific DeepEval judge wrappers.
- Logging goes through core.get_logger; env reads go through core.config helpers.
- Heavy deps (deepeval / mcp / provider SDKs) stay lazy: importing the harness
  package pulls none of them.
Modules moved/refactored:
- see base move commit (devops_bench/harness/scenario.py, default.py)

Bugs fixed vs legacy:
- Port-forward pipe deadlock: the kubectl port-forward Popen captured stdout and
  stderr with PIPE but nothing ever read them, so kubectl could block once its
  output buffer filled under sustained chaos load. Both streams now go to
  DEVNULL.
- Default deployment/namespace mismatch: replace_placeholders() defaulted
  TARGET_DEPLOYMENT_NAME/NAMESPACE to hello-app/production while start_scenario()
  defaulted them to hypercomputer-d1-frontend/default, so when the env was unset
  the agent prompt and the chaos target diverged. Both now read a single shared
  pair of module constants (_DEFAULT_TARGET_DEPLOYMENT / _DEFAULT_NAMESPACE).
- Scenario join shorter than the verification budget: _SCENARIO_JOIN_SEC was 90s
  while the verification timeout is 120s, so a slow-but-completing verification
  was cut off, partial reports were read, and the join raced teardown. The
  verification budget is now the public scenario.VERIFICATION_TIMEOUT_SEC and the
  join is set above it (budget + 60 = 180s).
- Resource leak on early exception: when a task errored before _drain_scenario
  joined the scenario thread, the daemon thread plus its kubectl port-forward and
  fortio load kept running across tasks. ScenarioManager.stop() now aborts the
  scenario (skips a pending verification) and terminates the port-forward, and
  _run_one calls it in its finally block.

Improvements vs legacy:
- none in this commit; behavioral/robustness improvements land in the follow-up
  feat(harness) commit.
…s poll, spec-driven chaos URL

Modules moved/refactored:
- see base move commit (devops_bench/harness/scenario.py, default.py)

Bugs fixed vs legacy:
- none (the legacy-inherited bugs were addressed in the preceding fix(harness)
  commit)

Improvements vs legacy:
- Clear error on an unregistered agent: resolve_agent() raises NotRegisteredError
  (naming the resolved key and listing the registered agents) when an imported
  agent module fails to register, instead of an opaque TypeError from calling
  None(). AGENTS.get already raises on a true miss; this guards the
  imported-but-did-not-register case.
- Failed-task result records: a task that errors mid-run no longer drops out of
  results.json. _run_one returns a _failed_record() ({status: "failed", error,
  score: 0, output: "", plus the task's identifying fields}); successful records
  gain a status: "success" marker; run() collects every record; _score() skips
  failed records (no agent output to judge) while they are still written out for
  downstream parsers.
- Port-forward liveness poll: after the settle delay, _inject_chaos_with_delay
  polls the kubectl port-forward and raises a clear RuntimeError if it already
  exited (missing deployment, auth error), so chaos load is never pointed at a
  dead tunnel.
- Spec-driven chaos target URL (coordinates with the chaos agent reading the
  target from the spec): _inject_chaos_with_delay rewrites the action's
  target.service_url to the shared local-port constant, and _inject_fault reads
  that URL back from the action for the goal text rather than re-hardcoding it, so
  the spec is the single source of truth for the target URL.
@pradeepvrd pradeepvrd force-pushed the integration/devops-bench-stage2-merged branch from 110210e to c80543d Compare June 18, 2026 08:23
@pradeepvrd pradeepvrd force-pushed the feat/devops-bench-harness branch from ccb985b to 8bbef2c Compare June 18, 2026 08:23
@pradeepvrd

Copy link
Copy Markdown
Owner Author

Superseded by the reconciled cross-cutting refactor (see docs/refactor/e2e-refactor-sequencing-plan.md). Reworked into the layered devops_bench/ package on branch refactor/integration; replaced by the reworked component PRs and capstone #23. Closing as superseded.

@pradeepvrd pradeepvrd closed this Jun 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant