feat(harness): evaluation orchestrator (Stage 3a)#10
Closed
pradeepvrd wants to merge 3 commits into
Closed
Conversation
ce82ed5 to
ccb985b
Compare
a45ce16 to
110210e
Compare
Modules moved/refactored:
- pkg/manager/manager.py -> devops_bench/harness/scenario.py (ScenarioManager)
- pkg/evaluator/evaluate.py main() loop -> devops_bench/harness/{base,default,artifacts}.py (decomposed pipeline)
The monolithic main() run loop and the threaded ScenarioManager are split into
the engine layer under devops_bench/harness/:
- base.py: Harness ABC + RunContext construction; the run-phase contract.
- scenario.py: ScenarioManager faithfully ported from manager.py (daemon-thread
chaos+verification, chaos_active_event coordination, kubectl
port-forward, get_reports -> (chaos_report, perf_report)), wired
to devops_bench.chaos.ChaosAgent + devops_bench.verification.
- default.py: DefaultHarness, the decomposed pipeline (provision -> optional
background chaos -> agent run -> artifact capture -> teardown ->
score -> report).
- artifacts.py: generated-files diff/copy extracted from main().
Bugs fixed vs legacy:
- none in this commit; the port is behavior-preserving and inherits the legacy
bugs (port-forward PIPE deadlock, default deployment/namespace mismatch, join
shorter than the verification budget, resource leak on early exception). Those
are addressed in the follow-up fix(harness) commit.
Improvements vs legacy:
- Agent execution is model/provider-agnostic: the agent is resolved from the
AGENTS registry by type (importing the concrete submodule so it self-registers)
instead of the legacy hardcoded cli-vs-api dispatch + provider adapters.
- RunContext (from core) threads task metadata + cluster details across phases.
- Scoring is delegated to devops_bench.metrics.evaluate_metrics_batch and the
ModelLayerJudge (via get_judge_model) rather than reimplementing scoring or the
legacy provider-specific DeepEval judge wrappers.
- Logging goes through core.get_logger; env reads go through core.config helpers.
- Heavy deps (deepeval / mcp / provider SDKs) stay lazy: importing the harness
package pulls none of them.
Modules moved/refactored: - see base move commit (devops_bench/harness/scenario.py, default.py) Bugs fixed vs legacy: - Port-forward pipe deadlock: the kubectl port-forward Popen captured stdout and stderr with PIPE but nothing ever read them, so kubectl could block once its output buffer filled under sustained chaos load. Both streams now go to DEVNULL. - Default deployment/namespace mismatch: replace_placeholders() defaulted TARGET_DEPLOYMENT_NAME/NAMESPACE to hello-app/production while start_scenario() defaulted them to hypercomputer-d1-frontend/default, so when the env was unset the agent prompt and the chaos target diverged. Both now read a single shared pair of module constants (_DEFAULT_TARGET_DEPLOYMENT / _DEFAULT_NAMESPACE). - Scenario join shorter than the verification budget: _SCENARIO_JOIN_SEC was 90s while the verification timeout is 120s, so a slow-but-completing verification was cut off, partial reports were read, and the join raced teardown. The verification budget is now the public scenario.VERIFICATION_TIMEOUT_SEC and the join is set above it (budget + 60 = 180s). - Resource leak on early exception: when a task errored before _drain_scenario joined the scenario thread, the daemon thread plus its kubectl port-forward and fortio load kept running across tasks. ScenarioManager.stop() now aborts the scenario (skips a pending verification) and terminates the port-forward, and _run_one calls it in its finally block. Improvements vs legacy: - none in this commit; behavioral/robustness improvements land in the follow-up feat(harness) commit.
…s poll, spec-driven chaos URL
Modules moved/refactored:
- see base move commit (devops_bench/harness/scenario.py, default.py)
Bugs fixed vs legacy:
- none (the legacy-inherited bugs were addressed in the preceding fix(harness)
commit)
Improvements vs legacy:
- Clear error on an unregistered agent: resolve_agent() raises NotRegisteredError
(naming the resolved key and listing the registered agents) when an imported
agent module fails to register, instead of an opaque TypeError from calling
None(). AGENTS.get already raises on a true miss; this guards the
imported-but-did-not-register case.
- Failed-task result records: a task that errors mid-run no longer drops out of
results.json. _run_one returns a _failed_record() ({status: "failed", error,
score: 0, output: "", plus the task's identifying fields}); successful records
gain a status: "success" marker; run() collects every record; _score() skips
failed records (no agent output to judge) while they are still written out for
downstream parsers.
- Port-forward liveness poll: after the settle delay, _inject_chaos_with_delay
polls the kubectl port-forward and raises a clear RuntimeError if it already
exited (missing deployment, auth error), so chaos load is never pointed at a
dead tunnel.
- Spec-driven chaos target URL (coordinates with the chaos agent reading the
target from the spec): _inject_chaos_with_delay rewrites the action's
target.service_url to the shared local-port constant, and _inject_fault reads
that URL back from the action for the goal text rather than re-hardcoding it, so
the spec is the single source of truth for the target URL.
110210e to
c80543d
Compare
ccb985b to
8bbef2c
Compare
Owner
Author
|
Superseded by the reconciled cross-cutting refactor (see docs/refactor/e2e-refactor-sequencing-plan.md). Reworked into the layered devops_bench/ package on branch refactor/integration; replaced by the reworked component PRs and capstone #23. Closing as superseded. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Decomposes the legacy orchestrator into
devops_bench/harness/(←pkg/manager/manager.py+ themain()loop ofpkg/evaluator/evaluate.py).scenario.py(ScenarioManager: chaos+verification, port-forward, reports),default.py(phased pipeline),artifacts.py,base.py(Harness ABC + RunContext).AGENTSregistry; scoring delegated todevops_bench.metrics(not reimplemented).tests/unit/harness/.Stacked draft PR — part of the in-place Stage 2/3 restructure (see
docs/migration/pr-plan.md). Base is the fork branch shown above; it will be retargeted togke-labs/mainonce Stage 1 (gke-labs#89–92) merges. PRs are intended to be reviewed and merged in stage order.Status: peer-reviewed by 2 teammates + senior sign-off on the full integration branch; full suite green (ruff + 374 unit tests). Do NOT mark ready until its stage is up for merge.