Cross-cutting harness refactor: layered devops_bench (Stage 1.5–3, reconciled) by pradeepvrd · Pull Request #23 · pradeepvrd/devops-bench

pradeepvrd · 2026-06-20T07:55:10Z

Cross-cutting harness refactor — assembled & reviewed

Implements the reconciled landing plan in docs/refactor/e2e-refactor-sequencing-plan.md, migrating the legacy monolith into a strictly layered devops_bench/ package. This branch (refactor/integration) is the assembled result of the reworked component PRs #11–#22, each peer-reviewed (two reviewers) and then covered by two senior reviews.

Diff base: integration/devops-bench-stage1 (the clean Stage-1 foundation: core + tasks + deployers + k8s + models), so this PR shows exactly the Stage-1.5–3 refactor work.

What landed (by wave)

Stage 1.5 — models/loop.py shared run_tool_loop primitive (PR feat(models): run_tool_loop shared turn-loop primitive (Phase 1, D1) #11); agents ⊥ chaos become siblings.
Verification — discriminated-union spec, single-monotonic-deadline runner, typed VerificationResult (PR feat(verification): discriminated-union spec + deadline runner + typed result (Phase 2A/2B) #12); later swapped to registry-driven parsing (PR feat(metrics): registry + run_geval + verifier registry; Phase-0 fixes (Phase 4) #16).
Agents — AgentHarness/AgentConfig/AgentResult + CLI agents (PR feat(agents): AgentHarness base + CLI agents (Phase 3 PR1) #13), ApiAgent on run_tool_loop (PR feat(agents): ApiAgent on run_tool_loop; skills⊥MCP; no env-smuggling (Phase 3 PR2) #14), capability bindings McpBinding/SkillBinding/AgentRules (PR feat(agents): capability bindings (McpBinding/SkillBinding/AgentRules) (Phase 3 PR3) #17).
Chaos — typed ChaosResult/Fault/Trigger, loop reuse (PR feat(chaos): typed ChaosResult/Fault/Trigger; loop reuse; authoring contract (Phase 3) #15); Phase-4 registry-driven parsing + lazy agent import (PR refactor(chaos): Phase-4 swap — registry-driven ChaosSpec parsing; lazy ChaosAgent import #21).
Metrics — METRICS/MetricScore/MetricContext/run_geval + VERIFIERS registry (PR feat(metrics): registry + run_geval + verifier registry; Phase-0 fixes (Phase 4) #16).
Harness — registry-only wiring, single BENCH_USE_MCP read, typed seams, ResultReporter, optimize-scale/task.yaml native-YAML migration, E2E smoke (PR feat(harness): registry-only wiring + typed seams + reporter + task.yaml migration (Phase 5) #18).
Senior-review fixes — provider-aware token extraction, results.json key symmetry, openclaw orphan-policy unification, capabilities-snapshot consistency, chaos B1/S1, and a chaos test-isolation fix (PRs fix(agents): openclaw orphan tool_result drops to errors, matching API + Gemini policy #20–test(chaos): fix B2 — pre-load bundled concretes before snapshotting registries #22).

Acceptance

654 unit tests pass; ruff check devops_bench/ clean; every test file passes in isolation (no order-dependence).
Strict one-directional layering (core ← leaves ← components ← harness); zero reverse/sibling imports; agents/chaos depend only on models.
Registry-only resolution on all five axes (AGENTS/FAULTS/TRIGGERS/METRICS/VERIFIERS), each with a "drop a decorator, no central edit" acceptance test.
BENCH_USE_MCP read once in the harness, threaded to both agent and judge.
import devops_bench.<pkg> pulls no provider SDK / deepeval / mcp for every package.
results.json schema preserved (D3) via to_entry()/to_dict(), success/failed records key-symmetric.
E2E smoke runs complextasks/optimize-scale through the real wiring against NoOpDeployer (network/SDK stubbed only at the boundary).

Draft for review.

…ons contract Vendors the refactor planning docs (e2e sequencing plan, five component handoffs) plus the distilled CONVENTIONS.md so the refactor work is self-contained on the integration branch.

Add devops_bench/models/loop.py providing the single canonical model-agnostic tool-use loop consumed by both agents/api and chaos/agent. The primitive imports only models.base + core (Layer 1) and lets the caller format tools so it stays ignorant of provider tool-descriptor shapes. Behavior preserved from the duplicated chaos/api loops: - seeds contents with the user goal - per turn appends the assistant message + one tool-result entry per call using the §5 neutral dict shapes (role/content/tool_calls/tool_call_id/name) - retains final_text on EVERY turn so a tool call on the last turn or the turn cap never discards the model's summary - for/else turn cap with a warning, never an infinite loop - accumulates latency across every generate_content call Dispatch errors propagate to the caller (the loop never silently swallows them); each consumer wraps dispatch as it sees fit. Unit tests cover: turn-cap + warning, final-text retention (last-turn tool call + natural exit), latency accumulation, dispatch error surfacing, tools_used tracking, caller-formats-tools pass-through, full conversation shape, max_turns=0 edge case, and that importing the loop pulls no provider SDK. Refs: CONVENTIONS.md §5/§6; e2e-refactor-sequencing-plan.md §3.1/§4.1 (D1); chaos-refactor-handoff.md §5.1.

…d result Phase 2A (refactor: PR #6 rework). Replaces the loose list/dict spec arms with a hand-maintained discriminated union (`SequenceSpec` / `ParallelSpec` + type-tagged leaves), swaps `VerificationResult.details` for typed `children` and `raw` fields, and rewrites `VerifierAgent` around a single monotonic deadline (no per-node budget arithmetic, one clock source). - `verification/spec.py`: `VerificationNode = Annotated[Union, ...]`. Bare leaves are valid whole specs; bare list/dict roots are rejected. The union is hand-maintained at Phase A — structured so a Wave 2 registry-driven `parse_node` can swap the parsing without reshaping the runner's isinstance dispatch. - `verification/base.py`: pydantic `VerificationResult` with `success`, `elapsed_time`, `reason`, optional `name`, `children`, `raw`. `BaseVerifier` gains an optional `name` for result labeling. - `verification/runner.py`: `VerifierAgent.wait_for_condition(spec, timeout_sec=120)` computes one `time.monotonic()` deadline at entry. Parallel children each see the full remaining deadline; sequences are fail-fast (later steps recorded as skipped). - `verification/verifiers/{pod_healthy,scaling_complete}.py`: leaves consume `k8s.kubectl.get_json` / `k8s.conditions.poll_until`; both null-guard `.status` on raw kubectl returns. Result `details=` -> `raw=`, `name` threaded through. Phase 2B authoring contract: - `verification/schema.py`: `json_schema()` emits the pydantic JSON Schema for editor autocomplete; `validate_spec(data)` returns the parsed leaf node and propagates `pydantic.ValidationError` on bad input. - Regression test pins a literal of the migrated `optimize-scale` verification (`type: parallel` with `pod_healthy` + `scaling_complete` children); both leaf `verify` methods are stubbed so it runs without a cluster. Test bar (`tests/unit/verification/`): spec parsing/discrimination; result shape (`details` is gone); deadline semantics (parallel sees full remaining deadline, sequence fail-fast, one monotonic clock); leaf null-status guards on pods AND deployment; bare list/dict spec rejected; lightweight-import guard asserts `import devops_bench.verification` pulls no provider SDK / deepeval / mcp. `ruff check` and `pytest tests/` are green (345 passed). Harness/task-file migration (§7c) is intentionally out of scope — Wave 4 harness PR.

Lays the agents/ Layer-2 component on the post-Stage-1 foundation: the template-method AgentHarness, typed AgentConfig / AgentResult / ToolCall, the AGENTS registry, and rewritten gemini + openclaw CLI harnesses on the new base. The legacy run_cli_agent dispatch, SSH transport, and hardcoded GKE tool list are gone; trajectories come from the official CLI channels. Highlights ---------- - AgentHarness owns latency bookkeeping and a broad safety net so a single agent crash never aborts the benchmark; subclasses implement _execute and return AgentResult. deepeval tracing is wrapped on demand, kept import-time optional. - AgentResult / ToolCall match CONVENTIONS §3 exactly: typed dataclasses, to_dict() boundary shim, AgentResult.errored() classmethod for failures, and a has_errors() helper. The metrics layer reads one canonical trajectory across CLI and (PR2) API agents. - AgentConfig.from_env() is the single env entry point (AGENT_MODEL / PROVIDER / API_KEY / TARGET / TIMEOUT_SEC / ALLOWED_TOOLS); agents stop self-reading the environment per §7. AGENT_ALLOWED_TOOLS is the PR1 interim for what becomes McpBinding.tools in PR3. - GeminiCliAgent invokes `gemini --output-format stream-json --skip-trust`, selects between `--allowed-tools <name>` (one per tool) and `-e=""` extensions-disabled, threads model/key into GEMINI_MODEL/GOOGLE_API_KEY, passes timeout into every subprocess.run, and parses tool_use / tool_result / result events from stdout into the canonical trajectory — no session-file glob, no internal-schema parsing. - OpenClawAgent runs `oc agent --local` through bash (nvm sourced; every interpolated value shlex.quoted), wipes ~/.openclaw/agents/<name>/sessions before each run, picks the resulting session via `oc sessions --agent <name> --json`, and extracts the trajectory via the official `oc sessions export-trajectory --workspace <tmpdir>` bundle per docs/openclaw/sessions rather than the legacy debug-log sessionFile= scrape. SSH transport and the OPENCLAW_LOCAL toggle are removed. - Extraction failures (JSON decode misses, unpaired tool_results, non-zero exit codes, missing export bundles, timeouts) land on AgentResult.errors — never silent-empty. Tests (tests/unit/agents/) ------------------------- - typed-result shape + to_dict() defensive copies + errored() classmethod - AgentConfig.from_env() field mapping and default fallthroughs - abstract base cannot be instantiated; safety-net converts a raising _execute to AgentResult.errored; latency always stamped; a third-party @AGENTS.register("dummy-extension") resolves with no central edit - Gemini stream-json parser: tool_use/tool_result folding, error events, unmatched results, is_error→status, empty input - Gemini agent: argv flips on -e="" / --allowed-tools, env overlay, timeout threaded, SubprocessError + OSError paths, legacy run_cli_agent removed - OpenClaw: model id normalization (gemini→google), shlex-quoted command fragments, session-key selection across list/wrapper payloads, trajectory export parsing, sessions-empty + export-failure + bash-timeout paths, legacy SSH/local runners removed - Import-hygiene test (spawned interpreter): `import devops_bench.agents` pulls no mcp / deepeval / anthropic / google.genai / openai Downstream contract ------------------ AgentHarness.run(prompt) → AgentResult is the single entry point the harness will call after PR2 / capabilities. The API agent driving models.loop.run_tool_loop slots in by overriding _execute and returning the same AgentResult shape — no reshape needed.

PR #11 review (APPROVE-WITH-NITS) follow-up: - Switch latency timing to time.monotonic() (immune to wall-clock adjustments; matches the verification runner's pattern). - Coerce text to str at the source: ``text = client.get_text_content(...) or ""``. §5 says ``content`` must be a ``str``; guards a future adapter returning ``None``. Both ``final_text`` and the assistant message inherit the coerced value, so no further ``or ""`` is needed downstream. - test_loop_import_pulls_no_provider_sdk now runs in a fresh interpreter subprocess (``sys.executable -c ...``). In-process module probing was masked by sibling test_models_* imports of the adapter SDKs. - test_contents_record_user_assistant_and_tool_entries now also asserts the loop forwards function-call dicts by IDENTITY (not by copy) into assistant_message["tool_calls"] — locks the byte-identical-trajectory guarantee for the upcoming agents/api + chaos consumers.

…t; doc + test gaps Addresses peer-review findings on PR #13: BLOCKING — `_execute` was sourcing `AgentResult.output` from `_strip_ansi(completed.stdout)` (the noisy `oc --log-level debug` stream) even when `_read_export_bundle` had read the clean `output.md`/`output.txt`/ `final.txt` from the trajectory bundle. The judge would grade the debug noise. Now `_extract_trajectory` plumbs the bundle output through and `_execute` prefers it, falling back to stripped stdout only when the bundle has no output file. Two new tests cover bundle-preferred and fallback paths. Non-blocking nits, all addressed: 1. `oc models set <id>` now chains with `&&` so a failed model-set aborts the run and surfaces via `completed.returncode` → AgentResult.errors, instead of silently falling through to oc's stored-default model (which would invalidate the benchmark arm). New test asserts `&&` not `;`. 2. Gemini stream-json field-name leniency (`tool_use_id`/`id` and `args`/`input` fallbacks) is now covered by two regression tests so the tolerance can't silently rot. 3. New test asserts `GeminiCliAgent._execute` actually wires `extra_env` (GEMINI_MODEL / GOOGLE_API_KEY / GEMINI_API_KEY / OTEL_SDK_DISABLED) into the `run(...)` call site — not just that `_build_env` returns it in isolation. 4. `AgentResult.to_dict()` docstring corrected: the copies are shallow, not deep — outer containers are fresh but inner ToolCall dicts and other nested values stay aliased. Tests: 367 passed (was 361; +6). ruff clean.

…sub-second leaf floor Review nits from PR #12 (APPROVE-WITH-NITS). - _run_parallel: convert an unexpected leaf exception into a failed child result instead of letting it abort the whole group (catch around ``f.result()``). Pre-fill children with timed-out defaults so a never-completed future stays accounted for without a second materialize pass. - _run_parallel: ``ex.shutdown(wait=False, cancel_futures=True)`` so a deadline-exhausted group does not block on queued workers we no longer want. - _run_leaf: short-circuit when remaining < ``_MIN_LEAF_BUDGET_SECONDS`` (1s) to avoid issuing useless ``kubectl wait --timeout=0.001s`` at the tail of the deadline. - test_runner: add a parallel-with-exhausted-deadline test that uses the REAL ``PodHealthyVerifier`` / ``ScalingCompleteVerifier`` classes (kubectl primitives patched to raise) and asserts all children come back timed-out with ``elapsed_time == 0.0``. Add an unhandled-exception test asserting one bad leaf does not abort the parallel group. pytest: 347 passed (was 345, +2 new). ruff clean.

… (Phase 3 PR2) Rework the open PR #9 ``agents/api`` package onto the AgentHarness template method, AgentConfig, and AgentResult landed in Wave 1, and onto the shared ``models/loop.run_tool_loop`` primitive — agents and chaos are now true siblings, neither carrying its own loop. Concrete changes: * ``ApiAgent._execute`` builds the LLM client via ``get_model(provider, model)`` with EXPLICIT args from ``AgentConfig`` (no env reads inside the agent), pre-formats tools via ``client.format_tools(...)`` (caller-formats-tools, per CONVENTIONS §6), and drives ``run_tool_loop``. * A ``_build_dispatch`` closure routes each tool call to either a skill file read or MCP, wrapping every call in its own try/except so one tool failure lands on ``AgentResult.errors`` and the trajectory (``status="error"``) rather than aborting the run (``run_tool_loop`` propagates dispatch errors by design). * ``fold_trajectory`` collapses ``LoopResult.contents`` into canonical :class:`ToolCall` entries, pairing assistant-issued ``tool_calls`` with the matching ``role: tool`` results by ``tool_call_id``. ``AgentResult.output`` is ``LoopResult.final_text``; tokens/latency/tools_used flow through unchanged (semantically byte-identical to the OLD api loop). * **Skills are decoupled from MCP.** ``config.skills_paths`` drives skill discovery and ``config.target`` drives MCP — an agent may have one, both, or neither. The hardcoded ``third_party/gke-mcp/skills`` constant is gone. * The agent never reads ``BENCH_USE_MCP`` (or any env var); the legacy ``context``/``system_instruction`` grab-bag is dropped from ``run()``'s public surface (Rules arrive in PR3). * ``AgentConfig`` gains interim ``skills_paths`` / ``max_turns`` fields (CSV-parsed from ``AGENT_SKILLS_PATHS`` / ``AGENT_MAX_TURNS``); PR3 migrates ``skills_paths`` into ``SkillBinding.paths``. Tests (``tests/unit/agents/api/``): * fake ``LLMClient`` + fake MCP/skills; trajectory folding correctness (assistant+tool pairs → canonical ToolCall entries; unmatched call ↔ ``status="called"``; dispatcher ``Error: ...`` strings ↔ ``status="error"``); * a dispatch ``RuntimeError`` lands in errors and on the trajectory, the run does not crash; * skills load on the MCP-off path (skills ⊥ MCP); * an AST walk asserts no ``BENCH_USE_MCP`` / ``os.environ`` / ``get_env`` reads anywhere under ``agents/api/``; * ``import devops_bench.agents.api`` and ``...api.agent`` pull no ``mcp``/``deepeval``/SDK at module top. Quality bar: ``uv run ruff check devops_bench/ tests/`` clean; ``uv run pytest tests/ -q`` GREEN — 464 passed (421 baseline + 43 new).

…ontract Phase 3 chaos refactor (rework PR #7). Brings the chaos package onto the shared model-agnostic loop and the same discriminated-union typed-node idiom verification uses, so chaos and agents become true Layer-2 siblings. Changes - New ``ChaosResult`` pydantic model (success / injected_fault / output / elapsed_time / error) replacing the legacy ``{status, output}`` dict. - ``Fault`` and ``Trigger`` reshaped as ``type``-tagged pydantic models in ``chaos/base.py``; FAULTS / TRIGGERS registries unchanged. - ``chaos/spec.py`` adds the ``ChaosSpec`` discriminated union (``ChaosAction`` / ``ChaosTrigger``). ``verify`` is a plain string key — chaos never imports a verification node. The legacy ``verification`` field name is accepted as an alias so the real ``optimize-scale/task.yaml`` parses unchanged ahead of the Phase-B task-file migration. - ``ChaosAgent`` reworked to drive ``models.loop.run_tool_loop``; fortio specifics (system prompt builder, RUN_COMMAND_TOOL, run_chaos_command, load target) move into ``faults/generate_load.py`` along with the typed ``GenerateLoadFault`` / ``LoadTarget`` nodes. The agent is now fault-agnostic — system instruction, tool descriptor, and command handler are injected by the concrete fault. - New ``triggers/time_delay.py`` houses the ``TimeTrigger`` previously inlined in ``harness/scenario.py``. - Slim ``chaos/__init__.py`` exports only ``Fault``, ``Trigger``, ``ChaosResult``, ``FAULTS``, ``TRIGGERS``, ``ChaosSpec``; no eager import of ``ChaosAgent``. - New ``chaos/schema.py`` emits JSON Schema and validates a single chaos entry (the Phase-A authoring contract). Tests (35 new under ``tests/unit/chaos/``) - Spec discrimination: minimal / named / verify-reference / legacy alias / bare-list rejection / extra-forbid / unknown discriminator / registry membership. - ``ChaosResult`` shape + round-trip; ``Fault`` / ``Trigger`` abstractness. - ``ChaosAgent`` against a fake ``LLMClient`` (no SDK / no network): no-tool turn, tool dispatch + message-shape pinning, non-dict args, unknown tool, final-text retention across the turn cap. - ``GenerateLoadFault.inject`` with mocked subprocess + stubbed agent: success path, exception → ``success=False`` with error, event threaded through. - ``TimeTrigger`` zero vs positive delay (``time.sleep`` patched). - Regression: the REAL ``complextasks/optimize-scale/task.yaml`` chaos entry parses through ``ChaosSpec`` and discriminates to ``GenerateLoadFault`` / ``TimeTrigger`` with the documented ``target.service_url`` / ``qps`` / ``delay_seconds``. - Import hygiene: ``import devops_bench.chaos`` pulls no provider SDK, ``mcp``, ``deepeval``, or ``ollama``; ``ChaosAgent`` is not exported. Bar - ``uv run ruff check devops_bench/ tests/unit/chaos/`` clean. - Full ``uv run pytest tests/ -q`` green (456 passed; +35 chaos tests on top of the 421-test baseline).

…s (Phase 4) Land the metrics extensibility refactor and the verifier registry swap in one drop, building on the Wave-1 verification union with no changes to the runner or the leaf-verifier logic. Metrics: - New `metrics/base.py` carrying `METRICS` (`Registry[type[MetricEvaluator]]`), the `MetricScore` dataclass with a D3-preserving `to_entry()`, the `MetricContext` that threads `use_mcp` from the harness (CONVENTIONS.md §7), and the single shared `run_geval` recorder. - Phase-0 fixes folded in: `CHECKLIST_THRESHOLD` constant decouples the checklist cutoff from the tool-invocation threshold, dead grounding branches are removed, the ` [GEval]` strip lives in exactly one place (`run_geval`), and `extract_checklist_items` is exported from the package facade. - Each metric family (outcome_validity, tool_invocation, checklist, grounding, chaos) is now a registered `@METRICS.register(...)` evaluator; `evaluate_metrics_batch` collapses to a registry loop, importing the builtin metric modules at call time so the package import stays free of `deepeval` (CONVENTIONS.md §8). - The `use_mcp` boolean is threaded through `evaluate_metrics_batch(..., *, use_mcp=...)` and into `MetricContext`; metrics no longer self-read `BENCH_USE_MCP`. The kwarg falls back to the env var so the harness can flip its caller in a follow-up without breaking the rest of the tree. Verifier registry: - New `verification/registry.py` carrying `VERIFIERS`, populated by `@VERIFIERS.register("...")` decorators on the existing leaves (pod_healthy, scaling_complete) and on `SequenceSpec` / `ParallelSpec`. - `verification/spec.py` swaps the static `Annotated[Union, Field( discriminator="type")]` for a `parse_node()` helper invoked from a `model_validator(mode="before")` on `VerificationSpec` (and on each compound spec, for `checks` recursion). Lookup misses surface as `pydantic.ValidationError` so callers see the same error surface. - Touch on `verification/runner.py`, `verification/base.py`, and the leaf verifier logic is zero — the only change to leaves is the decorator. The existing verification regression + runner tests stay green unchanged. Tests under `tests/unit/metrics/` cover: checklist parsing, the registry loop, MCP gating via `MetricContext`, the legacy `to_entry()` shape (bare value for rates, dict for judged), the single-place ` [GEval]` strip, chaos/grounding/skill loading, the dummy-metric extension-axis acceptance, and a fresh-interpreter assert that `import devops_bench.metrics` pulls no `deepeval`/`mcp`/provider SDK. The verifier extension-axis test lives next to its peers under `tests/unit/verification/`.

…; align __init__ docstring CONVENTIONS §9: dummy ``@FAULTS.register("dummy")`` / ``@TRIGGERS.register("dummy")`` resolve via ``REGISTRY.get(key)`` with no edit to ``chaos/spec.py``. Comment calls out Phase A's split: registry drives harness resolution; ``ChaosSpec`` parsing still requires a Union edit until Phase 4's registry-driven swap. Also: pin ``args={}`` (missing ``command`` key) dispatch returning an ``"Error: ..."`` string (no crash); align ``chaos/__init__.py`` docstring with the import-hygiene test it claims to satisfy — concretes load with the package (required for the discriminated union), but provider SDKs / fortio / ``deepeval`` / ``mcp`` stay strictly lazy.

…el; dedupe compound _parse_children Wave-2 review nits on PR #16: - Pipeline ordering (D3): `evaluate_metrics_batch` no longer iterates `METRICS.values()` (which is import-order-sensitive across test sessions and third-party registrations). Iteration now walks a pinned `_BUILTIN_METRIC_ORDER` tuple of `(module, key)` pairs (outcome -> tool -> checklist -> grounding -> chaos), then appends any third-party keys in registry order. New `test_batch_score_insertion_order_matches_legacy_results_json` pins the on-disk schema. - `parse_node` no longer bypasses every `BaseModel`. It now only accepts already-parsed instances whose concrete class is in `VERIFIERS`; an unregistered model (including as a `checks` child of a compound node) raises the same `ValidationError` surface as an unknown `type` string. Two new tests in `test_verifier_registry.py` cover the standalone and compound-child cases. - The two byte-identical `_parse_children` validators on `SequenceSpec` and `ParallelSpec` collapse to one module-level `_parse_compound_children` helper attached via `model_validator(mode="before")(...)`. Tests: 476 passed (was 473; +1 ordering, +2 unregistered-BaseModel guard). Ruff: clean.

Wave 2 review of PR #14 found a blocking bug + three non-blocking nits. BLOCKING — `extract_tokens` was Google-only Read only ``prompt_token_count`` / ``candidates_token_count`` / ``total_token_count`` so Anthropic (``usage.input_tokens`` / ``output_tokens``) and OpenAI / Ollama (``usage.prompt_tokens`` / ``completion_tokens`` / ``total_tokens``) responses returned all-zero tokens, corrupting token metrics in results.json for every non-Google provider. Fix: detect which usage shape is present and map each provider's source fields onto the legacy on-disk key scheme (D3 results.json stability). A new ``_first_int_attr`` helper looks up the first non-None attribute name in order. Anthropic emits no aggregated total, so the helper computes ``prompt + candidates`` when source ``total`` is absent (with a non-zero prompt/candidates) so non-empty runs never log ``total_tokens: 0``. Tests: Anthropic-shape, OpenAI/Ollama-shape, key-scheme stability, and end-to-end Anthropic/OpenAI runs that flow through ``ApiAgent._execute`` asserting ``AgentResult.tokens`` is non-zero and correctly mapped. NON-BLOCKING NIT 1 — AST env-scan now covers ``ast.Subscript`` Added an explicit ``ast.Subscript`` branch in the env-smuggling guard so ``os.environ["FOO"]`` patterns can't be bypassed even if the inner Attribute walk regresses. Also caught ``os.environ.get(...)`` explicitly. NON-BLOCKING NIT 2 — ``MCPClient.call_tool`` degrades gracefully The legacy ``from deepeval.tracing import observe`` was unconditional, so a host without deepeval crashed with ImportError on the first tool call. Mirrors ``_maybe_observe`` from ``agents/base.py``: try the import, fall through to the bare ``call_tool`` on failure. Tested by monkey-patching ``builtins.__import__`` to raise on ``deepeval`` and confirming the call still completes. NON-BLOCKING NIT 3 — explicit orphan-result handling in ``fold_trajectory`` An unpaired ``{role: tool}`` result (no matching assistant call_id) is now intentionally dropped from the canonical trajectory (preserving the "every trajectory item is a real ToolCall the model issued" invariant metrics relies on) AND surfaced as a diagnostic on ``AgentResult.errors`` via the new ``_fold_with_extraction_errors`` helper. Matches the CLI agents' "no silent-empty" policy. The drop is also debug-logged. The decision is documented in the function's docstring. Bar: ``uv run ruff check devops_bench/ tests/unit/agents/`` clean; ``uv run pytest tests/ -q`` GREEN — 473 passed (464 baseline + 9 new).

…) (Phase 3 PR3) Introduce devops_bench/agents/capabilities/ — the binding vocabulary the orchestrator threads through AgentConfig.capabilities. Three orthogonal axes: McpBinding (server command + tool names), SkillBinding (local SKILL.md paths), AgentRules (operator brief text). Each binding ships with a runtime_checkable Protocol (SupportsMcp / SupportsSkills / SupportsRules) plus a trivial mixin so concrete agents declare *only* what they can actually consume — capability negotiation refuses bindings an agent would silently ignore. AgentConfig migrates from the interim PR1/PR2 fields (target/allowed_tools/ skills_paths) to a single capabilities aggregate; the legacy AGENT_TARGET env still maps to the CLI binary path, while the new AGENT_MCP_SERVER / AGENT_ALLOWED_TOOLS / AGENT_SKILLS_PATHS / AGENT_RULES_TEXT env vars build the bindings. The harness will skip the env path entirely (it constructs AgentCapabilities directly from a benchmark catalog). Each agent now consumes the bindings: - ApiAgent (McpMixin + SkillsMixin + RulesMixin): MCP gate = capabilities.mcp with a non-empty command; skills gate = capabilities.skills.paths; rules ride on the provider's system_instruction. Skills ⊥ MCP independence preserved. - GeminiCliAgent (McpMixin + RulesMixin): --allowed-tools comes from capabilities.allowed_tools (aggregated across bound servers). - OpenClawAgent (RulesMixin only): the installed oc build exposes no in-agent MCP or skills wiring. Tests: new tests/unit/agents/test_agents_capabilities.py covers binding construction, Protocol membership, and the aggregate's accessors; updated config / api / cli tests assert each agent reads the bindings; a new AST- walking test forbids GKE-specific runtime literals anywhere in agents/; an import-hygiene test pins capabilities to stdlib-only. The existing BENCH_USE_MCP / os.environ AST guard still passes. 600/600 tests green; ruff clean on devops_bench/ + tests/unit/agents/.

… shlex.join MCP path Wave 3 review surfaced both CLI agents advertising SupportsRules without actually delivering the bound rules text — the mixin attribute mirror was the entire delivery story. Fix: - **Gemini** (cli/gemini.py): _execute now runs the binary inside a per-run tempfile.TemporaryDirectory and writes capabilities.rules.text to GEMINI.md there before invocation, then passes cwd=tmpdir to core.subprocess.run. The CLI auto-loads GEMINI.md from its working dir, which is the binary's native startup-context channel. Temp dir is cleaned up after _execute returns. - **OpenClaw** (cli/openclaw.py): new _prepend_rules helper splices the rules text ahead of the prompt (separated by a blank line) before _build_local_command interpolates it into the oc agent -m '<prompt>' segment. oc has no dedicated rules flag, so prompt-prepending is the reliable, binary-agnostic delivery channel. Tests assert the bindings actually reach the subprocess invocation: - tests/unit/agents/test_agents_cli_gemini.py snapshots the cwd handed to run() and reads GEMINI.md from inside the fake run (before TemporaryDirectory cleanup) — proves the binary would see the file on startup. Also asserts no GEMINI.md is written for empty rules, and the temp dir is cleaned up. - tests/unit/agents/test_agents_cli_openclaw.py captures the bash command string and asserts the rules text + prompt both appear with rules ahead of the prompt. Helper _prepend_rules has unit tests for the empty / non- empty branches. Non-blocking nits from the review, addressed in the same commit: - api/agent.py: shlex.join(mcp_binding.command) replaces the lossy " ".join so a spaced argv token (("uv run", "mcp-server")) round-trips MCPClient's shlex.split intact. New test test_execute_preserves_spaced_command_token_through_shlex_roundtrip verifies the round-trip. - capabilities/aggregate.py: AgentCapabilities.mcp docstring now warns callers that ApiAgent honors only the first binding (multi-server fan- out is not implemented in PR3) — Wave 4 will see the gotcha when wiring. - tests/unit/agents/api/test_agents_api_agent.py: renamed test_execute_runs_with_no_tools_when_target_unset → ..._when_capabilities_default to match the post-PR3 gate. - config.py: tightened the inline comment explaining why from_env uses name="default" on the synthesized McpBinding. Side effect: the prior test_execute_returns_errored_on_empty_mcp_server_string relied on " ".join collapsing a whitespace-only command. With shlex.join that path is no longer reachable through a binding, so the test was rewritten to patch MCPClient directly — it still proves the agent converts MCPClient.__aenter__ ValueErrors to AgentResult.errored() rather than letting them bubble through the base safety net. ruff clean (devops_bench/ + tests/unit/agents/). pytest tests/ -q → 608 passed.

…aml migration (Phase 5) Phase 5 (D4 option b): the harness is written once, born consuming the typed Layer-2 contracts every component now exposes — no behavior-preserving port plus retrofit churn. The orchestrator is the only place that wires components, and it does so by consuming registries. Harness package (new): - ``harness/base.py`` — ``Harness`` ABC: ``run(list[Task]) -> list[dict]``. - ``harness/default.py`` — ``DefaultHarness``: registry-only agent resolution (no ``_AGENT_MODULES`` table; builtin agent modules imported once at call time so ``@AGENTS.register`` fires, then pure ``AGENTS.get(key)``); single ``BENCH_USE_MCP`` read at construction, threaded into both the built ``AgentConfig.capabilities`` and the ``evaluate_metrics_batch(use_mcp=...)`` call; explicit ``AgentConfig`` + ``AgentCapabilities`` (MCP / skills / rules) built from the env-layer bindings with MCP gated by the harness-owned boolean; opaque task blobs parsed into typed ``ChaosSpec`` / ``VerificationSpec`` at the harness boundary; ``RunContext.workspace_path`` rooted artifact diff (no hardcoded cwd); ``capabilities_granted`` snapshot on the result. - ``harness/scenario.py`` — runs ``trigger.wait(ctx)`` then ``action.inject(ctx, event)`` on the typed ``ChaosSpec``; resolves the opaque ``verify:`` key against a name-keyed verification mapping the harness builds (chaos never imports verification); ``VerifierAgent().wait_for_condition(node, timeout_sec=120)`` for typed ``VerificationResult``; ``skip_port_forward=True`` lets the E2E smoke exercise the wiring without ``kubectl``. - ``harness/reporter.py`` — ``ResultReporter`` writes ``results.json``; the engine depends on the sink so output stores become a substitution. - ``harness/artifacts.py`` — workspace-aware ``snapshot_dir`` / ``collect_generated_files``. - ``harness/__init__.py`` — lazy facade; ``import devops_bench.harness`` pulls no SDK / deepeval / mcp / concrete agent. results.json (Decision D3 — schema preserved): - Every typed value routed through ``AgentResult.to_dict()`` / ``ChaosResult.model_dump()`` / ``VerificationResult.model_dump()`` / ``MetricScore.to_entry()``; the legacy mixed shape is unchanged. - Failed records keep ``status: "failed"`` + ``score: 0`` + ``error`` and carry the same identifying keys as success records. - New additive key: ``capabilities_granted = {"use_mcp", "skills"}`` so metrics / dashboards can read what the harness granted instead of re-reading env (CONVENTIONS.md §7 closure step). task.yaml migration (Decision D2 — keep field names, migrate values): - ``complextasks/optimize-scale/task.yaml``: ``chaos_spec`` / ``verification_spec`` migrated from JSON-in-YAML strings to native YAML. - Field names unchanged so gke-labs#89's loader + schema continue to parse unmodified. The harness accepts both shapes (legacy strings round-trip through ``json.loads``) so any fork on the old shape lands the harness PR without touching its task files. - Cross-reference: chaos ``verify: "Planned Load Spike Verification"`` now resolves through the name-keyed mapping the harness builds from ``verification_spec``. - The chaos optimize-scale regression test now exercises the native YAML shape end-to-end. Tests (24 new under ``tests/unit/harness/`` + 1 in chaos): - ``test_registry_resolution.py`` — dummy ``@AGENTS.register("dummy")`` resolves with no harness edit; legacy ``cli`` / ``binary`` alias normalizes to the canonical key; unknown key raises ``NotRegisteredError``. - ``test_single_env_read.py`` — ``BENCH_USE_MCP`` is read once at construction; same boolean reaches the agent's capability gate and the metrics scoring call; grep-style assertion that only one executable read site exists in the package. - ``test_scenario.py`` — fake fault + trigger + verifier exercise the seam wiring (``trigger.wait`` -> ``action.inject``; mapping lookup; rewritten local URL; typed ``ChaosResult`` failure path); ``subprocess.Popen`` poisoned to guard against accidental real ``kubectl`` calls. - ``test_reporter.py`` — golden schema for ``results.json`` (preserved legacy key set + additive ``capabilities_granted``). - ``test_spec_parsing.py`` — the real optimize-scale blob parses into typed ``ChaosSpec`` / ``VerificationSpec``; legacy string shape still round-trips; loader carries native-YAML values through verbatim. - ``test_package_import.py`` — bare ``import devops_bench.harness`` pulls no SDK / deepeval / mcp. - ``test_e2e_smoke.py`` — full ``complextasks/optimize-scale`` flows through ``DefaultHarness.run`` against ``NoOpDeployer`` with stubbed agent + judge + faked trigger/fault/verifier but the REAL wiring; asserts exactly one ``results.json`` with the preserved schema, populated trajectory, and both ``chaos_report`` + ``perf_report``. Quality bar: - ``uv run ruff check devops_bench/ tests/unit/harness/`` clean. - ``uv run pytest tests/ -q``: 633 passed (608 baseline + 25 new).

BLOCKING fix (D3 symmetric records): - Success and failed records now carry the SAME top-level key set (DefaultHarness._RECORD_KEYS); a downstream parser iterating one shape can never KeyError crossing into the other. Previously the success path emitted ``errors`` + ``scores`` while the failed path emitted ``error`` + ``score`` + NO ``scores``. - New shared seed ``_empty_record(task)`` populates the symmetric union; ``_build_success_record`` / renamed ``_build_failed_record`` overlay the differing values. Both records now carry ``error``, ``errors``, ``score``, ``scores``, ``verification_parse_errors``, and the existing capabilities/chaos/perf/etc fields. - Golden tests rewritten to invoke the real builders on a stub AgentResult and assert ``set(record.keys()) == required_keys`` on the actual output (was: hand-rolled dict that let the builders drift silently). New ``test_legacy_failed_record_keys_are_emitted_verbatim`` mirrors the success golden test; a direct ``test_success_and_failed_records_have_identical_top_level_keys`` pins the symmetry invariant. Required non-blocking nits: 1. Single-read regex extended to ``os.getenv`` / ``first_env`` / ``os.environ.get`` / ``os.environ[...]`` / bare ``getenv|environ[`` so a sneaked-in second BENCH_USE_MCP read fails CI rather than passing review. 2. Verification authoring loud-fail: - ``_build_verification_mapping`` now returns ``(mapping, errors)`` and accumulates every authoring failure (non-mapping entry, missing ``name``, JSON parse failure, model validation failure) so it lands on the run record as ``verification_parse_errors`` — operator sees the broken entry on results.json, not just in the log. - ``ScenarioManager._resolve_verification`` writes a failure record into ``chaos_report["verification"]`` (with ``unresolved_reference`` / ``known_references``) when a chaos ``verify:`` key doesn't resolve, so a typo'd cross-reference is visible on the run record. - Canonical ``[{name, spec: <typed-node>}]`` shape documented in the docstring; bare-node compat path retained with a regression test that pins the two-shape acceptance explicitly. 3. Port-forward early-exit reap: when the Popen child has already exited at ``poll()`` time, ``wait()`` is called before raising so the process is reaped (not zombied). 4. ``_granted_skill_paths`` snapshot captured once at ``__init__``; per-record builders read the cached tuple instead of re-calling ``AgentConfig.from_env()``. Removes the env-drift surface. 5. ``_drain_scenario`` checks ``thread.is_alive()`` after join; warns and stamps ``chaos_report["status"]="timed_out"`` so partial results are flagged on the record. 6. ``_ensure_builtin_agents_registered`` now catches ONLY ``ImportError`` / ``MissingDependencyError``; a real bug (``SyntaxError``, etc.) in an agent module re-raises so it is not invisible behind a debug log. 7. E2E smoke ``_CHAOS_ACTIVE_WAIT_SEC`` monkeypatch now references the live module attribute and asserts the patched value (was a no-op assertion on the test's frozen ``from ... import`` alias). OPTIONAL (also landed): - ``_DEFAULT_TARGET_DEPLOYMENT`` / ``_DEFAULT_NAMESPACE`` are now ``DefaultHarness`` constructor args (current literals as defaults) so a non-Hypercompute embedder can override without monkey-patching the module constant. Tests: - 13 new tests under tests/unit/harness/ (4 reporter goldens, 5 default-harness internals, 4 verification authoring path tests). - ``uv run pytest tests/ -q``: 646 passed (was 633; +13 new). - ``uv run ruff check devops_bench/ tests/unit/``: clean.

…eporter + task.yaml migration (PR #18)

…ted matches the agent's actual config Resolves the consistency gap a senior reviewer flagged on PR #18: the harness snapshotted ``use_mcp`` and ``_granted_skill_paths`` at ``__init__``, but ``build_agent_config()`` re-read ``AgentConfig.from_env()`` per agent run, so a mid-batch ``AGENT_*`` env mutation desynced what the agent ran with from what the record claimed it ran with. Fix: build the gated ``AgentConfig`` ONCE in ``__init__`` via the new ``_build_agent_config_snapshot()`` helper and store it on ``self._agent_config``. ``build_agent_config()`` becomes a pure accessor that returns the same object identity across calls; ``_granted_skill_paths`` becomes a derived property reading off the snapshot. ``capabilities_granted`` on every record is therefore the SAME source the agent was constructed from — structurally impossible to desync. Tests (tests/unit/harness/test_default_harness.py): - ``test_build_agent_config_returns_identical_snapshot_across_calls`` pins ``a is b`` over back-to-back calls (no per-call rebuild). - ``test_capabilities_granted_matches_agent_config_even_after_env_mutation`` mutates ``AGENT_SKILLS_PATHS`` / ``BENCH_USE_MCP`` / ``AGENT_MCP_SERVER`` AFTER construction and asserts the record + agent agree. Verified the new tests catch the regression by temporarily reverting the production fix: both new tests fail, the identity check catches the per-call rebuild and the desync check catches the env-mutation leak. Bar: ``uv run ruff check devops_bench/ tests/unit/harness/`` clean; ``uv run pytest tests/ -q``: 648 passed (was 646 on base; +2 new).

…I + Gemini policy `parse_trajectory_export` used to synthesize a free-floating ToolCall entry for an unpaired `tool_result` while also surfacing it on `errors`. The API agent's `_fold_with_extraction_errors` and the Gemini stream-json parser both DROP the orphan (errors-only), so the two CLI agents disagreed with the API agent at the metrics seam — only OpenClaw would feed metrics synthetic entries that no real tool_call backed. Make OpenClaw mirror the API + Gemini policy: drop the orphan from the trajectory, surface it on `AgentResult.errors` with a content preview so nothing is silent-empty. Comment quotes the API agent's invariant verbatim ("every trajectory item is a real ToolCall the model issued") so the rule is uniform and rediscoverable across all three agents. Updates the docstring and replaces the test (now asserts orphan ABSENT from trajectory AND PRESENT in errors, with the id surfaced). Tests: 646 passed. ruff clean.

…zy ChaosAgent import Completes the chaos Phase-A → Phase-4 swap from CONVENTIONS §4, fixing senior review's B1 (chaos import drags in the agent + models chain) and S1 (adding a fault still needs a central edit to ``chaos/spec.py``). Changes - New ``chaos/registry.py`` hosts ``FAULTS`` / ``TRIGGERS`` (mirrors ``verification/registry.py``), with ``entry_point_group`` for external packages. ``chaos/base.py`` re-exports both for backward compatibility. - ``chaos/spec.py`` drops the hand-maintained ``Annotated[Union, Field(discriminator="type")]``. ``ChaosSpec`` now carries a ``model_validator(mode="before")`` that routes ``trigger`` / ``action`` payloads through ``parse_trigger`` / ``parse_fault`` — registry-driven resolution with ``ValidationError`` listing registered keys on miss. The parser lazy-imports the bundled fault/trigger modules on first parse, so the spec module itself depends only on ``core`` and ``pydantic``. - ``GenerateLoadFault.inject`` lazy-imports ``ChaosAgent`` so loading the fault module for registry registration does not pull ``devops_bench.chaos.agent`` or the ``devops_bench.models`` chain. - ``chaos/__init__.py`` docstring rewritten to the Phase-4 reality: importing the package now pulls only ``chaos.{__init__, base, registry, spec}`` and ``core.*`` — concretes load lazily on first parse; agent + models load only when ``Fault.inject`` runs. Tests - Stronger import-hygiene tests assert that ``devops_bench.chaos.agent``, ``devops_bench.models.loop``, and ``devops_bench.models.base`` are absent from ``sys.modules`` after both ``import devops_bench.chaos`` and after ``ChaosSpec.model_validate(...)`` runs. - Extension-axis tests upgraded: a dummy ``@FAULTS.register("dummy")`` / ``@TRIGGERS.register("dummy")`` now parses through ``ChaosSpec`` with no edit to ``chaos/spec.py`` (the Phase-4 bar). Plus an error-message test proving the unknown-type ValidationError enumerates the registered keys. - ``test_generate_load`` re-targets the patch at ``devops_bench.chaos.agent.ChaosAgent`` (lazy import inside ``inject``). - ``test_spec`` swaps the JSON-schema assertion to reflect Phase-4 (the schema no longer enumerates concrete types — that's what lets a new fault land without a schema edit) and adds a missing-type-discriminator case. Bar: ``uv run ruff check devops_bench/ tests/unit/chaos/`` clean; ``uv run pytest tests/ -q`` → 652 passed.

… — closes B1/S1 (PR #21)

…agent (PR #20)

…l config

…ture Fixes B2: `tests/unit/chaos/test_extension_axis.py` failed in isolation because the fixture snapshotted `FAULTS`/`TRIGGERS` *before* Phase 4's lazy `_ensure_concretes_loaded()` populated them; teardown then removed the lazily-added bundled keys, leaving subsequent tests staring at `registered: []`. Calling `_ensure_concretes_loaded()` at fixture entry — before the snapshot — keeps the real baseline intact so teardown restores to the bundled set, not an empty one. Only this file uses the snapshot pattern; nothing else needed updating.

pradeepvrd · 2026-06-20T21:31:05Z

Superseded by the decomposed stacked PRs #26→#33 (reproduce this tree exactly; §7 lossless diff empty). Keeping the refactor/integration branch for now.

pradeepvrd added 30 commits June 19, 2026 22:10

docs(refactor): add cross-cutting plan, handoffs, and shared conventi…

e6e733e

…ons contract Vendors the refactor planning docs (e2e sequencing plan, five component handoffs) plus the distilled CONVENTIONS.md so the refactor work is self-contained on the integration branch.

merge(wave1): models/loop.py shared turn-loop (PR #11)

0c5c5e6

merge(wave1): verification redesign (PR #12)

305f62a

merge(wave1): agents base + CLI (PR #13)

ca1d98d

merge(wave2): agents/api on run_tool_loop (PR #14)

42ec015

merge(wave2): chaos refactor (PR #15)

378dee0

merge(wave2): metrics + verifier registry (PR #16)

274fbb9

merge(wave3): agents capability bindings (PR #17)

253ee02

merge(wave4): harness convergence — registry wiring + typed seams + r…

7eafe98

…eporter + task.yaml migration (PR #18)

merge(senior-fix): chaos Phase-4 registry parsing + lazy agent import…

1e01fa2

… — closes B1/S1 (PR #21)

merge(senior-fix): unify openclaw orphan tool-result policy with API …

089bb5e

…agent (PR #20)

pradeepvrd added 3 commits June 20, 2026 00:47

merge(senior-fix): record capabilities_granted from the agent's actua…

54aa391

…l config

merge(senior-fix): fix chaos extension-axis test isolation (B2) (PR #22)

592c233

docs(refactor): add upstream PR cut plan (8 stacked PRs, fixes folded)

4d18986

pradeepvrd closed this Jun 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cross-cutting harness refactor: layered devops_bench (Stage 1.5–3, reconciled)#23

Cross-cutting harness refactor: layered devops_bench (Stage 1.5–3, reconciled)#23
pradeepvrd wants to merge 34 commits into
integration/devops-bench-stage1from
refactor/integration

pradeepvrd commented Jun 20, 2026

Uh oh!

pradeepvrd commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pradeepvrd commented Jun 20, 2026

Cross-cutting harness refactor — assembled & reviewed

What landed (by wave)

Acceptance

Uh oh!

pradeepvrd commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant