Cross-cutting harness refactor: layered devops_bench (Stage 1.5–3, reconciled)#23
Closed
pradeepvrd wants to merge 34 commits into
Closed
Cross-cutting harness refactor: layered devops_bench (Stage 1.5–3, reconciled)#23pradeepvrd wants to merge 34 commits into
pradeepvrd wants to merge 34 commits into
Conversation
…ons contract Vendors the refactor planning docs (e2e sequencing plan, five component handoffs) plus the distilled CONVENTIONS.md so the refactor work is self-contained on the integration branch.
Add devops_bench/models/loop.py providing the single canonical model-agnostic tool-use loop consumed by both agents/api and chaos/agent. The primitive imports only models.base + core (Layer 1) and lets the caller format tools so it stays ignorant of provider tool-descriptor shapes. Behavior preserved from the duplicated chaos/api loops: - seeds contents with the user goal - per turn appends the assistant message + one tool-result entry per call using the §5 neutral dict shapes (role/content/tool_calls/tool_call_id/name) - retains final_text on EVERY turn so a tool call on the last turn or the turn cap never discards the model's summary - for/else turn cap with a warning, never an infinite loop - accumulates latency across every generate_content call Dispatch errors propagate to the caller (the loop never silently swallows them); each consumer wraps dispatch as it sees fit. Unit tests cover: turn-cap + warning, final-text retention (last-turn tool call + natural exit), latency accumulation, dispatch error surfacing, tools_used tracking, caller-formats-tools pass-through, full conversation shape, max_turns=0 edge case, and that importing the loop pulls no provider SDK. Refs: CONVENTIONS.md §5/§6; e2e-refactor-sequencing-plan.md §3.1/§4.1 (D1); chaos-refactor-handoff.md §5.1.
…d result Phase 2A (refactor: PR #6 rework). Replaces the loose list/dict spec arms with a hand-maintained discriminated union (`SequenceSpec` / `ParallelSpec` + type-tagged leaves), swaps `VerificationResult.details` for typed `children` and `raw` fields, and rewrites `VerifierAgent` around a single monotonic deadline (no per-node budget arithmetic, one clock source). - `verification/spec.py`: `VerificationNode = Annotated[Union, ...]`. Bare leaves are valid whole specs; bare list/dict roots are rejected. The union is hand-maintained at Phase A — structured so a Wave 2 registry-driven `parse_node` can swap the parsing without reshaping the runner's isinstance dispatch. - `verification/base.py`: pydantic `VerificationResult` with `success`, `elapsed_time`, `reason`, optional `name`, `children`, `raw`. `BaseVerifier` gains an optional `name` for result labeling. - `verification/runner.py`: `VerifierAgent.wait_for_condition(spec, timeout_sec=120)` computes one `time.monotonic()` deadline at entry. Parallel children each see the full remaining deadline; sequences are fail-fast (later steps recorded as skipped). - `verification/verifiers/{pod_healthy,scaling_complete}.py`: leaves consume `k8s.kubectl.get_json` / `k8s.conditions.poll_until`; both null-guard `.status` on raw kubectl returns. Result `details=` -> `raw=`, `name` threaded through. Phase 2B authoring contract: - `verification/schema.py`: `json_schema()` emits the pydantic JSON Schema for editor autocomplete; `validate_spec(data)` returns the parsed leaf node and propagates `pydantic.ValidationError` on bad input. - Regression test pins a literal of the migrated `optimize-scale` verification (`type: parallel` with `pod_healthy` + `scaling_complete` children); both leaf `verify` methods are stubbed so it runs without a cluster. Test bar (`tests/unit/verification/`): spec parsing/discrimination; result shape (`details` is gone); deadline semantics (parallel sees full remaining deadline, sequence fail-fast, one monotonic clock); leaf null-status guards on pods AND deployment; bare list/dict spec rejected; lightweight-import guard asserts `import devops_bench.verification` pulls no provider SDK / deepeval / mcp. `ruff check` and `pytest tests/` are green (345 passed). Harness/task-file migration (§7c) is intentionally out of scope — Wave 4 harness PR.
Lays the agents/ Layer-2 component on the post-Stage-1 foundation: the
template-method AgentHarness, typed AgentConfig / AgentResult / ToolCall,
the AGENTS registry, and rewritten gemini + openclaw CLI harnesses on the
new base. The legacy run_cli_agent dispatch, SSH transport, and hardcoded
GKE tool list are gone; trajectories come from the official CLI channels.
Highlights
----------
- AgentHarness owns latency bookkeeping and a broad safety net so a single
agent crash never aborts the benchmark; subclasses implement _execute and
return AgentResult. deepeval tracing is wrapped on demand, kept import-time
optional.
- AgentResult / ToolCall match CONVENTIONS §3 exactly: typed dataclasses,
to_dict() boundary shim, AgentResult.errored() classmethod for failures,
and a has_errors() helper. The metrics layer reads one canonical
trajectory across CLI and (PR2) API agents.
- AgentConfig.from_env() is the single env entry point (AGENT_MODEL /
PROVIDER / API_KEY / TARGET / TIMEOUT_SEC / ALLOWED_TOOLS); agents stop
self-reading the environment per §7. AGENT_ALLOWED_TOOLS is the PR1
interim for what becomes McpBinding.tools in PR3.
- GeminiCliAgent invokes `gemini --output-format stream-json --skip-trust`,
selects between `--allowed-tools <name>` (one per tool) and `-e=""`
extensions-disabled, threads model/key into GEMINI_MODEL/GOOGLE_API_KEY,
passes timeout into every subprocess.run, and parses tool_use /
tool_result / result events from stdout into the canonical trajectory —
no session-file glob, no internal-schema parsing.
- OpenClawAgent runs `oc agent --local` through bash (nvm sourced; every
interpolated value shlex.quoted), wipes ~/.openclaw/agents/<name>/sessions
before each run, picks the resulting session via `oc sessions --agent
<name> --json`, and extracts the trajectory via the official `oc sessions
export-trajectory --workspace <tmpdir>` bundle per docs/openclaw/sessions
rather than the legacy debug-log sessionFile= scrape. SSH transport and
the OPENCLAW_LOCAL toggle are removed.
- Extraction failures (JSON decode misses, unpaired tool_results, non-zero
exit codes, missing export bundles, timeouts) land on AgentResult.errors
— never silent-empty.
Tests (tests/unit/agents/)
-------------------------
- typed-result shape + to_dict() defensive copies + errored() classmethod
- AgentConfig.from_env() field mapping and default fallthroughs
- abstract base cannot be instantiated; safety-net converts a raising
_execute to AgentResult.errored; latency always stamped; a third-party
@AGENTS.register("dummy-extension") resolves with no central edit
- Gemini stream-json parser: tool_use/tool_result folding, error events,
unmatched results, is_error→status, empty input
- Gemini agent: argv flips on -e="" / --allowed-tools, env overlay, timeout
threaded, SubprocessError + OSError paths, legacy run_cli_agent removed
- OpenClaw: model id normalization (gemini→google), shlex-quoted command
fragments, session-key selection across list/wrapper payloads, trajectory
export parsing, sessions-empty + export-failure + bash-timeout paths,
legacy SSH/local runners removed
- Import-hygiene test (spawned interpreter): `import devops_bench.agents`
pulls no mcp / deepeval / anthropic / google.genai / openai
Downstream contract
------------------
AgentHarness.run(prompt) → AgentResult is the single entry point the
harness will call after PR2 / capabilities. The API agent driving
models.loop.run_tool_loop slots in by overriding _execute and returning the
same AgentResult shape — no reshape needed.
PR #11 review (APPROVE-WITH-NITS) follow-up: - Switch latency timing to time.monotonic() (immune to wall-clock adjustments; matches the verification runner's pattern). - Coerce text to str at the source: ``text = client.get_text_content(...) or ""``. §5 says ``content`` must be a ``str``; guards a future adapter returning ``None``. Both ``final_text`` and the assistant message inherit the coerced value, so no further ``or ""`` is needed downstream. - test_loop_import_pulls_no_provider_sdk now runs in a fresh interpreter subprocess (``sys.executable -c ...``). In-process module probing was masked by sibling test_models_* imports of the adapter SDKs. - test_contents_record_user_assistant_and_tool_entries now also asserts the loop forwards function-call dicts by IDENTITY (not by copy) into assistant_message["tool_calls"] — locks the byte-identical-trajectory guarantee for the upcoming agents/api + chaos consumers.
…t; doc + test gaps Addresses peer-review findings on PR #13: BLOCKING — `_execute` was sourcing `AgentResult.output` from `_strip_ansi(completed.stdout)` (the noisy `oc --log-level debug` stream) even when `_read_export_bundle` had read the clean `output.md`/`output.txt`/ `final.txt` from the trajectory bundle. The judge would grade the debug noise. Now `_extract_trajectory` plumbs the bundle output through and `_execute` prefers it, falling back to stripped stdout only when the bundle has no output file. Two new tests cover bundle-preferred and fallback paths. Non-blocking nits, all addressed: 1. `oc models set <id>` now chains with `&&` so a failed model-set aborts the run and surfaces via `completed.returncode` → AgentResult.errors, instead of silently falling through to oc's stored-default model (which would invalidate the benchmark arm). New test asserts `&&` not `;`. 2. Gemini stream-json field-name leniency (`tool_use_id`/`id` and `args`/`input` fallbacks) is now covered by two regression tests so the tolerance can't silently rot. 3. New test asserts `GeminiCliAgent._execute` actually wires `extra_env` (GEMINI_MODEL / GOOGLE_API_KEY / GEMINI_API_KEY / OTEL_SDK_DISABLED) into the `run(...)` call site — not just that `_build_env` returns it in isolation. 4. `AgentResult.to_dict()` docstring corrected: the copies are shallow, not deep — outer containers are fresh but inner ToolCall dicts and other nested values stay aliased. Tests: 367 passed (was 361; +6). ruff clean.
…sub-second leaf floor Review nits from PR #12 (APPROVE-WITH-NITS). - _run_parallel: convert an unexpected leaf exception into a failed child result instead of letting it abort the whole group (catch around ``f.result()``). Pre-fill children with timed-out defaults so a never-completed future stays accounted for without a second materialize pass. - _run_parallel: ``ex.shutdown(wait=False, cancel_futures=True)`` so a deadline-exhausted group does not block on queued workers we no longer want. - _run_leaf: short-circuit when remaining < ``_MIN_LEAF_BUDGET_SECONDS`` (1s) to avoid issuing useless ``kubectl wait --timeout=0.001s`` at the tail of the deadline. - test_runner: add a parallel-with-exhausted-deadline test that uses the REAL ``PodHealthyVerifier`` / ``ScalingCompleteVerifier`` classes (kubectl primitives patched to raise) and asserts all children come back timed-out with ``elapsed_time == 0.0``. Add an unhandled-exception test asserting one bad leaf does not abort the parallel group. pytest: 347 passed (was 345, +2 new). ruff clean.
… (Phase 3 PR2) Rework the open PR #9 ``agents/api`` package onto the AgentHarness template method, AgentConfig, and AgentResult landed in Wave 1, and onto the shared ``models/loop.run_tool_loop`` primitive — agents and chaos are now true siblings, neither carrying its own loop. Concrete changes: * ``ApiAgent._execute`` builds the LLM client via ``get_model(provider, model)`` with EXPLICIT args from ``AgentConfig`` (no env reads inside the agent), pre-formats tools via ``client.format_tools(...)`` (caller-formats-tools, per CONVENTIONS §6), and drives ``run_tool_loop``. * A ``_build_dispatch`` closure routes each tool call to either a skill file read or MCP, wrapping every call in its own try/except so one tool failure lands on ``AgentResult.errors`` and the trajectory (``status="error"``) rather than aborting the run (``run_tool_loop`` propagates dispatch errors by design). * ``fold_trajectory`` collapses ``LoopResult.contents`` into canonical :class:`ToolCall` entries, pairing assistant-issued ``tool_calls`` with the matching ``role: tool`` results by ``tool_call_id``. ``AgentResult.output`` is ``LoopResult.final_text``; tokens/latency/tools_used flow through unchanged (semantically byte-identical to the OLD api loop). * **Skills are decoupled from MCP.** ``config.skills_paths`` drives skill discovery and ``config.target`` drives MCP — an agent may have one, both, or neither. The hardcoded ``third_party/gke-mcp/skills`` constant is gone. * The agent never reads ``BENCH_USE_MCP`` (or any env var); the legacy ``context``/``system_instruction`` grab-bag is dropped from ``run()``'s public surface (Rules arrive in PR3). * ``AgentConfig`` gains interim ``skills_paths`` / ``max_turns`` fields (CSV-parsed from ``AGENT_SKILLS_PATHS`` / ``AGENT_MAX_TURNS``); PR3 migrates ``skills_paths`` into ``SkillBinding.paths``. Tests (``tests/unit/agents/api/``): * fake ``LLMClient`` + fake MCP/skills; trajectory folding correctness (assistant+tool pairs → canonical ToolCall entries; unmatched call ↔ ``status="called"``; dispatcher ``Error: ...`` strings ↔ ``status="error"``); * a dispatch ``RuntimeError`` lands in errors and on the trajectory, the run does not crash; * skills load on the MCP-off path (skills ⊥ MCP); * an AST walk asserts no ``BENCH_USE_MCP`` / ``os.environ`` / ``get_env`` reads anywhere under ``agents/api/``; * ``import devops_bench.agents.api`` and ``...api.agent`` pull no ``mcp``/``deepeval``/SDK at module top. Quality bar: ``uv run ruff check devops_bench/ tests/`` clean; ``uv run pytest tests/ -q`` GREEN — 464 passed (421 baseline + 43 new).
…ontract Phase 3 chaos refactor (rework PR #7). Brings the chaos package onto the shared model-agnostic loop and the same discriminated-union typed-node idiom verification uses, so chaos and agents become true Layer-2 siblings. Changes - New ``ChaosResult`` pydantic model (success / injected_fault / output / elapsed_time / error) replacing the legacy ``{status, output}`` dict. - ``Fault`` and ``Trigger`` reshaped as ``type``-tagged pydantic models in ``chaos/base.py``; FAULTS / TRIGGERS registries unchanged. - ``chaos/spec.py`` adds the ``ChaosSpec`` discriminated union (``ChaosAction`` / ``ChaosTrigger``). ``verify`` is a plain string key — chaos never imports a verification node. The legacy ``verification`` field name is accepted as an alias so the real ``optimize-scale/task.yaml`` parses unchanged ahead of the Phase-B task-file migration. - ``ChaosAgent`` reworked to drive ``models.loop.run_tool_loop``; fortio specifics (system prompt builder, RUN_COMMAND_TOOL, run_chaos_command, load target) move into ``faults/generate_load.py`` along with the typed ``GenerateLoadFault`` / ``LoadTarget`` nodes. The agent is now fault-agnostic — system instruction, tool descriptor, and command handler are injected by the concrete fault. - New ``triggers/time_delay.py`` houses the ``TimeTrigger`` previously inlined in ``harness/scenario.py``. - Slim ``chaos/__init__.py`` exports only ``Fault``, ``Trigger``, ``ChaosResult``, ``FAULTS``, ``TRIGGERS``, ``ChaosSpec``; no eager import of ``ChaosAgent``. - New ``chaos/schema.py`` emits JSON Schema and validates a single chaos entry (the Phase-A authoring contract). Tests (35 new under ``tests/unit/chaos/``) - Spec discrimination: minimal / named / verify-reference / legacy alias / bare-list rejection / extra-forbid / unknown discriminator / registry membership. - ``ChaosResult`` shape + round-trip; ``Fault`` / ``Trigger`` abstractness. - ``ChaosAgent`` against a fake ``LLMClient`` (no SDK / no network): no-tool turn, tool dispatch + message-shape pinning, non-dict args, unknown tool, final-text retention across the turn cap. - ``GenerateLoadFault.inject`` with mocked subprocess + stubbed agent: success path, exception → ``success=False`` with error, event threaded through. - ``TimeTrigger`` zero vs positive delay (``time.sleep`` patched). - Regression: the REAL ``complextasks/optimize-scale/task.yaml`` chaos entry parses through ``ChaosSpec`` and discriminates to ``GenerateLoadFault`` / ``TimeTrigger`` with the documented ``target.service_url`` / ``qps`` / ``delay_seconds``. - Import hygiene: ``import devops_bench.chaos`` pulls no provider SDK, ``mcp``, ``deepeval``, or ``ollama``; ``ChaosAgent`` is not exported. Bar - ``uv run ruff check devops_bench/ tests/unit/chaos/`` clean. - Full ``uv run pytest tests/ -q`` green (456 passed; +35 chaos tests on top of the 421-test baseline).
…s (Phase 4)
Land the metrics extensibility refactor and the verifier registry swap in
one drop, building on the Wave-1 verification union with no changes to the
runner or the leaf-verifier logic.
Metrics:
- New `metrics/base.py` carrying `METRICS` (`Registry[type[MetricEvaluator]]`),
the `MetricScore` dataclass with a D3-preserving `to_entry()`, the
`MetricContext` that threads `use_mcp` from the harness (CONVENTIONS.md §7),
and the single shared `run_geval` recorder.
- Phase-0 fixes folded in: `CHECKLIST_THRESHOLD` constant decouples the
checklist cutoff from the tool-invocation threshold, dead grounding
branches are removed, the ` [GEval]` strip lives in exactly one place
(`run_geval`), and `extract_checklist_items` is exported from the package
facade.
- Each metric family (outcome_validity, tool_invocation, checklist,
grounding, chaos) is now a registered `@METRICS.register(...)` evaluator;
`evaluate_metrics_batch` collapses to a registry loop, importing the
builtin metric modules at call time so the package import stays free of
`deepeval` (CONVENTIONS.md §8).
- The `use_mcp` boolean is threaded through `evaluate_metrics_batch(..., *,
use_mcp=...)` and into `MetricContext`; metrics no longer self-read
`BENCH_USE_MCP`. The kwarg falls back to the env var so the harness can
flip its caller in a follow-up without breaking the rest of the tree.
Verifier registry:
- New `verification/registry.py` carrying `VERIFIERS`, populated by
`@VERIFIERS.register("...")` decorators on the existing leaves
(pod_healthy, scaling_complete) and on `SequenceSpec` / `ParallelSpec`.
- `verification/spec.py` swaps the static `Annotated[Union, Field(
discriminator="type")]` for a `parse_node()` helper invoked from a
`model_validator(mode="before")` on `VerificationSpec` (and on each
compound spec, for `checks` recursion). Lookup misses surface as
`pydantic.ValidationError` so callers see the same error surface.
- Touch on `verification/runner.py`, `verification/base.py`, and the leaf
verifier logic is zero — the only change to leaves is the decorator. The
existing verification regression + runner tests stay green unchanged.
Tests under `tests/unit/metrics/` cover: checklist parsing, the registry
loop, MCP gating via `MetricContext`, the legacy `to_entry()` shape (bare
value for rates, dict for judged), the single-place ` [GEval]` strip,
chaos/grounding/skill loading, the dummy-metric extension-axis acceptance,
and a fresh-interpreter assert that `import devops_bench.metrics` pulls no
`deepeval`/`mcp`/provider SDK. The verifier extension-axis test lives next
to its peers under `tests/unit/verification/`.
…; align __init__ docstring
CONVENTIONS §9: dummy ``@FAULTS.register("dummy")`` / ``@TRIGGERS.register("dummy")``
resolve via ``REGISTRY.get(key)`` with no edit to ``chaos/spec.py``. Comment
calls out Phase A's split: registry drives harness resolution; ``ChaosSpec``
parsing still requires a Union edit until Phase 4's registry-driven swap.
Also: pin ``args={}`` (missing ``command`` key) dispatch returning an
``"Error: ..."`` string (no crash); align ``chaos/__init__.py`` docstring with
the import-hygiene test it claims to satisfy — concretes load with the
package (required for the discriminated union), but provider SDKs / fortio /
``deepeval`` / ``mcp`` stay strictly lazy.
…el; dedupe compound _parse_children Wave-2 review nits on PR #16: - Pipeline ordering (D3): `evaluate_metrics_batch` no longer iterates `METRICS.values()` (which is import-order-sensitive across test sessions and third-party registrations). Iteration now walks a pinned `_BUILTIN_METRIC_ORDER` tuple of `(module, key)` pairs (outcome -> tool -> checklist -> grounding -> chaos), then appends any third-party keys in registry order. New `test_batch_score_insertion_order_matches_legacy_results_json` pins the on-disk schema. - `parse_node` no longer bypasses every `BaseModel`. It now only accepts already-parsed instances whose concrete class is in `VERIFIERS`; an unregistered model (including as a `checks` child of a compound node) raises the same `ValidationError` surface as an unknown `type` string. Two new tests in `test_verifier_registry.py` cover the standalone and compound-child cases. - The two byte-identical `_parse_children` validators on `SequenceSpec` and `ParallelSpec` collapse to one module-level `_parse_compound_children` helper attached via `model_validator(mode="before")(...)`. Tests: 476 passed (was 473; +1 ordering, +2 unregistered-BaseModel guard). Ruff: clean.
Wave 2 review of PR #14 found a blocking bug + three non-blocking nits. BLOCKING — `extract_tokens` was Google-only Read only ``prompt_token_count`` / ``candidates_token_count`` / ``total_token_count`` so Anthropic (``usage.input_tokens`` / ``output_tokens``) and OpenAI / Ollama (``usage.prompt_tokens`` / ``completion_tokens`` / ``total_tokens``) responses returned all-zero tokens, corrupting token metrics in results.json for every non-Google provider. Fix: detect which usage shape is present and map each provider's source fields onto the legacy on-disk key scheme (D3 results.json stability). A new ``_first_int_attr`` helper looks up the first non-None attribute name in order. Anthropic emits no aggregated total, so the helper computes ``prompt + candidates`` when source ``total`` is absent (with a non-zero prompt/candidates) so non-empty runs never log ``total_tokens: 0``. Tests: Anthropic-shape, OpenAI/Ollama-shape, key-scheme stability, and end-to-end Anthropic/OpenAI runs that flow through ``ApiAgent._execute`` asserting ``AgentResult.tokens`` is non-zero and correctly mapped. NON-BLOCKING NIT 1 — AST env-scan now covers ``ast.Subscript`` Added an explicit ``ast.Subscript`` branch in the env-smuggling guard so ``os.environ["FOO"]`` patterns can't be bypassed even if the inner Attribute walk regresses. Also caught ``os.environ.get(...)`` explicitly. NON-BLOCKING NIT 2 — ``MCPClient.call_tool`` degrades gracefully The legacy ``from deepeval.tracing import observe`` was unconditional, so a host without deepeval crashed with ImportError on the first tool call. Mirrors ``_maybe_observe`` from ``agents/base.py``: try the import, fall through to the bare ``call_tool`` on failure. Tested by monkey-patching ``builtins.__import__`` to raise on ``deepeval`` and confirming the call still completes. NON-BLOCKING NIT 3 — explicit orphan-result handling in ``fold_trajectory`` An unpaired ``{role: tool}`` result (no matching assistant call_id) is now intentionally dropped from the canonical trajectory (preserving the "every trajectory item is a real ToolCall the model issued" invariant metrics relies on) AND surfaced as a diagnostic on ``AgentResult.errors`` via the new ``_fold_with_extraction_errors`` helper. Matches the CLI agents' "no silent-empty" policy. The drop is also debug-logged. The decision is documented in the function's docstring. Bar: ``uv run ruff check devops_bench/ tests/unit/agents/`` clean; ``uv run pytest tests/ -q`` GREEN — 473 passed (464 baseline + 9 new).
…) (Phase 3 PR3) Introduce devops_bench/agents/capabilities/ — the binding vocabulary the orchestrator threads through AgentConfig.capabilities. Three orthogonal axes: McpBinding (server command + tool names), SkillBinding (local SKILL.md paths), AgentRules (operator brief text). Each binding ships with a runtime_checkable Protocol (SupportsMcp / SupportsSkills / SupportsRules) plus a trivial mixin so concrete agents declare *only* what they can actually consume — capability negotiation refuses bindings an agent would silently ignore. AgentConfig migrates from the interim PR1/PR2 fields (target/allowed_tools/ skills_paths) to a single capabilities aggregate; the legacy AGENT_TARGET env still maps to the CLI binary path, while the new AGENT_MCP_SERVER / AGENT_ALLOWED_TOOLS / AGENT_SKILLS_PATHS / AGENT_RULES_TEXT env vars build the bindings. The harness will skip the env path entirely (it constructs AgentCapabilities directly from a benchmark catalog). Each agent now consumes the bindings: - ApiAgent (McpMixin + SkillsMixin + RulesMixin): MCP gate = capabilities.mcp with a non-empty command; skills gate = capabilities.skills.paths; rules ride on the provider's system_instruction. Skills ⊥ MCP independence preserved. - GeminiCliAgent (McpMixin + RulesMixin): --allowed-tools comes from capabilities.allowed_tools (aggregated across bound servers). - OpenClawAgent (RulesMixin only): the installed oc build exposes no in-agent MCP or skills wiring. Tests: new tests/unit/agents/test_agents_capabilities.py covers binding construction, Protocol membership, and the aggregate's accessors; updated config / api / cli tests assert each agent reads the bindings; a new AST- walking test forbids GKE-specific runtime literals anywhere in agents/; an import-hygiene test pins capabilities to stdlib-only. The existing BENCH_USE_MCP / os.environ AST guard still passes. 600/600 tests green; ruff clean on devops_bench/ + tests/unit/agents/.
… shlex.join MCP path
Wave 3 review surfaced both CLI agents advertising SupportsRules without
actually delivering the bound rules text — the mixin attribute mirror was
the entire delivery story. Fix:
- **Gemini** (cli/gemini.py): _execute now runs the binary inside a per-run
tempfile.TemporaryDirectory and writes capabilities.rules.text to
GEMINI.md there before invocation, then passes cwd=tmpdir to
core.subprocess.run. The CLI auto-loads GEMINI.md from its working dir,
which is the binary's native startup-context channel. Temp dir is cleaned
up after _execute returns.
- **OpenClaw** (cli/openclaw.py): new _prepend_rules helper splices the
rules text ahead of the prompt (separated by a blank line) before
_build_local_command interpolates it into the oc agent -m '<prompt>'
segment. oc has no dedicated rules flag, so prompt-prepending is the
reliable, binary-agnostic delivery channel.
Tests assert the bindings actually reach the subprocess invocation:
- tests/unit/agents/test_agents_cli_gemini.py snapshots the cwd handed to
run() and reads GEMINI.md from inside the fake run (before TemporaryDirectory
cleanup) — proves the binary would see the file on startup. Also asserts
no GEMINI.md is written for empty rules, and the temp dir is cleaned up.
- tests/unit/agents/test_agents_cli_openclaw.py captures the bash command
string and asserts the rules text + prompt both appear with rules ahead
of the prompt. Helper _prepend_rules has unit tests for the empty / non-
empty branches.
Non-blocking nits from the review, addressed in the same commit:
- api/agent.py: shlex.join(mcp_binding.command) replaces the lossy " ".join
so a spaced argv token (("uv run", "mcp-server")) round-trips MCPClient's
shlex.split intact. New test
test_execute_preserves_spaced_command_token_through_shlex_roundtrip
verifies the round-trip.
- capabilities/aggregate.py: AgentCapabilities.mcp docstring now warns
callers that ApiAgent honors only the first binding (multi-server fan-
out is not implemented in PR3) — Wave 4 will see the gotcha when wiring.
- tests/unit/agents/api/test_agents_api_agent.py: renamed
test_execute_runs_with_no_tools_when_target_unset →
..._when_capabilities_default to match the post-PR3 gate.
- config.py: tightened the inline comment explaining why from_env uses
name="default" on the synthesized McpBinding.
Side effect: the prior test_execute_returns_errored_on_empty_mcp_server_string
relied on " ".join collapsing a whitespace-only command. With shlex.join
that path is no longer reachable through a binding, so the test was rewritten
to patch MCPClient directly — it still proves the agent converts
MCPClient.__aenter__ ValueErrors to AgentResult.errored() rather than
letting them bubble through the base safety net.
ruff clean (devops_bench/ + tests/unit/agents/). pytest tests/ -q → 608 passed.
…aml migration (Phase 5)
Phase 5 (D4 option b): the harness is written once, born consuming the typed
Layer-2 contracts every component now exposes — no behavior-preserving port
plus retrofit churn. The orchestrator is the only place that wires
components, and it does so by consuming registries.
Harness package (new):
- ``harness/base.py`` — ``Harness`` ABC: ``run(list[Task]) -> list[dict]``.
- ``harness/default.py`` — ``DefaultHarness``: registry-only agent
resolution (no ``_AGENT_MODULES`` table; builtin agent modules imported
once at call time so ``@AGENTS.register`` fires, then pure
``AGENTS.get(key)``); single ``BENCH_USE_MCP`` read at construction,
threaded into both the built ``AgentConfig.capabilities`` and the
``evaluate_metrics_batch(use_mcp=...)`` call; explicit ``AgentConfig`` +
``AgentCapabilities`` (MCP / skills / rules) built from the env-layer
bindings with MCP gated by the harness-owned boolean; opaque task blobs
parsed into typed ``ChaosSpec`` / ``VerificationSpec`` at the harness
boundary; ``RunContext.workspace_path`` rooted artifact diff (no
hardcoded cwd); ``capabilities_granted`` snapshot on the result.
- ``harness/scenario.py`` — runs ``trigger.wait(ctx)`` then
``action.inject(ctx, event)`` on the typed ``ChaosSpec``; resolves the
opaque ``verify:`` key against a name-keyed verification mapping the
harness builds (chaos never imports verification);
``VerifierAgent().wait_for_condition(node, timeout_sec=120)`` for typed
``VerificationResult``; ``skip_port_forward=True`` lets the E2E smoke
exercise the wiring without ``kubectl``.
- ``harness/reporter.py`` — ``ResultReporter`` writes ``results.json``;
the engine depends on the sink so output stores become a substitution.
- ``harness/artifacts.py`` — workspace-aware ``snapshot_dir`` /
``collect_generated_files``.
- ``harness/__init__.py`` — lazy facade; ``import devops_bench.harness``
pulls no SDK / deepeval / mcp / concrete agent.
results.json (Decision D3 — schema preserved):
- Every typed value routed through ``AgentResult.to_dict()`` /
``ChaosResult.model_dump()`` / ``VerificationResult.model_dump()`` /
``MetricScore.to_entry()``; the legacy mixed shape is unchanged.
- Failed records keep ``status: "failed"`` + ``score: 0`` + ``error`` and
carry the same identifying keys as success records.
- New additive key: ``capabilities_granted = {"use_mcp", "skills"}`` so
metrics / dashboards can read what the harness granted instead of
re-reading env (CONVENTIONS.md §7 closure step).
task.yaml migration (Decision D2 — keep field names, migrate values):
- ``complextasks/optimize-scale/task.yaml``: ``chaos_spec`` /
``verification_spec`` migrated from JSON-in-YAML strings to native YAML.
- Field names unchanged so gke-labs#89's loader + schema continue to parse
unmodified. The harness accepts both shapes (legacy strings round-trip
through ``json.loads``) so any fork on the old shape lands the harness
PR without touching its task files.
- Cross-reference: chaos ``verify: "Planned Load Spike Verification"``
now resolves through the name-keyed mapping the harness builds from
``verification_spec``.
- The chaos optimize-scale regression test now exercises the native YAML
shape end-to-end.
Tests (24 new under ``tests/unit/harness/`` + 1 in chaos):
- ``test_registry_resolution.py`` — dummy ``@AGENTS.register("dummy")``
resolves with no harness edit; legacy ``cli`` / ``binary`` alias
normalizes to the canonical key; unknown key raises
``NotRegisteredError``.
- ``test_single_env_read.py`` — ``BENCH_USE_MCP`` is read once at
construction; same boolean reaches the agent's capability gate and the
metrics scoring call; grep-style assertion that only one executable
read site exists in the package.
- ``test_scenario.py`` — fake fault + trigger + verifier exercise the
seam wiring (``trigger.wait`` -> ``action.inject``; mapping lookup;
rewritten local URL; typed ``ChaosResult`` failure path);
``subprocess.Popen`` poisoned to guard against accidental real
``kubectl`` calls.
- ``test_reporter.py`` — golden schema for ``results.json`` (preserved
legacy key set + additive ``capabilities_granted``).
- ``test_spec_parsing.py`` — the real optimize-scale blob parses into
typed ``ChaosSpec`` / ``VerificationSpec``; legacy string shape still
round-trips; loader carries native-YAML values through verbatim.
- ``test_package_import.py`` — bare ``import devops_bench.harness``
pulls no SDK / deepeval / mcp.
- ``test_e2e_smoke.py`` — full ``complextasks/optimize-scale`` flows
through ``DefaultHarness.run`` against ``NoOpDeployer`` with stubbed
agent + judge + faked trigger/fault/verifier but the REAL wiring;
asserts exactly one ``results.json`` with the preserved schema,
populated trajectory, and both ``chaos_report`` + ``perf_report``.
Quality bar:
- ``uv run ruff check devops_bench/ tests/unit/harness/`` clean.
- ``uv run pytest tests/ -q``: 633 passed (608 baseline + 25 new).
BLOCKING fix (D3 symmetric records):
- Success and failed records now carry the SAME top-level key set
(DefaultHarness._RECORD_KEYS); a downstream parser iterating one
shape can never KeyError crossing into the other. Previously the
success path emitted ``errors`` + ``scores`` while the failed path
emitted ``error`` + ``score`` + NO ``scores``.
- New shared seed ``_empty_record(task)`` populates the symmetric
union; ``_build_success_record`` / renamed ``_build_failed_record``
overlay the differing values. Both records now carry ``error``,
``errors``, ``score``, ``scores``, ``verification_parse_errors``,
and the existing capabilities/chaos/perf/etc fields.
- Golden tests rewritten to invoke the real builders on a stub
AgentResult and assert ``set(record.keys()) == required_keys`` on
the actual output (was: hand-rolled dict that let the builders
drift silently). New ``test_legacy_failed_record_keys_are_emitted_verbatim``
mirrors the success golden test; a direct
``test_success_and_failed_records_have_identical_top_level_keys``
pins the symmetry invariant.
Required non-blocking nits:
1. Single-read regex extended to ``os.getenv`` / ``first_env`` /
``os.environ.get`` / ``os.environ[...]`` / bare ``getenv|environ[``
so a sneaked-in second BENCH_USE_MCP read fails CI rather than
passing review.
2. Verification authoring loud-fail:
- ``_build_verification_mapping`` now returns ``(mapping, errors)``
and accumulates every authoring failure (non-mapping entry,
missing ``name``, JSON parse failure, model validation failure)
so it lands on the run record as ``verification_parse_errors``
— operator sees the broken entry on results.json, not just in
the log.
- ``ScenarioManager._resolve_verification`` writes a failure
record into ``chaos_report["verification"]`` (with
``unresolved_reference`` / ``known_references``) when a chaos
``verify:`` key doesn't resolve, so a typo'd cross-reference is
visible on the run record.
- Canonical ``[{name, spec: <typed-node>}]`` shape documented in
the docstring; bare-node compat path retained with a regression
test that pins the two-shape acceptance explicitly.
3. Port-forward early-exit reap: when the Popen child has already
exited at ``poll()`` time, ``wait()`` is called before raising so
the process is reaped (not zombied).
4. ``_granted_skill_paths`` snapshot captured once at ``__init__``;
per-record builders read the cached tuple instead of re-calling
``AgentConfig.from_env()``. Removes the env-drift surface.
5. ``_drain_scenario`` checks ``thread.is_alive()`` after join;
warns and stamps ``chaos_report["status"]="timed_out"`` so
partial results are flagged on the record.
6. ``_ensure_builtin_agents_registered`` now catches ONLY
``ImportError`` / ``MissingDependencyError``; a real bug
(``SyntaxError``, etc.) in an agent module re-raises so it is not
invisible behind a debug log.
7. E2E smoke ``_CHAOS_ACTIVE_WAIT_SEC`` monkeypatch now references
the live module attribute and asserts the patched value (was a
no-op assertion on the test's frozen ``from ... import`` alias).
OPTIONAL (also landed):
- ``_DEFAULT_TARGET_DEPLOYMENT`` / ``_DEFAULT_NAMESPACE`` are now
``DefaultHarness`` constructor args (current literals as defaults)
so a non-Hypercompute embedder can override without monkey-patching
the module constant.
Tests:
- 13 new tests under tests/unit/harness/ (4 reporter goldens, 5
default-harness internals, 4 verification authoring path tests).
- ``uv run pytest tests/ -q``: 646 passed (was 633; +13 new).
- ``uv run ruff check devops_bench/ tests/unit/``: clean.
…eporter + task.yaml migration (PR #18)
…ted matches the agent's actual config Resolves the consistency gap a senior reviewer flagged on PR #18: the harness snapshotted ``use_mcp`` and ``_granted_skill_paths`` at ``__init__``, but ``build_agent_config()`` re-read ``AgentConfig.from_env()`` per agent run, so a mid-batch ``AGENT_*`` env mutation desynced what the agent ran with from what the record claimed it ran with. Fix: build the gated ``AgentConfig`` ONCE in ``__init__`` via the new ``_build_agent_config_snapshot()`` helper and store it on ``self._agent_config``. ``build_agent_config()`` becomes a pure accessor that returns the same object identity across calls; ``_granted_skill_paths`` becomes a derived property reading off the snapshot. ``capabilities_granted`` on every record is therefore the SAME source the agent was constructed from — structurally impossible to desync. Tests (tests/unit/harness/test_default_harness.py): - ``test_build_agent_config_returns_identical_snapshot_across_calls`` pins ``a is b`` over back-to-back calls (no per-call rebuild). - ``test_capabilities_granted_matches_agent_config_even_after_env_mutation`` mutates ``AGENT_SKILLS_PATHS`` / ``BENCH_USE_MCP`` / ``AGENT_MCP_SERVER`` AFTER construction and asserts the record + agent agree. Verified the new tests catch the regression by temporarily reverting the production fix: both new tests fail, the identity check catches the per-call rebuild and the desync check catches the env-mutation leak. Bar: ``uv run ruff check devops_bench/ tests/unit/harness/`` clean; ``uv run pytest tests/ -q``: 648 passed (was 646 on base; +2 new).
…I + Gemini policy
`parse_trajectory_export` used to synthesize a free-floating ToolCall entry
for an unpaired `tool_result` while also surfacing it on `errors`. The API
agent's `_fold_with_extraction_errors` and the Gemini stream-json parser
both DROP the orphan (errors-only), so the two CLI agents disagreed with
the API agent at the metrics seam — only OpenClaw would feed metrics
synthetic entries that no real tool_call backed.
Make OpenClaw mirror the API + Gemini policy: drop the orphan from the
trajectory, surface it on `AgentResult.errors` with a content preview so
nothing is silent-empty. Comment quotes the API agent's invariant verbatim
("every trajectory item is a real ToolCall the model issued") so the rule
is uniform and rediscoverable across all three agents. Updates the
docstring and replaces the test (now asserts orphan ABSENT from trajectory
AND PRESENT in errors, with the id surfaced).
Tests: 646 passed. ruff clean.
…zy ChaosAgent import
Completes the chaos Phase-A → Phase-4 swap from CONVENTIONS §4, fixing senior
review's B1 (chaos import drags in the agent + models chain) and S1 (adding a
fault still needs a central edit to ``chaos/spec.py``).
Changes
- New ``chaos/registry.py`` hosts ``FAULTS`` / ``TRIGGERS`` (mirrors
``verification/registry.py``), with ``entry_point_group`` for external
packages. ``chaos/base.py`` re-exports both for backward compatibility.
- ``chaos/spec.py`` drops the hand-maintained
``Annotated[Union, Field(discriminator="type")]``. ``ChaosSpec`` now carries
a ``model_validator(mode="before")`` that routes ``trigger`` / ``action``
payloads through ``parse_trigger`` / ``parse_fault`` — registry-driven
resolution with ``ValidationError`` listing registered keys on miss. The
parser lazy-imports the bundled fault/trigger modules on first parse, so the
spec module itself depends only on ``core`` and ``pydantic``.
- ``GenerateLoadFault.inject`` lazy-imports ``ChaosAgent`` so loading the
fault module for registry registration does not pull
``devops_bench.chaos.agent`` or the ``devops_bench.models`` chain.
- ``chaos/__init__.py`` docstring rewritten to the Phase-4 reality: importing
the package now pulls only ``chaos.{__init__, base, registry, spec}`` and
``core.*`` — concretes load lazily on first parse; agent + models load only
when ``Fault.inject`` runs.
Tests
- Stronger import-hygiene tests assert that ``devops_bench.chaos.agent``,
``devops_bench.models.loop``, and ``devops_bench.models.base`` are absent
from ``sys.modules`` after both ``import devops_bench.chaos`` and after
``ChaosSpec.model_validate(...)`` runs.
- Extension-axis tests upgraded: a dummy ``@FAULTS.register("dummy")`` /
``@TRIGGERS.register("dummy")`` now parses through ``ChaosSpec`` with no
edit to ``chaos/spec.py`` (the Phase-4 bar). Plus an error-message test
proving the unknown-type ValidationError enumerates the registered keys.
- ``test_generate_load`` re-targets the patch at
``devops_bench.chaos.agent.ChaosAgent`` (lazy import inside ``inject``).
- ``test_spec`` swaps the JSON-schema assertion to reflect Phase-4 (the
schema no longer enumerates concrete types — that's what lets a new fault
land without a schema edit) and adds a missing-type-discriminator case.
Bar: ``uv run ruff check devops_bench/ tests/unit/chaos/`` clean;
``uv run pytest tests/ -q`` → 652 passed.
… — closes B1/S1 (PR #21)
…ture Fixes B2: `tests/unit/chaos/test_extension_axis.py` failed in isolation because the fixture snapshotted `FAULTS`/`TRIGGERS` *before* Phase 4's lazy `_ensure_concretes_loaded()` populated them; teardown then removed the lazily-added bundled keys, leaving subsequent tests staring at `registered: []`. Calling `_ensure_concretes_loaded()` at fixture entry — before the snapshot — keeps the real baseline intact so teardown restores to the bundled set, not an empty one. Only this file uses the snapshot pattern; nothing else needed updating.
This was referenced Jun 20, 2026
Owner
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Cross-cutting harness refactor — assembled & reviewed
Implements the reconciled landing plan in
docs/refactor/e2e-refactor-sequencing-plan.md, migrating the legacy monolith into a strictly layereddevops_bench/package. This branch (refactor/integration) is the assembled result of the reworked component PRs #11–#22, each peer-reviewed (two reviewers) and then covered by two senior reviews.Diff base:
integration/devops-bench-stage1(the clean Stage-1 foundation: core + tasks + deployers + k8s + models), so this PR shows exactly the Stage-1.5–3 refactor work.What landed (by wave)
models/loop.pysharedrun_tool_loopprimitive (PR feat(models): run_tool_loop shared turn-loop primitive (Phase 1, D1) #11); agents ⊥ chaos become siblings.VerificationResult(PR feat(verification): discriminated-union spec + deadline runner + typed result (Phase 2A/2B) #12); later swapped to registry-driven parsing (PR feat(metrics): registry + run_geval + verifier registry; Phase-0 fixes (Phase 4) #16).AgentHarness/AgentConfig/AgentResult+ CLI agents (PR feat(agents): AgentHarness base + CLI agents (Phase 3 PR1) #13),ApiAgentonrun_tool_loop(PR feat(agents): ApiAgent on run_tool_loop; skills⊥MCP; no env-smuggling (Phase 3 PR2) #14), capability bindingsMcpBinding/SkillBinding/AgentRules(PR feat(agents): capability bindings (McpBinding/SkillBinding/AgentRules) (Phase 3 PR3) #17).ChaosResult/Fault/Trigger, loop reuse (PR feat(chaos): typed ChaosResult/Fault/Trigger; loop reuse; authoring contract (Phase 3) #15); Phase-4 registry-driven parsing + lazy agent import (PR refactor(chaos): Phase-4 swap — registry-driven ChaosSpec parsing; lazy ChaosAgent import #21).METRICS/MetricScore/MetricContext/run_geval+VERIFIERSregistry (PR feat(metrics): registry + run_geval + verifier registry; Phase-0 fixes (Phase 4) #16).BENCH_USE_MCPread, typed seams,ResultReporter,optimize-scale/task.yamlnative-YAML migration, E2E smoke (PR feat(harness): registry-only wiring + typed seams + reporter + task.yaml migration (Phase 5) #18).Acceptance
ruff check devops_bench/clean; every test file passes in isolation (no order-dependence).models.BENCH_USE_MCPread once in the harness, threaded to both agent and judge.import devops_bench.<pkg>pulls no provider SDK / deepeval / mcp for every package.results.jsonschema preserved (D3) viato_entry()/to_dict(), success/failed records key-symmetric.complextasks/optimize-scalethrough the real wiring againstNoOpDeployer(network/SDK stubbed only at the boundary).Draft for review.