Skip to content

Cross-cutting harness refactor: layered devops_bench (Stage 1.5–3, reconciled)#23

Closed
pradeepvrd wants to merge 34 commits into
integration/devops-bench-stage1from
refactor/integration
Closed

Cross-cutting harness refactor: layered devops_bench (Stage 1.5–3, reconciled)#23
pradeepvrd wants to merge 34 commits into
integration/devops-bench-stage1from
refactor/integration

Conversation

@pradeepvrd

Copy link
Copy Markdown
Owner

Cross-cutting harness refactor — assembled & reviewed

Implements the reconciled landing plan in docs/refactor/e2e-refactor-sequencing-plan.md, migrating the legacy monolith into a strictly layered devops_bench/ package. This branch (refactor/integration) is the assembled result of the reworked component PRs #11#22, each peer-reviewed (two reviewers) and then covered by two senior reviews.

Diff base: integration/devops-bench-stage1 (the clean Stage-1 foundation: core + tasks + deployers + k8s + models), so this PR shows exactly the Stage-1.5–3 refactor work.

What landed (by wave)

Acceptance

  • 654 unit tests pass; ruff check devops_bench/ clean; every test file passes in isolation (no order-dependence).
  • Strict one-directional layering (core ← leaves ← components ← harness); zero reverse/sibling imports; agents/chaos depend only on models.
  • Registry-only resolution on all five axes (AGENTS/FAULTS/TRIGGERS/METRICS/VERIFIERS), each with a "drop a decorator, no central edit" acceptance test.
  • BENCH_USE_MCP read once in the harness, threaded to both agent and judge.
  • import devops_bench.<pkg> pulls no provider SDK / deepeval / mcp for every package.
  • results.json schema preserved (D3) via to_entry()/to_dict(), success/failed records key-symmetric.
  • E2E smoke runs complextasks/optimize-scale through the real wiring against NoOpDeployer (network/SDK stubbed only at the boundary).

Draft for review.

…ons contract

Vendors the refactor planning docs (e2e sequencing plan, five component
handoffs) plus the distilled CONVENTIONS.md so the refactor work is
self-contained on the integration branch.
Add devops_bench/models/loop.py providing the single canonical model-agnostic
tool-use loop consumed by both agents/api and chaos/agent. The primitive
imports only models.base + core (Layer 1) and lets the caller format tools so
it stays ignorant of provider tool-descriptor shapes.

Behavior preserved from the duplicated chaos/api loops:
- seeds contents with the user goal
- per turn appends the assistant message + one tool-result entry per call
  using the §5 neutral dict shapes (role/content/tool_calls/tool_call_id/name)
- retains final_text on EVERY turn so a tool call on the last turn or the
  turn cap never discards the model's summary
- for/else turn cap with a warning, never an infinite loop
- accumulates latency across every generate_content call

Dispatch errors propagate to the caller (the loop never silently swallows
them); each consumer wraps dispatch as it sees fit.

Unit tests cover: turn-cap + warning, final-text retention (last-turn tool
call + natural exit), latency accumulation, dispatch error surfacing,
tools_used tracking, caller-formats-tools pass-through, full conversation
shape, max_turns=0 edge case, and that importing the loop pulls no provider
SDK.

Refs: CONVENTIONS.md §5/§6; e2e-refactor-sequencing-plan.md §3.1/§4.1 (D1);
chaos-refactor-handoff.md §5.1.
…d result

Phase 2A (refactor: PR #6 rework). Replaces the loose list/dict spec arms with
a hand-maintained discriminated union (`SequenceSpec` / `ParallelSpec` +
type-tagged leaves), swaps `VerificationResult.details` for typed `children`
and `raw` fields, and rewrites `VerifierAgent` around a single monotonic
deadline (no per-node budget arithmetic, one clock source).

- `verification/spec.py`: `VerificationNode = Annotated[Union, ...]`. Bare
  leaves are valid whole specs; bare list/dict roots are rejected. The union
  is hand-maintained at Phase A — structured so a Wave 2 registry-driven
  `parse_node` can swap the parsing without reshaping the runner's
  isinstance dispatch.
- `verification/base.py`: pydantic `VerificationResult` with `success`,
  `elapsed_time`, `reason`, optional `name`, `children`, `raw`. `BaseVerifier`
  gains an optional `name` for result labeling.
- `verification/runner.py`: `VerifierAgent.wait_for_condition(spec,
  timeout_sec=120)` computes one `time.monotonic()` deadline at entry.
  Parallel children each see the full remaining deadline; sequences are
  fail-fast (later steps recorded as skipped).
- `verification/verifiers/{pod_healthy,scaling_complete}.py`: leaves consume
  `k8s.kubectl.get_json` / `k8s.conditions.poll_until`; both null-guard
  `.status` on raw kubectl returns. Result `details=` -> `raw=`, `name`
  threaded through.

Phase 2B authoring contract:
- `verification/schema.py`: `json_schema()` emits the pydantic JSON Schema for
  editor autocomplete; `validate_spec(data)` returns the parsed leaf node and
  propagates `pydantic.ValidationError` on bad input.
- Regression test pins a literal of the migrated `optimize-scale` verification
  (`type: parallel` with `pod_healthy` + `scaling_complete` children); both
  leaf `verify` methods are stubbed so it runs without a cluster.

Test bar (`tests/unit/verification/`): spec parsing/discrimination; result
shape (`details` is gone); deadline semantics (parallel sees full remaining
deadline, sequence fail-fast, one monotonic clock); leaf null-status guards
on pods AND deployment; bare list/dict spec rejected; lightweight-import
guard asserts `import devops_bench.verification` pulls no provider SDK /
deepeval / mcp.

`ruff check` and `pytest tests/` are green (345 passed). Harness/task-file
migration (§7c) is intentionally out of scope — Wave 4 harness PR.
Lays the agents/ Layer-2 component on the post-Stage-1 foundation: the
template-method AgentHarness, typed AgentConfig / AgentResult / ToolCall,
the AGENTS registry, and rewritten gemini + openclaw CLI harnesses on the
new base. The legacy run_cli_agent dispatch, SSH transport, and hardcoded
GKE tool list are gone; trajectories come from the official CLI channels.

Highlights
----------
- AgentHarness owns latency bookkeeping and a broad safety net so a single
  agent crash never aborts the benchmark; subclasses implement _execute and
  return AgentResult. deepeval tracing is wrapped on demand, kept import-time
  optional.
- AgentResult / ToolCall match CONVENTIONS §3 exactly: typed dataclasses,
  to_dict() boundary shim, AgentResult.errored() classmethod for failures,
  and a has_errors() helper. The metrics layer reads one canonical
  trajectory across CLI and (PR2) API agents.
- AgentConfig.from_env() is the single env entry point (AGENT_MODEL /
  PROVIDER / API_KEY / TARGET / TIMEOUT_SEC / ALLOWED_TOOLS); agents stop
  self-reading the environment per §7. AGENT_ALLOWED_TOOLS is the PR1
  interim for what becomes McpBinding.tools in PR3.
- GeminiCliAgent invokes `gemini --output-format stream-json --skip-trust`,
  selects between `--allowed-tools <name>` (one per tool) and `-e=""`
  extensions-disabled, threads model/key into GEMINI_MODEL/GOOGLE_API_KEY,
  passes timeout into every subprocess.run, and parses tool_use /
  tool_result / result events from stdout into the canonical trajectory —
  no session-file glob, no internal-schema parsing.
- OpenClawAgent runs `oc agent --local` through bash (nvm sourced; every
  interpolated value shlex.quoted), wipes ~/.openclaw/agents/<name>/sessions
  before each run, picks the resulting session via `oc sessions --agent
  <name> --json`, and extracts the trajectory via the official `oc sessions
  export-trajectory --workspace <tmpdir>` bundle per docs/openclaw/sessions
  rather than the legacy debug-log sessionFile= scrape. SSH transport and
  the OPENCLAW_LOCAL toggle are removed.
- Extraction failures (JSON decode misses, unpaired tool_results, non-zero
  exit codes, missing export bundles, timeouts) land on AgentResult.errors
  — never silent-empty.

Tests (tests/unit/agents/)
-------------------------
- typed-result shape + to_dict() defensive copies + errored() classmethod
- AgentConfig.from_env() field mapping and default fallthroughs
- abstract base cannot be instantiated; safety-net converts a raising
  _execute to AgentResult.errored; latency always stamped; a third-party
  @AGENTS.register("dummy-extension") resolves with no central edit
- Gemini stream-json parser: tool_use/tool_result folding, error events,
  unmatched results, is_error→status, empty input
- Gemini agent: argv flips on -e="" / --allowed-tools, env overlay, timeout
  threaded, SubprocessError + OSError paths, legacy run_cli_agent removed
- OpenClaw: model id normalization (gemini→google), shlex-quoted command
  fragments, session-key selection across list/wrapper payloads, trajectory
  export parsing, sessions-empty + export-failure + bash-timeout paths,
  legacy SSH/local runners removed
- Import-hygiene test (spawned interpreter): `import devops_bench.agents`
  pulls no mcp / deepeval / anthropic / google.genai / openai

Downstream contract
------------------
AgentHarness.run(prompt) → AgentResult is the single entry point the
harness will call after PR2 / capabilities. The API agent driving
models.loop.run_tool_loop slots in by overriding _execute and returning the
same AgentResult shape — no reshape needed.
PR #11 review (APPROVE-WITH-NITS) follow-up:

- Switch latency timing to time.monotonic() (immune to wall-clock
  adjustments; matches the verification runner's pattern).
- Coerce text to str at the source: ``text = client.get_text_content(...) or ""``.
  §5 says ``content`` must be a ``str``; guards a future adapter returning
  ``None``. Both ``final_text`` and the assistant message inherit the coerced
  value, so no further ``or ""`` is needed downstream.
- test_loop_import_pulls_no_provider_sdk now runs in a fresh interpreter
  subprocess (``sys.executable -c ...``). In-process module probing was masked
  by sibling test_models_* imports of the adapter SDKs.
- test_contents_record_user_assistant_and_tool_entries now also asserts the
  loop forwards function-call dicts by IDENTITY (not by copy) into
  assistant_message["tool_calls"] — locks the byte-identical-trajectory
  guarantee for the upcoming agents/api + chaos consumers.
…t; doc + test gaps

Addresses peer-review findings on PR #13:

BLOCKING — `_execute` was sourcing `AgentResult.output` from
`_strip_ansi(completed.stdout)` (the noisy `oc --log-level debug` stream)
even when `_read_export_bundle` had read the clean `output.md`/`output.txt`/
`final.txt` from the trajectory bundle. The judge would grade the debug
noise. Now `_extract_trajectory` plumbs the bundle output through and
`_execute` prefers it, falling back to stripped stdout only when the bundle
has no output file. Two new tests cover bundle-preferred and fallback paths.

Non-blocking nits, all addressed:
1. `oc models set <id>` now chains with `&&` so a failed model-set aborts
   the run and surfaces via `completed.returncode` → AgentResult.errors,
   instead of silently falling through to oc's stored-default model (which
   would invalidate the benchmark arm). New test asserts `&&` not `;`.
2. Gemini stream-json field-name leniency (`tool_use_id`/`id` and
   `args`/`input` fallbacks) is now covered by two regression tests so the
   tolerance can't silently rot.
3. New test asserts `GeminiCliAgent._execute` actually wires `extra_env`
   (GEMINI_MODEL / GOOGLE_API_KEY / GEMINI_API_KEY / OTEL_SDK_DISABLED) into
   the `run(...)` call site — not just that `_build_env` returns it in
   isolation.
4. `AgentResult.to_dict()` docstring corrected: the copies are shallow, not
   deep — outer containers are fresh but inner ToolCall dicts and other
   nested values stay aliased.

Tests: 367 passed (was 361; +6). ruff clean.
…sub-second leaf floor

Review nits from PR #12 (APPROVE-WITH-NITS).

- _run_parallel: convert an unexpected leaf exception into a failed child
  result instead of letting it abort the whole group (catch around
  ``f.result()``). Pre-fill children with timed-out defaults so a never-completed
  future stays accounted for without a second materialize pass.
- _run_parallel: ``ex.shutdown(wait=False, cancel_futures=True)`` so a
  deadline-exhausted group does not block on queued workers we no longer want.
- _run_leaf: short-circuit when remaining < ``_MIN_LEAF_BUDGET_SECONDS`` (1s)
  to avoid issuing useless ``kubectl wait --timeout=0.001s`` at the tail of
  the deadline.
- test_runner: add a parallel-with-exhausted-deadline test that uses the REAL
  ``PodHealthyVerifier`` / ``ScalingCompleteVerifier`` classes (kubectl
  primitives patched to raise) and asserts all children come back timed-out
  with ``elapsed_time == 0.0``. Add an unhandled-exception test asserting
  one bad leaf does not abort the parallel group.

pytest: 347 passed (was 345, +2 new). ruff clean.
… (Phase 3 PR2)

Rework the open PR #9 ``agents/api`` package onto the AgentHarness template
method, AgentConfig, and AgentResult landed in Wave 1, and onto the shared
``models/loop.run_tool_loop`` primitive — agents and chaos are now true
siblings, neither carrying its own loop.

Concrete changes:

* ``ApiAgent._execute`` builds the LLM client via ``get_model(provider, model)``
  with EXPLICIT args from ``AgentConfig`` (no env reads inside the agent),
  pre-formats tools via ``client.format_tools(...)`` (caller-formats-tools, per
  CONVENTIONS §6), and drives ``run_tool_loop``.
* A ``_build_dispatch`` closure routes each tool call to either a skill file
  read or MCP, wrapping every call in its own try/except so one tool failure
  lands on ``AgentResult.errors`` and the trajectory (``status="error"``)
  rather than aborting the run (``run_tool_loop`` propagates dispatch errors
  by design).
* ``fold_trajectory`` collapses ``LoopResult.contents`` into canonical
  :class:`ToolCall` entries, pairing assistant-issued ``tool_calls`` with the
  matching ``role: tool`` results by ``tool_call_id``. ``AgentResult.output``
  is ``LoopResult.final_text``; tokens/latency/tools_used flow through
  unchanged (semantically byte-identical to the OLD api loop).
* **Skills are decoupled from MCP.** ``config.skills_paths`` drives skill
  discovery and ``config.target`` drives MCP — an agent may have one, both,
  or neither. The hardcoded ``third_party/gke-mcp/skills`` constant is gone.
* The agent never reads ``BENCH_USE_MCP`` (or any env var); the legacy
  ``context``/``system_instruction`` grab-bag is dropped from ``run()``'s
  public surface (Rules arrive in PR3).
* ``AgentConfig`` gains interim ``skills_paths`` / ``max_turns`` fields
  (CSV-parsed from ``AGENT_SKILLS_PATHS`` / ``AGENT_MAX_TURNS``); PR3 migrates
  ``skills_paths`` into ``SkillBinding.paths``.

Tests (``tests/unit/agents/api/``):

* fake ``LLMClient`` + fake MCP/skills; trajectory folding correctness
  (assistant+tool pairs → canonical ToolCall entries; unmatched call ↔
  ``status="called"``; dispatcher ``Error: ...`` strings ↔ ``status="error"``);
* a dispatch ``RuntimeError`` lands in errors and on the trajectory, the run
  does not crash;
* skills load on the MCP-off path (skills ⊥ MCP);
* an AST walk asserts no ``BENCH_USE_MCP`` / ``os.environ`` / ``get_env``
  reads anywhere under ``agents/api/``;
* ``import devops_bench.agents.api`` and ``...api.agent`` pull no
  ``mcp``/``deepeval``/SDK at module top.

Quality bar: ``uv run ruff check devops_bench/ tests/`` clean;
``uv run pytest tests/ -q`` GREEN — 464 passed (421 baseline + 43 new).
…ontract

Phase 3 chaos refactor (rework PR #7). Brings the chaos package onto the
shared model-agnostic loop and the same discriminated-union typed-node idiom
verification uses, so chaos and agents become true Layer-2 siblings.

Changes
- New ``ChaosResult`` pydantic model (success / injected_fault / output /
  elapsed_time / error) replacing the legacy ``{status, output}`` dict.
- ``Fault`` and ``Trigger`` reshaped as ``type``-tagged pydantic models in
  ``chaos/base.py``; FAULTS / TRIGGERS registries unchanged.
- ``chaos/spec.py`` adds the ``ChaosSpec`` discriminated union
  (``ChaosAction`` / ``ChaosTrigger``). ``verify`` is a plain string key — chaos
  never imports a verification node. The legacy ``verification`` field name is
  accepted as an alias so the real ``optimize-scale/task.yaml`` parses
  unchanged ahead of the Phase-B task-file migration.
- ``ChaosAgent`` reworked to drive ``models.loop.run_tool_loop``; fortio
  specifics (system prompt builder, RUN_COMMAND_TOOL, run_chaos_command, load
  target) move into ``faults/generate_load.py`` along with the typed
  ``GenerateLoadFault`` / ``LoadTarget`` nodes. The agent is now
  fault-agnostic — system instruction, tool descriptor, and command handler
  are injected by the concrete fault.
- New ``triggers/time_delay.py`` houses the ``TimeTrigger`` previously inlined
  in ``harness/scenario.py``.
- Slim ``chaos/__init__.py`` exports only ``Fault``, ``Trigger``,
  ``ChaosResult``, ``FAULTS``, ``TRIGGERS``, ``ChaosSpec``; no eager import of
  ``ChaosAgent``.
- New ``chaos/schema.py`` emits JSON Schema and validates a single chaos entry
  (the Phase-A authoring contract).

Tests (35 new under ``tests/unit/chaos/``)
- Spec discrimination: minimal / named / verify-reference / legacy alias /
  bare-list rejection / extra-forbid / unknown discriminator / registry
  membership.
- ``ChaosResult`` shape + round-trip; ``Fault`` / ``Trigger`` abstractness.
- ``ChaosAgent`` against a fake ``LLMClient`` (no SDK / no network): no-tool
  turn, tool dispatch + message-shape pinning, non-dict args, unknown tool,
  final-text retention across the turn cap.
- ``GenerateLoadFault.inject`` with mocked subprocess + stubbed agent: success
  path, exception → ``success=False`` with error, event threaded through.
- ``TimeTrigger`` zero vs positive delay (``time.sleep`` patched).
- Regression: the REAL ``complextasks/optimize-scale/task.yaml`` chaos entry
  parses through ``ChaosSpec`` and discriminates to ``GenerateLoadFault`` /
  ``TimeTrigger`` with the documented ``target.service_url`` / ``qps`` /
  ``delay_seconds``.
- Import hygiene: ``import devops_bench.chaos`` pulls no provider SDK,
  ``mcp``, ``deepeval``, or ``ollama``; ``ChaosAgent`` is not exported.

Bar
- ``uv run ruff check devops_bench/ tests/unit/chaos/`` clean.
- Full ``uv run pytest tests/ -q`` green (456 passed; +35 chaos tests on top
  of the 421-test baseline).
…s (Phase 4)

Land the metrics extensibility refactor and the verifier registry swap in
one drop, building on the Wave-1 verification union with no changes to the
runner or the leaf-verifier logic.

Metrics:
- New `metrics/base.py` carrying `METRICS` (`Registry[type[MetricEvaluator]]`),
  the `MetricScore` dataclass with a D3-preserving `to_entry()`, the
  `MetricContext` that threads `use_mcp` from the harness (CONVENTIONS.md §7),
  and the single shared `run_geval` recorder.
- Phase-0 fixes folded in: `CHECKLIST_THRESHOLD` constant decouples the
  checklist cutoff from the tool-invocation threshold, dead grounding
  branches are removed, the ` [GEval]` strip lives in exactly one place
  (`run_geval`), and `extract_checklist_items` is exported from the package
  facade.
- Each metric family (outcome_validity, tool_invocation, checklist,
  grounding, chaos) is now a registered `@METRICS.register(...)` evaluator;
  `evaluate_metrics_batch` collapses to a registry loop, importing the
  builtin metric modules at call time so the package import stays free of
  `deepeval` (CONVENTIONS.md §8).
- The `use_mcp` boolean is threaded through `evaluate_metrics_batch(..., *,
  use_mcp=...)` and into `MetricContext`; metrics no longer self-read
  `BENCH_USE_MCP`. The kwarg falls back to the env var so the harness can
  flip its caller in a follow-up without breaking the rest of the tree.

Verifier registry:
- New `verification/registry.py` carrying `VERIFIERS`, populated by
  `@VERIFIERS.register("...")` decorators on the existing leaves
  (pod_healthy, scaling_complete) and on `SequenceSpec` / `ParallelSpec`.
- `verification/spec.py` swaps the static `Annotated[Union, Field(
  discriminator="type")]` for a `parse_node()` helper invoked from a
  `model_validator(mode="before")` on `VerificationSpec` (and on each
  compound spec, for `checks` recursion). Lookup misses surface as
  `pydantic.ValidationError` so callers see the same error surface.
- Touch on `verification/runner.py`, `verification/base.py`, and the leaf
  verifier logic is zero — the only change to leaves is the decorator. The
  existing verification regression + runner tests stay green unchanged.

Tests under `tests/unit/metrics/` cover: checklist parsing, the registry
loop, MCP gating via `MetricContext`, the legacy `to_entry()` shape (bare
value for rates, dict for judged), the single-place ` [GEval]` strip,
chaos/grounding/skill loading, the dummy-metric extension-axis acceptance,
and a fresh-interpreter assert that `import devops_bench.metrics` pulls no
`deepeval`/`mcp`/provider SDK. The verifier extension-axis test lives next
to its peers under `tests/unit/verification/`.
…; align __init__ docstring

CONVENTIONS §9: dummy ``@FAULTS.register("dummy")`` / ``@TRIGGERS.register("dummy")``
resolve via ``REGISTRY.get(key)`` with no edit to ``chaos/spec.py``. Comment
calls out Phase A's split: registry drives harness resolution; ``ChaosSpec``
parsing still requires a Union edit until Phase 4's registry-driven swap.

Also: pin ``args={}`` (missing ``command`` key) dispatch returning an
``"Error: ..."`` string (no crash); align ``chaos/__init__.py`` docstring with
the import-hygiene test it claims to satisfy — concretes load with the
package (required for the discriminated union), but provider SDKs / fortio /
``deepeval`` / ``mcp`` stay strictly lazy.
…el; dedupe compound _parse_children

Wave-2 review nits on PR #16:

- Pipeline ordering (D3): `evaluate_metrics_batch` no longer iterates
  `METRICS.values()` (which is import-order-sensitive across test sessions
  and third-party registrations). Iteration now walks a pinned
  `_BUILTIN_METRIC_ORDER` tuple of `(module, key)` pairs (outcome -> tool
  -> checklist -> grounding -> chaos), then appends any third-party keys
  in registry order. New
  `test_batch_score_insertion_order_matches_legacy_results_json`
  pins the on-disk schema.

- `parse_node` no longer bypasses every `BaseModel`. It now only accepts
  already-parsed instances whose concrete class is in `VERIFIERS`; an
  unregistered model (including as a `checks` child of a compound node)
  raises the same `ValidationError` surface as an unknown `type` string.
  Two new tests in `test_verifier_registry.py` cover the standalone and
  compound-child cases.

- The two byte-identical `_parse_children` validators on `SequenceSpec`
  and `ParallelSpec` collapse to one module-level `_parse_compound_children`
  helper attached via `model_validator(mode="before")(...)`.

Tests: 476 passed (was 473; +1 ordering, +2 unregistered-BaseModel guard).
Ruff: clean.
Wave 2 review of PR #14 found a blocking bug + three non-blocking nits.

BLOCKING — `extract_tokens` was Google-only
  Read only ``prompt_token_count`` / ``candidates_token_count`` /
  ``total_token_count`` so Anthropic (``usage.input_tokens`` /
  ``output_tokens``) and OpenAI / Ollama (``usage.prompt_tokens`` /
  ``completion_tokens`` / ``total_tokens``) responses returned all-zero tokens,
  corrupting token metrics in results.json for every non-Google provider.

  Fix: detect which usage shape is present and map each provider's source
  fields onto the legacy on-disk key scheme (D3 results.json stability).
  A new ``_first_int_attr`` helper looks up the first non-None attribute name
  in order. Anthropic emits no aggregated total, so the helper computes
  ``prompt + candidates`` when source ``total`` is absent (with a non-zero
  prompt/candidates) so non-empty runs never log ``total_tokens: 0``.

  Tests: Anthropic-shape, OpenAI/Ollama-shape, key-scheme stability, and
  end-to-end Anthropic/OpenAI runs that flow through ``ApiAgent._execute``
  asserting ``AgentResult.tokens`` is non-zero and correctly mapped.

NON-BLOCKING NIT 1 — AST env-scan now covers ``ast.Subscript``
  Added an explicit ``ast.Subscript`` branch in the env-smuggling guard so
  ``os.environ["FOO"]`` patterns can't be bypassed even if the inner
  Attribute walk regresses. Also caught ``os.environ.get(...)`` explicitly.

NON-BLOCKING NIT 2 — ``MCPClient.call_tool`` degrades gracefully
  The legacy ``from deepeval.tracing import observe`` was unconditional, so a
  host without deepeval crashed with ImportError on the first tool call.
  Mirrors ``_maybe_observe`` from ``agents/base.py``: try the import, fall
  through to the bare ``call_tool`` on failure. Tested by monkey-patching
  ``builtins.__import__`` to raise on ``deepeval`` and confirming the call
  still completes.

NON-BLOCKING NIT 3 — explicit orphan-result handling in ``fold_trajectory``
  An unpaired ``{role: tool}`` result (no matching assistant call_id) is now
  intentionally dropped from the canonical trajectory (preserving the "every
  trajectory item is a real ToolCall the model issued" invariant metrics
  relies on) AND surfaced as a diagnostic on ``AgentResult.errors`` via the
  new ``_fold_with_extraction_errors`` helper. Matches the CLI agents'
  "no silent-empty" policy. The drop is also debug-logged. The decision is
  documented in the function's docstring.

Bar: ``uv run ruff check devops_bench/ tests/unit/agents/`` clean;
``uv run pytest tests/ -q`` GREEN — 473 passed (464 baseline + 9 new).
…) (Phase 3 PR3)

Introduce devops_bench/agents/capabilities/ — the binding vocabulary the
orchestrator threads through AgentConfig.capabilities. Three orthogonal axes:
McpBinding (server command + tool names), SkillBinding (local SKILL.md paths),
AgentRules (operator brief text). Each binding ships with a runtime_checkable
Protocol (SupportsMcp / SupportsSkills / SupportsRules) plus a trivial mixin
so concrete agents declare *only* what they can actually consume — capability
negotiation refuses bindings an agent would silently ignore.

AgentConfig migrates from the interim PR1/PR2 fields (target/allowed_tools/
skills_paths) to a single capabilities aggregate; the legacy AGENT_TARGET env
still maps to the CLI binary path, while the new AGENT_MCP_SERVER /
AGENT_ALLOWED_TOOLS / AGENT_SKILLS_PATHS / AGENT_RULES_TEXT env vars build
the bindings. The harness will skip the env path entirely (it constructs
AgentCapabilities directly from a benchmark catalog).

Each agent now consumes the bindings:
- ApiAgent (McpMixin + SkillsMixin + RulesMixin): MCP gate = capabilities.mcp
  with a non-empty command; skills gate = capabilities.skills.paths; rules
  ride on the provider's system_instruction. Skills ⊥ MCP independence
  preserved.
- GeminiCliAgent (McpMixin + RulesMixin): --allowed-tools comes from
  capabilities.allowed_tools (aggregated across bound servers).
- OpenClawAgent (RulesMixin only): the installed oc build exposes no
  in-agent MCP or skills wiring.

Tests: new tests/unit/agents/test_agents_capabilities.py covers binding
construction, Protocol membership, and the aggregate's accessors; updated
config / api / cli tests assert each agent reads the bindings; a new AST-
walking test forbids GKE-specific runtime literals anywhere in agents/; an
import-hygiene test pins capabilities to stdlib-only. The existing
BENCH_USE_MCP / os.environ AST guard still passes. 600/600 tests green;
ruff clean on devops_bench/ + tests/unit/agents/.
… shlex.join MCP path

Wave 3 review surfaced both CLI agents advertising SupportsRules without
actually delivering the bound rules text — the mixin attribute mirror was
the entire delivery story. Fix:

- **Gemini** (cli/gemini.py): _execute now runs the binary inside a per-run
  tempfile.TemporaryDirectory and writes capabilities.rules.text to
  GEMINI.md there before invocation, then passes cwd=tmpdir to
  core.subprocess.run. The CLI auto-loads GEMINI.md from its working dir,
  which is the binary's native startup-context channel. Temp dir is cleaned
  up after _execute returns.
- **OpenClaw** (cli/openclaw.py): new _prepend_rules helper splices the
  rules text ahead of the prompt (separated by a blank line) before
  _build_local_command interpolates it into the oc agent -m '<prompt>'
  segment. oc has no dedicated rules flag, so prompt-prepending is the
  reliable, binary-agnostic delivery channel.

Tests assert the bindings actually reach the subprocess invocation:
- tests/unit/agents/test_agents_cli_gemini.py snapshots the cwd handed to
  run() and reads GEMINI.md from inside the fake run (before TemporaryDirectory
  cleanup) — proves the binary would see the file on startup. Also asserts
  no GEMINI.md is written for empty rules, and the temp dir is cleaned up.
- tests/unit/agents/test_agents_cli_openclaw.py captures the bash command
  string and asserts the rules text + prompt both appear with rules ahead
  of the prompt. Helper _prepend_rules has unit tests for the empty / non-
  empty branches.

Non-blocking nits from the review, addressed in the same commit:
- api/agent.py: shlex.join(mcp_binding.command) replaces the lossy " ".join
  so a spaced argv token (("uv run", "mcp-server")) round-trips MCPClient's
  shlex.split intact. New test
  test_execute_preserves_spaced_command_token_through_shlex_roundtrip
  verifies the round-trip.
- capabilities/aggregate.py: AgentCapabilities.mcp docstring now warns
  callers that ApiAgent honors only the first binding (multi-server fan-
  out is not implemented in PR3) — Wave 4 will see the gotcha when wiring.
- tests/unit/agents/api/test_agents_api_agent.py: renamed
  test_execute_runs_with_no_tools_when_target_unset →
  ..._when_capabilities_default to match the post-PR3 gate.
- config.py: tightened the inline comment explaining why from_env uses
  name="default" on the synthesized McpBinding.

Side effect: the prior test_execute_returns_errored_on_empty_mcp_server_string
relied on " ".join collapsing a whitespace-only command. With shlex.join
that path is no longer reachable through a binding, so the test was rewritten
to patch MCPClient directly — it still proves the agent converts
MCPClient.__aenter__ ValueErrors to AgentResult.errored() rather than
letting them bubble through the base safety net.

ruff clean (devops_bench/ + tests/unit/agents/). pytest tests/ -q → 608 passed.
…aml migration (Phase 5)

Phase 5 (D4 option b): the harness is written once, born consuming the typed
Layer-2 contracts every component now exposes — no behavior-preserving port
plus retrofit churn. The orchestrator is the only place that wires
components, and it does so by consuming registries.

Harness package (new):
- ``harness/base.py`` — ``Harness`` ABC: ``run(list[Task]) -> list[dict]``.
- ``harness/default.py`` — ``DefaultHarness``: registry-only agent
  resolution (no ``_AGENT_MODULES`` table; builtin agent modules imported
  once at call time so ``@AGENTS.register`` fires, then pure
  ``AGENTS.get(key)``); single ``BENCH_USE_MCP`` read at construction,
  threaded into both the built ``AgentConfig.capabilities`` and the
  ``evaluate_metrics_batch(use_mcp=...)`` call; explicit ``AgentConfig`` +
  ``AgentCapabilities`` (MCP / skills / rules) built from the env-layer
  bindings with MCP gated by the harness-owned boolean; opaque task blobs
  parsed into typed ``ChaosSpec`` / ``VerificationSpec`` at the harness
  boundary; ``RunContext.workspace_path`` rooted artifact diff (no
  hardcoded cwd); ``capabilities_granted`` snapshot on the result.
- ``harness/scenario.py`` — runs ``trigger.wait(ctx)`` then
  ``action.inject(ctx, event)`` on the typed ``ChaosSpec``; resolves the
  opaque ``verify:`` key against a name-keyed verification mapping the
  harness builds (chaos never imports verification);
  ``VerifierAgent().wait_for_condition(node, timeout_sec=120)`` for typed
  ``VerificationResult``; ``skip_port_forward=True`` lets the E2E smoke
  exercise the wiring without ``kubectl``.
- ``harness/reporter.py`` — ``ResultReporter`` writes ``results.json``;
  the engine depends on the sink so output stores become a substitution.
- ``harness/artifacts.py`` — workspace-aware ``snapshot_dir`` /
  ``collect_generated_files``.
- ``harness/__init__.py`` — lazy facade; ``import devops_bench.harness``
  pulls no SDK / deepeval / mcp / concrete agent.

results.json (Decision D3 — schema preserved):
- Every typed value routed through ``AgentResult.to_dict()`` /
  ``ChaosResult.model_dump()`` / ``VerificationResult.model_dump()`` /
  ``MetricScore.to_entry()``; the legacy mixed shape is unchanged.
- Failed records keep ``status: "failed"`` + ``score: 0`` + ``error`` and
  carry the same identifying keys as success records.
- New additive key: ``capabilities_granted = {"use_mcp", "skills"}`` so
  metrics / dashboards can read what the harness granted instead of
  re-reading env (CONVENTIONS.md §7 closure step).

task.yaml migration (Decision D2 — keep field names, migrate values):
- ``complextasks/optimize-scale/task.yaml``: ``chaos_spec`` /
  ``verification_spec`` migrated from JSON-in-YAML strings to native YAML.
- Field names unchanged so gke-labs#89's loader + schema continue to parse
  unmodified. The harness accepts both shapes (legacy strings round-trip
  through ``json.loads``) so any fork on the old shape lands the harness
  PR without touching its task files.
- Cross-reference: chaos ``verify: "Planned Load Spike Verification"``
  now resolves through the name-keyed mapping the harness builds from
  ``verification_spec``.
- The chaos optimize-scale regression test now exercises the native YAML
  shape end-to-end.

Tests (24 new under ``tests/unit/harness/`` + 1 in chaos):
- ``test_registry_resolution.py`` — dummy ``@AGENTS.register("dummy")``
  resolves with no harness edit; legacy ``cli`` / ``binary`` alias
  normalizes to the canonical key; unknown key raises
  ``NotRegisteredError``.
- ``test_single_env_read.py`` — ``BENCH_USE_MCP`` is read once at
  construction; same boolean reaches the agent's capability gate and the
  metrics scoring call; grep-style assertion that only one executable
  read site exists in the package.
- ``test_scenario.py`` — fake fault + trigger + verifier exercise the
  seam wiring (``trigger.wait`` -> ``action.inject``; mapping lookup;
  rewritten local URL; typed ``ChaosResult`` failure path);
  ``subprocess.Popen`` poisoned to guard against accidental real
  ``kubectl`` calls.
- ``test_reporter.py`` — golden schema for ``results.json`` (preserved
  legacy key set + additive ``capabilities_granted``).
- ``test_spec_parsing.py`` — the real optimize-scale blob parses into
  typed ``ChaosSpec`` / ``VerificationSpec``; legacy string shape still
  round-trips; loader carries native-YAML values through verbatim.
- ``test_package_import.py`` — bare ``import devops_bench.harness``
  pulls no SDK / deepeval / mcp.
- ``test_e2e_smoke.py`` — full ``complextasks/optimize-scale`` flows
  through ``DefaultHarness.run`` against ``NoOpDeployer`` with stubbed
  agent + judge + faked trigger/fault/verifier but the REAL wiring;
  asserts exactly one ``results.json`` with the preserved schema,
  populated trajectory, and both ``chaos_report`` + ``perf_report``.

Quality bar:
- ``uv run ruff check devops_bench/ tests/unit/harness/`` clean.
- ``uv run pytest tests/ -q``: 633 passed (608 baseline + 25 new).
BLOCKING fix (D3 symmetric records):
- Success and failed records now carry the SAME top-level key set
  (DefaultHarness._RECORD_KEYS); a downstream parser iterating one
  shape can never KeyError crossing into the other. Previously the
  success path emitted ``errors`` + ``scores`` while the failed path
  emitted ``error`` + ``score`` + NO ``scores``.
- New shared seed ``_empty_record(task)`` populates the symmetric
  union; ``_build_success_record`` / renamed ``_build_failed_record``
  overlay the differing values. Both records now carry ``error``,
  ``errors``, ``score``, ``scores``, ``verification_parse_errors``,
  and the existing capabilities/chaos/perf/etc fields.
- Golden tests rewritten to invoke the real builders on a stub
  AgentResult and assert ``set(record.keys()) == required_keys`` on
  the actual output (was: hand-rolled dict that let the builders
  drift silently). New ``test_legacy_failed_record_keys_are_emitted_verbatim``
  mirrors the success golden test; a direct
  ``test_success_and_failed_records_have_identical_top_level_keys``
  pins the symmetry invariant.

Required non-blocking nits:
1. Single-read regex extended to ``os.getenv`` / ``first_env`` /
   ``os.environ.get`` / ``os.environ[...]`` / bare ``getenv|environ[``
   so a sneaked-in second BENCH_USE_MCP read fails CI rather than
   passing review.
2. Verification authoring loud-fail:
   - ``_build_verification_mapping`` now returns ``(mapping, errors)``
     and accumulates every authoring failure (non-mapping entry,
     missing ``name``, JSON parse failure, model validation failure)
     so it lands on the run record as ``verification_parse_errors``
     — operator sees the broken entry on results.json, not just in
     the log.
   - ``ScenarioManager._resolve_verification`` writes a failure
     record into ``chaos_report["verification"]`` (with
     ``unresolved_reference`` / ``known_references``) when a chaos
     ``verify:`` key doesn't resolve, so a typo'd cross-reference is
     visible on the run record.
   - Canonical ``[{name, spec: <typed-node>}]`` shape documented in
     the docstring; bare-node compat path retained with a regression
     test that pins the two-shape acceptance explicitly.
3. Port-forward early-exit reap: when the Popen child has already
   exited at ``poll()`` time, ``wait()`` is called before raising so
   the process is reaped (not zombied).
4. ``_granted_skill_paths`` snapshot captured once at ``__init__``;
   per-record builders read the cached tuple instead of re-calling
   ``AgentConfig.from_env()``. Removes the env-drift surface.
5. ``_drain_scenario`` checks ``thread.is_alive()`` after join;
   warns and stamps ``chaos_report["status"]="timed_out"`` so
   partial results are flagged on the record.
6. ``_ensure_builtin_agents_registered`` now catches ONLY
   ``ImportError`` / ``MissingDependencyError``; a real bug
   (``SyntaxError``, etc.) in an agent module re-raises so it is not
   invisible behind a debug log.
7. E2E smoke ``_CHAOS_ACTIVE_WAIT_SEC`` monkeypatch now references
   the live module attribute and asserts the patched value (was a
   no-op assertion on the test's frozen ``from ... import`` alias).

OPTIONAL (also landed):
- ``_DEFAULT_TARGET_DEPLOYMENT`` / ``_DEFAULT_NAMESPACE`` are now
  ``DefaultHarness`` constructor args (current literals as defaults)
  so a non-Hypercompute embedder can override without monkey-patching
  the module constant.

Tests:
- 13 new tests under tests/unit/harness/ (4 reporter goldens, 5
  default-harness internals, 4 verification authoring path tests).
- ``uv run pytest tests/ -q``: 646 passed (was 633; +13 new).
- ``uv run ruff check devops_bench/ tests/unit/``: clean.
…ted matches the agent's actual config

Resolves the consistency gap a senior reviewer flagged on PR #18: the
harness snapshotted ``use_mcp`` and ``_granted_skill_paths`` at
``__init__``, but ``build_agent_config()`` re-read ``AgentConfig.from_env()``
per agent run, so a mid-batch ``AGENT_*`` env mutation desynced what the
agent ran with from what the record claimed it ran with.

Fix: build the gated ``AgentConfig`` ONCE in ``__init__`` via the new
``_build_agent_config_snapshot()`` helper and store it on
``self._agent_config``. ``build_agent_config()`` becomes a pure accessor
that returns the same object identity across calls; ``_granted_skill_paths``
becomes a derived property reading off the snapshot. ``capabilities_granted``
on every record is therefore the SAME source the agent was constructed
from — structurally impossible to desync.

Tests (tests/unit/harness/test_default_harness.py):
- ``test_build_agent_config_returns_identical_snapshot_across_calls``
  pins ``a is b`` over back-to-back calls (no per-call rebuild).
- ``test_capabilities_granted_matches_agent_config_even_after_env_mutation``
  mutates ``AGENT_SKILLS_PATHS`` / ``BENCH_USE_MCP`` / ``AGENT_MCP_SERVER``
  AFTER construction and asserts the record + agent agree.

Verified the new tests catch the regression by temporarily reverting
the production fix: both new tests fail, the identity check catches
the per-call rebuild and the desync check catches the env-mutation
leak.

Bar: ``uv run ruff check devops_bench/ tests/unit/harness/`` clean;
``uv run pytest tests/ -q``: 648 passed (was 646 on base; +2 new).
…I + Gemini policy

`parse_trajectory_export` used to synthesize a free-floating ToolCall entry
for an unpaired `tool_result` while also surfacing it on `errors`. The API
agent's `_fold_with_extraction_errors` and the Gemini stream-json parser
both DROP the orphan (errors-only), so the two CLI agents disagreed with
the API agent at the metrics seam — only OpenClaw would feed metrics
synthetic entries that no real tool_call backed.

Make OpenClaw mirror the API + Gemini policy: drop the orphan from the
trajectory, surface it on `AgentResult.errors` with a content preview so
nothing is silent-empty. Comment quotes the API agent's invariant verbatim
("every trajectory item is a real ToolCall the model issued") so the rule
is uniform and rediscoverable across all three agents. Updates the
docstring and replaces the test (now asserts orphan ABSENT from trajectory
AND PRESENT in errors, with the id surfaced).

Tests: 646 passed. ruff clean.
…zy ChaosAgent import

Completes the chaos Phase-A → Phase-4 swap from CONVENTIONS §4, fixing senior
review's B1 (chaos import drags in the agent + models chain) and S1 (adding a
fault still needs a central edit to ``chaos/spec.py``).

Changes
- New ``chaos/registry.py`` hosts ``FAULTS`` / ``TRIGGERS`` (mirrors
  ``verification/registry.py``), with ``entry_point_group`` for external
  packages. ``chaos/base.py`` re-exports both for backward compatibility.
- ``chaos/spec.py`` drops the hand-maintained
  ``Annotated[Union, Field(discriminator="type")]``. ``ChaosSpec`` now carries
  a ``model_validator(mode="before")`` that routes ``trigger`` / ``action``
  payloads through ``parse_trigger`` / ``parse_fault`` — registry-driven
  resolution with ``ValidationError`` listing registered keys on miss. The
  parser lazy-imports the bundled fault/trigger modules on first parse, so the
  spec module itself depends only on ``core`` and ``pydantic``.
- ``GenerateLoadFault.inject`` lazy-imports ``ChaosAgent`` so loading the
  fault module for registry registration does not pull
  ``devops_bench.chaos.agent`` or the ``devops_bench.models`` chain.
- ``chaos/__init__.py`` docstring rewritten to the Phase-4 reality: importing
  the package now pulls only ``chaos.{__init__, base, registry, spec}`` and
  ``core.*`` — concretes load lazily on first parse; agent + models load only
  when ``Fault.inject`` runs.

Tests
- Stronger import-hygiene tests assert that ``devops_bench.chaos.agent``,
  ``devops_bench.models.loop``, and ``devops_bench.models.base`` are absent
  from ``sys.modules`` after both ``import devops_bench.chaos`` and after
  ``ChaosSpec.model_validate(...)`` runs.
- Extension-axis tests upgraded: a dummy ``@FAULTS.register("dummy")`` /
  ``@TRIGGERS.register("dummy")`` now parses through ``ChaosSpec`` with no
  edit to ``chaos/spec.py`` (the Phase-4 bar). Plus an error-message test
  proving the unknown-type ValidationError enumerates the registered keys.
- ``test_generate_load`` re-targets the patch at
  ``devops_bench.chaos.agent.ChaosAgent`` (lazy import inside ``inject``).
- ``test_spec`` swaps the JSON-schema assertion to reflect Phase-4 (the
  schema no longer enumerates concrete types — that's what lets a new fault
  land without a schema edit) and adds a missing-type-discriminator case.

Bar: ``uv run ruff check devops_bench/ tests/unit/chaos/`` clean;
``uv run pytest tests/ -q`` → 652 passed.
…ture

Fixes B2: `tests/unit/chaos/test_extension_axis.py` failed in isolation
because the fixture snapshotted `FAULTS`/`TRIGGERS` *before* Phase 4's lazy
`_ensure_concretes_loaded()` populated them; teardown then removed the
lazily-added bundled keys, leaving subsequent tests staring at
`registered: []`. Calling `_ensure_concretes_loaded()` at fixture entry —
before the snapshot — keeps the real baseline intact so teardown restores
to the bundled set, not an empty one. Only this file uses the snapshot
pattern; nothing else needed updating.
@pradeepvrd

Copy link
Copy Markdown
Owner Author

Superseded by the decomposed stacked PRs #26#33 (reproduce this tree exactly; §7 lossless diff empty). Keeping the refactor/integration branch for now.

@pradeepvrd pradeepvrd closed this Jun 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant