feat(agents): ApiAgent on run_tool_loop; skills⊥MCP; no env-smuggling (Phase 3 PR2) by pradeepvrd · Pull Request #14 · pradeepvrd/devops-bench

pradeepvrd · 2026-06-20T06:12:13Z

Scope (Phase 3 PR2)

Reworks open PR #9 — the devops_bench/agents/api/ package — onto the Wave 1 foundations:

the AgentHarness template / AgentConfig / AgentResult landed by Wave 1 PR1 (feat(agents): AgentHarness base + CLI agents (Phase 3 PR1) #13)
the shared models/loop.run_tool_loop primitive landed by Wave 1 PR (feat(models): run_tool_loop shared turn-loop primitive (Phase 1, D1) #11)

Agents and chaos are now true siblings: both depend on models, neither on the other. No agent code reaches around the typed seam any more.

Conventions adhered to

Convention (`docs/refactor/CONVENTIONS.md`)	How this PR honors it
§3 `AgentResult` / `ToolCall` shapes (dataclass)	`_execute` returns an `AgentResult`; trajectory is a list of `ToolCall.to_dict()` entries
§5 neutral message contract	Tool calls and results flow through `run_tool_loop` unchanged
§6 `run_tool_loop` (caller-formats-tools)	Agent calls `client.format_tools(...)` before invoking the loop; the loop sees pre-formatted descriptors
§7 env rules — no env-smuggling	No `BENCH_USE_MCP` read; no `os.environ`; capability/MCP on-off comes from `AgentConfig` (`target` gates MCP, `skills_paths` gates skills); `get_model(provider, model)` called with EXPLICIT args
§8 light `__init__` / lazy imports	`import devops_bench.agents.api` pulls no `mcp`/`deepeval`/SDK (asserted in two import-hygiene tests)
§9 test bar	ruff clean; 464 pytest GREEN (421 baseline + 43 new); Google-style docstrings throughout

Key implementation points

ApiAgent (agents/api/agent.py): subclasses AgentHarness, overrides _execute. Builds the LLM client via get_model(config.provider, config.model) with EXPLICIT args, opens an MCP session when config.target is set, discovers local skills when config.skills_paths is set, format tools, then drives run_tool_loop. AgentResult.output = LoopResult.final_text, latency / tokens / tools_used propagate through unchanged (semantically byte-identical to the OLD api loop — same usage_metadata/usage extraction, same per-generate_content latency accounting that run_tool_loop already preserves).
fold_trajectory: collapses LoopResult.contents into canonical ToolCall entries — each assistant tool_calls[*] is paired with its matching {role:tool, tool_call_id, ...} result by id. Unmatched calls survive as status="called" (no silent drop); dispatcher Error: ... strings surface as status="error".
Skills ⊥ MCP: agents/api/skills.py is a pure stdlib discovery module (no MCP coupling, no heavy imports). The dispatch closure routes skill names to a local file read; MCP-only tool names hit the MCP session.
Dispatch error handling: each tool call is wrapped in its own try/except inside _build_dispatch. A tool failure lands on AgentResult.errors and the trajectory (status="error"), but the loop keeps going — one failing tool never aborts the benchmark. (run_tool_loop itself propagates dispatch errors by design — the agent's wrapper is what makes that benchmark-safe.)
No legacy surface: ApiAgent.run(prompt) now takes only (self, prompt) — context / system_instruction are gone. The _SKILLS_DIR = "third_party/gke-mcp/skills" constant is deleted. BENCH_USE_MCP is never read inside the agent (an AST walk in tests enforces this).
AgentConfig gains two interim fields: skills_paths (PR3 migrates → SkillBinding.paths) and max_turns (env: AGENT_SKILLS_PATHS, AGENT_MAX_TURNS).

Tests added (`tests/unit/agents/api/`)

test_agents_api_agent.py — registration under "api", fold_trajectory (pairing, error status, unmatched call, text-only-turn skip, defensive args/id handling), extract_tokens (usage_metadata + fallback to usage, missing usage → {}), ApiAgent._execute end-to-end (MCP-off path, explicit provider/model wiring, full trajectory folding with MCP on, dispatch error → errors + trajectory.error, missing-MCP graceful path, skills-without-MCP, empty MCP server → errored result, env ignored), AST walk asserting zero BENCH_USE_MCP / os.environ / env-helper imports, public surface signature check, sync _execute invariant.
test_agents_api_skills.py — frontmatter parsing, normalization, multi-path walk, empty/missing path handling, plus a subprocess-isolated import-hygiene check on skills.py.
test_agents_api_mcp.py — extract_tool_text joining/fallbacks, empty server_path rejected with ValueError, subprocess-isolated import-hygiene on mcp.py (the SDK does not load until __aenter__).
test_agents_api_import_hygiene.py — subprocess-isolated assertion that both import devops_bench.agents.api and import devops_bench.agents.api.agent pull no mcp/deepeval/SDK at module top.

Existing tests/unit/agents/test_agents_config.py is extended to cover the two new fields.

Reviewer call-outs

fold_trajectory is the load-bearing extraction. Verify the assistant↔tool pairing by tool_call_id matches what the model adapters actually produce. The dispatcher emits "Error: ..." strings as the tool result text, and the folder detects that prefix to mark the entry status="error" — this is the single coupling between dispatcher and folder.
AgentResult.errors semantics — dispatch failures are recorded both on errors and as a status="error" trajectory entry. The model still sees the error text via Error: ... as the tool result, so it can react on the next turn. Confirm this matches what metrics/pipeline.py expects.
MCP gate = config.target (no BENCH_USE_MCP env) — the harness PR (Phase 5) will own the env→config translation. Until then, anyone driving ApiAgent directly must set config.target to enable MCP.
Skills gate = config.skills_paths — fully independent of MCP. The PR3 capabilities work migrates this to SkillBinding.paths.
max_turns lives on AgentConfig now; the agent never reads AGENT_MAX_TURNS directly. Confirm the orchestrator threads it through.
Static AST check for env-smuggling — the test walks agents/api/*.py for os.environ / get_env / BENCH_USE_MCP accessor patterns. Worth confirming the false-positive avoidance (it allows the names in docstrings but rejects them on any line that mentions environ / getenv / get_env).

Stack

This PR is stacked on refactor/integration (Wave 1 merged) and is the agents-side input to Wave 2. The chaos refactor (rework #7) runs in parallel and consumes the same run_tool_loop primitive.

… (Phase 3 PR2) Rework the open PR #9 ``agents/api`` package onto the AgentHarness template method, AgentConfig, and AgentResult landed in Wave 1, and onto the shared ``models/loop.run_tool_loop`` primitive — agents and chaos are now true siblings, neither carrying its own loop. Concrete changes: * ``ApiAgent._execute`` builds the LLM client via ``get_model(provider, model)`` with EXPLICIT args from ``AgentConfig`` (no env reads inside the agent), pre-formats tools via ``client.format_tools(...)`` (caller-formats-tools, per CONVENTIONS §6), and drives ``run_tool_loop``. * A ``_build_dispatch`` closure routes each tool call to either a skill file read or MCP, wrapping every call in its own try/except so one tool failure lands on ``AgentResult.errors`` and the trajectory (``status="error"``) rather than aborting the run (``run_tool_loop`` propagates dispatch errors by design). * ``fold_trajectory`` collapses ``LoopResult.contents`` into canonical :class:`ToolCall` entries, pairing assistant-issued ``tool_calls`` with the matching ``role: tool`` results by ``tool_call_id``. ``AgentResult.output`` is ``LoopResult.final_text``; tokens/latency/tools_used flow through unchanged (semantically byte-identical to the OLD api loop). * **Skills are decoupled from MCP.** ``config.skills_paths`` drives skill discovery and ``config.target`` drives MCP — an agent may have one, both, or neither. The hardcoded ``third_party/gke-mcp/skills`` constant is gone. * The agent never reads ``BENCH_USE_MCP`` (or any env var); the legacy ``context``/``system_instruction`` grab-bag is dropped from ``run()``'s public surface (Rules arrive in PR3). * ``AgentConfig`` gains interim ``skills_paths`` / ``max_turns`` fields (CSV-parsed from ``AGENT_SKILLS_PATHS`` / ``AGENT_MAX_TURNS``); PR3 migrates ``skills_paths`` into ``SkillBinding.paths``. Tests (``tests/unit/agents/api/``): * fake ``LLMClient`` + fake MCP/skills; trajectory folding correctness (assistant+tool pairs → canonical ToolCall entries; unmatched call ↔ ``status="called"``; dispatcher ``Error: ...`` strings ↔ ``status="error"``); * a dispatch ``RuntimeError`` lands in errors and on the trajectory, the run does not crash; * skills load on the MCP-off path (skills ⊥ MCP); * an AST walk asserts no ``BENCH_USE_MCP`` / ``os.environ`` / ``get_env`` reads anywhere under ``agents/api/``; * ``import devops_bench.agents.api`` and ``...api.agent`` pull no ``mcp``/``deepeval``/SDK at module top. Quality bar: ``uv run ruff check devops_bench/ tests/`` clean; ``uv run pytest tests/ -q`` GREEN — 464 passed (421 baseline + 43 new).

Wave 2 review of PR #14 found a blocking bug + three non-blocking nits. BLOCKING — `extract_tokens` was Google-only Read only ``prompt_token_count`` / ``candidates_token_count`` / ``total_token_count`` so Anthropic (``usage.input_tokens`` / ``output_tokens``) and OpenAI / Ollama (``usage.prompt_tokens`` / ``completion_tokens`` / ``total_tokens``) responses returned all-zero tokens, corrupting token metrics in results.json for every non-Google provider. Fix: detect which usage shape is present and map each provider's source fields onto the legacy on-disk key scheme (D3 results.json stability). A new ``_first_int_attr`` helper looks up the first non-None attribute name in order. Anthropic emits no aggregated total, so the helper computes ``prompt + candidates`` when source ``total`` is absent (with a non-zero prompt/candidates) so non-empty runs never log ``total_tokens: 0``. Tests: Anthropic-shape, OpenAI/Ollama-shape, key-scheme stability, and end-to-end Anthropic/OpenAI runs that flow through ``ApiAgent._execute`` asserting ``AgentResult.tokens`` is non-zero and correctly mapped. NON-BLOCKING NIT 1 — AST env-scan now covers ``ast.Subscript`` Added an explicit ``ast.Subscript`` branch in the env-smuggling guard so ``os.environ["FOO"]`` patterns can't be bypassed even if the inner Attribute walk regresses. Also caught ``os.environ.get(...)`` explicitly. NON-BLOCKING NIT 2 — ``MCPClient.call_tool`` degrades gracefully The legacy ``from deepeval.tracing import observe`` was unconditional, so a host without deepeval crashed with ImportError on the first tool call. Mirrors ``_maybe_observe`` from ``agents/base.py``: try the import, fall through to the bare ``call_tool`` on failure. Tested by monkey-patching ``builtins.__import__`` to raise on ``deepeval`` and confirming the call still completes. NON-BLOCKING NIT 3 — explicit orphan-result handling in ``fold_trajectory`` An unpaired ``{role: tool}`` result (no matching assistant call_id) is now intentionally dropped from the canonical trajectory (preserving the "every trajectory item is a real ToolCall the model issued" invariant metrics relies on) AND surfaced as a diagnostic on ``AgentResult.errors`` via the new ``_fold_with_extraction_errors`` helper. Matches the CLI agents' "no silent-empty" policy. The drop is also debug-logged. The decision is documented in the function's docstring. Bar: ``uv run ruff check devops_bench/ tests/unit/agents/`` clean; ``uv run pytest tests/ -q`` GREEN — 473 passed (464 baseline + 9 new).

pradeepvrd added 2 commits June 19, 2026 23:11

pradeepvrd merged commit 42ec015 into refactor/integration Jun 20, 2026

pradeepvrd mentioned this pull request Jun 20, 2026

Cross-cutting harness refactor: layered devops_bench (Stage 1.5–3, reconciled) #23

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(agents): ApiAgent on run_tool_loop; skills⊥MCP; no env-smuggling (Phase 3 PR2)#14

feat(agents): ApiAgent on run_tool_loop; skills⊥MCP; no env-smuggling (Phase 3 PR2)#14
pradeepvrd merged 2 commits into
refactor/integrationfrom
refactor/agents-api

pradeepvrd commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pradeepvrd commented Jun 20, 2026

Scope (Phase 3 PR2)

Conventions adhered to

Key implementation points

Tests added (tests/unit/agents/api/)

Reviewer call-outs

Stack

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Tests added (`tests/unit/agents/api/`)