feat(agents): ApiAgent on run_tool_loop; skills⊥MCP; no env-smuggling (Phase 3 PR2)#14
Merged
Merged
Conversation
… (Phase 3 PR2) Rework the open PR #9 ``agents/api`` package onto the AgentHarness template method, AgentConfig, and AgentResult landed in Wave 1, and onto the shared ``models/loop.run_tool_loop`` primitive — agents and chaos are now true siblings, neither carrying its own loop. Concrete changes: * ``ApiAgent._execute`` builds the LLM client via ``get_model(provider, model)`` with EXPLICIT args from ``AgentConfig`` (no env reads inside the agent), pre-formats tools via ``client.format_tools(...)`` (caller-formats-tools, per CONVENTIONS §6), and drives ``run_tool_loop``. * A ``_build_dispatch`` closure routes each tool call to either a skill file read or MCP, wrapping every call in its own try/except so one tool failure lands on ``AgentResult.errors`` and the trajectory (``status="error"``) rather than aborting the run (``run_tool_loop`` propagates dispatch errors by design). * ``fold_trajectory`` collapses ``LoopResult.contents`` into canonical :class:`ToolCall` entries, pairing assistant-issued ``tool_calls`` with the matching ``role: tool`` results by ``tool_call_id``. ``AgentResult.output`` is ``LoopResult.final_text``; tokens/latency/tools_used flow through unchanged (semantically byte-identical to the OLD api loop). * **Skills are decoupled from MCP.** ``config.skills_paths`` drives skill discovery and ``config.target`` drives MCP — an agent may have one, both, or neither. The hardcoded ``third_party/gke-mcp/skills`` constant is gone. * The agent never reads ``BENCH_USE_MCP`` (or any env var); the legacy ``context``/``system_instruction`` grab-bag is dropped from ``run()``'s public surface (Rules arrive in PR3). * ``AgentConfig`` gains interim ``skills_paths`` / ``max_turns`` fields (CSV-parsed from ``AGENT_SKILLS_PATHS`` / ``AGENT_MAX_TURNS``); PR3 migrates ``skills_paths`` into ``SkillBinding.paths``. Tests (``tests/unit/agents/api/``): * fake ``LLMClient`` + fake MCP/skills; trajectory folding correctness (assistant+tool pairs → canonical ToolCall entries; unmatched call ↔ ``status="called"``; dispatcher ``Error: ...`` strings ↔ ``status="error"``); * a dispatch ``RuntimeError`` lands in errors and on the trajectory, the run does not crash; * skills load on the MCP-off path (skills ⊥ MCP); * an AST walk asserts no ``BENCH_USE_MCP`` / ``os.environ`` / ``get_env`` reads anywhere under ``agents/api/``; * ``import devops_bench.agents.api`` and ``...api.agent`` pull no ``mcp``/``deepeval``/SDK at module top. Quality bar: ``uv run ruff check devops_bench/ tests/`` clean; ``uv run pytest tests/ -q`` GREEN — 464 passed (421 baseline + 43 new).
Wave 2 review of PR #14 found a blocking bug + three non-blocking nits. BLOCKING — `extract_tokens` was Google-only Read only ``prompt_token_count`` / ``candidates_token_count`` / ``total_token_count`` so Anthropic (``usage.input_tokens`` / ``output_tokens``) and OpenAI / Ollama (``usage.prompt_tokens`` / ``completion_tokens`` / ``total_tokens``) responses returned all-zero tokens, corrupting token metrics in results.json for every non-Google provider. Fix: detect which usage shape is present and map each provider's source fields onto the legacy on-disk key scheme (D3 results.json stability). A new ``_first_int_attr`` helper looks up the first non-None attribute name in order. Anthropic emits no aggregated total, so the helper computes ``prompt + candidates`` when source ``total`` is absent (with a non-zero prompt/candidates) so non-empty runs never log ``total_tokens: 0``. Tests: Anthropic-shape, OpenAI/Ollama-shape, key-scheme stability, and end-to-end Anthropic/OpenAI runs that flow through ``ApiAgent._execute`` asserting ``AgentResult.tokens`` is non-zero and correctly mapped. NON-BLOCKING NIT 1 — AST env-scan now covers ``ast.Subscript`` Added an explicit ``ast.Subscript`` branch in the env-smuggling guard so ``os.environ["FOO"]`` patterns can't be bypassed even if the inner Attribute walk regresses. Also caught ``os.environ.get(...)`` explicitly. NON-BLOCKING NIT 2 — ``MCPClient.call_tool`` degrades gracefully The legacy ``from deepeval.tracing import observe`` was unconditional, so a host without deepeval crashed with ImportError on the first tool call. Mirrors ``_maybe_observe`` from ``agents/base.py``: try the import, fall through to the bare ``call_tool`` on failure. Tested by monkey-patching ``builtins.__import__`` to raise on ``deepeval`` and confirming the call still completes. NON-BLOCKING NIT 3 — explicit orphan-result handling in ``fold_trajectory`` An unpaired ``{role: tool}`` result (no matching assistant call_id) is now intentionally dropped from the canonical trajectory (preserving the "every trajectory item is a real ToolCall the model issued" invariant metrics relies on) AND surfaced as a diagnostic on ``AgentResult.errors`` via the new ``_fold_with_extraction_errors`` helper. Matches the CLI agents' "no silent-empty" policy. The drop is also debug-logged. The decision is documented in the function's docstring. Bar: ``uv run ruff check devops_bench/ tests/unit/agents/`` clean; ``uv run pytest tests/ -q`` GREEN — 473 passed (464 baseline + 9 new).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Scope (Phase 3 PR2)
Reworks open PR #9 — the
devops_bench/agents/api/package — onto the Wave 1 foundations:models/loop.run_tool_loopprimitive landed by Wave 1 PR (feat(models): run_tool_loop shared turn-loop primitive (Phase 1, D1) #11)Agents and chaos are now true siblings: both depend on
models, neither on the other. No agent code reaches around the typed seam any more.Conventions adhered to
docs/refactor/CONVENTIONS.md)AgentResult/ToolCallshapes (dataclass)_executereturns anAgentResult; trajectory is a list ofToolCall.to_dict()entriesrun_tool_loopunchangedrun_tool_loop(caller-formats-tools)client.format_tools(...)before invoking the loop; the loop sees pre-formatted descriptorsBENCH_USE_MCPread; noos.environ; capability/MCP on-off comes fromAgentConfig(targetgates MCP,skills_pathsgates skills);get_model(provider, model)called with EXPLICIT args__init__/ lazy importsimport devops_bench.agents.apipulls nomcp/deepeval/SDK (asserted in two import-hygiene tests)Key implementation points
ApiAgent(agents/api/agent.py): subclassesAgentHarness, overrides_execute. Builds the LLM client viaget_model(config.provider, config.model)with EXPLICIT args, opens an MCP session whenconfig.targetis set, discovers local skills whenconfig.skills_pathsis set, format tools, then drivesrun_tool_loop.AgentResult.output = LoopResult.final_text,latency/tokens/tools_usedpropagate through unchanged (semantically byte-identical to the OLD api loop — sameusage_metadata/usageextraction, same per-generate_contentlatency accounting thatrun_tool_loopalready preserves).fold_trajectory: collapsesLoopResult.contentsinto canonicalToolCallentries — each assistanttool_calls[*]is paired with its matching{role:tool, tool_call_id, ...}result by id. Unmatched calls survive asstatus="called"(no silent drop); dispatcherError: ...strings surface asstatus="error".agents/api/skills.pyis a pure stdlib discovery module (no MCP coupling, no heavy imports). The dispatch closure routes skill names to a local file read; MCP-only tool names hit the MCP session._build_dispatch. A tool failure lands onAgentResult.errorsand the trajectory (status="error"), but the loop keeps going — one failing tool never aborts the benchmark. (run_tool_loopitself propagates dispatch errors by design — the agent's wrapper is what makes that benchmark-safe.)ApiAgent.run(prompt)now takes only(self, prompt)—context/system_instructionare gone. The_SKILLS_DIR = "third_party/gke-mcp/skills"constant is deleted.BENCH_USE_MCPis never read inside the agent (an AST walk in tests enforces this).AgentConfiggains two interim fields:skills_paths(PR3 migrates →SkillBinding.paths) andmax_turns(env:AGENT_SKILLS_PATHS,AGENT_MAX_TURNS).Tests added (
tests/unit/agents/api/)test_agents_api_agent.py— registration under"api",fold_trajectory(pairing, error status, unmatched call, text-only-turn skip, defensive args/id handling),extract_tokens(usage_metadata + fallback to usage, missing usage →{}),ApiAgent._executeend-to-end (MCP-off path, explicit provider/model wiring, full trajectory folding with MCP on, dispatch error → errors + trajectory.error, missing-MCP graceful path, skills-without-MCP, empty MCP server → errored result, env ignored), AST walk asserting zeroBENCH_USE_MCP/os.environ/ env-helper imports, public surface signature check, sync_executeinvariant.test_agents_api_skills.py— frontmatter parsing, normalization, multi-path walk, empty/missing path handling, plus a subprocess-isolated import-hygiene check onskills.py.test_agents_api_mcp.py—extract_tool_textjoining/fallbacks, emptyserver_pathrejected withValueError, subprocess-isolated import-hygiene onmcp.py(the SDK does not load until__aenter__).test_agents_api_import_hygiene.py— subprocess-isolated assertion that bothimport devops_bench.agents.apiandimport devops_bench.agents.api.agentpull nomcp/deepeval/SDK at module top.Existing
tests/unit/agents/test_agents_config.pyis extended to cover the two new fields.Reviewer call-outs
fold_trajectoryis the load-bearing extraction. Verify the assistant↔tool pairing bytool_call_idmatches what the model adapters actually produce. The dispatcher emits"Error: ..."strings as the tool result text, and the folder detects that prefix to mark the entrystatus="error"— this is the single coupling between dispatcher and folder.AgentResult.errorssemantics — dispatch failures are recorded both onerrorsand as astatus="error"trajectory entry. The model still sees the error text viaError: ...as the tool result, so it can react on the next turn. Confirm this matches whatmetrics/pipeline.pyexpects.config.target(noBENCH_USE_MCPenv) — the harness PR (Phase 5) will own the env→config translation. Until then, anyone drivingApiAgentdirectly must setconfig.targetto enable MCP.config.skills_paths— fully independent of MCP. The PR3 capabilities work migrates this toSkillBinding.paths.max_turnslives onAgentConfignow; the agent never readsAGENT_MAX_TURNSdirectly. Confirm the orchestrator threads it through.agents/api/*.pyforos.environ/get_env/BENCH_USE_MCPaccessor patterns. Worth confirming the false-positive avoidance (it allows the names in docstrings but rejects them on any line that mentionsenviron/getenv/get_env).Stack
This PR is stacked on
refactor/integration(Wave 1 merged) and is the agents-side input to Wave 2. The chaos refactor (rework #7) runs in parallel and consumes the samerun_tool_loopprimitive.