Skip to content

feat(agents): ApiAgent on run_tool_loop; skills⊥MCP; no env-smuggling (Phase 3 PR2)#14

Merged
pradeepvrd merged 2 commits into
refactor/integrationfrom
refactor/agents-api
Jun 20, 2026
Merged

feat(agents): ApiAgent on run_tool_loop; skills⊥MCP; no env-smuggling (Phase 3 PR2)#14
pradeepvrd merged 2 commits into
refactor/integrationfrom
refactor/agents-api

Conversation

@pradeepvrd

Copy link
Copy Markdown
Owner

Scope (Phase 3 PR2)

Reworks open PR #9 — the devops_bench/agents/api/ package — onto the Wave 1 foundations:

Agents and chaos are now true siblings: both depend on models, neither on the other. No agent code reaches around the typed seam any more.

Conventions adhered to

Convention (docs/refactor/CONVENTIONS.md) How this PR honors it
§3 AgentResult / ToolCall shapes (dataclass) _execute returns an AgentResult; trajectory is a list of ToolCall.to_dict() entries
§5 neutral message contract Tool calls and results flow through run_tool_loop unchanged
§6 run_tool_loop (caller-formats-tools) Agent calls client.format_tools(...) before invoking the loop; the loop sees pre-formatted descriptors
§7 env rules — no env-smuggling No BENCH_USE_MCP read; no os.environ; capability/MCP on-off comes from AgentConfig (target gates MCP, skills_paths gates skills); get_model(provider, model) called with EXPLICIT args
§8 light __init__ / lazy imports import devops_bench.agents.api pulls no mcp/deepeval/SDK (asserted in two import-hygiene tests)
§9 test bar ruff clean; 464 pytest GREEN (421 baseline + 43 new); Google-style docstrings throughout

Key implementation points

  • ApiAgent (agents/api/agent.py): subclasses AgentHarness, overrides _execute. Builds the LLM client via get_model(config.provider, config.model) with EXPLICIT args, opens an MCP session when config.target is set, discovers local skills when config.skills_paths is set, format tools, then drives run_tool_loop. AgentResult.output = LoopResult.final_text, latency / tokens / tools_used propagate through unchanged (semantically byte-identical to the OLD api loop — same usage_metadata/usage extraction, same per-generate_content latency accounting that run_tool_loop already preserves).
  • fold_trajectory: collapses LoopResult.contents into canonical ToolCall entries — each assistant tool_calls[*] is paired with its matching {role:tool, tool_call_id, ...} result by id. Unmatched calls survive as status="called" (no silent drop); dispatcher Error: ... strings surface as status="error".
  • Skills ⊥ MCP: agents/api/skills.py is a pure stdlib discovery module (no MCP coupling, no heavy imports). The dispatch closure routes skill names to a local file read; MCP-only tool names hit the MCP session.
  • Dispatch error handling: each tool call is wrapped in its own try/except inside _build_dispatch. A tool failure lands on AgentResult.errors and the trajectory (status="error"), but the loop keeps going — one failing tool never aborts the benchmark. (run_tool_loop itself propagates dispatch errors by design — the agent's wrapper is what makes that benchmark-safe.)
  • No legacy surface: ApiAgent.run(prompt) now takes only (self, prompt)context / system_instruction are gone. The _SKILLS_DIR = "third_party/gke-mcp/skills" constant is deleted. BENCH_USE_MCP is never read inside the agent (an AST walk in tests enforces this).
  • AgentConfig gains two interim fields: skills_paths (PR3 migrates → SkillBinding.paths) and max_turns (env: AGENT_SKILLS_PATHS, AGENT_MAX_TURNS).

Tests added (tests/unit/agents/api/)

  • test_agents_api_agent.py — registration under "api", fold_trajectory (pairing, error status, unmatched call, text-only-turn skip, defensive args/id handling), extract_tokens (usage_metadata + fallback to usage, missing usage → {}), ApiAgent._execute end-to-end (MCP-off path, explicit provider/model wiring, full trajectory folding with MCP on, dispatch error → errors + trajectory.error, missing-MCP graceful path, skills-without-MCP, empty MCP server → errored result, env ignored), AST walk asserting zero BENCH_USE_MCP / os.environ / env-helper imports, public surface signature check, sync _execute invariant.
  • test_agents_api_skills.py — frontmatter parsing, normalization, multi-path walk, empty/missing path handling, plus a subprocess-isolated import-hygiene check on skills.py.
  • test_agents_api_mcp.pyextract_tool_text joining/fallbacks, empty server_path rejected with ValueError, subprocess-isolated import-hygiene on mcp.py (the SDK does not load until __aenter__).
  • test_agents_api_import_hygiene.py — subprocess-isolated assertion that both import devops_bench.agents.api and import devops_bench.agents.api.agent pull no mcp/deepeval/SDK at module top.

Existing tests/unit/agents/test_agents_config.py is extended to cover the two new fields.

Reviewer call-outs

  1. fold_trajectory is the load-bearing extraction. Verify the assistant↔tool pairing by tool_call_id matches what the model adapters actually produce. The dispatcher emits "Error: ..." strings as the tool result text, and the folder detects that prefix to mark the entry status="error" — this is the single coupling between dispatcher and folder.
  2. AgentResult.errors semantics — dispatch failures are recorded both on errors and as a status="error" trajectory entry. The model still sees the error text via Error: ... as the tool result, so it can react on the next turn. Confirm this matches what metrics/pipeline.py expects.
  3. MCP gate = config.target (no BENCH_USE_MCP env) — the harness PR (Phase 5) will own the env→config translation. Until then, anyone driving ApiAgent directly must set config.target to enable MCP.
  4. Skills gate = config.skills_paths — fully independent of MCP. The PR3 capabilities work migrates this to SkillBinding.paths.
  5. max_turns lives on AgentConfig now; the agent never reads AGENT_MAX_TURNS directly. Confirm the orchestrator threads it through.
  6. Static AST check for env-smuggling — the test walks agents/api/*.py for os.environ / get_env / BENCH_USE_MCP accessor patterns. Worth confirming the false-positive avoidance (it allows the names in docstrings but rejects them on any line that mentions environ / getenv / get_env).

Stack

This PR is stacked on refactor/integration (Wave 1 merged) and is the agents-side input to Wave 2. The chaos refactor (rework #7) runs in parallel and consumes the same run_tool_loop primitive.

… (Phase 3 PR2)

Rework the open PR #9 ``agents/api`` package onto the AgentHarness template
method, AgentConfig, and AgentResult landed in Wave 1, and onto the shared
``models/loop.run_tool_loop`` primitive — agents and chaos are now true
siblings, neither carrying its own loop.

Concrete changes:

* ``ApiAgent._execute`` builds the LLM client via ``get_model(provider, model)``
  with EXPLICIT args from ``AgentConfig`` (no env reads inside the agent),
  pre-formats tools via ``client.format_tools(...)`` (caller-formats-tools, per
  CONVENTIONS §6), and drives ``run_tool_loop``.
* A ``_build_dispatch`` closure routes each tool call to either a skill file
  read or MCP, wrapping every call in its own try/except so one tool failure
  lands on ``AgentResult.errors`` and the trajectory (``status="error"``)
  rather than aborting the run (``run_tool_loop`` propagates dispatch errors
  by design).
* ``fold_trajectory`` collapses ``LoopResult.contents`` into canonical
  :class:`ToolCall` entries, pairing assistant-issued ``tool_calls`` with the
  matching ``role: tool`` results by ``tool_call_id``. ``AgentResult.output``
  is ``LoopResult.final_text``; tokens/latency/tools_used flow through
  unchanged (semantically byte-identical to the OLD api loop).
* **Skills are decoupled from MCP.** ``config.skills_paths`` drives skill
  discovery and ``config.target`` drives MCP — an agent may have one, both,
  or neither. The hardcoded ``third_party/gke-mcp/skills`` constant is gone.
* The agent never reads ``BENCH_USE_MCP`` (or any env var); the legacy
  ``context``/``system_instruction`` grab-bag is dropped from ``run()``'s
  public surface (Rules arrive in PR3).
* ``AgentConfig`` gains interim ``skills_paths`` / ``max_turns`` fields
  (CSV-parsed from ``AGENT_SKILLS_PATHS`` / ``AGENT_MAX_TURNS``); PR3 migrates
  ``skills_paths`` into ``SkillBinding.paths``.

Tests (``tests/unit/agents/api/``):

* fake ``LLMClient`` + fake MCP/skills; trajectory folding correctness
  (assistant+tool pairs → canonical ToolCall entries; unmatched call ↔
  ``status="called"``; dispatcher ``Error: ...`` strings ↔ ``status="error"``);
* a dispatch ``RuntimeError`` lands in errors and on the trajectory, the run
  does not crash;
* skills load on the MCP-off path (skills ⊥ MCP);
* an AST walk asserts no ``BENCH_USE_MCP`` / ``os.environ`` / ``get_env``
  reads anywhere under ``agents/api/``;
* ``import devops_bench.agents.api`` and ``...api.agent`` pull no
  ``mcp``/``deepeval``/SDK at module top.

Quality bar: ``uv run ruff check devops_bench/ tests/`` clean;
``uv run pytest tests/ -q`` GREEN — 464 passed (421 baseline + 43 new).
Wave 2 review of PR #14 found a blocking bug + three non-blocking nits.

BLOCKING — `extract_tokens` was Google-only
  Read only ``prompt_token_count`` / ``candidates_token_count`` /
  ``total_token_count`` so Anthropic (``usage.input_tokens`` /
  ``output_tokens``) and OpenAI / Ollama (``usage.prompt_tokens`` /
  ``completion_tokens`` / ``total_tokens``) responses returned all-zero tokens,
  corrupting token metrics in results.json for every non-Google provider.

  Fix: detect which usage shape is present and map each provider's source
  fields onto the legacy on-disk key scheme (D3 results.json stability).
  A new ``_first_int_attr`` helper looks up the first non-None attribute name
  in order. Anthropic emits no aggregated total, so the helper computes
  ``prompt + candidates`` when source ``total`` is absent (with a non-zero
  prompt/candidates) so non-empty runs never log ``total_tokens: 0``.

  Tests: Anthropic-shape, OpenAI/Ollama-shape, key-scheme stability, and
  end-to-end Anthropic/OpenAI runs that flow through ``ApiAgent._execute``
  asserting ``AgentResult.tokens`` is non-zero and correctly mapped.

NON-BLOCKING NIT 1 — AST env-scan now covers ``ast.Subscript``
  Added an explicit ``ast.Subscript`` branch in the env-smuggling guard so
  ``os.environ["FOO"]`` patterns can't be bypassed even if the inner
  Attribute walk regresses. Also caught ``os.environ.get(...)`` explicitly.

NON-BLOCKING NIT 2 — ``MCPClient.call_tool`` degrades gracefully
  The legacy ``from deepeval.tracing import observe`` was unconditional, so a
  host without deepeval crashed with ImportError on the first tool call.
  Mirrors ``_maybe_observe`` from ``agents/base.py``: try the import, fall
  through to the bare ``call_tool`` on failure. Tested by monkey-patching
  ``builtins.__import__`` to raise on ``deepeval`` and confirming the call
  still completes.

NON-BLOCKING NIT 3 — explicit orphan-result handling in ``fold_trajectory``
  An unpaired ``{role: tool}`` result (no matching assistant call_id) is now
  intentionally dropped from the canonical trajectory (preserving the "every
  trajectory item is a real ToolCall the model issued" invariant metrics
  relies on) AND surfaced as a diagnostic on ``AgentResult.errors`` via the
  new ``_fold_with_extraction_errors`` helper. Matches the CLI agents'
  "no silent-empty" policy. The drop is also debug-logged. The decision is
  documented in the function's docstring.

Bar: ``uv run ruff check devops_bench/ tests/unit/agents/`` clean;
``uv run pytest tests/ -q`` GREEN — 473 passed (464 baseline + 9 new).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant