Skip to content

Support a local option to test for inner dev loop#71

Closed
adrianchung wants to merge 7 commits into
gke-labs:mainfrom
adrianchung:claude/ollama-e2e-testing-ugpvk4
Closed

Support a local option to test for inner dev loop#71
adrianchung wants to merge 7 commits into
gke-labs:mainfrom
adrianchung:claude/ollama-e2e-testing-ugpvk4

Conversation

@adrianchung

Copy link
Copy Markdown
Collaborator

Closes #70

Summary

  • Adds OllamaClientAdapter and OllamaDeepEvalModel so benchmarks can run fully offline using a local Ollama server instead of a cloud LLM provider
  • Adds scripts/setup_local_env.sh to install Ollama, kind, and pull the model in a single step for CI and dev setup
  • Adds NoOpDeployer (activated by BENCH_NO_INFRA=true) to skip OpenTofu provisioning when a cluster is already available locally
  • Fixes empty agent responses when BENCH_USE_MCP=false by using a separate SYSTEM_INSTRUCTION_NO_MCP that does not require tool calls
  • Adds scripts/mock_ollama_server.py and scripts/run_ollama_e2e_test.sh for fully offline end-to-end testing without network access to Ollama's registry
  • Adds 31 unit tests covering adapter formatting, message conversion, tool-call extraction, judge model generation, and the NoOpDeployer

Test plan

  • pytest tests/test_ollama_adapters.py tests/test_factory.py passes (31 tests)
  • bash scripts/run_ollama_e2e_test.sh completes end-to-end with the mock server
  • With Ollama running locally (ollama serve) and gemma4:e2b pulled, run a task with AGENT_PROVIDER=ollama JUDGE_PROVIDER=ollama BENCH_NO_INFRA=true BENCH_USE_MCP=false

@google-cla

google-cla Bot commented Jun 14, 2026

Copy link
Copy Markdown

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

Adds OllamaClientAdapter (agent) and OllamaDeepEvalModel (judge) backed
by Ollama's OpenAI-compatible API. Wire AGENT_PROVIDER=ollama and
JUDGE_PROVIDER=ollama into the existing routing. Defaults to gemma4:2b
at http://localhost:11434/v1 (override via OLLAMA_BASE_URL / AGENT_MODEL
/ JUDGE_MODEL). Includes scripts/setup_ollama.sh to install Ollama and
pull the model in a single step.
Single script to install Ollama, kind, pre-pull the kind node image,
create a devops-bench cluster, and pull gemma4:2b — designed to run as
a web-based environment setup script so the result is
cached for subsequent sessions.
- deployers/factory.py: add NoOpDeployer activated by BENCH_NO_INFRA=true,
  which skips tofu/GCP cluster provisioning; enables running the benchmark
  pipeline without infrastructure when testing provider adapters locally
- scripts/mock_ollama_server.py: minimal HTTP server that simulates Ollama's
  OpenAI-compatible chat API; classifies calls as agent/steps/score and
  returns appropriate canned responses so DeepEval GEval metrics complete
- scripts/run_ollama_e2e_test.sh: driver script that starts the mock server,
  sets all Ollama env vars, and runs evaluate.py against the simplest task
  (tasks/generic/gateway-https-redirect) end-to-end with exit 0
When BENCH_USE_MCP=false, the original system instruction told the model
"Do NOT output templates, you MUST use tool calls." Models that follow
instructions strictly (including gemma4) returned empty responses because
no tools were available to call.

Add SYSTEM_INSTRUCTION_NO_MCP that asks for direct YAML/manifest output,
and select it in execute_agent() based on the bench_use_mcp flag.
tests/test_ollama_adapters.py (26 tests):
- OllamaClientAdapter: format_tools, extract_function_calls (dict args,
  JSON string args, malformed JSON fallback), get_text_content (None guard),
  _convert_to_openai_messages (system instruction ordering, tool call wire
  format, tool result messages, full conversation round-trip)
- OllamaDeepEvalModel: generate, a_generate, model name, None content guard

tests/test_factory.py (5 new tests):
- BENCH_NO_INFRA=true returns NoOpDeployer
- BENCH_NO_INFRA takes precedence over explicit infra config
- NoOpDeployer.up/down print and do not raise
- NoOpDeployer.get_cluster_info returns expected keys
@adrianchung adrianchung force-pushed the claude/ollama-e2e-testing-ugpvk4 branch from b4dda47 to 2b1bd9f Compare June 14, 2026 02:01
adrianchung and others added 2 commits June 14, 2026 19:15
Establishes a repeatable way to generate DevOps Bench tasks from an expert
catalog and run them against the framework, plus the first end-to-end
validated task.

- docs/task-generation/: methodology (schema, ID allocation, expected_output
  rules, task classes, kind/MCP cluster access), the 114-row expert catalog as
  source of truth with a generation-status tracker, and a run + leaderboard guide.
- AGENTS.md: vendor-neutral agent guidance pointing at the methodology.
- tasks/generic/debug-crashloop/: first generated task (task_id 1001) with a
  CrashLoopBackOff fixture whose root cause is a missing DATABASE_URL env var.
- pkg/agents/runner/api/mcp_client.py: forward the environment to the MCP
  server subprocess so it inherits KUBECONFIG and cloud credentials; without
  this the MCP server cannot resolve the target cluster's kubeconfig context.

Validated end-to-end on a local kind cluster via the GKE MCP server: the
investigation pipeline (generate -> fixture -> MCP tools on the live cluster
-> agent loop -> judge) discriminates model capability. gemma4:e4b passes
ChecklistScore 1.0 (lists pods, reads logs, finds the DATABASE_URL root cause
and fix); gemma4:e2b fails at 0.25 (hallucinates the pod name).
Task-generation framework + first validated task (debug-crashloop)
@adrianchung adrianchung force-pushed the claude/ollama-e2e-testing-ugpvk4 branch from 6faaffd to a52f8c5 Compare June 15, 2026 01:11
@adrianchung adrianchung deleted the claude/ollama-e2e-testing-ugpvk4 branch June 15, 2026 01:14
@adrianchung

Copy link
Copy Markdown
Collaborator Author

Superseded by #72 (branch renamed to ollama-e2e-testing; identical commits, Claude attribution removed).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support a local option to test for inner dev loop

1 participant