Support a local option to test for inner dev loop by adrianchung · Pull Request #71 · gke-labs/devops-bench

adrianchung · 2026-06-14T01:50:17Z

Closes #70

Summary

Adds OllamaClientAdapter and OllamaDeepEvalModel so benchmarks can run fully offline using a local Ollama server instead of a cloud LLM provider
Adds scripts/setup_local_env.sh to install Ollama, kind, and pull the model in a single step for CI and dev setup
Adds NoOpDeployer (activated by BENCH_NO_INFRA=true) to skip OpenTofu provisioning when a cluster is already available locally
Fixes empty agent responses when BENCH_USE_MCP=false by using a separate SYSTEM_INSTRUCTION_NO_MCP that does not require tool calls
Adds scripts/mock_ollama_server.py and scripts/run_ollama_e2e_test.sh for fully offline end-to-end testing without network access to Ollama's registry
Adds 31 unit tests covering adapter formatting, message conversion, tool-call extraction, judge model generation, and the NoOpDeployer

Test plan

pytest tests/test_ollama_adapters.py tests/test_factory.py passes (31 tests)
bash scripts/run_ollama_e2e_test.sh completes end-to-end with the mock server
With Ollama running locally (ollama serve) and gemma4:e2b pulled, run a task with AGENT_PROVIDER=ollama JUDGE_PROVIDER=ollama BENCH_NO_INFRA=true BENCH_USE_MCP=false

google-cla · 2026-06-14T01:50:39Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

Adds OllamaClientAdapter (agent) and OllamaDeepEvalModel (judge) backed by Ollama's OpenAI-compatible API. Wire AGENT_PROVIDER=ollama and JUDGE_PROVIDER=ollama into the existing routing. Defaults to gemma4:2b at http://localhost:11434/v1 (override via OLLAMA_BASE_URL / AGENT_MODEL / JUDGE_MODEL). Includes scripts/setup_ollama.sh to install Ollama and pull the model in a single step.

Single script to install Ollama, kind, pre-pull the kind node image, create a devops-bench cluster, and pull gemma4:2b — designed to run as a web-based environment setup script so the result is cached for subsequent sessions.

- deployers/factory.py: add NoOpDeployer activated by BENCH_NO_INFRA=true, which skips tofu/GCP cluster provisioning; enables running the benchmark pipeline without infrastructure when testing provider adapters locally - scripts/mock_ollama_server.py: minimal HTTP server that simulates Ollama's OpenAI-compatible chat API; classifies calls as agent/steps/score and returns appropriate canned responses so DeepEval GEval metrics complete - scripts/run_ollama_e2e_test.sh: driver script that starts the mock server, sets all Ollama env vars, and runs evaluate.py against the simplest task (tasks/generic/gateway-https-redirect) end-to-end with exit 0

When BENCH_USE_MCP=false, the original system instruction told the model "Do NOT output templates, you MUST use tool calls." Models that follow instructions strictly (including gemma4) returned empty responses because no tools were available to call. Add SYSTEM_INSTRUCTION_NO_MCP that asks for direct YAML/manifest output, and select it in execute_agent() based on the bench_use_mcp flag.

tests/test_ollama_adapters.py (26 tests): - OllamaClientAdapter: format_tools, extract_function_calls (dict args, JSON string args, malformed JSON fallback), get_text_content (None guard), _convert_to_openai_messages (system instruction ordering, tool call wire format, tool result messages, full conversation round-trip) - OllamaDeepEvalModel: generate, a_generate, model name, None content guard tests/test_factory.py (5 new tests): - BENCH_NO_INFRA=true returns NoOpDeployer - BENCH_NO_INFRA takes precedence over explicit infra config - NoOpDeployer.up/down print and do not raise - NoOpDeployer.get_cluster_info returns expected keys

Establishes a repeatable way to generate DevOps Bench tasks from an expert catalog and run them against the framework, plus the first end-to-end validated task. - docs/task-generation/: methodology (schema, ID allocation, expected_output rules, task classes, kind/MCP cluster access), the 114-row expert catalog as source of truth with a generation-status tracker, and a run + leaderboard guide. - AGENTS.md: vendor-neutral agent guidance pointing at the methodology. - tasks/generic/debug-crashloop/: first generated task (task_id 1001) with a CrashLoopBackOff fixture whose root cause is a missing DATABASE_URL env var. - pkg/agents/runner/api/mcp_client.py: forward the environment to the MCP server subprocess so it inherits KUBECONFIG and cloud credentials; without this the MCP server cannot resolve the target cluster's kubeconfig context. Validated end-to-end on a local kind cluster via the GKE MCP server: the investigation pipeline (generate -> fixture -> MCP tools on the live cluster -> agent loop -> judge) discriminates model capability. gemma4:e4b passes ChecklistScore 1.0 (lists pods, reads logs, finds the DATABASE_URL root cause and fix); gemma4:e2b fails at 0.25 (hallucinates the pod name).

Task-generation framework + first validated task (debug-crashloop)

adrianchung · 2026-06-15T01:15:47Z

Superseded by #72 (branch renamed to ollama-e2e-testing; identical commits, Claude attribution removed).

adrianchung added 5 commits June 14, 2026 02:01

Add setup_local_env.sh for Ollama + kind dev environment

d3cd72c

Single script to install Ollama, kind, pre-pull the kind node image, create a devops-bench cluster, and pull gemma4:2b — designed to run as a web-based environment setup script so the result is cached for subsequent sessions.

adrianchung force-pushed the claude/ollama-e2e-testing-ugpvk4 branch from b4dda47 to 2b1bd9f Compare June 14, 2026 02:01

adrianchung and others added 2 commits June 14, 2026 19:15

Merge pull request #2 from adrianchung/kubernetes-task-generation

a52f8c5

Task-generation framework + first validated task (debug-crashloop)

adrianchung force-pushed the claude/ollama-e2e-testing-ugpvk4 branch from 6faaffd to a52f8c5 Compare June 15, 2026 01:11

adrianchung closed this Jun 15, 2026

adrianchung deleted the claude/ollama-e2e-testing-ugpvk4 branch June 15, 2026 01:14

adrianchung mentioned this pull request Jun 15, 2026

Support a local option to test for inner dev loop #72

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support a local option to test for inner dev loop#71

Support a local option to test for inner dev loop#71
adrianchung wants to merge 7 commits into
gke-labs:mainfrom
adrianchung:claude/ollama-e2e-testing-ugpvk4

adrianchung commented Jun 14, 2026

Uh oh!

google-cla Bot commented Jun 14, 2026

Uh oh!

adrianchung commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

adrianchung commented Jun 14, 2026

Summary

Test plan

Uh oh!

google-cla Bot commented Jun 14, 2026

Uh oh!

adrianchung commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant