Support a local option to test for inner dev loop by adrianchung · Pull Request #72 · gke-labs/devops-bench

adrianchung · 2026-06-15T01:15:29Z

Closes #70

Summary

Adds OllamaClientAdapter and OllamaDeepEvalModel so benchmarks can run fully offline using a local Ollama server instead of a cloud LLM provider
Adds scripts/setup_local_env.sh to install Ollama, kind, and pull the model in a single step for CI and dev setup
Adds NoOpDeployer (activated by BENCH_NO_INFRA=true) to skip OpenTofu provisioning when a cluster is already available locally
Fixes empty agent responses when BENCH_USE_MCP=false by using a separate SYSTEM_INSTRUCTION_NO_MCP that does not require tool calls
Adds scripts/mock_ollama_server.py and scripts/run_ollama_e2e_test.sh for fully offline end-to-end testing without network access to Ollama's registry
Adds 31 unit tests covering adapter formatting, message conversion, tool-call extraction, judge model generation, and the NoOpDeployer

Test plan

pytest tests/test_ollama_adapters.py tests/test_factory.py passes (31 tests)
bash scripts/run_ollama_e2e_test.sh completes end-to-end with the mock server
With Ollama running locally (ollama serve) and gemma4:e2b pulled, run a task with AGENT_PROVIDER=ollama JUDGE_PROVIDER=ollama BENCH_NO_INFRA=true BENCH_USE_MCP=false

Supersedes #71 — same commits; reopened from a renamed branch (ollama-e2e-testing).

Adds OllamaClientAdapter (agent) and OllamaDeepEvalModel (judge) backed by Ollama's OpenAI-compatible API. Wire AGENT_PROVIDER=ollama and JUDGE_PROVIDER=ollama into the existing routing. Defaults to gemma4:2b at http://localhost:11434/v1 (override via OLLAMA_BASE_URL / AGENT_MODEL / JUDGE_MODEL). Includes scripts/setup_ollama.sh to install Ollama and pull the model in a single step.

Single script to install Ollama, kind, pre-pull the kind node image, create a devops-bench cluster, and pull gemma4:2b — designed to run as a web-based environment setup script so the result is cached for subsequent sessions.

- deployers/factory.py: add NoOpDeployer activated by BENCH_NO_INFRA=true, which skips tofu/GCP cluster provisioning; enables running the benchmark pipeline without infrastructure when testing provider adapters locally - scripts/mock_ollama_server.py: minimal HTTP server that simulates Ollama's OpenAI-compatible chat API; classifies calls as agent/steps/score and returns appropriate canned responses so DeepEval GEval metrics complete - scripts/run_ollama_e2e_test.sh: driver script that starts the mock server, sets all Ollama env vars, and runs evaluate.py against the simplest task (tasks/generic/gateway-https-redirect) end-to-end with exit 0

When BENCH_USE_MCP=false, the original system instruction told the model "Do NOT output templates, you MUST use tool calls." Models that follow instructions strictly (including gemma4) returned empty responses because no tools were available to call. Add SYSTEM_INSTRUCTION_NO_MCP that asks for direct YAML/manifest output, and select it in execute_agent() based on the bench_use_mcp flag.

tests/test_ollama_adapters.py (26 tests): - OllamaClientAdapter: format_tools, extract_function_calls (dict args, JSON string args, malformed JSON fallback), get_text_content (None guard), _convert_to_openai_messages (system instruction ordering, tool call wire format, tool result messages, full conversation round-trip) - OllamaDeepEvalModel: generate, a_generate, model name, None content guard tests/test_factory.py (5 new tests): - BENCH_NO_INFRA=true returns NoOpDeployer - BENCH_NO_INFRA takes precedence over explicit infra config - NoOpDeployer.up/down print and do not raise - NoOpDeployer.get_cluster_info returns expected keys

Establishes a repeatable way to generate DevOps Bench tasks from an expert catalog and run them against the framework, plus the first end-to-end validated task. - docs/task-generation/: methodology (schema, ID allocation, expected_output rules, task classes, kind/MCP cluster access), the 114-row expert catalog as source of truth with a generation-status tracker, and a run + leaderboard guide. - AGENTS.md: vendor-neutral agent guidance pointing at the methodology. - tasks/generic/debug-crashloop/: first generated task (task_id 1001) with a CrashLoopBackOff fixture whose root cause is a missing DATABASE_URL env var. - pkg/agents/runner/api/mcp_client.py: forward the environment to the MCP server subprocess so it inherits KUBECONFIG and cloud credentials; without this the MCP server cannot resolve the target cluster's kubeconfig context. Validated end-to-end on a local kind cluster via the GKE MCP server: the investigation pipeline (generate -> fixture -> MCP tools on the live cluster -> agent loop -> judge) discriminates model capability. gemma4:e4b passes ChecklistScore 1.0 (lists pods, reads logs, finds the DATABASE_URL root cause and fix); gemma4:e2b fails at 0.25 (hallucinates the pod name).

Task-generation framework + first validated task (debug-crashloop)

adrianchung · 2026-06-15T18:29:19Z

Split into four focused, independently-reviewable PRs (per review feedback that this was too big). Together they cover this PR's changes exactly, with no file overlap, so they can be reviewed/merged in any order:

Add NoOpDeployer for infrastructure-free local runs #73 — Add NoOpDeployer (infrastructure-free local runs)
Add Ollama provider support (agent + judge) #74 — Add Ollama provider support (agent + judge)
Add local end-to-end test harness for the Ollama provider #75 — Add local end-to-end test harness
Add task-generation framework + first task (debug-crashloop) #76 — Add task-generation framework + first task (debug-crashloop)

Recommended review/merge order: #73, #74, #75, #76 (later ones only run once the earlier ones land; no merge dependency).

adrianchung and others added 7 commits June 14, 2026 02:01

Add setup_local_env.sh for Ollama + kind dev environment

d3cd72c

Single script to install Ollama, kind, pre-pull the kind node image, create a devops-bench cluster, and pull gemma4:2b — designed to run as a web-based environment setup script so the result is cached for subsequent sessions.

Merge pull request #2 from adrianchung/kubernetes-task-generation

a52f8c5

Task-generation framework + first validated task (debug-crashloop)

adrianchung closed this Jun 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support a local option to test for inner dev loop#72

Support a local option to test for inner dev loop#72
adrianchung wants to merge 7 commits into
gke-labs:mainfrom
adrianchung:ollama-e2e-testing

adrianchung commented Jun 15, 2026

Uh oh!

adrianchung commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

adrianchung commented Jun 15, 2026

Summary

Test plan

Uh oh!

adrianchung commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant