Skip to content

Support a local option to test for inner dev loop#72

Closed
adrianchung wants to merge 7 commits into
gke-labs:mainfrom
adrianchung:ollama-e2e-testing
Closed

Support a local option to test for inner dev loop#72
adrianchung wants to merge 7 commits into
gke-labs:mainfrom
adrianchung:ollama-e2e-testing

Conversation

@adrianchung

Copy link
Copy Markdown
Collaborator

Closes #70

Summary

  • Adds OllamaClientAdapter and OllamaDeepEvalModel so benchmarks can run fully offline using a local Ollama server instead of a cloud LLM provider
  • Adds scripts/setup_local_env.sh to install Ollama, kind, and pull the model in a single step for CI and dev setup
  • Adds NoOpDeployer (activated by BENCH_NO_INFRA=true) to skip OpenTofu provisioning when a cluster is already available locally
  • Fixes empty agent responses when BENCH_USE_MCP=false by using a separate SYSTEM_INSTRUCTION_NO_MCP that does not require tool calls
  • Adds scripts/mock_ollama_server.py and scripts/run_ollama_e2e_test.sh for fully offline end-to-end testing without network access to Ollama's registry
  • Adds 31 unit tests covering adapter formatting, message conversion, tool-call extraction, judge model generation, and the NoOpDeployer

Test plan

  • pytest tests/test_ollama_adapters.py tests/test_factory.py passes (31 tests)
  • bash scripts/run_ollama_e2e_test.sh completes end-to-end with the mock server
  • With Ollama running locally (ollama serve) and gemma4:e2b pulled, run a task with AGENT_PROVIDER=ollama JUDGE_PROVIDER=ollama BENCH_NO_INFRA=true BENCH_USE_MCP=false

Supersedes #71 — same commits; reopened from a renamed branch (ollama-e2e-testing).

adrianchung and others added 7 commits June 14, 2026 02:01
Adds OllamaClientAdapter (agent) and OllamaDeepEvalModel (judge) backed
by Ollama's OpenAI-compatible API. Wire AGENT_PROVIDER=ollama and
JUDGE_PROVIDER=ollama into the existing routing. Defaults to gemma4:2b
at http://localhost:11434/v1 (override via OLLAMA_BASE_URL / AGENT_MODEL
/ JUDGE_MODEL). Includes scripts/setup_ollama.sh to install Ollama and
pull the model in a single step.
Single script to install Ollama, kind, pre-pull the kind node image,
create a devops-bench cluster, and pull gemma4:2b — designed to run as
a web-based environment setup script so the result is
cached for subsequent sessions.
- deployers/factory.py: add NoOpDeployer activated by BENCH_NO_INFRA=true,
  which skips tofu/GCP cluster provisioning; enables running the benchmark
  pipeline without infrastructure when testing provider adapters locally
- scripts/mock_ollama_server.py: minimal HTTP server that simulates Ollama's
  OpenAI-compatible chat API; classifies calls as agent/steps/score and
  returns appropriate canned responses so DeepEval GEval metrics complete
- scripts/run_ollama_e2e_test.sh: driver script that starts the mock server,
  sets all Ollama env vars, and runs evaluate.py against the simplest task
  (tasks/generic/gateway-https-redirect) end-to-end with exit 0
When BENCH_USE_MCP=false, the original system instruction told the model
"Do NOT output templates, you MUST use tool calls." Models that follow
instructions strictly (including gemma4) returned empty responses because
no tools were available to call.

Add SYSTEM_INSTRUCTION_NO_MCP that asks for direct YAML/manifest output,
and select it in execute_agent() based on the bench_use_mcp flag.
tests/test_ollama_adapters.py (26 tests):
- OllamaClientAdapter: format_tools, extract_function_calls (dict args,
  JSON string args, malformed JSON fallback), get_text_content (None guard),
  _convert_to_openai_messages (system instruction ordering, tool call wire
  format, tool result messages, full conversation round-trip)
- OllamaDeepEvalModel: generate, a_generate, model name, None content guard

tests/test_factory.py (5 new tests):
- BENCH_NO_INFRA=true returns NoOpDeployer
- BENCH_NO_INFRA takes precedence over explicit infra config
- NoOpDeployer.up/down print and do not raise
- NoOpDeployer.get_cluster_info returns expected keys
Establishes a repeatable way to generate DevOps Bench tasks from an expert
catalog and run them against the framework, plus the first end-to-end
validated task.

- docs/task-generation/: methodology (schema, ID allocation, expected_output
  rules, task classes, kind/MCP cluster access), the 114-row expert catalog as
  source of truth with a generation-status tracker, and a run + leaderboard guide.
- AGENTS.md: vendor-neutral agent guidance pointing at the methodology.
- tasks/generic/debug-crashloop/: first generated task (task_id 1001) with a
  CrashLoopBackOff fixture whose root cause is a missing DATABASE_URL env var.
- pkg/agents/runner/api/mcp_client.py: forward the environment to the MCP
  server subprocess so it inherits KUBECONFIG and cloud credentials; without
  this the MCP server cannot resolve the target cluster's kubeconfig context.

Validated end-to-end on a local kind cluster via the GKE MCP server: the
investigation pipeline (generate -> fixture -> MCP tools on the live cluster
-> agent loop -> judge) discriminates model capability. gemma4:e4b passes
ChecklistScore 1.0 (lists pods, reads logs, finds the DATABASE_URL root cause
and fix); gemma4:e2b fails at 0.25 (hallucinates the pod name).
Task-generation framework + first validated task (debug-crashloop)
@adrianchung

Copy link
Copy Markdown
Collaborator Author

Split into four focused, independently-reviewable PRs (per review feedback that this was too big). Together they cover this PR's changes exactly, with no file overlap, so they can be reviewed/merged in any order:

Recommended review/merge order: #73, #74, #75, #76 (later ones only run once the earlier ones land; no merge dependency).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support a local option to test for inner dev loop

1 participant