Support a local option to test for inner dev loop#72
Closed
adrianchung wants to merge 7 commits into
Closed
Conversation
Adds OllamaClientAdapter (agent) and OllamaDeepEvalModel (judge) backed by Ollama's OpenAI-compatible API. Wire AGENT_PROVIDER=ollama and JUDGE_PROVIDER=ollama into the existing routing. Defaults to gemma4:2b at http://localhost:11434/v1 (override via OLLAMA_BASE_URL / AGENT_MODEL / JUDGE_MODEL). Includes scripts/setup_ollama.sh to install Ollama and pull the model in a single step.
Single script to install Ollama, kind, pre-pull the kind node image, create a devops-bench cluster, and pull gemma4:2b — designed to run as a web-based environment setup script so the result is cached for subsequent sessions.
- deployers/factory.py: add NoOpDeployer activated by BENCH_NO_INFRA=true, which skips tofu/GCP cluster provisioning; enables running the benchmark pipeline without infrastructure when testing provider adapters locally - scripts/mock_ollama_server.py: minimal HTTP server that simulates Ollama's OpenAI-compatible chat API; classifies calls as agent/steps/score and returns appropriate canned responses so DeepEval GEval metrics complete - scripts/run_ollama_e2e_test.sh: driver script that starts the mock server, sets all Ollama env vars, and runs evaluate.py against the simplest task (tasks/generic/gateway-https-redirect) end-to-end with exit 0
When BENCH_USE_MCP=false, the original system instruction told the model "Do NOT output templates, you MUST use tool calls." Models that follow instructions strictly (including gemma4) returned empty responses because no tools were available to call. Add SYSTEM_INSTRUCTION_NO_MCP that asks for direct YAML/manifest output, and select it in execute_agent() based on the bench_use_mcp flag.
tests/test_ollama_adapters.py (26 tests): - OllamaClientAdapter: format_tools, extract_function_calls (dict args, JSON string args, malformed JSON fallback), get_text_content (None guard), _convert_to_openai_messages (system instruction ordering, tool call wire format, tool result messages, full conversation round-trip) - OllamaDeepEvalModel: generate, a_generate, model name, None content guard tests/test_factory.py (5 new tests): - BENCH_NO_INFRA=true returns NoOpDeployer - BENCH_NO_INFRA takes precedence over explicit infra config - NoOpDeployer.up/down print and do not raise - NoOpDeployer.get_cluster_info returns expected keys
Establishes a repeatable way to generate DevOps Bench tasks from an expert catalog and run them against the framework, plus the first end-to-end validated task. - docs/task-generation/: methodology (schema, ID allocation, expected_output rules, task classes, kind/MCP cluster access), the 114-row expert catalog as source of truth with a generation-status tracker, and a run + leaderboard guide. - AGENTS.md: vendor-neutral agent guidance pointing at the methodology. - tasks/generic/debug-crashloop/: first generated task (task_id 1001) with a CrashLoopBackOff fixture whose root cause is a missing DATABASE_URL env var. - pkg/agents/runner/api/mcp_client.py: forward the environment to the MCP server subprocess so it inherits KUBECONFIG and cloud credentials; without this the MCP server cannot resolve the target cluster's kubeconfig context. Validated end-to-end on a local kind cluster via the GKE MCP server: the investigation pipeline (generate -> fixture -> MCP tools on the live cluster -> agent loop -> judge) discriminates model capability. gemma4:e4b passes ChecklistScore 1.0 (lists pods, reads logs, finds the DATABASE_URL root cause and fix); gemma4:e2b fails at 0.25 (hallucinates the pod name).
Task-generation framework + first validated task (debug-crashloop)
This was referenced Jun 15, 2026
Collaborator
Author
|
Split into four focused, independently-reviewable PRs (per review feedback that this was too big). Together they cover this PR's changes exactly, with no file overlap, so they can be reviewed/merged in any order:
Recommended review/merge order: #73, #74, #75, #76 (later ones only run once the earlier ones land; no merge dependency). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #70
Summary
OllamaClientAdapterandOllamaDeepEvalModelso benchmarks can run fully offline using a local Ollama server instead of a cloud LLM providerscripts/setup_local_env.shto install Ollama, kind, and pull the model in a single step for CI and dev setupNoOpDeployer(activated byBENCH_NO_INFRA=true) to skip OpenTofu provisioning when a cluster is already available locallyBENCH_USE_MCP=falseby using a separateSYSTEM_INSTRUCTION_NO_MCPthat does not require tool callsscripts/mock_ollama_server.pyandscripts/run_ollama_e2e_test.shfor fully offline end-to-end testing without network access to Ollama's registryTest plan
pytest tests/test_ollama_adapters.py tests/test_factory.pypasses (31 tests)bash scripts/run_ollama_e2e_test.shcompletes end-to-end with the mock serverollama serve) andgemma4:e2bpulled, run a task withAGENT_PROVIDER=ollama JUDGE_PROVIDER=ollama BENCH_NO_INFRA=true BENCH_USE_MCP=falseSupersedes #71 — same commits; reopened from a renamed branch (
ollama-e2e-testing).