Benchmark quality: parallel-safety + harness/metrics/task fixes by pradeepvrd · Pull Request #141 · gke-labs/devops-bench

pradeepvrd · 2026-06-28T15:32:27Z

Summary

Benchmark quality pass: parallel-run safety plus harness, metrics, task-definition, and leaderboard-ingest fixes. Each fix's details are in its commit message; this lists the top-level commits by author.

Commits

Jessie Liu (7)

feat(deployers) per-run isolation engine (noop deployer + per-run tofu stack-dir)
fix(tf) run-scoped naming to avoid cross-run collisions
fix(tasks) mark manifest-generation tasks deployer: noop
fix(matrix) pin per-task NAMESPACE/TARGET_DEPLOYMENT_NAME
fix(optimize-scale) make chaos load reach the workload
fix(opa-remediation) retry Kyverno webhook apply
docs(evals) stale-state rerun cleanup + gke-mcp mutation blast radius

pradeepvrd (17)

feat(harness) add validated flag for leaderboard gating
feat(metrics) generation-only judging + MCP tool-name normalization
feat(chaos) route load via external LoadBalancer with port-forward fallback
feat(models) add google-vertex aliases; pass KUBECONFIG to MCP servers
fix(tasks) correct manifest schemas and make generate-only tasks noop
feat(tf) add hypercomputer-d1 stack; wire deploy-config/fix-config/get-app-architecture
feat(tasks) replace parallelstore-csi with lustre-csi deployment
fix(complextasks) grade migration-and-upgrade & multi-region-failover on outcome
fix(tf) sweep leaked hello-app Artifact Registry repo on minimum teardown
fix(bastion) wire agent +skills to gke-mcp repo; sync hygiene; agent timeout
docs(evals) bastion kind requirements, parallel-evals failure modes, stale-state wipe
fix(bastion) stop sync from excluding devops_bench/results
fix(tasks) report the real task folder for single-file task.yaml loads
feat(results) aggregate per-task parallel runs into one dashboard run
fix(ingest) accept a uniqueness suffix on runId
feat(ingest) add Gemini 3.1 Pro to the dashboard catalog
fix(tasks) make task ids globally unique across task trees

Notes

WIP tasks owned by separate PRs (deploy-postgres-web-app, troubleshoot-unhealthy-pod, gitops-auto-revert, debug-crashloop and their tf stacks) are intentionally excluded so this merges cleanly.

Testing

pytest tests/unit — 744 passed.
site_new vitest ingest/ — 25 passed.

…fu stack-dir) Add a noop deployer for manifest-generation tasks and copy each run's OpenTofu stack into a per-run working dir so concurrent matrix runs no longer collide on shared tofu state.

Derive run-unique GitOps repo paths and cluster names (multi-region e-/w- prefix to dodge the node-SA substr collision), declare the minimum-stack namespace var, and document the kind-task parallel model.

Manifest-only tasks skip cluster provisioning (no shared cluster to collide on); deploy-hello-app uses a run-unique Artifact Registry repo.

Add task_extra_env so both eval arms agree on fixture namespace and target deployment for pre-seeded tasks.

Serve on port 8080 with a CPU-burn workload so the HPA actually scales, install fortio on the bastion, and pin the fixture's deployment/namespace.

Retry the policy apply until the Kyverno admission webhook is serving, so fixture setup no longer flakes with context-deadline-exceeded.

Document clearing stale per-run state before reruns and the cluster- mutation risk when gke-mcp exposes unrelated clusters.

Add a validated field to the task schema and result row; only vetted tasks promote to the leaderboard. Plumb it through results normalization and the site schema/seed data.

Thread a generation_only flag so manifest-only (deployer: noop) tasks aren't penalized by the OutcomeValidity judge for not applying to a cluster, and emit generation_only/validated on result records. Strip the MCP server prefix (server__tool) so expected-tool matching is canonical.

…allback Resolve the target Service's external LB IP and rewrite the action URL to http://<ip>:8080, falling back to port-forward when resolution times out or is skipped in smoke tests; wait for rollout before forwarding. Expose optimize-scale via LoadBalancer so load reaches it at 300+ qps.

Resolve google-vertex/google_vertex to the gemini adapter, and pass the run-scoped KUBECONFIG to MCP servers so they use run credentials.

Fix HPA/computeclass API versions and spec paths, clarify generate-only prompts, convert create/modify-deployment to the noop deployer, and reframe gpu-stress-test-diagnosis as post-incident log analysis.

…et-app-architecture Introduce the hypercomputer-d1 prebuilt stack (vLLM backend seed, GCS bucket, Workload Identity KSA, frontend) with seed_mode variants, and point the three tasks at it on e2-standard-4.

Swap the deprecated Parallelstore CSI task/stack for a Managed Lustre CSI task (18TB capacity, L4 GPU nodes) for model-serving storage.

…r on outcome Accept either failover or direct primary recovery (service restored = 2xx) and clarify kind's control-plane upgrade path.

…rdown deploy-hello-app creates hello-app-<cluster> in project-global Artifact Registry, which cluster teardown never removes; add a destroy-time null_resource to delete it so it doesn't leak across runs.

… timeout Point SKILLS_PATHS at the cloned gke-mcp operational skills (not judge rubrics), clone that repo in vm-setup, raise AGENT_TIMEOUT_SEC to 1200, and strip macOS AppleDouble files during sync.

… stale-state wipe Document kind toolchain/cleanup needs and expanded matrix failure modes, and strengthen stale per-run state guidance to wipe all state before every run.

The tar `--exclude='results'` matched any path component named 'results', so it stripped the `devops_bench/results/` SOURCE module (the rows.json / manifest.json builder) from every sync. The harness's `build_rows` import then failed inside its per-metric/best-effort try-except, so runs silently produced no rows.json/manifest.json (leaderboard rows). The eval-output `results/` dir is already excluded by not being in the synced path allowlist.

The parallel matrix runs one task per process, pointing the loader at a single <task-dir>/task.yaml. _load_single_file used the file stem -- the literal "task" -- for both folder and the name fallback, while only the directory loader used the containing dir name. Every emitted leaderboard row therefore carried taskFolder="task", and the dashboard's derive() (which groups tasks by taskFolder) collapsed a whole setup's tasks into a single task. Fix: for a single spec named task.yaml, derive folder and name from the parent directory, mirroring the directory loader; keep the stem fallback for arbitrarily-named single specs.

The parallel matrix runs one task per process, emitting one rows.json per task, each with a unique runId and its own timestamp t. The dashboard models a run as a batch of tasks sharing one runId/t (tasks distinguished by taskFolder), and derive.mjs groups runs by t and shows only the latest run's tasks -- so the per-task files render as many single-task runs and a setup surfaces just one task. Add devops_bench/results/aggregate.py (+ CLI) to combine the per-task rows into one batch run: stamp a single shared runId (with a unique run_<ts>_<pid> suffix so concurrent or repeated matrix runs never collide on the setupId__runId__taskFolder__iteration doc id) and one shared t, de-dupe retried tasks (latest t wins), and write a combined rows.json + per-setup manifests.json. Wire a Phase-6 aggregation step into the run-parallel-evals skill.

load.mjs rejected every runId produced by an isolated/parallel run: the producer makes run ids unique per process by appending a suffix (pid or matrix id) to run_<ts>, but RUN_ID_RE required a bare run_YYYYMMDD_HHMMSS, so real runs failed validation and could not be ingested. The timestamp alone is not unique, so the suffix is required to keep the setupId__runId__taskFolder__iteration doc id distinct across parallel runs. Fix: loosen RUN_ID_RE to allow an optional _<suffix>, update the PROTOCOL contract, and cover the suffixed form in load.test.mjs.

Ingesting gemini-3.1-pro-preview emitted an "unknown model" warning and synthesized placeholder metadata (provider "Unknown", default logo) because the model was absent from the catalog, so the leaderboard showed a generic entry. Map gemini-3.1-pro-preview -- plus the stable id and versioned variants via substring -- to a curated Gemini 3.1 Pro entry (Google, Proprietary), and add a matching "gemini" brand glyph so the leaderboard renders a logo.

complextasks/ reused task ids 1, 2, 3, and 5 that tasks/ already assigned, so a combined load (or any cross-tree id check) saw duplicate task ids -- the same collision class flagged for migration-and-upgrade vs lustre-csi. Renumber the four colliding complextasks into the free 17-20 range so the trees no longer overlap: tasks/ keeps 1-14, complextasks/ becomes a contiguous 15-20 block (optimize-scale 1->17, secret-rotation 2->18, cp-recovery 3->19, opa-remediation 5->20). Tasks are selected by path, not numeric id, so this only affects loader ordering/de-dup.

pradeepvrd requested review from jessie1111101 and richackard June 28, 2026 15:33

pradeepvrd force-pushed the bench-quality branch from 3606e12 to 936254c Compare June 28, 2026 15:36

jessie1111101 and others added 13 commits June 28, 2026 12:45

feat(deployers): per-run isolation engine (noop deployer + per-run to…

2666753

…fu stack-dir) Add a noop deployer for manifest-generation tasks and copy each run's OpenTofu stack into a per-run working dir so concurrent matrix runs no longer collide on shared tofu state.

fix(tf): run-scoped naming to avoid cross-run collisions

7d5b670

Derive run-unique GitOps repo paths and cluster names (multi-region e-/w- prefix to dodge the node-SA substr collision), declare the minimum-stack namespace var, and document the kind-task parallel model.

fix(tasks): mark manifest-generation tasks deployer: noop

a1c6f85

Manifest-only tasks skip cluster provisioning (no shared cluster to collide on); deploy-hello-app uses a run-unique Artifact Registry repo.

fix(matrix): pin per-task NAMESPACE/TARGET_DEPLOYMENT_NAME

fb21892

Add task_extra_env so both eval arms agree on fixture namespace and target deployment for pre-seeded tasks.

fix(optimize-scale): make chaos load reach the workload

ae8989f

Serve on port 8080 with a CPU-burn workload so the HPA actually scales, install fortio on the bastion, and pin the fixture's deployment/namespace.

fix(opa-remediation): retry Kyverno webhook apply

21970b3

Retry the policy apply until the Kyverno admission webhook is serving, so fixture setup no longer flakes with context-deadline-exceeded.

docs(evals): stale-state rerun cleanup + gke-mcp mutation blast radius

a7c7513

Document clearing stale per-run state before reruns and the cluster- mutation risk when gke-mcp exposes unrelated clusters.

feat(harness): add validated flag for leaderboard gating

c033788

Add a validated field to the task schema and result row; only vetted tasks promote to the leaderboard. Plumb it through results normalization and the site schema/seed data.

feat(models): add google-vertex aliases; pass KUBECONFIG to MCP servers

d9987a1

Resolve google-vertex/google_vertex to the gemini adapter, and pass the run-scoped KUBECONFIG to MCP servers so they use run credentials.

fix(tasks): correct manifest schemas and make generate-only tasks noop

a4afd77

Fix HPA/computeclass API versions and spec paths, clarify generate-only prompts, convert create/modify-deployment to the noop deployer, and reframe gpu-stress-test-diagnosis as post-incident log analysis.

feat(tf): add hypercomputer-d1 stack; wire deploy-config/fix-config/g…

b6a1239

…et-app-architecture Introduce the hypercomputer-d1 prebuilt stack (vLLM backend seed, GCS bucket, Workload Identity KSA, frontend) with seed_mode variants, and point the three tasks at it on e2-standard-4.

pradeepvrd force-pushed the bench-quality branch from dec4ea3 to 153c8ce Compare June 28, 2026 19:49

richackard reviewed Jun 28, 2026

View reviewed changes

Comment thread complextasks/migration-and-upgrade/task.yaml

pradeepvrd force-pushed the bench-quality branch from 983e8b4 to c1432af Compare June 28, 2026 20:45

pradeepvrd mentioned this pull request Jun 28, 2026

feat(tasks): make all benchmark tasks parallel-run safe #133

Closed

pradeepvrd added 10 commits June 28, 2026 13:55

feat(tasks): replace parallelstore-csi with lustre-csi deployment

5b95d47

Swap the deprecated Parallelstore CSI task/stack for a Managed Lustre CSI task (18TB capacity, L4 GPU nodes) for model-serving storage.

fix(complextasks): grade migration-and-upgrade & multi-region-failove…

39bcd85

…r on outcome Accept either failover or direct primary recovery (service restored = 2xx) and clarify kind's control-plane upgrade path.

fix(tf): sweep leaked hello-app Artifact Registry repo on minimum tea…

20692f5

…rdown deploy-hello-app creates hello-app-<cluster> in project-global Artifact Registry, which cluster teardown never removes; add a destroy-time null_resource to delete it so it doesn't leak across runs.

fix(bastion): wire agent +skills to gke-mcp repo; sync hygiene; agent…

ac4176b

… timeout Point SKILLS_PATHS at the cloned gke-mcp operational skills (not judge rubrics), clone that repo in vm-setup, raise AGENT_TIMEOUT_SEC to 1200, and strip macOS AppleDouble files during sync.

docs(evals): bastion kind requirements, parallel-evals failure modes,…

1252dcf

… stale-state wipe Document kind toolchain/cleanup needs and expanded matrix failure modes, and strengthen stale per-run state guidance to wipe all state before every run.

pradeepvrd force-pushed the bench-quality branch from c1432af to c757b09 Compare June 28, 2026 20:57

pradeepvrd mentioned this pull request Jun 28, 2026

refactor(tasks): merge complextasks into tasks/ and group by gcp/kind/noop/common #146

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Benchmark quality: parallel-safety + harness/metrics/task fixes#141

Benchmark quality: parallel-safety + harness/metrics/task fixes#141
pradeepvrd wants to merge 24 commits into
gke-labs:mainfrom
pradeepvrd:bench-quality

pradeepvrd commented Jun 28, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

pradeepvrd commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Commits

Notes

Testing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pradeepvrd commented Jun 28, 2026 •

edited

Loading