Benchmark quality: parallel-safety + harness/metrics/task fixes#141
Open
pradeepvrd wants to merge 24 commits into
Open
Benchmark quality: parallel-safety + harness/metrics/task fixes#141pradeepvrd wants to merge 24 commits into
pradeepvrd wants to merge 24 commits into
Conversation
3606e12 to
936254c
Compare
…fu stack-dir) Add a noop deployer for manifest-generation tasks and copy each run's OpenTofu stack into a per-run working dir so concurrent matrix runs no longer collide on shared tofu state.
Derive run-unique GitOps repo paths and cluster names (multi-region e-/w- prefix to dodge the node-SA substr collision), declare the minimum-stack namespace var, and document the kind-task parallel model.
Manifest-only tasks skip cluster provisioning (no shared cluster to collide on); deploy-hello-app uses a run-unique Artifact Registry repo.
Add task_extra_env so both eval arms agree on fixture namespace and target deployment for pre-seeded tasks.
Serve on port 8080 with a CPU-burn workload so the HPA actually scales, install fortio on the bastion, and pin the fixture's deployment/namespace.
Retry the policy apply until the Kyverno admission webhook is serving, so fixture setup no longer flakes with context-deadline-exceeded.
Document clearing stale per-run state before reruns and the cluster- mutation risk when gke-mcp exposes unrelated clusters.
Add a validated field to the task schema and result row; only vetted tasks promote to the leaderboard. Plumb it through results normalization and the site schema/seed data.
Thread a generation_only flag so manifest-only (deployer: noop) tasks aren't penalized by the OutcomeValidity judge for not applying to a cluster, and emit generation_only/validated on result records. Strip the MCP server prefix (server__tool) so expected-tool matching is canonical.
…allback Resolve the target Service's external LB IP and rewrite the action URL to http://<ip>:8080, falling back to port-forward when resolution times out or is skipped in smoke tests; wait for rollout before forwarding. Expose optimize-scale via LoadBalancer so load reaches it at 300+ qps.
Resolve google-vertex/google_vertex to the gemini adapter, and pass the run-scoped KUBECONFIG to MCP servers so they use run credentials.
Fix HPA/computeclass API versions and spec paths, clarify generate-only prompts, convert create/modify-deployment to the noop deployer, and reframe gpu-stress-test-diagnosis as post-incident log analysis.
…et-app-architecture Introduce the hypercomputer-d1 prebuilt stack (vLLM backend seed, GCS bucket, Workload Identity KSA, frontend) with seed_mode variants, and point the three tasks at it on e2-standard-4.
dec4ea3 to
153c8ce
Compare
richackard
reviewed
Jun 28, 2026
983e8b4 to
c1432af
Compare
Swap the deprecated Parallelstore CSI task/stack for a Managed Lustre CSI task (18TB capacity, L4 GPU nodes) for model-serving storage.
…r on outcome Accept either failover or direct primary recovery (service restored = 2xx) and clarify kind's control-plane upgrade path.
…rdown deploy-hello-app creates hello-app-<cluster> in project-global Artifact Registry, which cluster teardown never removes; add a destroy-time null_resource to delete it so it doesn't leak across runs.
… timeout Point SKILLS_PATHS at the cloned gke-mcp operational skills (not judge rubrics), clone that repo in vm-setup, raise AGENT_TIMEOUT_SEC to 1200, and strip macOS AppleDouble files during sync.
… stale-state wipe Document kind toolchain/cleanup needs and expanded matrix failure modes, and strengthen stale per-run state guidance to wipe all state before every run.
The tar `--exclude='results'` matched any path component named 'results', so it stripped the `devops_bench/results/` SOURCE module (the rows.json / manifest.json builder) from every sync. The harness's `build_rows` import then failed inside its per-metric/best-effort try-except, so runs silently produced no rows.json/manifest.json (leaderboard rows). The eval-output `results/` dir is already excluded by not being in the synced path allowlist.
The parallel matrix runs one task per process, pointing the loader at a single <task-dir>/task.yaml. _load_single_file used the file stem -- the literal "task" -- for both folder and the name fallback, while only the directory loader used the containing dir name. Every emitted leaderboard row therefore carried taskFolder="task", and the dashboard's derive() (which groups tasks by taskFolder) collapsed a whole setup's tasks into a single task. Fix: for a single spec named task.yaml, derive folder and name from the parent directory, mirroring the directory loader; keep the stem fallback for arbitrarily-named single specs.
The parallel matrix runs one task per process, emitting one rows.json per task, each with a unique runId and its own timestamp t. The dashboard models a run as a batch of tasks sharing one runId/t (tasks distinguished by taskFolder), and derive.mjs groups runs by t and shows only the latest run's tasks -- so the per-task files render as many single-task runs and a setup surfaces just one task. Add devops_bench/results/aggregate.py (+ CLI) to combine the per-task rows into one batch run: stamp a single shared runId (with a unique run_<ts>_<pid> suffix so concurrent or repeated matrix runs never collide on the setupId__runId__taskFolder__iteration doc id) and one shared t, de-dupe retried tasks (latest t wins), and write a combined rows.json + per-setup manifests.json. Wire a Phase-6 aggregation step into the run-parallel-evals skill.
load.mjs rejected every runId produced by an isolated/parallel run: the producer makes run ids unique per process by appending a suffix (pid or matrix id) to run_<ts>, but RUN_ID_RE required a bare run_YYYYMMDD_HHMMSS, so real runs failed validation and could not be ingested. The timestamp alone is not unique, so the suffix is required to keep the setupId__runId__taskFolder__iteration doc id distinct across parallel runs. Fix: loosen RUN_ID_RE to allow an optional _<suffix>, update the PROTOCOL contract, and cover the suffixed form in load.test.mjs.
Ingesting gemini-3.1-pro-preview emitted an "unknown model" warning and synthesized placeholder metadata (provider "Unknown", default logo) because the model was absent from the catalog, so the leaderboard showed a generic entry. Map gemini-3.1-pro-preview -- plus the stable id and versioned variants via substring -- to a curated Gemini 3.1 Pro entry (Google, Proprietary), and add a matching "gemini" brand glyph so the leaderboard renders a logo.
c1432af to
c757b09
Compare
complextasks/ reused task ids 1, 2, 3, and 5 that tasks/ already assigned, so a combined load (or any cross-tree id check) saw duplicate task ids -- the same collision class flagged for migration-and-upgrade vs lustre-csi. Renumber the four colliding complextasks into the free 17-20 range so the trees no longer overlap: tasks/ keeps 1-14, complextasks/ becomes a contiguous 15-20 block (optimize-scale 1->17, secret-rotation 2->18, cp-recovery 3->19, opa-remediation 5->20). Tasks are selected by path, not numeric id, so this only affects loader ordering/de-dup.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Benchmark quality pass: parallel-run safety plus harness, metrics, task-definition, and leaderboard-ingest fixes. Each fix's details are in its commit message; this lists the top-level commits by author.
Commits
Jessie Liu (7)
feat(deployers)per-run isolation engine (noop deployer + per-run tofu stack-dir)fix(tf)run-scoped naming to avoid cross-run collisionsfix(tasks)mark manifest-generation tasksdeployer: noopfix(matrix)pin per-taskNAMESPACE/TARGET_DEPLOYMENT_NAMEfix(optimize-scale)make chaos load reach the workloadfix(opa-remediation)retry Kyverno webhook applydocs(evals)stale-state rerun cleanup + gke-mcp mutation blast radiuspradeepvrd (17)
feat(harness)addvalidatedflag for leaderboard gatingfeat(metrics)generation-only judging + MCP tool-name normalizationfeat(chaos)route load via external LoadBalancer with port-forward fallbackfeat(models)addgoogle-vertexaliases; passKUBECONFIGto MCP serversfix(tasks)correct manifest schemas and make generate-only tasks noopfeat(tf)addhypercomputer-d1stack; wire deploy-config/fix-config/get-app-architecturefeat(tasks)replace parallelstore-csi with lustre-csi deploymentfix(complextasks)grade migration-and-upgrade & multi-region-failover on outcomefix(tf)sweep leaked hello-app Artifact Registry repo on minimum teardownfix(bastion)wire agent+skillsto gke-mcp repo; sync hygiene; agent timeoutdocs(evals)bastion kind requirements, parallel-evals failure modes, stale-state wipefix(bastion)stop sync from excludingdevops_bench/resultsfix(tasks)report the real task folder for single-file task.yaml loadsfeat(results)aggregate per-task parallel runs into one dashboard runfix(ingest)accept a uniqueness suffix on runIdfeat(ingest)add Gemini 3.1 Pro to the dashboard catalogfix(tasks)make task ids globally unique across task treesNotes
deploy-postgres-web-app,troubleshoot-unhealthy-pod,gitops-auto-revert,debug-crashloopand their tf stacks) are intentionally excluded so this merges cleanly.Testing
pytest tests/unit— 744 passed.site_newvitestingest/— 25 passed.