Skip to content

Benchmark quality: parallel-safety + harness/metrics/task fixes#141

Open
pradeepvrd wants to merge 24 commits into
gke-labs:mainfrom
pradeepvrd:bench-quality
Open

Benchmark quality: parallel-safety + harness/metrics/task fixes#141
pradeepvrd wants to merge 24 commits into
gke-labs:mainfrom
pradeepvrd:bench-quality

Conversation

@pradeepvrd

@pradeepvrd pradeepvrd commented Jun 28, 2026

Copy link
Copy Markdown
Collaborator

Summary

Benchmark quality pass: parallel-run safety plus harness, metrics, task-definition, and leaderboard-ingest fixes. Each fix's details are in its commit message; this lists the top-level commits by author.

Commits

Jessie Liu (7)

  • feat(deployers) per-run isolation engine (noop deployer + per-run tofu stack-dir)
  • fix(tf) run-scoped naming to avoid cross-run collisions
  • fix(tasks) mark manifest-generation tasks deployer: noop
  • fix(matrix) pin per-task NAMESPACE/TARGET_DEPLOYMENT_NAME
  • fix(optimize-scale) make chaos load reach the workload
  • fix(opa-remediation) retry Kyverno webhook apply
  • docs(evals) stale-state rerun cleanup + gke-mcp mutation blast radius

pradeepvrd (17)

  • feat(harness) add validated flag for leaderboard gating
  • feat(metrics) generation-only judging + MCP tool-name normalization
  • feat(chaos) route load via external LoadBalancer with port-forward fallback
  • feat(models) add google-vertex aliases; pass KUBECONFIG to MCP servers
  • fix(tasks) correct manifest schemas and make generate-only tasks noop
  • feat(tf) add hypercomputer-d1 stack; wire deploy-config/fix-config/get-app-architecture
  • feat(tasks) replace parallelstore-csi with lustre-csi deployment
  • fix(complextasks) grade migration-and-upgrade & multi-region-failover on outcome
  • fix(tf) sweep leaked hello-app Artifact Registry repo on minimum teardown
  • fix(bastion) wire agent +skills to gke-mcp repo; sync hygiene; agent timeout
  • docs(evals) bastion kind requirements, parallel-evals failure modes, stale-state wipe
  • fix(bastion) stop sync from excluding devops_bench/results
  • fix(tasks) report the real task folder for single-file task.yaml loads
  • feat(results) aggregate per-task parallel runs into one dashboard run
  • fix(ingest) accept a uniqueness suffix on runId
  • feat(ingest) add Gemini 3.1 Pro to the dashboard catalog
  • fix(tasks) make task ids globally unique across task trees

Notes

  • WIP tasks owned by separate PRs (deploy-postgres-web-app, troubleshoot-unhealthy-pod, gitops-auto-revert, debug-crashloop and their tf stacks) are intentionally excluded so this merges cleanly.

Testing

  • pytest tests/unit — 744 passed.
  • site_new vitest ingest/ — 25 passed.

jessie1111101 and others added 13 commits June 28, 2026 12:45
…fu stack-dir)

Add a noop deployer for manifest-generation tasks and copy each run's
OpenTofu stack into a per-run working dir so concurrent matrix runs no
longer collide on shared tofu state.
Derive run-unique GitOps repo paths and cluster names (multi-region e-/w-
prefix to dodge the node-SA substr collision), declare the minimum-stack
namespace var, and document the kind-task parallel model.
Manifest-only tasks skip cluster provisioning (no shared cluster to
collide on); deploy-hello-app uses a run-unique Artifact Registry repo.
Add task_extra_env so both eval arms agree on fixture namespace and
target deployment for pre-seeded tasks.
Serve on port 8080 with a CPU-burn workload so the HPA actually scales,
install fortio on the bastion, and pin the fixture's deployment/namespace.
Retry the policy apply until the Kyverno admission webhook is serving,
so fixture setup no longer flakes with context-deadline-exceeded.
Document clearing stale per-run state before reruns and the cluster-
mutation risk when gke-mcp exposes unrelated clusters.
Add a validated field to the task schema and result row; only vetted
tasks promote to the leaderboard. Plumb it through results normalization
and the site schema/seed data.
Thread a generation_only flag so manifest-only (deployer: noop) tasks
aren't penalized by the OutcomeValidity judge for not applying to a
cluster, and emit generation_only/validated on result records. Strip the
MCP server prefix (server__tool) so expected-tool matching is canonical.
…allback

Resolve the target Service's external LB IP and rewrite the action URL to
http://<ip>:8080, falling back to port-forward when resolution times out
or is skipped in smoke tests; wait for rollout before forwarding. Expose
optimize-scale via LoadBalancer so load reaches it at 300+ qps.
Resolve google-vertex/google_vertex to the gemini adapter, and pass the
run-scoped KUBECONFIG to MCP servers so they use run credentials.
Fix HPA/computeclass API versions and spec paths, clarify generate-only
prompts, convert create/modify-deployment to the noop deployer, and
reframe gpu-stress-test-diagnosis as post-incident log analysis.
…et-app-architecture

Introduce the hypercomputer-d1 prebuilt stack (vLLM backend seed, GCS
bucket, Workload Identity KSA, frontend) with seed_mode variants, and
point the three tasks at it on e2-standard-4.
Comment thread complextasks/migration-and-upgrade/task.yaml
Swap the deprecated Parallelstore CSI task/stack for a Managed Lustre CSI
task (18TB capacity, L4 GPU nodes) for model-serving storage.
…r on outcome

Accept either failover or direct primary recovery (service restored = 2xx)
and clarify kind's control-plane upgrade path.
…rdown

deploy-hello-app creates hello-app-<cluster> in project-global Artifact
Registry, which cluster teardown never removes; add a destroy-time
null_resource to delete it so it doesn't leak across runs.
… timeout

Point SKILLS_PATHS at the cloned gke-mcp operational skills (not judge
rubrics), clone that repo in vm-setup, raise AGENT_TIMEOUT_SEC to 1200,
and strip macOS AppleDouble files during sync.
… stale-state wipe

Document kind toolchain/cleanup needs and expanded matrix failure modes,
and strengthen stale per-run state guidance to wipe all state before
every run.
The tar `--exclude='results'` matched any path component named 'results',
so it stripped the `devops_bench/results/` SOURCE module (the rows.json /
manifest.json builder) from every sync. The harness's `build_rows` import
then failed inside its per-metric/best-effort try-except, so runs silently
produced no rows.json/manifest.json (leaderboard rows). The eval-output
`results/` dir is already excluded by not being in the synced path allowlist.
The parallel matrix runs one task per process, pointing the loader at a single
<task-dir>/task.yaml. _load_single_file used the file stem -- the literal
"task" -- for both folder and the name fallback, while only the directory
loader used the containing dir name. Every emitted leaderboard row therefore
carried taskFolder="task", and the dashboard's derive() (which groups tasks by
taskFolder) collapsed a whole setup's tasks into a single task.

Fix: for a single spec named task.yaml, derive folder and name from the parent
directory, mirroring the directory loader; keep the stem fallback for
arbitrarily-named single specs.
The parallel matrix runs one task per process, emitting one rows.json per task,
each with a unique runId and its own timestamp t. The dashboard models a run as
a batch of tasks sharing one runId/t (tasks distinguished by taskFolder), and
derive.mjs groups runs by t and shows only the latest run's tasks -- so the
per-task files render as many single-task runs and a setup surfaces just one
task.

Add devops_bench/results/aggregate.py (+ CLI) to combine the per-task rows into
one batch run: stamp a single shared runId (with a unique run_<ts>_<pid> suffix
so concurrent or repeated matrix runs never collide on the
setupId__runId__taskFolder__iteration doc id) and one shared t, de-dupe retried
tasks (latest t wins), and write a combined rows.json + per-setup
manifests.json. Wire a Phase-6 aggregation step into the run-parallel-evals
skill.
load.mjs rejected every runId produced by an isolated/parallel run: the
producer makes run ids unique per process by appending a suffix (pid or matrix
id) to run_<ts>, but RUN_ID_RE required a bare run_YYYYMMDD_HHMMSS, so real runs
failed validation and could not be ingested. The timestamp alone is not unique,
so the suffix is required to keep the setupId__runId__taskFolder__iteration doc
id distinct across parallel runs.

Fix: loosen RUN_ID_RE to allow an optional _<suffix>, update the PROTOCOL
contract, and cover the suffixed form in load.test.mjs.
Ingesting gemini-3.1-pro-preview emitted an "unknown model" warning and
synthesized placeholder metadata (provider "Unknown", default logo) because the
model was absent from the catalog, so the leaderboard showed a generic entry.

Map gemini-3.1-pro-preview -- plus the stable id and versioned variants via
substring -- to a curated Gemini 3.1 Pro entry (Google, Proprietary), and add a
matching "gemini" brand glyph so the leaderboard renders a logo.
complextasks/ reused task ids 1, 2, 3, and 5 that tasks/ already assigned, so a
combined load (or any cross-tree id check) saw duplicate task ids -- the same
collision class flagged for migration-and-upgrade vs lustre-csi.

Renumber the four colliding complextasks into the free 17-20 range so the trees
no longer overlap: tasks/ keeps 1-14, complextasks/ becomes a contiguous 15-20
block (optimize-scale 1->17, secret-rotation 2->18, cp-recovery 3->19,
opa-remediation 5->20). Tasks are selected by path, not numeric id, so this
only affects loader ordering/de-dup.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants