refactor(tasks): merge complextasks into tasks/ and group by gcp/kind/noop/common by pradeepvrd · Pull Request #146 · gke-labs/devops-bench

pradeepvrd · 2026-06-28T22:07:21Z

Summary

Merge the separate complextasks/ tree (and tasks/generic/) into a single tasks/ root and group every task by how it runs. Implements the merge mandated by docs/migration/directory-structure.md, plus a gcp/kind/noop/common sub-taxonomy.

Classification = a task's actual provider dependency (not which tf stack exists):

Bucket	Count	Criterion	Tasks
`gcp/`	8	hard GCP dependency	deploy-config, deploy-hello-app, fix-config, get-app-architecture, gpu-stress-test-diagnosis, lustre-csi-deployment, multi-region-failover, secret-rotation
`kind/`	1	self-managed control plane only	cp-recovery
`common/`	3	generic k8s, runs on both	optimize-scale, opa-remediation, migration-and-upgrade
`noop/`	8	`deployer: noop` generate-only	computeclass-active-migration, computeclass-spot-fallback, create-deployment, gateway-cloud-armor, gateway-https-redirect, hpa-metric-filtering, hpa-renamed-metric, modify-deployment

Notable calls: optimize-scale has no GCP dependency → common; secret-rotation (Cloud Secret Manager) and deploy-hello-app (mandatory Artifact Registry build/publish) stay gcp; cp-recovery does etcd/control-plane surgery impossible on managed GKE → kind.

All moves are pure git renames (history preserved); stack references are repo-root-relative so they're unaffected. Task-path references updated across the matrix lib (two-root enumeration → single tasks/), bastion scripts, Dockerfile, README, skills, docs, and tests.

Stacking

Branched on top of #141 (which renumbers task ids and renames lustre-csi), since that PR edits the same task files. Review/merge #141 first; the diff unique to this PR is the single refactor(tasks): … commit.

Testing

pytest tests/unit — 744 passed.
find tasks -name task.yaml | wc -l → 20; loader loads 20; task ids remain globally unique.

…fu stack-dir) Add a noop deployer for manifest-generation tasks and copy each run's OpenTofu stack into a per-run working dir so concurrent matrix runs no longer collide on shared tofu state.

Derive run-unique GitOps repo paths and cluster names (multi-region e-/w- prefix to dodge the node-SA substr collision), declare the minimum-stack namespace var, and document the kind-task parallel model.

Manifest-only tasks skip cluster provisioning (no shared cluster to collide on); deploy-hello-app uses a run-unique Artifact Registry repo.

Add task_extra_env so both eval arms agree on fixture namespace and target deployment for pre-seeded tasks.

Serve on port 8080 with a CPU-burn workload so the HPA actually scales, install fortio on the bastion, and pin the fixture's deployment/namespace.

Retry the policy apply until the Kyverno admission webhook is serving, so fixture setup no longer flakes with context-deadline-exceeded.

Document clearing stale per-run state before reruns and the cluster- mutation risk when gke-mcp exposes unrelated clusters.

Add a validated field to the task schema and result row; only vetted tasks promote to the leaderboard. Plumb it through results normalization and the site schema/seed data.

Thread a generation_only flag so manifest-only (deployer: noop) tasks aren't penalized by the OutcomeValidity judge for not applying to a cluster, and emit generation_only/validated on result records. Strip the MCP server prefix (server__tool) so expected-tool matching is canonical.

…allback Resolve the target Service's external LB IP and rewrite the action URL to http://<ip>:8080, falling back to port-forward when resolution times out or is skipped in smoke tests; wait for rollout before forwarding. Expose optimize-scale via LoadBalancer so load reaches it at 300+ qps.

Resolve google-vertex/google_vertex to the gemini adapter, and pass the run-scoped KUBECONFIG to MCP servers so they use run credentials.

Fix HPA/computeclass API versions and spec paths, clarify generate-only prompts, convert create/modify-deployment to the noop deployer, and reframe gpu-stress-test-diagnosis as post-incident log analysis.

…et-app-architecture Introduce the hypercomputer-d1 prebuilt stack (vLLM backend seed, GCS bucket, Workload Identity KSA, frontend) with seed_mode variants, and point the three tasks at it on e2-standard-4.

Swap the deprecated Parallelstore CSI task/stack for a Managed Lustre CSI task (18TB capacity, L4 GPU nodes) for model-serving storage.

…r on outcome Accept either failover or direct primary recovery (service restored = 2xx) and clarify kind's control-plane upgrade path.

…rdown deploy-hello-app creates hello-app-<cluster> in project-global Artifact Registry, which cluster teardown never removes; add a destroy-time null_resource to delete it so it doesn't leak across runs.

… timeout Point SKILLS_PATHS at the cloned gke-mcp operational skills (not judge rubrics), clone that repo in vm-setup, raise AGENT_TIMEOUT_SEC to 1200, and strip macOS AppleDouble files during sync.

… stale-state wipe Document kind toolchain/cleanup needs and expanded matrix failure modes, and strengthen stale per-run state guidance to wipe all state before every run.

The tar `--exclude='results'` matched any path component named 'results', so it stripped the `devops_bench/results/` SOURCE module (the rows.json / manifest.json builder) from every sync. The harness's `build_rows` import then failed inside its per-metric/best-effort try-except, so runs silently produced no rows.json/manifest.json (leaderboard rows). The eval-output `results/` dir is already excluded by not being in the synced path allowlist.

The parallel matrix runs one task per process, pointing the loader at a single <task-dir>/task.yaml. _load_single_file used the file stem -- the literal "task" -- for both folder and the name fallback, while only the directory loader used the containing dir name. Every emitted leaderboard row therefore carried taskFolder="task", and the dashboard's derive() (which groups tasks by taskFolder) collapsed a whole setup's tasks into a single task. Fix: for a single spec named task.yaml, derive folder and name from the parent directory, mirroring the directory loader; keep the stem fallback for arbitrarily-named single specs.

The parallel matrix runs one task per process, emitting one rows.json per task, each with a unique runId and its own timestamp t. The dashboard models a run as a batch of tasks sharing one runId/t (tasks distinguished by taskFolder), and derive.mjs groups runs by t and shows only the latest run's tasks -- so the per-task files render as many single-task runs and a setup surfaces just one task. Add devops_bench/results/aggregate.py (+ CLI) to combine the per-task rows into one batch run: stamp a single shared runId (with a unique run_<ts>_<pid> suffix so concurrent or repeated matrix runs never collide on the setupId__runId__taskFolder__iteration doc id) and one shared t, de-dupe retried tasks (latest t wins), and write a combined rows.json + per-setup manifests.json. Wire a Phase-6 aggregation step into the run-parallel-evals skill.

load.mjs rejected every runId produced by an isolated/parallel run: the producer makes run ids unique per process by appending a suffix (pid or matrix id) to run_<ts>, but RUN_ID_RE required a bare run_YYYYMMDD_HHMMSS, so real runs failed validation and could not be ingested. The timestamp alone is not unique, so the suffix is required to keep the setupId__runId__taskFolder__iteration doc id distinct across parallel runs. Fix: loosen RUN_ID_RE to allow an optional _<suffix>, update the PROTOCOL contract, and cover the suffixed form in load.test.mjs.

Ingesting gemini-3.1-pro-preview emitted an "unknown model" warning and synthesized placeholder metadata (provider "Unknown", default logo) because the model was absent from the catalog, so the leaderboard showed a generic entry. Map gemini-3.1-pro-preview -- plus the stable id and versioned variants via substring -- to a curated Gemini 3.1 Pro entry (Google, Proprietary), and add a matching "gemini" brand glyph so the leaderboard renders a logo.

complextasks/ reused task ids 1, 2, 3, and 5 that tasks/ already assigned, so a combined load (or any cross-tree id check) saw duplicate task ids -- the same collision class flagged for migration-and-upgrade vs lustre-csi. Renumber the four colliding complextasks into the free 17-20 range so the trees no longer overlap: tasks/ keeps 1-14, complextasks/ becomes a contiguous 15-20 block (optimize-scale 1->17, secret-rotation 2->18, cp-recovery 3->19, opa-remediation 5->20). Tasks are selected by path, not numeric id, so this only affects loader ordering/de-dup.

…/noop/common Merge the separate complextasks/ tree and tasks/generic/ into a single tasks/ root (per docs/migration/directory-structure.md) and group every task by its ACTUAL provider dependency, not by which tf stack happens to exist: - gcp/ (8): hard GCP dependency -- Cloud Secret Manager, Cloud SQL, Managed Lustre, Hypercomputer/Workload Identity, GKE GPUs, multi-region GKE, or a mandatory Artifact Registry build/publish. Cannot run on kind. - kind/ (1): cp-recovery -- control-plane/etcd surgery, only meaningful on a self-managed cluster (its own header marks it the kind-only exception). - common/ (3): optimize-scale, opa-remediation, migration-and-upgrade -- generic k8s scenarios runnable on both gcp and kind. - noop/ (8): deployer:noop generate-only tasks. Classification is by what the task needs: optimize-scale (no GCP dep) -> common, while secret-rotation (Cloud Secret Manager) and deploy-hello-app (mandatory Artifact Registry) remain gcp. Moves are pure git renames; stack refs are repo-root-relative and unaffected. All task-path references updated: the matrix enumeration collapses to a single tasks/ root, plus scripts, Dockerfile, skills, docs, and tests.

jessie1111101 and others added 25 commits June 28, 2026 12:45

feat(deployers): per-run isolation engine (noop deployer + per-run to…

2666753

…fu stack-dir) Add a noop deployer for manifest-generation tasks and copy each run's OpenTofu stack into a per-run working dir so concurrent matrix runs no longer collide on shared tofu state.

fix(tf): run-scoped naming to avoid cross-run collisions

7d5b670

Derive run-unique GitOps repo paths and cluster names (multi-region e-/w- prefix to dodge the node-SA substr collision), declare the minimum-stack namespace var, and document the kind-task parallel model.

fix(tasks): mark manifest-generation tasks deployer: noop

a1c6f85

Manifest-only tasks skip cluster provisioning (no shared cluster to collide on); deploy-hello-app uses a run-unique Artifact Registry repo.

fix(matrix): pin per-task NAMESPACE/TARGET_DEPLOYMENT_NAME

fb21892

Add task_extra_env so both eval arms agree on fixture namespace and target deployment for pre-seeded tasks.

fix(optimize-scale): make chaos load reach the workload

ae8989f

Serve on port 8080 with a CPU-burn workload so the HPA actually scales, install fortio on the bastion, and pin the fixture's deployment/namespace.

fix(opa-remediation): retry Kyverno webhook apply

21970b3

Retry the policy apply until the Kyverno admission webhook is serving, so fixture setup no longer flakes with context-deadline-exceeded.

docs(evals): stale-state rerun cleanup + gke-mcp mutation blast radius

a7c7513

Document clearing stale per-run state before reruns and the cluster- mutation risk when gke-mcp exposes unrelated clusters.

feat(harness): add validated flag for leaderboard gating

c033788

Add a validated field to the task schema and result row; only vetted tasks promote to the leaderboard. Plumb it through results normalization and the site schema/seed data.

feat(models): add google-vertex aliases; pass KUBECONFIG to MCP servers

d9987a1

Resolve google-vertex/google_vertex to the gemini adapter, and pass the run-scoped KUBECONFIG to MCP servers so they use run credentials.

fix(tasks): correct manifest schemas and make generate-only tasks noop

a4afd77

Fix HPA/computeclass API versions and spec paths, clarify generate-only prompts, convert create/modify-deployment to the noop deployer, and reframe gpu-stress-test-diagnosis as post-incident log analysis.

feat(tf): add hypercomputer-d1 stack; wire deploy-config/fix-config/g…

b6a1239

…et-app-architecture Introduce the hypercomputer-d1 prebuilt stack (vLLM backend seed, GCS bucket, Workload Identity KSA, frontend) with seed_mode variants, and point the three tasks at it on e2-standard-4.

feat(tasks): replace parallelstore-csi with lustre-csi deployment

5b95d47

Swap the deprecated Parallelstore CSI task/stack for a Managed Lustre CSI task (18TB capacity, L4 GPU nodes) for model-serving storage.

fix(complextasks): grade migration-and-upgrade & multi-region-failove…

39bcd85

…r on outcome Accept either failover or direct primary recovery (service restored = 2xx) and clarify kind's control-plane upgrade path.

fix(tf): sweep leaked hello-app Artifact Registry repo on minimum tea…

20692f5

…rdown deploy-hello-app creates hello-app-<cluster> in project-global Artifact Registry, which cluster teardown never removes; add a destroy-time null_resource to delete it so it doesn't leak across runs.

fix(bastion): wire agent +skills to gke-mcp repo; sync hygiene; agent…

ac4176b

… timeout Point SKILLS_PATHS at the cloned gke-mcp operational skills (not judge rubrics), clone that repo in vm-setup, raise AGENT_TIMEOUT_SEC to 1200, and strip macOS AppleDouble files during sync.

docs(evals): bastion kind requirements, parallel-evals failure modes,…

1252dcf

… stale-state wipe Document kind toolchain/cleanup needs and expanded matrix failure modes, and strengthen stale per-run state guidance to wipe all state before every run.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor(tasks): merge complextasks into tasks/ and group by gcp/kind/noop/common#146

refactor(tasks): merge complextasks into tasks/ and group by gcp/kind/noop/common#146
pradeepvrd wants to merge 25 commits into
gke-labs:mainfrom
pradeepvrd:task-reorg

pradeepvrd commented Jun 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

pradeepvrd commented Jun 28, 2026

Summary

Stacking

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants