refactor(tasks): merge complextasks into tasks/ and group by gcp/kind/noop/common#146
Open
pradeepvrd wants to merge 25 commits into
Open
refactor(tasks): merge complextasks into tasks/ and group by gcp/kind/noop/common#146pradeepvrd wants to merge 25 commits into
pradeepvrd wants to merge 25 commits into
Conversation
…fu stack-dir) Add a noop deployer for manifest-generation tasks and copy each run's OpenTofu stack into a per-run working dir so concurrent matrix runs no longer collide on shared tofu state.
Derive run-unique GitOps repo paths and cluster names (multi-region e-/w- prefix to dodge the node-SA substr collision), declare the minimum-stack namespace var, and document the kind-task parallel model.
Manifest-only tasks skip cluster provisioning (no shared cluster to collide on); deploy-hello-app uses a run-unique Artifact Registry repo.
Add task_extra_env so both eval arms agree on fixture namespace and target deployment for pre-seeded tasks.
Serve on port 8080 with a CPU-burn workload so the HPA actually scales, install fortio on the bastion, and pin the fixture's deployment/namespace.
Retry the policy apply until the Kyverno admission webhook is serving, so fixture setup no longer flakes with context-deadline-exceeded.
Document clearing stale per-run state before reruns and the cluster- mutation risk when gke-mcp exposes unrelated clusters.
Add a validated field to the task schema and result row; only vetted tasks promote to the leaderboard. Plumb it through results normalization and the site schema/seed data.
Thread a generation_only flag so manifest-only (deployer: noop) tasks aren't penalized by the OutcomeValidity judge for not applying to a cluster, and emit generation_only/validated on result records. Strip the MCP server prefix (server__tool) so expected-tool matching is canonical.
…allback Resolve the target Service's external LB IP and rewrite the action URL to http://<ip>:8080, falling back to port-forward when resolution times out or is skipped in smoke tests; wait for rollout before forwarding. Expose optimize-scale via LoadBalancer so load reaches it at 300+ qps.
Resolve google-vertex/google_vertex to the gemini adapter, and pass the run-scoped KUBECONFIG to MCP servers so they use run credentials.
Fix HPA/computeclass API versions and spec paths, clarify generate-only prompts, convert create/modify-deployment to the noop deployer, and reframe gpu-stress-test-diagnosis as post-incident log analysis.
…et-app-architecture Introduce the hypercomputer-d1 prebuilt stack (vLLM backend seed, GCS bucket, Workload Identity KSA, frontend) with seed_mode variants, and point the three tasks at it on e2-standard-4.
Swap the deprecated Parallelstore CSI task/stack for a Managed Lustre CSI task (18TB capacity, L4 GPU nodes) for model-serving storage.
…r on outcome Accept either failover or direct primary recovery (service restored = 2xx) and clarify kind's control-plane upgrade path.
…rdown deploy-hello-app creates hello-app-<cluster> in project-global Artifact Registry, which cluster teardown never removes; add a destroy-time null_resource to delete it so it doesn't leak across runs.
… timeout Point SKILLS_PATHS at the cloned gke-mcp operational skills (not judge rubrics), clone that repo in vm-setup, raise AGENT_TIMEOUT_SEC to 1200, and strip macOS AppleDouble files during sync.
… stale-state wipe Document kind toolchain/cleanup needs and expanded matrix failure modes, and strengthen stale per-run state guidance to wipe all state before every run.
The tar `--exclude='results'` matched any path component named 'results', so it stripped the `devops_bench/results/` SOURCE module (the rows.json / manifest.json builder) from every sync. The harness's `build_rows` import then failed inside its per-metric/best-effort try-except, so runs silently produced no rows.json/manifest.json (leaderboard rows). The eval-output `results/` dir is already excluded by not being in the synced path allowlist.
The parallel matrix runs one task per process, pointing the loader at a single <task-dir>/task.yaml. _load_single_file used the file stem -- the literal "task" -- for both folder and the name fallback, while only the directory loader used the containing dir name. Every emitted leaderboard row therefore carried taskFolder="task", and the dashboard's derive() (which groups tasks by taskFolder) collapsed a whole setup's tasks into a single task. Fix: for a single spec named task.yaml, derive folder and name from the parent directory, mirroring the directory loader; keep the stem fallback for arbitrarily-named single specs.
The parallel matrix runs one task per process, emitting one rows.json per task, each with a unique runId and its own timestamp t. The dashboard models a run as a batch of tasks sharing one runId/t (tasks distinguished by taskFolder), and derive.mjs groups runs by t and shows only the latest run's tasks -- so the per-task files render as many single-task runs and a setup surfaces just one task. Add devops_bench/results/aggregate.py (+ CLI) to combine the per-task rows into one batch run: stamp a single shared runId (with a unique run_<ts>_<pid> suffix so concurrent or repeated matrix runs never collide on the setupId__runId__taskFolder__iteration doc id) and one shared t, de-dupe retried tasks (latest t wins), and write a combined rows.json + per-setup manifests.json. Wire a Phase-6 aggregation step into the run-parallel-evals skill.
load.mjs rejected every runId produced by an isolated/parallel run: the producer makes run ids unique per process by appending a suffix (pid or matrix id) to run_<ts>, but RUN_ID_RE required a bare run_YYYYMMDD_HHMMSS, so real runs failed validation and could not be ingested. The timestamp alone is not unique, so the suffix is required to keep the setupId__runId__taskFolder__iteration doc id distinct across parallel runs. Fix: loosen RUN_ID_RE to allow an optional _<suffix>, update the PROTOCOL contract, and cover the suffixed form in load.test.mjs.
Ingesting gemini-3.1-pro-preview emitted an "unknown model" warning and synthesized placeholder metadata (provider "Unknown", default logo) because the model was absent from the catalog, so the leaderboard showed a generic entry. Map gemini-3.1-pro-preview -- plus the stable id and versioned variants via substring -- to a curated Gemini 3.1 Pro entry (Google, Proprietary), and add a matching "gemini" brand glyph so the leaderboard renders a logo.
complextasks/ reused task ids 1, 2, 3, and 5 that tasks/ already assigned, so a combined load (or any cross-tree id check) saw duplicate task ids -- the same collision class flagged for migration-and-upgrade vs lustre-csi. Renumber the four colliding complextasks into the free 17-20 range so the trees no longer overlap: tasks/ keeps 1-14, complextasks/ becomes a contiguous 15-20 block (optimize-scale 1->17, secret-rotation 2->18, cp-recovery 3->19, opa-remediation 5->20). Tasks are selected by path, not numeric id, so this only affects loader ordering/de-dup.
…/noop/common Merge the separate complextasks/ tree and tasks/generic/ into a single tasks/ root (per docs/migration/directory-structure.md) and group every task by its ACTUAL provider dependency, not by which tf stack happens to exist: - gcp/ (8): hard GCP dependency -- Cloud Secret Manager, Cloud SQL, Managed Lustre, Hypercomputer/Workload Identity, GKE GPUs, multi-region GKE, or a mandatory Artifact Registry build/publish. Cannot run on kind. - kind/ (1): cp-recovery -- control-plane/etcd surgery, only meaningful on a self-managed cluster (its own header marks it the kind-only exception). - common/ (3): optimize-scale, opa-remediation, migration-and-upgrade -- generic k8s scenarios runnable on both gcp and kind. - noop/ (8): deployer:noop generate-only tasks. Classification is by what the task needs: optimize-scale (no GCP dep) -> common, while secret-rotation (Cloud Secret Manager) and deploy-hello-app (mandatory Artifact Registry) remain gcp. Moves are pure git renames; stack refs are repo-root-relative and unaffected. All task-path references updated: the matrix enumeration collapses to a single tasks/ root, plus scripts, Dockerfile, skills, docs, and tests.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Merge the separate
complextasks/tree (andtasks/generic/) into a singletasks/root and group every task by how it runs. Implements the merge mandated bydocs/migration/directory-structure.md, plus agcp/kind/noop/commonsub-taxonomy.Classification = a task's actual provider dependency (not which tf stack exists):
gcp/kind/common/noop/deployer: noopgenerate-onlyNotable calls:
optimize-scalehas no GCP dependency →common;secret-rotation(Cloud Secret Manager) anddeploy-hello-app(mandatory Artifact Registry build/publish) staygcp;cp-recoverydoes etcd/control-plane surgery impossible on managed GKE →kind.All moves are pure git renames (history preserved); stack references are repo-root-relative so they're unaffected. Task-path references updated across the matrix lib (two-root enumeration → single
tasks/), bastion scripts, Dockerfile, README, skills, docs, and tests.Stacking
Branched on top of #141 (which renumbers task ids and renames lustre-csi), since that PR edits the same task files. Review/merge #141 first; the diff unique to this PR is the single
refactor(tasks): …commit.Testing
pytest tests/unit— 744 passed.find tasks -name task.yaml | wc -l→ 20; loader loads 20; task ids remain globally unique.