Skip to content

refactor(tasks): merge complextasks into tasks/ and group by gcp/kind/noop/common#146

Open
pradeepvrd wants to merge 25 commits into
gke-labs:mainfrom
pradeepvrd:task-reorg
Open

refactor(tasks): merge complextasks into tasks/ and group by gcp/kind/noop/common#146
pradeepvrd wants to merge 25 commits into
gke-labs:mainfrom
pradeepvrd:task-reorg

Conversation

@pradeepvrd

Copy link
Copy Markdown
Collaborator

Summary

Merge the separate complextasks/ tree (and tasks/generic/) into a single tasks/ root and group every task by how it runs. Implements the merge mandated by docs/migration/directory-structure.md, plus a gcp/kind/noop/common sub-taxonomy.

Classification = a task's actual provider dependency (not which tf stack exists):

Bucket Count Criterion Tasks
gcp/ 8 hard GCP dependency deploy-config, deploy-hello-app, fix-config, get-app-architecture, gpu-stress-test-diagnosis, lustre-csi-deployment, multi-region-failover, secret-rotation
kind/ 1 self-managed control plane only cp-recovery
common/ 3 generic k8s, runs on both optimize-scale, opa-remediation, migration-and-upgrade
noop/ 8 deployer: noop generate-only computeclass-active-migration, computeclass-spot-fallback, create-deployment, gateway-cloud-armor, gateway-https-redirect, hpa-metric-filtering, hpa-renamed-metric, modify-deployment

Notable calls: optimize-scale has no GCP dependency → common; secret-rotation (Cloud Secret Manager) and deploy-hello-app (mandatory Artifact Registry build/publish) stay gcp; cp-recovery does etcd/control-plane surgery impossible on managed GKE → kind.

All moves are pure git renames (history preserved); stack references are repo-root-relative so they're unaffected. Task-path references updated across the matrix lib (two-root enumeration → single tasks/), bastion scripts, Dockerfile, README, skills, docs, and tests.

Stacking

Branched on top of #141 (which renumbers task ids and renames lustre-csi), since that PR edits the same task files. Review/merge #141 first; the diff unique to this PR is the single refactor(tasks): … commit.

Testing

  • pytest tests/unit — 744 passed.
  • find tasks -name task.yaml | wc -l → 20; loader loads 20; task ids remain globally unique.

jessie1111101 and others added 25 commits June 28, 2026 12:45
…fu stack-dir)

Add a noop deployer for manifest-generation tasks and copy each run's
OpenTofu stack into a per-run working dir so concurrent matrix runs no
longer collide on shared tofu state.
Derive run-unique GitOps repo paths and cluster names (multi-region e-/w-
prefix to dodge the node-SA substr collision), declare the minimum-stack
namespace var, and document the kind-task parallel model.
Manifest-only tasks skip cluster provisioning (no shared cluster to
collide on); deploy-hello-app uses a run-unique Artifact Registry repo.
Add task_extra_env so both eval arms agree on fixture namespace and
target deployment for pre-seeded tasks.
Serve on port 8080 with a CPU-burn workload so the HPA actually scales,
install fortio on the bastion, and pin the fixture's deployment/namespace.
Retry the policy apply until the Kyverno admission webhook is serving,
so fixture setup no longer flakes with context-deadline-exceeded.
Document clearing stale per-run state before reruns and the cluster-
mutation risk when gke-mcp exposes unrelated clusters.
Add a validated field to the task schema and result row; only vetted
tasks promote to the leaderboard. Plumb it through results normalization
and the site schema/seed data.
Thread a generation_only flag so manifest-only (deployer: noop) tasks
aren't penalized by the OutcomeValidity judge for not applying to a
cluster, and emit generation_only/validated on result records. Strip the
MCP server prefix (server__tool) so expected-tool matching is canonical.
…allback

Resolve the target Service's external LB IP and rewrite the action URL to
http://<ip>:8080, falling back to port-forward when resolution times out
or is skipped in smoke tests; wait for rollout before forwarding. Expose
optimize-scale via LoadBalancer so load reaches it at 300+ qps.
Resolve google-vertex/google_vertex to the gemini adapter, and pass the
run-scoped KUBECONFIG to MCP servers so they use run credentials.
Fix HPA/computeclass API versions and spec paths, clarify generate-only
prompts, convert create/modify-deployment to the noop deployer, and
reframe gpu-stress-test-diagnosis as post-incident log analysis.
…et-app-architecture

Introduce the hypercomputer-d1 prebuilt stack (vLLM backend seed, GCS
bucket, Workload Identity KSA, frontend) with seed_mode variants, and
point the three tasks at it on e2-standard-4.
Swap the deprecated Parallelstore CSI task/stack for a Managed Lustre CSI
task (18TB capacity, L4 GPU nodes) for model-serving storage.
…r on outcome

Accept either failover or direct primary recovery (service restored = 2xx)
and clarify kind's control-plane upgrade path.
…rdown

deploy-hello-app creates hello-app-<cluster> in project-global Artifact
Registry, which cluster teardown never removes; add a destroy-time
null_resource to delete it so it doesn't leak across runs.
… timeout

Point SKILLS_PATHS at the cloned gke-mcp operational skills (not judge
rubrics), clone that repo in vm-setup, raise AGENT_TIMEOUT_SEC to 1200,
and strip macOS AppleDouble files during sync.
… stale-state wipe

Document kind toolchain/cleanup needs and expanded matrix failure modes,
and strengthen stale per-run state guidance to wipe all state before
every run.
The tar `--exclude='results'` matched any path component named 'results',
so it stripped the `devops_bench/results/` SOURCE module (the rows.json /
manifest.json builder) from every sync. The harness's `build_rows` import
then failed inside its per-metric/best-effort try-except, so runs silently
produced no rows.json/manifest.json (leaderboard rows). The eval-output
`results/` dir is already excluded by not being in the synced path allowlist.
The parallel matrix runs one task per process, pointing the loader at a single
<task-dir>/task.yaml. _load_single_file used the file stem -- the literal
"task" -- for both folder and the name fallback, while only the directory
loader used the containing dir name. Every emitted leaderboard row therefore
carried taskFolder="task", and the dashboard's derive() (which groups tasks by
taskFolder) collapsed a whole setup's tasks into a single task.

Fix: for a single spec named task.yaml, derive folder and name from the parent
directory, mirroring the directory loader; keep the stem fallback for
arbitrarily-named single specs.
The parallel matrix runs one task per process, emitting one rows.json per task,
each with a unique runId and its own timestamp t. The dashboard models a run as
a batch of tasks sharing one runId/t (tasks distinguished by taskFolder), and
derive.mjs groups runs by t and shows only the latest run's tasks -- so the
per-task files render as many single-task runs and a setup surfaces just one
task.

Add devops_bench/results/aggregate.py (+ CLI) to combine the per-task rows into
one batch run: stamp a single shared runId (with a unique run_<ts>_<pid> suffix
so concurrent or repeated matrix runs never collide on the
setupId__runId__taskFolder__iteration doc id) and one shared t, de-dupe retried
tasks (latest t wins), and write a combined rows.json + per-setup
manifests.json. Wire a Phase-6 aggregation step into the run-parallel-evals
skill.
load.mjs rejected every runId produced by an isolated/parallel run: the
producer makes run ids unique per process by appending a suffix (pid or matrix
id) to run_<ts>, but RUN_ID_RE required a bare run_YYYYMMDD_HHMMSS, so real runs
failed validation and could not be ingested. The timestamp alone is not unique,
so the suffix is required to keep the setupId__runId__taskFolder__iteration doc
id distinct across parallel runs.

Fix: loosen RUN_ID_RE to allow an optional _<suffix>, update the PROTOCOL
contract, and cover the suffixed form in load.test.mjs.
Ingesting gemini-3.1-pro-preview emitted an "unknown model" warning and
synthesized placeholder metadata (provider "Unknown", default logo) because the
model was absent from the catalog, so the leaderboard showed a generic entry.

Map gemini-3.1-pro-preview -- plus the stable id and versioned variants via
substring -- to a curated Gemini 3.1 Pro entry (Google, Proprietary), and add a
matching "gemini" brand glyph so the leaderboard renders a logo.
complextasks/ reused task ids 1, 2, 3, and 5 that tasks/ already assigned, so a
combined load (or any cross-tree id check) saw duplicate task ids -- the same
collision class flagged for migration-and-upgrade vs lustre-csi.

Renumber the four colliding complextasks into the free 17-20 range so the trees
no longer overlap: tasks/ keeps 1-14, complextasks/ becomes a contiguous 15-20
block (optimize-scale 1->17, secret-rotation 2->18, cp-recovery 3->19,
opa-remediation 5->20). Tasks are selected by path, not numeric id, so this
only affects loader ordering/de-dup.
…/noop/common

Merge the separate complextasks/ tree and tasks/generic/ into a single tasks/
root (per docs/migration/directory-structure.md) and group every task by its
ACTUAL provider dependency, not by which tf stack happens to exist:

- gcp/ (8): hard GCP dependency -- Cloud Secret Manager, Cloud SQL, Managed
  Lustre, Hypercomputer/Workload Identity, GKE GPUs, multi-region GKE, or a
  mandatory Artifact Registry build/publish. Cannot run on kind.
- kind/ (1): cp-recovery -- control-plane/etcd surgery, only meaningful on a
  self-managed cluster (its own header marks it the kind-only exception).
- common/ (3): optimize-scale, opa-remediation, migration-and-upgrade -- generic
  k8s scenarios runnable on both gcp and kind.
- noop/ (8): deployer:noop generate-only tasks.

Classification is by what the task needs: optimize-scale (no GCP dep) -> common,
while secret-rotation (Cloud Secret Manager) and deploy-hello-app (mandatory
Artifact Registry) remain gcp. Moves are pure git renames; stack refs are
repo-root-relative and unaffected. All task-path references updated: the matrix
enumeration collapses to a single tasks/ root, plus scripts, Dockerfile, skills,
docs, and tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants