feat(tasks): make all benchmark tasks parallel-run safe by jessie1111101 · Pull Request #133 · gke-labs/devops-bench

jessie1111101 · 2026-06-25T19:44:07Z

Stacked on #132 (base: skills/agent-skills).

Makes every benchmark task safe to run concurrently in the Task×Model×AgentConfig
matrix. Each combo provisions its own cluster, so this ensures no cross-run
collisions on shared/global or host-level resources.

Changes

6 manifest-gen tasks → deployer: noop (no cluster); legacy factory now honors noop
optimize-scale → new prebuilt/optimize-scale GKE stack + pre-seeded workload;
matrix pins TARGET_DEPLOYMENT_NAME/NAMESPACE so both arms agree
deploy-hello-app → run-unique Artifact Registry repo name
per-run tofu stack-dir copy (both arms) → removes the shared .terraform.lock.hcl
race (resolves that known-issue)
imported + parallel-fixed the merged complex/GKE tasks (migration, opa,
multi-region, postgres, unhealthy-pod, gitops, debug-crashloop): per-run GitOps
repo paths, dropped shared-SA container.admin (BYO creds), region-prefixed
cluster names (avoid node-SA substr collision)
debug-crashloop → new kind stack that applies the broken fixture (was unsolvable
on bare prebuilt/kind)
namespace pinned per pre-seeded task so the prompt's {{NAMESPACE}} matches the
fixture's namespace on both arms

Addressed from /devops-bench-review

namespace divergence, debug-crashloop fixture, gitops repo path → ~, migration
scratch-cluster naming, doc fixes.

Known follow-ups (out of scope)

systemic duplicate task_ids (complextasks vs tasks/gcp) — only migration renumbered
direct (non-matrix) runs: namespace pins live in the matrix helper; a harness-level
NAMESPACE export would also cover direct runs

google-cla · 2026-06-25T19:44:18Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

Add the bastion-side matrix orchestration for running parallel evaluations: - run_matrix.sh + _matrix_lib.sh: parallel Task×Model×AgentConfig matrix, split into refactored + legacy wrappers; hardened against SSH drops; per-stamp remote runner with pre-created output dirs. - BENCH_VERTEX mode via VM-SA ADC; configure-oc.sh --vertex registers the oc google-vertex provider; portable oc Vertex ADC auth across isolated runs. - vm-setup / sync-to-bastion install the gemini CLI and support parallel runs.

…pacity) Concurrent GKE tasks managing the shared VM-SA's project IAM bindings clobber each other on teardown; kind host capacity (disk/inotify/clusters) sums across concurrent runs.

…n runner _matrix_lib.sh now defaults to LOCAL execution (nohup on this host, no ssh/sync, results in ~/matrix-runs/<stamp>); set BENCH_REMOTE=1 for the previous behavior (sync-to-bastion + remote nohup + pull). Gated via a host_exec helper and BENCH_REMOTE branches on launch/poll/pull/resume/sync; runner cd's to REPO_ROOT locally and tolerates a missing .venv/secrets.env. Wrapper headers + docs/bastion.md note the local default.

Collect agent skills in one place so more skills / agent guidelines can be added independently of feature PRs. - devops-bench-review: comprehensive PR/workspace review (correctness, parallel-safety across the eval matrix axes, task/stack + docs conventions). - run-parallel-evals: moved here from the parallel-eval feature branch. Its referenced docs (docs/parallel-evals.md, docs/bastion.md) and scripts/bastion/* land with the parallel-eval PR; the skill is dormant until that merges.

Mirror the .agents/skills/<name> sources into .claude/skills/<name> so Claude Code discovers them (same symlink pattern the skill source uses).

Add an explicit Scope & guardrails section: the skill analyzes statically and presents findings; it may run unit tests / ruff lint + format checks, but must NOT run benchmark evals, the matrix, or any infra provisioning/teardown. Parallel hazards are found by reading and reasoning, not by launching concurrent runs.

…ure modes Add a Phase-1 pre-flight that screens the selected matrix tasks against the known per-run isolation gaps (shared $HOME repos, multi-cluster node-SA collapse, shared VM-SA IAM clobber, task_id collisions, host-capacity sum), and a Phase-6 step (+ guardrail) making it part of the run to append any new failure mode to the docs/parallel-evals.md / docs/bastion.md appendices.

…essive disclosure - SKILL.md becomes a lean router: a Modes section points to references/ loaded on demand; adds a background-classifier keepalive guardrail. - references/resilient-monitoring.md: low-tier monitor + mid-tier analyzer subagents, supervisor loop, API-error recovery (re-spawn on durable RESUME_STAMP), and keepalive so a long detached run isn't classified as finished mid-run. - references/unlimited-mode.md (opt-in): classify flake vs model vs bug; diagnose -> fix in a worktree -> re-sync -> restart failed combos -> repeat, with attempt caps, scoped local commits, self-review, and budget/checkpoint guardrails.

Describe required capabilities (sub-tasks, model tiers, background run, timer/wake, durable state, worktree, keepalive) instead of Claude Code tool names; add a 'Harness portability' map in SKILL.md and name Claude Code/Antigravity/Codex equivalents as examples with inline fallbacks (a bare shell that can ssh the bastion suffices). Generalize devops-bench-review's CLAUDE.md -> agent-instruction files and the verifier sub-agent -> harness-agnostic pass.

…y map Replace the 2-way (Claude Code / other) portability table with a 4-column one that names Antigravity's real primitives — dynamic subagents + 'agy -p --model', the Gemini 3.5 Flash / 3.1 Pro tiers via /models|--model, Background Agents, Scheduled tasks, Artifacts + Knowledge base, project/workspace scoping, request-review approval, and quota-as-governor — alongside Claude Code and a generic fallback.

… docs) Map the Antigravity column to the real snake_case tools from antigravity.google/docs/hooks — run_command (+ RunPersistent), the programmatic subagent tool + browser_subagent, view_file/list_dir/grep_search/codebase_search/ view_code_item, write_to_file/replace_file_content/multi_replace_file_content — and add a footnote listing them with the PreToolUse/PostToolUse/PreInvocation/ PostInvocation/Stop hook events.

Replace docs-scraped names with the actual agent tool set reported by a running Antigravity instance: invoke_subagent/define_subagent/manage_subagents/send_message (sub-tasks), manage_task (background), schedule (timer/cron), ask_question/ ask_permission (user), view_file/list_dir/grep_search + write_to_file/ replace_file_content/multi_replace_file_content (files), run_command, search_web/ read_url_content. Drops the incorrect codebase_search/view_code_item/browser_subagent /RunPersistent that secondary sources had claimed.

…n mode - Collapse the portability table to one primitive per cell (invoke_subagent, schedule, …); add a 'harness chains the rest' note; full inventory stays in the footnote. - Add an Execution mode section: LOCAL is the default (run on this host, no ssh/sync); BENCH_REMOTE=1 (+ BASTION_*) opts into the bastion ssh runner. Phase 1 now always asks local-or-remote; command snippets are the remote form with a 'drop the ssh wrapper locally' convention; references' monitor + unlimited re-sync are mode-aware.

Stacked on #132 (skills/agent-skills). Each matrix combo provisions its own cluster; this makes every task collision-free under concurrent runs: - 6 manifest-gen tasks -> deployer: noop (no cluster); legacy factory honors noop - optimize-scale: new prebuilt/optimize-scale GKE stack + pre-seeded workload; matrix pins TARGET_DEPLOYMENT_NAME/NAMESPACE so both arms agree - deploy-hello-app: run-unique Artifact Registry repo name - per-run tofu stack-dir copy (both arms) removes the shared .terraform.lock race (resolves the 'Shared OpenTofu working directory' known-issue) - import + parallel-fix the merged complex/GKE tasks (#64 migration, #87 opa, #93 multi-region, #86 postgres/unhealthy/gitops, #76 debug-crashloop): per-run GitOps repo paths, dropped shared-SA container.admin (BYO creds), region-prefixed cluster names (avoid node-SA substr collision), unique task_id - cp-recovery documented as the kind-only exception (docs/bastion.md)

- namespace divergence (#2): pin NAMESPACE in task_extra_env for the pre-seeded tasks (multi-region->storefront, unhealthy-pod/gitops/postgres/debug-crashloop ->default) so prompt {{NAMESPACE}} matches the fixture namespace on BOTH arms; add unused namespace var to prebuilt/minimum so postgres' pin doesn't warn - debug-crashloop (#1): new prebuilt/debug-crashloop-kind stack applies the broken frontend fixture (was never applied under bare prebuilt/kind -> task was unsolvable); move fixture into the stack; drop false GKE/us-central1 prompt - gitops repo path (#3): /app/results/... -> ~/gitops-repo-<cluster> (writable on the bastion, matches the other tasks' ~ convention) - migration (#4): prompt tells the agent to uniquely name any throwaway validation cluster from {{CLUSTER_NAME}} - docs (#5,#7): bastion.md lists all kind tasks (not just cp-recovery) + host capacity caveat; multi-region cluster_name description fixed (e-/w- prefix)

…ebhook retry; install fortio in vm-setup; doc gke-mcp mutation risk + stale-state reruns - tf/prebuilt/optimize-scale: workload now listens on 8080 (harness port-forwards deployment to fixed remote 8080; hpa-example:80 made the load spike a silent no-op) - tf/prebuilt/opa-remediation-kind/setup.sh: retry policy apply past Kyverno webhook readiness race (context deadline exceeded) - scripts/bastion/vm-setup.sh: install fortio (chaos generate_load dependency) - skills/run-parallel-evals: clear stale per-run tofu state + root-owned GitOps repos on rerun; fortio prereq; cluster-mutation blast-radius guardrail for manifest-gen - complextasks/optimize-scale/README.md: new

…extra_env Lets all 20 tasks run in one parallel invocation — these two fixtures deploy into dedicated namespaces (matching each stack's namespace default), so without pinning they'd resolve {{NAMESPACE}} to the harness default and miss the fixtures. Closes the last gap for a single mixed parallel batch.

pradeepvrd · 2026-06-28T20:51:26Z

@jessie1111101 I cherrypicked all your comments in #141

I think we can abandon this PR in favor of that one.

pradeepvrd force-pushed the skills/agent-skills branch from da32df0 to f1c4194 Compare June 25, 2026 20:29

jessie1111101 force-pushed the parallel-task-fixes-132 branch from 7dd6def to 323f325 Compare June 25, 2026 20:30

pradeepvrd force-pushed the skills/agent-skills branch from 7588a3e to ed54798 Compare June 26, 2026 01:11

pradeepvrd force-pushed the parallel-task-fixes-132 branch from 099ef11 to d946a83 Compare June 26, 2026 01:13

pradeepvrd force-pushed the skills/agent-skills branch 2 times, most recently from 6fef3ff to 121e7fb Compare June 26, 2026 21:49

pradeepvrd force-pushed the parallel-task-fixes-132 branch from d946a83 to e0b8758 Compare June 26, 2026 21:49

pradeepvrd and others added 17 commits June 26, 2026 15:21

docs(bastion): append known issues (shared VM-SA IAM clobber, host ca…

bdfebbb

…pacity) Concurrent GKE tasks managing the shared VM-SA's project IAM bindings clobber each other on teardown; kind host capacity (disk/inotify/clusters) sums across concurrent runs.

skills: add .claude/skills discovery symlinks

0737127

Mirror the .agents/skills/<name> sources into .claude/skills/<name> so Claude Code discovers them (same symlink pattern the skill source uses).

pradeepvrd force-pushed the skills/agent-skills branch from 121e7fb to 377a5aa Compare June 26, 2026 22:22

pradeepvrd force-pushed the parallel-task-fixes-132 branch from e0b8758 to 35e8191 Compare June 26, 2026 22:22

pradeepvrd force-pushed the skills/agent-skills branch from 377a5aa to fdb29d1 Compare June 27, 2026 01:55

pradeepvrd mentioned this pull request Jun 27, 2026

feat: parallel evaluation harness — isolation, bastion matrix, skills & docs (consolidated) #128

Merged

pradeepvrd closed this Jun 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(tasks): make all benchmark tasks parallel-run safe#133

feat(tasks): make all benchmark tasks parallel-run safe#133
jessie1111101 wants to merge 17 commits into
skills/agent-skillsfrom
parallel-task-fixes-132

jessie1111101 commented Jun 25, 2026

Uh oh!

google-cla Bot commented Jun 25, 2026

Uh oh!

pradeepvrd commented Jun 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

jessie1111101 commented Jun 25, 2026

Changes

Addressed from /devops-bench-review

Known follow-ups (out of scope)

Uh oh!

google-cla Bot commented Jun 25, 2026

Uh oh!

pradeepvrd commented Jun 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants