Skip to content

feat(tasks): make all benchmark tasks parallel-run safe#133

Closed
jessie1111101 wants to merge 17 commits into
skills/agent-skillsfrom
parallel-task-fixes-132
Closed

feat(tasks): make all benchmark tasks parallel-run safe#133
jessie1111101 wants to merge 17 commits into
skills/agent-skillsfrom
parallel-task-fixes-132

Conversation

@jessie1111101

Copy link
Copy Markdown
Collaborator

Stacked on #132 (base: skills/agent-skills).

Makes every benchmark task safe to run concurrently in the Task×Model×AgentConfig
matrix. Each combo provisions its own cluster, so this ensures no cross-run
collisions on shared/global or host-level resources.

Changes

  • 6 manifest-gen tasks → deployer: noop (no cluster); legacy factory now honors noop
  • optimize-scale → new prebuilt/optimize-scale GKE stack + pre-seeded workload;
    matrix pins TARGET_DEPLOYMENT_NAME/NAMESPACE so both arms agree
  • deploy-hello-app → run-unique Artifact Registry repo name
  • per-run tofu stack-dir copy (both arms) → removes the shared .terraform.lock.hcl
    race (resolves that known-issue)
  • imported + parallel-fixed the merged complex/GKE tasks (migration, opa,
    multi-region, postgres, unhealthy-pod, gitops, debug-crashloop): per-run GitOps
    repo paths, dropped shared-SA container.admin (BYO creds), region-prefixed
    cluster names (avoid node-SA substr collision)
  • debug-crashloop → new kind stack that applies the broken fixture (was unsolvable
    on bare prebuilt/kind)
  • namespace pinned per pre-seeded task so the prompt's {{NAMESPACE}} matches the
    fixture's namespace on both arms

Addressed from /devops-bench-review

namespace divergence, debug-crashloop fixture, gitops repo path → ~, migration
scratch-cluster naming, doc fixes.

Known follow-ups (out of scope)

  • systemic duplicate task_ids (complextasks vs tasks/gcp) — only migration renumbered
  • direct (non-matrix) runs: namespace pins live in the matrix helper; a harness-level
    NAMESPACE export would also cover direct runs

@google-cla

google-cla Bot commented Jun 25, 2026

Copy link
Copy Markdown

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@pradeepvrd pradeepvrd force-pushed the skills/agent-skills branch from da32df0 to f1c4194 Compare June 25, 2026 20:29
@jessie1111101 jessie1111101 force-pushed the parallel-task-fixes-132 branch from 7dd6def to 323f325 Compare June 25, 2026 20:30
@pradeepvrd pradeepvrd force-pushed the skills/agent-skills branch from 7588a3e to ed54798 Compare June 26, 2026 01:11
@pradeepvrd pradeepvrd force-pushed the parallel-task-fixes-132 branch from 099ef11 to d946a83 Compare June 26, 2026 01:13
@pradeepvrd pradeepvrd force-pushed the skills/agent-skills branch 2 times, most recently from 6fef3ff to 121e7fb Compare June 26, 2026 21:49
@pradeepvrd pradeepvrd force-pushed the parallel-task-fixes-132 branch from d946a83 to e0b8758 Compare June 26, 2026 21:49
pradeepvrd and others added 17 commits June 26, 2026 15:21
Add the bastion-side matrix orchestration for running parallel evaluations:

- run_matrix.sh + _matrix_lib.sh: parallel Task×Model×AgentConfig matrix, split
  into refactored + legacy wrappers; hardened against SSH drops; per-stamp
  remote runner with pre-created output dirs.
- BENCH_VERTEX mode via VM-SA ADC; configure-oc.sh --vertex registers the oc
  google-vertex provider; portable oc Vertex ADC auth across isolated runs.
- vm-setup / sync-to-bastion install the gemini CLI and support parallel runs.
…pacity)

Concurrent GKE tasks managing the shared VM-SA's project IAM bindings clobber
each other on teardown; kind host capacity (disk/inotify/clusters) sums across
concurrent runs.
…n runner

_matrix_lib.sh now defaults to LOCAL execution (nohup on this host, no ssh/sync,
results in ~/matrix-runs/<stamp>); set BENCH_REMOTE=1 for the previous behavior
(sync-to-bastion + remote nohup + pull). Gated via a host_exec helper and
BENCH_REMOTE branches on launch/poll/pull/resume/sync; runner cd's to REPO_ROOT
locally and tolerates a missing .venv/secrets.env. Wrapper headers + docs/bastion.md
note the local default.
Collect agent skills in one place so more skills / agent guidelines can be
added independently of feature PRs.

- devops-bench-review: comprehensive PR/workspace review (correctness,
  parallel-safety across the eval matrix axes, task/stack + docs conventions).
- run-parallel-evals: moved here from the parallel-eval feature branch.
  Its referenced docs (docs/parallel-evals.md, docs/bastion.md) and
  scripts/bastion/* land with the parallel-eval PR; the skill is dormant
  until that merges.
Mirror the .agents/skills/<name> sources into .claude/skills/<name> so Claude
Code discovers them (same symlink pattern the skill source uses).
Add an explicit Scope & guardrails section: the skill analyzes statically and
presents findings; it may run unit tests / ruff lint + format checks, but must
NOT run benchmark evals, the matrix, or any infra provisioning/teardown. Parallel
hazards are found by reading and reasoning, not by launching concurrent runs.
…ure modes

Add a Phase-1 pre-flight that screens the selected matrix tasks against the known
per-run isolation gaps (shared $HOME repos, multi-cluster node-SA collapse,
shared VM-SA IAM clobber, task_id collisions, host-capacity sum), and a Phase-6
step (+ guardrail) making it part of the run to append any new failure mode to the
docs/parallel-evals.md / docs/bastion.md appendices.
…essive disclosure

- SKILL.md becomes a lean router: a Modes section points to references/ loaded
  on demand; adds a background-classifier keepalive guardrail.
- references/resilient-monitoring.md: low-tier monitor + mid-tier analyzer
  subagents, supervisor loop, API-error recovery (re-spawn on durable RESUME_STAMP),
  and keepalive so a long detached run isn't classified as finished mid-run.
- references/unlimited-mode.md (opt-in): classify flake vs model vs bug; diagnose
  -> fix in a worktree -> re-sync -> restart failed combos -> repeat, with attempt
  caps, scoped local commits, self-review, and budget/checkpoint guardrails.
Describe required capabilities (sub-tasks, model tiers, background run, timer/wake,
durable state, worktree, keepalive) instead of Claude Code tool names; add a
'Harness portability' map in SKILL.md and name Claude Code/Antigravity/Codex
equivalents as examples with inline fallbacks (a bare shell that can ssh the bastion
suffices). Generalize devops-bench-review's CLAUDE.md -> agent-instruction files and
the verifier sub-agent -> harness-agnostic pass.
…y map

Replace the 2-way (Claude Code / other) portability table with a 4-column one
that names Antigravity's real primitives — dynamic subagents + 'agy -p --model',
the Gemini 3.5 Flash / 3.1 Pro tiers via /models|--model, Background Agents,
Scheduled tasks, Artifacts + Knowledge base, project/workspace scoping,
request-review approval, and quota-as-governor — alongside Claude Code and a
generic fallback.
… docs)

Map the Antigravity column to the real snake_case tools from
antigravity.google/docs/hooks — run_command (+ RunPersistent), the programmatic
subagent tool + browser_subagent, view_file/list_dir/grep_search/codebase_search/
view_code_item, write_to_file/replace_file_content/multi_replace_file_content — and
add a footnote listing them with the PreToolUse/PostToolUse/PreInvocation/
PostInvocation/Stop hook events.
Replace docs-scraped names with the actual agent tool set reported by a running
Antigravity instance: invoke_subagent/define_subagent/manage_subagents/send_message
(sub-tasks), manage_task (background), schedule (timer/cron), ask_question/
ask_permission (user), view_file/list_dir/grep_search + write_to_file/
replace_file_content/multi_replace_file_content (files), run_command, search_web/
read_url_content. Drops the incorrect codebase_search/view_code_item/browser_subagent
/RunPersistent that secondary sources had claimed.
…n mode

- Collapse the portability table to one primitive per cell (invoke_subagent,
  schedule, …); add a 'harness chains the rest' note; full inventory stays in the
  footnote.
- Add an Execution mode section: LOCAL is the default (run on this host, no
  ssh/sync); BENCH_REMOTE=1 (+ BASTION_*) opts into the bastion ssh runner. Phase 1
  now always asks local-or-remote; command snippets are the remote form with a
  'drop the ssh wrapper locally' convention; references' monitor + unlimited re-sync
  are mode-aware.
Stacked on #132 (skills/agent-skills). Each matrix combo provisions its own
cluster; this makes every task collision-free under concurrent runs:

- 6 manifest-gen tasks -> deployer: noop (no cluster); legacy factory honors noop
- optimize-scale: new prebuilt/optimize-scale GKE stack + pre-seeded workload;
  matrix pins TARGET_DEPLOYMENT_NAME/NAMESPACE so both arms agree
- deploy-hello-app: run-unique Artifact Registry repo name
- per-run tofu stack-dir copy (both arms) removes the shared .terraform.lock race
  (resolves the 'Shared OpenTofu working directory' known-issue)
- import + parallel-fix the merged complex/GKE tasks (#64 migration, #87 opa,
  #93 multi-region, #86 postgres/unhealthy/gitops, #76 debug-crashloop):
  per-run GitOps repo paths, dropped shared-SA container.admin (BYO creds),
  region-prefixed cluster names (avoid node-SA substr collision), unique task_id
- cp-recovery documented as the kind-only exception (docs/bastion.md)
- namespace divergence (#2): pin NAMESPACE in task_extra_env for the pre-seeded
  tasks (multi-region->storefront, unhealthy-pod/gitops/postgres/debug-crashloop
  ->default) so prompt {{NAMESPACE}} matches the fixture namespace on BOTH arms;
  add unused namespace var to prebuilt/minimum so postgres' pin doesn't warn
- debug-crashloop (#1): new prebuilt/debug-crashloop-kind stack applies the
  broken frontend fixture (was never applied under bare prebuilt/kind -> task was
  unsolvable); move fixture into the stack; drop false GKE/us-central1 prompt
- gitops repo path (#3): /app/results/... -> ~/gitops-repo-<cluster> (writable on
  the bastion, matches the other tasks' ~ convention)
- migration (#4): prompt tells the agent to uniquely name any throwaway
  validation cluster from {{CLUSTER_NAME}}
- docs (#5,#7): bastion.md lists all kind tasks (not just cp-recovery) + host
  capacity caveat; multi-region cluster_name description fixed (e-/w- prefix)
…ebhook retry; install fortio in vm-setup; doc gke-mcp mutation risk + stale-state reruns

- tf/prebuilt/optimize-scale: workload now listens on 8080 (harness port-forwards
  deployment to fixed remote 8080; hpa-example:80 made the load spike a silent no-op)
- tf/prebuilt/opa-remediation-kind/setup.sh: retry policy apply past Kyverno webhook
  readiness race (context deadline exceeded)
- scripts/bastion/vm-setup.sh: install fortio (chaos generate_load dependency)
- skills/run-parallel-evals: clear stale per-run tofu state + root-owned GitOps repos
  on rerun; fortio prereq; cluster-mutation blast-radius guardrail for manifest-gen
- complextasks/optimize-scale/README.md: new
…extra_env

Lets all 20 tasks run in one parallel invocation — these two fixtures deploy
into dedicated namespaces (matching each stack's namespace default), so without
pinning they'd resolve {{NAMESPACE}} to the harness default and miss the
fixtures. Closes the last gap for a single mixed parallel batch.
@pradeepvrd

Copy link
Copy Markdown
Collaborator

@jessie1111101 I cherrypicked all your comments in #141

I think we can abandon this PR in favor of that one.

@pradeepvrd pradeepvrd closed this Jun 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants