feat(tasks): make all benchmark tasks parallel-run safe#133
Closed
jessie1111101 wants to merge 17 commits into
Closed
feat(tasks): make all benchmark tasks parallel-run safe#133jessie1111101 wants to merge 17 commits into
jessie1111101 wants to merge 17 commits into
Conversation
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
da32df0 to
f1c4194
Compare
7dd6def to
323f325
Compare
7588a3e to
ed54798
Compare
099ef11 to
d946a83
Compare
6fef3ff to
121e7fb
Compare
d946a83 to
e0b8758
Compare
Add the bastion-side matrix orchestration for running parallel evaluations: - run_matrix.sh + _matrix_lib.sh: parallel Task×Model×AgentConfig matrix, split into refactored + legacy wrappers; hardened against SSH drops; per-stamp remote runner with pre-created output dirs. - BENCH_VERTEX mode via VM-SA ADC; configure-oc.sh --vertex registers the oc google-vertex provider; portable oc Vertex ADC auth across isolated runs. - vm-setup / sync-to-bastion install the gemini CLI and support parallel runs.
…pacity) Concurrent GKE tasks managing the shared VM-SA's project IAM bindings clobber each other on teardown; kind host capacity (disk/inotify/clusters) sums across concurrent runs.
…n runner _matrix_lib.sh now defaults to LOCAL execution (nohup on this host, no ssh/sync, results in ~/matrix-runs/<stamp>); set BENCH_REMOTE=1 for the previous behavior (sync-to-bastion + remote nohup + pull). Gated via a host_exec helper and BENCH_REMOTE branches on launch/poll/pull/resume/sync; runner cd's to REPO_ROOT locally and tolerates a missing .venv/secrets.env. Wrapper headers + docs/bastion.md note the local default.
Collect agent skills in one place so more skills / agent guidelines can be added independently of feature PRs. - devops-bench-review: comprehensive PR/workspace review (correctness, parallel-safety across the eval matrix axes, task/stack + docs conventions). - run-parallel-evals: moved here from the parallel-eval feature branch. Its referenced docs (docs/parallel-evals.md, docs/bastion.md) and scripts/bastion/* land with the parallel-eval PR; the skill is dormant until that merges.
Mirror the .agents/skills/<name> sources into .claude/skills/<name> so Claude Code discovers them (same symlink pattern the skill source uses).
Add an explicit Scope & guardrails section: the skill analyzes statically and presents findings; it may run unit tests / ruff lint + format checks, but must NOT run benchmark evals, the matrix, or any infra provisioning/teardown. Parallel hazards are found by reading and reasoning, not by launching concurrent runs.
…ure modes Add a Phase-1 pre-flight that screens the selected matrix tasks against the known per-run isolation gaps (shared $HOME repos, multi-cluster node-SA collapse, shared VM-SA IAM clobber, task_id collisions, host-capacity sum), and a Phase-6 step (+ guardrail) making it part of the run to append any new failure mode to the docs/parallel-evals.md / docs/bastion.md appendices.
…essive disclosure - SKILL.md becomes a lean router: a Modes section points to references/ loaded on demand; adds a background-classifier keepalive guardrail. - references/resilient-monitoring.md: low-tier monitor + mid-tier analyzer subagents, supervisor loop, API-error recovery (re-spawn on durable RESUME_STAMP), and keepalive so a long detached run isn't classified as finished mid-run. - references/unlimited-mode.md (opt-in): classify flake vs model vs bug; diagnose -> fix in a worktree -> re-sync -> restart failed combos -> repeat, with attempt caps, scoped local commits, self-review, and budget/checkpoint guardrails.
Describe required capabilities (sub-tasks, model tiers, background run, timer/wake, durable state, worktree, keepalive) instead of Claude Code tool names; add a 'Harness portability' map in SKILL.md and name Claude Code/Antigravity/Codex equivalents as examples with inline fallbacks (a bare shell that can ssh the bastion suffices). Generalize devops-bench-review's CLAUDE.md -> agent-instruction files and the verifier sub-agent -> harness-agnostic pass.
…y map Replace the 2-way (Claude Code / other) portability table with a 4-column one that names Antigravity's real primitives — dynamic subagents + 'agy -p --model', the Gemini 3.5 Flash / 3.1 Pro tiers via /models|--model, Background Agents, Scheduled tasks, Artifacts + Knowledge base, project/workspace scoping, request-review approval, and quota-as-governor — alongside Claude Code and a generic fallback.
… docs) Map the Antigravity column to the real snake_case tools from antigravity.google/docs/hooks — run_command (+ RunPersistent), the programmatic subagent tool + browser_subagent, view_file/list_dir/grep_search/codebase_search/ view_code_item, write_to_file/replace_file_content/multi_replace_file_content — and add a footnote listing them with the PreToolUse/PostToolUse/PreInvocation/ PostInvocation/Stop hook events.
Replace docs-scraped names with the actual agent tool set reported by a running Antigravity instance: invoke_subagent/define_subagent/manage_subagents/send_message (sub-tasks), manage_task (background), schedule (timer/cron), ask_question/ ask_permission (user), view_file/list_dir/grep_search + write_to_file/ replace_file_content/multi_replace_file_content (files), run_command, search_web/ read_url_content. Drops the incorrect codebase_search/view_code_item/browser_subagent /RunPersistent that secondary sources had claimed.
…n mode - Collapse the portability table to one primitive per cell (invoke_subagent, schedule, …); add a 'harness chains the rest' note; full inventory stays in the footnote. - Add an Execution mode section: LOCAL is the default (run on this host, no ssh/sync); BENCH_REMOTE=1 (+ BASTION_*) opts into the bastion ssh runner. Phase 1 now always asks local-or-remote; command snippets are the remote form with a 'drop the ssh wrapper locally' convention; references' monitor + unlimited re-sync are mode-aware.
Stacked on #132 (skills/agent-skills). Each matrix combo provisions its own cluster; this makes every task collision-free under concurrent runs: - 6 manifest-gen tasks -> deployer: noop (no cluster); legacy factory honors noop - optimize-scale: new prebuilt/optimize-scale GKE stack + pre-seeded workload; matrix pins TARGET_DEPLOYMENT_NAME/NAMESPACE so both arms agree - deploy-hello-app: run-unique Artifact Registry repo name - per-run tofu stack-dir copy (both arms) removes the shared .terraform.lock race (resolves the 'Shared OpenTofu working directory' known-issue) - import + parallel-fix the merged complex/GKE tasks (#64 migration, #87 opa, #93 multi-region, #86 postgres/unhealthy/gitops, #76 debug-crashloop): per-run GitOps repo paths, dropped shared-SA container.admin (BYO creds), region-prefixed cluster names (avoid node-SA substr collision), unique task_id - cp-recovery documented as the kind-only exception (docs/bastion.md)
- namespace divergence (#2): pin NAMESPACE in task_extra_env for the pre-seeded tasks (multi-region->storefront, unhealthy-pod/gitops/postgres/debug-crashloop ->default) so prompt {{NAMESPACE}} matches the fixture namespace on BOTH arms; add unused namespace var to prebuilt/minimum so postgres' pin doesn't warn - debug-crashloop (#1): new prebuilt/debug-crashloop-kind stack applies the broken frontend fixture (was never applied under bare prebuilt/kind -> task was unsolvable); move fixture into the stack; drop false GKE/us-central1 prompt - gitops repo path (#3): /app/results/... -> ~/gitops-repo-<cluster> (writable on the bastion, matches the other tasks' ~ convention) - migration (#4): prompt tells the agent to uniquely name any throwaway validation cluster from {{CLUSTER_NAME}} - docs (#5,#7): bastion.md lists all kind tasks (not just cp-recovery) + host capacity caveat; multi-region cluster_name description fixed (e-/w- prefix)
…ebhook retry; install fortio in vm-setup; doc gke-mcp mutation risk + stale-state reruns - tf/prebuilt/optimize-scale: workload now listens on 8080 (harness port-forwards deployment to fixed remote 8080; hpa-example:80 made the load spike a silent no-op) - tf/prebuilt/opa-remediation-kind/setup.sh: retry policy apply past Kyverno webhook readiness race (context deadline exceeded) - scripts/bastion/vm-setup.sh: install fortio (chaos generate_load dependency) - skills/run-parallel-evals: clear stale per-run tofu state + root-owned GitOps repos on rerun; fortio prereq; cluster-mutation blast-radius guardrail for manifest-gen - complextasks/optimize-scale/README.md: new
…extra_env
Lets all 20 tasks run in one parallel invocation — these two fixtures deploy
into dedicated namespaces (matching each stack's namespace default), so without
pinning they'd resolve {{NAMESPACE}} to the harness default and miss the
fixtures. Closes the last gap for a single mixed parallel batch.
121e7fb to
377a5aa
Compare
e0b8758 to
35e8191
Compare
377a5aa to
fdb29d1
Compare
Collaborator
|
@jessie1111101 I cherrypicked all your comments in #141 I think we can abandon this PR in favor of that one. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked on #132 (base:
skills/agent-skills).Makes every benchmark task safe to run concurrently in the Task×Model×AgentConfig
matrix. Each combo provisions its own cluster, so this ensures no cross-run
collisions on shared/global or host-level resources.
Changes
deployer: noop(no cluster); legacy factory now honorsnoopprebuilt/optimize-scaleGKE stack + pre-seeded workload;matrix pins TARGET_DEPLOYMENT_NAME/NAMESPACE so both arms agree
.terraform.lock.hclrace (resolves that known-issue)
multi-region, postgres, unhealthy-pod, gitops, debug-crashloop): per-run GitOps
repo paths, dropped shared-SA
container.admin(BYO creds), region-prefixedcluster names (avoid node-SA
substrcollision)on bare
prebuilt/kind){{NAMESPACE}}matches thefixture's namespace on both arms
Addressed from /devops-bench-review
namespace divergence, debug-crashloop fixture, gitops repo path →
~, migrationscratch-cluster naming, doc fixes.
Known follow-ups (out of scope)
task_ids (complextasks vs tasks/gcp) — only migration renumberedNAMESPACEexport would also cover direct runs