feat: parallel evaluation harness — isolation, bastion matrix, skills & docs (consolidated) by pradeepvrd · Pull Request #128 · gke-labs/devops-bench

pradeepvrd · 2026-06-25T05:08:59Z

Consolidated PR for the parallel-evaluation work — formerly the #128–#132 stack, now merged down
into this branch. Six logical commits on top of main:

feat: per-run isolation (refactored pipeline) — devops_bench/core/run_env.py + wiring so
concurrent runs on one host never collide (per-run TF state, free chaos port, openclaw state,
run-id'd result dir).
feat: per-run isolation (legacy pkg pipeline) — mirrors the same onto pkg/ for
legacy-vs-refactored comparison.
feat(secret-rotation): random-suffix GCP names + BYO runner credentials — makes the task
parallel/matrix-safe.
feat(bastion): parallel Task×Model×AgentConfig eval matrix + Vertex auth — bastion
orchestration (run_matrix.sh / _matrix_lib.sh), local-default + BENCH_REMOTE, Vertex ADC,
vm-setup.
docs(parallel-evals): parallel evaluation runbook + known issues — docs/parallel-evals.md.
skills: agent skills home — devops-bench-review, run-parallel-evals, run-eval.

Builds on #123 (bastion + comparison harness, already merged). #133 (parallel-task safety fixes)
stacks on top and is the remaining PR after this one.

…line) Add a per-run isolation primitive (devops_bench/core/run_env.py) and thread it through the refactored pipeline so concurrent runs on one host never collide: - OpenTofu state isolated per run via TF_DATA_DIR (written beside it, not inside). - Chaos port-forward binds a free local port; the workload's remote port is fixed. - OpenClaw agent state isolated per run; per-run --model; configurable agent id. - Run id appended to the result dir; run environment snapshotted into the report. Wires the primitive into the harness, deployers, chaos fault, agent, and CLI.

Mirror the per-run isolation onto the legacy pkg/ evaluator pipeline so it can also run in parallel for legacy-vs-refactored comparison: - pkg/runenv.py per-run isolation primitive; unique cluster name per run. - Legacy TF state isolated per run; per-run gcp create marker. - Chaos port-forward binds a free local port (pkg manager/evaluator). - OpenClaw runner state isolated per run; per-run --model; configurable agent. - Run id appended to the result dir.

Make the secret-rotation task safe for parallel/matrix runs: random-suffix the GCP resource names so concurrent provisions don't collide, and support bring-your-own runner credentials. Adds the task's agent-rules brief.

Bastion-side orchestration for parallel evaluations: - run_matrix.sh + _matrix_lib.sh: parallel Task×Model×AgentConfig matrix, split into refactored + legacy wrappers (run_matrix_legacy.sh); hardened against SSH drops; per-stamp remote runner with pre-created output dirs. - Local by default; BENCH_REMOTE=1 opts into the ssh/bastion runner. - BENCH_VERTEX mode via VM-SA ADC; configure-oc.sh --vertex registers the oc google-vertex provider; portable oc Vertex ADC auth across isolated runs. - vm-setup / sync-to-bastion install the gemini CLI and support parallel runs. - docs/bastion.md: usage + known issues (shared VM-SA IAM clobber, host capacity).

… run-eval) A dedicated home for agent skills so they evolve independently of feature PRs: - devops-bench-review (new): review-only review across correctness, parallel-safety across the eval matrix axes (Task × Model × AgentConfig), task/stack conventions, and docs conventions; runs unit tests / ruff only — never evals or infra. - run-parallel-evals: relocated here so all skills sit together; harness-agnostic with an Antigravity portability map and local/remote execution modes. - run-eval (new): drive a single Task × Model × AgentConfig run end to end (a 1×1×1 matrix); reuses run-parallel-evals' wrappers and recovery/reference files. Each skill is a source dir under .agents/skills/<name>/ plus a .claude/skills/ discovery symlink (force-added; .agents/.claude are git-excluded).

Add docs/parallel-evals.md: the end-to-end parallel-evaluation runbook (matrix CUJs, parallel-safety rules, resume-after-drop, Vertex setup), known issues from review findings, and the local-default / BENCH_REMOTE execution note. Docs-only; the run-parallel-evals skill lives in the skills PR (#132).

pradeepvrd mentioned this pull request Jun 25, 2026

docs(parallel-evals): parallel evaluation runbook + known issues #126

Merged

pradeepvrd force-pushed the feat/run-isolation branch from 5ae8b93 to b452d15 Compare June 25, 2026 16:17

pradeepvrd force-pushed the refactor/comparison branch 2 times, most recently from 807c52f to f0c6bdf Compare June 26, 2026 01:07

pradeepvrd force-pushed the feat/run-isolation branch from b452d15 to 38d1eca Compare June 26, 2026 01:09

pradeepvrd mentioned this pull request Jun 26, 2026

feat(harness): default harness wiring + env-snapshot reporting #120

Merged

pradeepvrd force-pushed the refactor/comparison branch from f0c6bdf to 66e296b Compare June 26, 2026 21:49

pradeepvrd force-pushed the feat/run-isolation branch from 38d1eca to 5857abc Compare June 26, 2026 21:49

pradeepvrd force-pushed the refactor/comparison branch from 66e296b to bd508a9 Compare June 26, 2026 22:22

pradeepvrd force-pushed the feat/run-isolation branch from 5857abc to f088e1c Compare June 26, 2026 22:22

pradeepvrd marked this pull request as ready for review June 27, 2026 01:20

pradeepvrd changed the title ~~feat: per-run isolation for parallel evaluation runs (refactored pipeline)~~ feat: parallel evaluation harness — isolation, bastion matrix, skills & docs (consolidated) Jun 27, 2026

pradeepvrd changed the base branch from refactor/comparison to main June 27, 2026 02:04

pradeepvrd added 6 commits June 26, 2026 19:05

pradeepvrd force-pushed the feat/run-isolation branch from 0d26d08 to eb8b318 Compare June 27, 2026 02:06

AishSundar approved these changes Jun 27, 2026

View reviewed changes

pradeepvrd merged commit a0ba40e into main Jun 27, 2026
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: parallel evaluation harness — isolation, bastion matrix, skills & docs (consolidated)#128

feat: parallel evaluation harness — isolation, bastion matrix, skills & docs (consolidated)#128
pradeepvrd merged 6 commits into
mainfrom
feat/run-isolation

pradeepvrd commented Jun 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

pradeepvrd commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pradeepvrd commented Jun 25, 2026 •

edited

Loading