feat: parallel evaluation harness — isolation, bastion matrix, skills & docs (consolidated)#128
Merged
Conversation
5ae8b93 to
b452d15
Compare
807c52f to
f0c6bdf
Compare
b452d15 to
38d1eca
Compare
f0c6bdf to
66e296b
Compare
38d1eca to
5857abc
Compare
66e296b to
bd508a9
Compare
5857abc to
f088e1c
Compare
…line) Add a per-run isolation primitive (devops_bench/core/run_env.py) and thread it through the refactored pipeline so concurrent runs on one host never collide: - OpenTofu state isolated per run via TF_DATA_DIR (written beside it, not inside). - Chaos port-forward binds a free local port; the workload's remote port is fixed. - OpenClaw agent state isolated per run; per-run --model; configurable agent id. - Run id appended to the result dir; run environment snapshotted into the report. Wires the primitive into the harness, deployers, chaos fault, agent, and CLI.
Mirror the per-run isolation onto the legacy pkg/ evaluator pipeline so it can also run in parallel for legacy-vs-refactored comparison: - pkg/runenv.py per-run isolation primitive; unique cluster name per run. - Legacy TF state isolated per run; per-run gcp create marker. - Chaos port-forward binds a free local port (pkg manager/evaluator). - OpenClaw runner state isolated per run; per-run --model; configurable agent. - Run id appended to the result dir.
Make the secret-rotation task safe for parallel/matrix runs: random-suffix the GCP resource names so concurrent provisions don't collide, and support bring-your-own runner credentials. Adds the task's agent-rules brief.
Bastion-side orchestration for parallel evaluations: - run_matrix.sh + _matrix_lib.sh: parallel Task×Model×AgentConfig matrix, split into refactored + legacy wrappers (run_matrix_legacy.sh); hardened against SSH drops; per-stamp remote runner with pre-created output dirs. - Local by default; BENCH_REMOTE=1 opts into the ssh/bastion runner. - BENCH_VERTEX mode via VM-SA ADC; configure-oc.sh --vertex registers the oc google-vertex provider; portable oc Vertex ADC auth across isolated runs. - vm-setup / sync-to-bastion install the gemini CLI and support parallel runs. - docs/bastion.md: usage + known issues (shared VM-SA IAM clobber, host capacity).
… run-eval) A dedicated home for agent skills so they evolve independently of feature PRs: - devops-bench-review (new): review-only review across correctness, parallel-safety across the eval matrix axes (Task × Model × AgentConfig), task/stack conventions, and docs conventions; runs unit tests / ruff only — never evals or infra. - run-parallel-evals: relocated here so all skills sit together; harness-agnostic with an Antigravity portability map and local/remote execution modes. - run-eval (new): drive a single Task × Model × AgentConfig run end to end (a 1×1×1 matrix); reuses run-parallel-evals' wrappers and recovery/reference files. Each skill is a source dir under .agents/skills/<name>/ plus a .claude/skills/ discovery symlink (force-added; .agents/.claude are git-excluded).
Add docs/parallel-evals.md: the end-to-end parallel-evaluation runbook (matrix CUJs, parallel-safety rules, resume-after-drop, Vertex setup), known issues from review findings, and the local-default / BENCH_REMOTE execution note. Docs-only; the run-parallel-evals skill lives in the skills PR (#132).
0d26d08 to
eb8b318
Compare
AishSundar
approved these changes
Jun 27, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Consolidated PR for the parallel-evaluation work — formerly the #128–#132 stack, now merged down
into this branch. Six logical commits on top of
main:devops_bench/core/run_env.py+ wiring soconcurrent runs on one host never collide (per-run TF state, free chaos port, openclaw state,
run-id'd result dir).
pkg/forlegacy-vs-refactored comparison.
parallel/matrix-safe.
orchestration (
run_matrix.sh/_matrix_lib.sh), local-default +BENCH_REMOTE, Vertex ADC,vm-setup.
docs/parallel-evals.md.devops-bench-review,run-parallel-evals,run-eval.Builds on #123 (bastion + comparison harness, already merged). #133 (parallel-task safety fixes)
stacks on top and is the remaining PR after this one.