Skip to content

feat: parallel evaluation harness — isolation, bastion matrix, skills & docs (consolidated)#128

Merged
pradeepvrd merged 6 commits into
mainfrom
feat/run-isolation
Jun 27, 2026
Merged

feat: parallel evaluation harness — isolation, bastion matrix, skills & docs (consolidated)#128
pradeepvrd merged 6 commits into
mainfrom
feat/run-isolation

Conversation

@pradeepvrd

@pradeepvrd pradeepvrd commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Consolidated PR for the parallel-evaluation work — formerly the #128#132 stack, now merged down
into this branch. Six logical commits on top of main:

  1. feat: per-run isolation (refactored pipeline)devops_bench/core/run_env.py + wiring so
    concurrent runs on one host never collide (per-run TF state, free chaos port, openclaw state,
    run-id'd result dir).
  2. feat: per-run isolation (legacy pkg pipeline) — mirrors the same onto pkg/ for
    legacy-vs-refactored comparison.
  3. feat(secret-rotation): random-suffix GCP names + BYO runner credentials — makes the task
    parallel/matrix-safe.
  4. feat(bastion): parallel Task×Model×AgentConfig eval matrix + Vertex auth — bastion
    orchestration (run_matrix.sh / _matrix_lib.sh), local-default + BENCH_REMOTE, Vertex ADC,
    vm-setup.
  5. docs(parallel-evals): parallel evaluation runbook + known issuesdocs/parallel-evals.md.
  6. skills: agent skills homedevops-bench-review, run-parallel-evals, run-eval.

Builds on #123 (bastion + comparison harness, already merged). #133 (parallel-task safety fixes)
stacks on top and is the remaining PR after this one.

@pradeepvrd pradeepvrd force-pushed the feat/run-isolation branch from 5ae8b93 to b452d15 Compare June 25, 2026 16:17
@pradeepvrd pradeepvrd force-pushed the refactor/comparison branch 2 times, most recently from 807c52f to f0c6bdf Compare June 26, 2026 01:07
@pradeepvrd pradeepvrd force-pushed the feat/run-isolation branch from b452d15 to 38d1eca Compare June 26, 2026 01:09
@pradeepvrd pradeepvrd force-pushed the refactor/comparison branch from f0c6bdf to 66e296b Compare June 26, 2026 21:49
@pradeepvrd pradeepvrd force-pushed the feat/run-isolation branch from 38d1eca to 5857abc Compare June 26, 2026 21:49
@pradeepvrd pradeepvrd force-pushed the refactor/comparison branch from 66e296b to bd508a9 Compare June 26, 2026 22:22
@pradeepvrd pradeepvrd force-pushed the feat/run-isolation branch from 5857abc to f088e1c Compare June 26, 2026 22:22
@pradeepvrd pradeepvrd marked this pull request as ready for review June 27, 2026 01:20
@pradeepvrd pradeepvrd changed the title feat: per-run isolation for parallel evaluation runs (refactored pipeline) feat: parallel evaluation harness — isolation, bastion matrix, skills & docs (consolidated) Jun 27, 2026
@pradeepvrd pradeepvrd changed the base branch from refactor/comparison to main June 27, 2026 02:04
…line)

Add a per-run isolation primitive (devops_bench/core/run_env.py) and thread it
through the refactored pipeline so concurrent runs on one host never collide:

- OpenTofu state isolated per run via TF_DATA_DIR (written beside it, not inside).
- Chaos port-forward binds a free local port; the workload's remote port is fixed.
- OpenClaw agent state isolated per run; per-run --model; configurable agent id.
- Run id appended to the result dir; run environment snapshotted into the report.

Wires the primitive into the harness, deployers, chaos fault, agent, and CLI.
Mirror the per-run isolation onto the legacy pkg/ evaluator pipeline so it can
also run in parallel for legacy-vs-refactored comparison:

- pkg/runenv.py per-run isolation primitive; unique cluster name per run.
- Legacy TF state isolated per run; per-run gcp create marker.
- Chaos port-forward binds a free local port (pkg manager/evaluator).
- OpenClaw runner state isolated per run; per-run --model; configurable agent.
- Run id appended to the result dir.
Make the secret-rotation task safe for parallel/matrix runs: random-suffix the
GCP resource names so concurrent provisions don't collide, and support
bring-your-own runner credentials. Adds the task's agent-rules brief.
Bastion-side orchestration for parallel evaluations:
- run_matrix.sh + _matrix_lib.sh: parallel Task×Model×AgentConfig matrix, split into
  refactored + legacy wrappers (run_matrix_legacy.sh); hardened against SSH drops;
  per-stamp remote runner with pre-created output dirs.
- Local by default; BENCH_REMOTE=1 opts into the ssh/bastion runner.
- BENCH_VERTEX mode via VM-SA ADC; configure-oc.sh --vertex registers the oc
  google-vertex provider; portable oc Vertex ADC auth across isolated runs.
- vm-setup / sync-to-bastion install the gemini CLI and support parallel runs.
- docs/bastion.md: usage + known issues (shared VM-SA IAM clobber, host capacity).
… run-eval)

A dedicated home for agent skills so they evolve independently of feature PRs:
- devops-bench-review (new): review-only review across correctness, parallel-safety
  across the eval matrix axes (Task × Model × AgentConfig), task/stack conventions,
  and docs conventions; runs unit tests / ruff only — never evals or infra.
- run-parallel-evals: relocated here so all skills sit together; harness-agnostic with
  an Antigravity portability map and local/remote execution modes.
- run-eval (new): drive a single Task × Model × AgentConfig run end to end (a 1×1×1
  matrix); reuses run-parallel-evals' wrappers and recovery/reference files.
Each skill is a source dir under .agents/skills/<name>/ plus a .claude/skills/ discovery
symlink (force-added; .agents/.claude are git-excluded).
Add docs/parallel-evals.md: the end-to-end parallel-evaluation runbook (matrix CUJs,
parallel-safety rules, resume-after-drop, Vertex setup), known issues from review
findings, and the local-default / BENCH_REMOTE execution note. Docs-only; the
run-parallel-evals skill lives in the skills PR (#132).
@pradeepvrd pradeepvrd force-pushed the feat/run-isolation branch from 0d26d08 to eb8b318 Compare June 27, 2026 02:06
@pradeepvrd pradeepvrd merged commit a0ba40e into main Jun 27, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants