test(compare): results comparison harness (pkg evaluator vs devops_bench) by pradeepvrd · Pull Request #25 · pradeepvrd/devops-bench

pradeepvrd · 2026-06-20T20:10:04Z

A droppable comparison harness that runs the same task through both entrypoints — pkg/evaluator/evaluate.py and python -m devops_bench — and diffs the two results.json, so the refactored pipeline can be checked against the prior one. Local scaffolding, not intended to land upstream.

scripts/compare_legacy_vs_refactor.sh + scripts/compare_results.py: a deterministic mock-Ollama regression gate that diffs the two result files and classifies each delta as matched / intended / regression.
scripts/run_compare.sh: a task- and agent-agnostic real-infra comparison (task file as $1; agent selected entirely by env) used to compare gemini or openclaw on real GKE.
complextasks/secret-rotation/agent-rules.md: operator brief consumed by the secret-rotation comparison.
pkg/agents/runner/{gcli,openclaw}.py: the prior openclaw path now uses oc's built-in main agent so it runs on a stock oc (the operator agent isn't configured there).

…nch) A droppable comparison harness that runs the same task through both entrypoints — `pkg/evaluator/evaluate.py` and `python -m devops_bench` — and diffs the two `results.json`, so the refactored pipeline can be checked against the prior one. Local scaffolding, not intended to land upstream. - `scripts/compare_legacy_vs_refactor.sh` + `scripts/compare_results.py`: a deterministic mock-Ollama regression gate that diffs the two result files and classifies each delta as matched / intended / regression. - `scripts/run_compare.sh`: a task- and agent-agnostic real-infra comparison (task file as `$1`; agent selected entirely by env) used to compare gemini or openclaw on real GKE. - `complextasks/secret-rotation/agent-rules.md`: operator brief consumed by the secret-rotation comparison. - `pkg/agents/runner/{gcli,openclaw}.py`: the prior openclaw path now uses `oc`'s built-in `main` agent so it runs on a stock `oc` (the `operator` agent isn't configured there).

pradeepvrd force-pushed the refactor/comparison branch from 9cd0842 to 8aadaa7 Compare June 20, 2026 20:18

pradeepvrd force-pushed the refactor/entrypoint branch from 448e337 to 46d99e1 Compare June 20, 2026 21:31

pradeepvrd force-pushed the refactor/comparison branch from 8aadaa7 to 7f69e79 Compare June 20, 2026 21:31

pradeepvrd force-pushed the refactor/entrypoint branch from 46d99e1 to f28d2ec Compare June 21, 2026 01:31

pradeepvrd force-pushed the refactor/comparison branch from 7f69e79 to 6203aa5 Compare June 21, 2026 01:31

pradeepvrd mentioned this pull request Jun 21, 2026

feat(agents): wire MCP servers + skills into the Gemini and OpenClaw CLI agents #34

Draft

pradeepvrd force-pushed the refactor/comparison branch from 6203aa5 to 1171433 Compare June 21, 2026 03:09

pradeepvrd changed the base branch from refactor/entrypoint to refactor/gemini-capabilities June 21, 2026 03:09

pradeepvrd force-pushed the refactor/gemini-capabilities branch from 7cd085e to cbd1afd Compare June 22, 2026 01:53

pradeepvrd force-pushed the refactor/comparison branch from 81b4c2b to 709ecaf Compare June 22, 2026 01:53

pradeepvrd force-pushed the refactor/gemini-capabilities branch from cbd1afd to 940c4ea Compare June 23, 2026 05:04

pradeepvrd force-pushed the refactor/comparison branch from 709ecaf to 85970d2 Compare June 23, 2026 05:04

pradeepvrd mentioned this pull request Jun 23, 2026

feat(bastion): static GCE bastion for running the eval harness #35

Open

pradeepvrd force-pushed the refactor/comparison branch from 85970d2 to 60c3eaa Compare June 23, 2026 05:41

pradeepvrd changed the base branch from refactor/gemini-capabilities to feat/eval-bastion June 23, 2026 05:41

pradeepvrd changed the title ~~test(refactor): legacy-vs-refactor results comparison harness~~ test(compare): results comparison harness (pkg evaluator vs devops_bench) Jun 23, 2026

pradeepvrd force-pushed the feat/eval-bastion branch from 86406d0 to 49c1641 Compare June 23, 2026 06:09

pradeepvrd force-pushed the refactor/comparison branch from 60c3eaa to b452941 Compare June 23, 2026 06:09

pradeepvrd force-pushed the feat/eval-bastion branch from 49c1641 to cfb9cd4 Compare June 23, 2026 06:37

pradeepvrd force-pushed the refactor/comparison branch from b452941 to aed5db4 Compare June 23, 2026 06:37

pradeepvrd force-pushed the feat/eval-bastion branch from cfb9cd4 to 6b4e72e Compare June 23, 2026 07:55

pradeepvrd force-pushed the refactor/comparison branch from aed5db4 to 5669d3c Compare June 23, 2026 07:57

pradeepvrd force-pushed the feat/eval-bastion branch from 6b4e72e to e8a9ad5 Compare June 23, 2026 08:22

pradeepvrd force-pushed the refactor/comparison branch from 5669d3c to 4b99b21 Compare June 23, 2026 08:22

pradeepvrd force-pushed the feat/eval-bastion branch from e8a9ad5 to 55b37f2 Compare June 23, 2026 18:30

pradeepvrd force-pushed the refactor/comparison branch from 4b99b21 to d30c251 Compare June 23, 2026 18:32

pradeepvrd force-pushed the feat/eval-bastion branch from 55b37f2 to 829ef83 Compare June 23, 2026 18:36

pradeepvrd force-pushed the refactor/comparison branch from d30c251 to b79f541 Compare June 23, 2026 18:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(compare): results comparison harness (pkg evaluator vs devops_bench)#25

test(compare): results comparison harness (pkg evaluator vs devops_bench)#25
pradeepvrd wants to merge 1 commit into
feat/eval-bastionfrom
refactor/comparison

pradeepvrd commented Jun 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pradeepvrd commented Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pradeepvrd commented Jun 20, 2026 •

edited

Loading