Skip to content

test(compare): results comparison harness (pkg evaluator vs devops_bench)#25

Draft
pradeepvrd wants to merge 1 commit into
feat/eval-bastionfrom
refactor/comparison
Draft

test(compare): results comparison harness (pkg evaluator vs devops_bench)#25
pradeepvrd wants to merge 1 commit into
feat/eval-bastionfrom
refactor/comparison

Conversation

@pradeepvrd

@pradeepvrd pradeepvrd commented Jun 20, 2026

Copy link
Copy Markdown
Owner

A droppable comparison harness that runs the same task through both entrypoints — pkg/evaluator/evaluate.py and python -m devops_bench — and diffs the two results.json, so the refactored pipeline can be checked against the prior one. Local scaffolding, not intended to land upstream.

  • scripts/compare_legacy_vs_refactor.sh + scripts/compare_results.py: a deterministic mock-Ollama regression gate that diffs the two result files and classifies each delta as matched / intended / regression.
  • scripts/run_compare.sh: a task- and agent-agnostic real-infra comparison (task file as $1; agent selected entirely by env) used to compare gemini or openclaw on real GKE.
  • complextasks/secret-rotation/agent-rules.md: operator brief consumed by the secret-rotation comparison.
  • pkg/agents/runner/{gcli,openclaw}.py: the prior openclaw path now uses oc's built-in main agent so it runs on a stock oc (the operator agent isn't configured there).

@pradeepvrd pradeepvrd force-pushed the refactor/comparison branch from 9cd0842 to 8aadaa7 Compare June 20, 2026 20:18
@pradeepvrd pradeepvrd force-pushed the refactor/entrypoint branch from 448e337 to 46d99e1 Compare June 20, 2026 21:31
@pradeepvrd pradeepvrd force-pushed the refactor/comparison branch from 8aadaa7 to 7f69e79 Compare June 20, 2026 21:31
@pradeepvrd pradeepvrd force-pushed the refactor/entrypoint branch from 46d99e1 to f28d2ec Compare June 21, 2026 01:31
@pradeepvrd pradeepvrd force-pushed the refactor/comparison branch from 7f69e79 to 6203aa5 Compare June 21, 2026 01:31
@pradeepvrd pradeepvrd force-pushed the refactor/comparison branch from 6203aa5 to 1171433 Compare June 21, 2026 03:09
@pradeepvrd pradeepvrd changed the base branch from refactor/entrypoint to refactor/gemini-capabilities June 21, 2026 03:09
@pradeepvrd pradeepvrd force-pushed the refactor/gemini-capabilities branch from 7cd085e to cbd1afd Compare June 22, 2026 01:53
@pradeepvrd pradeepvrd force-pushed the refactor/comparison branch from 81b4c2b to 709ecaf Compare June 22, 2026 01:53
@pradeepvrd pradeepvrd force-pushed the refactor/gemini-capabilities branch from cbd1afd to 940c4ea Compare June 23, 2026 05:04
@pradeepvrd pradeepvrd force-pushed the refactor/comparison branch from 709ecaf to 85970d2 Compare June 23, 2026 05:04
@pradeepvrd pradeepvrd force-pushed the refactor/comparison branch from 85970d2 to 60c3eaa Compare June 23, 2026 05:41
@pradeepvrd pradeepvrd changed the base branch from refactor/gemini-capabilities to feat/eval-bastion June 23, 2026 05:41
@pradeepvrd pradeepvrd changed the title test(refactor): legacy-vs-refactor results comparison harness test(compare): results comparison harness (pkg evaluator vs devops_bench) Jun 23, 2026
@pradeepvrd pradeepvrd force-pushed the refactor/comparison branch from 60c3eaa to b452941 Compare June 23, 2026 06:09
@pradeepvrd pradeepvrd force-pushed the refactor/comparison branch from b452941 to aed5db4 Compare June 23, 2026 06:37
@pradeepvrd pradeepvrd force-pushed the refactor/comparison branch from aed5db4 to 5669d3c Compare June 23, 2026 07:57
@pradeepvrd pradeepvrd force-pushed the refactor/comparison branch from 5669d3c to 4b99b21 Compare June 23, 2026 08:22
@pradeepvrd pradeepvrd force-pushed the refactor/comparison branch from 4b99b21 to d30c251 Compare June 23, 2026 18:32
…nch)

A droppable comparison harness that runs the same task through both entrypoints — `pkg/evaluator/evaluate.py` and `python -m devops_bench` — and diffs the two `results.json`, so the refactored pipeline can be checked against the prior one. Local scaffolding, not intended to land upstream.

- `scripts/compare_legacy_vs_refactor.sh` + `scripts/compare_results.py`: a deterministic mock-Ollama regression gate that diffs the two result files and classifies each delta as matched / intended / regression.
- `scripts/run_compare.sh`: a task- and agent-agnostic real-infra comparison (task file as `$1`; agent selected entirely by env) used to compare gemini or openclaw on real GKE.
- `complextasks/secret-rotation/agent-rules.md`: operator brief consumed by the secret-rotation comparison.
- `pkg/agents/runner/{gcli,openclaw}.py`: the prior openclaw path now uses `oc`'s built-in `main` agent so it runs on a stock `oc` (the `operator` agent isn't configured there).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant