test(compare): results comparison harness (pkg evaluator vs devops_bench)#25
Draft
pradeepvrd wants to merge 1 commit into
Draft
test(compare): results comparison harness (pkg evaluator vs devops_bench)#25pradeepvrd wants to merge 1 commit into
pradeepvrd wants to merge 1 commit into
Conversation
9cd0842 to
8aadaa7
Compare
448e337 to
46d99e1
Compare
8aadaa7 to
7f69e79
Compare
46d99e1 to
f28d2ec
Compare
7f69e79 to
6203aa5
Compare
6203aa5 to
1171433
Compare
7cd085e to
cbd1afd
Compare
81b4c2b to
709ecaf
Compare
cbd1afd to
940c4ea
Compare
709ecaf to
85970d2
Compare
85970d2 to
60c3eaa
Compare
86406d0 to
49c1641
Compare
60c3eaa to
b452941
Compare
49c1641 to
cfb9cd4
Compare
b452941 to
aed5db4
Compare
cfb9cd4 to
6b4e72e
Compare
aed5db4 to
5669d3c
Compare
6b4e72e to
e8a9ad5
Compare
5669d3c to
4b99b21
Compare
e8a9ad5 to
55b37f2
Compare
4b99b21 to
d30c251
Compare
…nch)
A droppable comparison harness that runs the same task through both entrypoints — `pkg/evaluator/evaluate.py` and `python -m devops_bench` — and diffs the two `results.json`, so the refactored pipeline can be checked against the prior one. Local scaffolding, not intended to land upstream.
- `scripts/compare_legacy_vs_refactor.sh` + `scripts/compare_results.py`: a deterministic mock-Ollama regression gate that diffs the two result files and classifies each delta as matched / intended / regression.
- `scripts/run_compare.sh`: a task- and agent-agnostic real-infra comparison (task file as `$1`; agent selected entirely by env) used to compare gemini or openclaw on real GKE.
- `complextasks/secret-rotation/agent-rules.md`: operator brief consumed by the secret-rotation comparison.
- `pkg/agents/runner/{gcli,openclaw}.py`: the prior openclaw path now uses `oc`'s built-in `main` agent so it runs on a stock `oc` (the `operator` agent isn't configured there).
55b37f2 to
829ef83
Compare
d30c251 to
b79f541
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
A droppable comparison harness that runs the same task through both entrypoints —
pkg/evaluator/evaluate.pyandpython -m devops_bench— and diffs the tworesults.json, so the refactored pipeline can be checked against the prior one. Local scaffolding, not intended to land upstream.scripts/compare_legacy_vs_refactor.sh+scripts/compare_results.py: a deterministic mock-Ollama regression gate that diffs the two result files and classifies each delta as matched / intended / regression.scripts/run_compare.sh: a task- and agent-agnostic real-infra comparison (task file as$1; agent selected entirely by env) used to compare gemini or openclaw on real GKE.complextasks/secret-rotation/agent-rules.md: operator brief consumed by the secret-rotation comparison.pkg/agents/runner/{gcli,openclaw}.py: the prior openclaw path now usesoc's built-inmainagent so it runs on a stockoc(theoperatoragent isn't configured there).