feat(metrics): metrics suite + bundled skills#32
Draft
pradeepvrd wants to merge 1 commit into
Draft
Conversation
pradeepvrd
commented
Jun 20, 2026
eec518a to
ef062ee
Compare
fd6fd3c to
0fc8f78
Compare
ef062ee to
8e1f4f0
Compare
0fc8f78 to
396ce1f
Compare
8e1f4f0 to
281df06
Compare
396ce1f to
148323c
Compare
281df06 to
bff43c5
Compare
148323c to
6cbbf71
Compare
bff43c5 to
550cec5
Compare
6cbbf71 to
cb3145e
Compare
550cec5 to
9db9100
Compare
cb3145e to
617f5e4
Compare
9db9100 to
ba65ba9
Compare
617f5e4 to
e349d59
Compare
ba65ba9 to
4be11fb
Compare
e349d59 to
5eb3685
Compare
4be11fb to
cc4d588
Compare
The metrics suite used to be built inline in `pkg/evaluator/evaluate.py` (GEval construction, checklist parsing, scoring in `evaluate_metrics_batch`); this extracts it into `devops_bench/metrics/` as registry-driven evaluators (`METRICS`) over a typed `MetricContext`/`MetricScore`, with judges routed through the `models` layer and judge skills shipped as package data under `devops_bench/skills/`.
**Behavior changes**
- Each metric declares its own `applies()` and yields typed scores via the `METRICS` registry, replacing the monolithic batch function; downstream packages can add metrics without editing the pipeline.
- Judges are instantiated through the `models` layer instead of constructing provider SDK clients directly (provider-agnostic scoring).
- Judge skill markdown moves from a hard-coded `skills/` filesystem path to `devops_bench/skills/` package data, so it resolves under `pip install` / wheels.
- The tool-invocation checklist item is dropped when MCP is disabled, so non-MCP runs are not scored against an inapplicable check.
**Bugs fixed**
- Checklist parsing uses `lstrip("- ")` instead of `strip("- ")`, so trailing hyphens in requirement text are no longer truncated.
- Document-retrieval scoring guards empty `doc_name`/`url` fields, which previously matched all text and inflated the rate.
5eb3685 to
96891b5
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The metrics suite used to be built inline in
pkg/evaluator/evaluate.py(GEval construction, checklist parsing, scoring inevaluate_metrics_batch); this extracts it intodevops_bench/metrics/as registry-driven evaluators (METRICS) over a typedMetricContext/MetricScore, with judges routed through themodelslayer and judge skills shipped as package data underdevops_bench/skills/.Behavior changes
applies()and yields typed scores via theMETRICSregistry, replacing the monolithic batch function; downstream packages can add metrics without editing the pipeline.modelslayer instead of constructing provider SDK clients directly (provider-agnostic scoring).skills/filesystem path todevops_bench/skills/package data, so it resolves underpip install/ wheels.Bugs fixed
lstrip("- ")instead ofstrip("- "), so trailing hyphens in requirement text are no longer truncated.doc_name/urlfields, which previously matched all text and inflated the rate.