Skip to content

feat(metrics): metrics suite + bundled skills#32

Draft
pradeepvrd wants to merge 1 commit into
submit/6-chaosfrom
submit/7-metrics
Draft

feat(metrics): metrics suite + bundled skills#32
pradeepvrd wants to merge 1 commit into
submit/6-chaosfrom
submit/7-metrics

Conversation

@pradeepvrd

@pradeepvrd pradeepvrd commented Jun 20, 2026

Copy link
Copy Markdown
Owner

The metrics suite used to be built inline in pkg/evaluator/evaluate.py (GEval construction, checklist parsing, scoring in evaluate_metrics_batch); this extracts it into devops_bench/metrics/ as registry-driven evaluators (METRICS) over a typed MetricContext/MetricScore, with judges routed through the models layer and judge skills shipped as package data under devops_bench/skills/.

Behavior changes

  • Each metric declares its own applies() and yields typed scores via the METRICS registry, replacing the monolithic batch function; downstream packages can add metrics without editing the pipeline.
  • Judges are instantiated through the models layer instead of constructing provider SDK clients directly (provider-agnostic scoring).
  • Judge skill markdown moves from a hard-coded skills/ filesystem path to devops_bench/skills/ package data, so it resolves under pip install / wheels.
  • The tool-invocation checklist item is dropped when MCP is disabled, so non-MCP runs are not scored against an inapplicable check.

Bugs fixed

  • Checklist parsing uses lstrip("- ") instead of strip("- "), so trailing hyphens in requirement text are no longer truncated.
  • Document-retrieval scoring guards empty doc_name/url fields, which previously matched all text and inflated the rate.

Comment thread devops_bench/metrics/__init__.py Outdated
Comment thread devops_bench/metrics/pipeline.py Outdated
The metrics suite used to be built inline in `pkg/evaluator/evaluate.py` (GEval construction, checklist parsing, scoring in `evaluate_metrics_batch`); this extracts it into `devops_bench/metrics/` as registry-driven evaluators (`METRICS`) over a typed `MetricContext`/`MetricScore`, with judges routed through the `models` layer and judge skills shipped as package data under `devops_bench/skills/`.

**Behavior changes**
- Each metric declares its own `applies()` and yields typed scores via the `METRICS` registry, replacing the monolithic batch function; downstream packages can add metrics without editing the pipeline.
- Judges are instantiated through the `models` layer instead of constructing provider SDK clients directly (provider-agnostic scoring).
- Judge skill markdown moves from a hard-coded `skills/` filesystem path to `devops_bench/skills/` package data, so it resolves under `pip install` / wheels.
- The tool-invocation checklist item is dropped when MCP is disabled, so non-MCP runs are not scored against an inapplicable check.

**Bugs fixed**
- Checklist parsing uses `lstrip("- ")` instead of `strip("- ")`, so trailing hyphens in requirement text are no longer truncated.
- Document-retrieval scoring guards empty `doc_name`/`url` fields, which previously matched all text and inflated the rate.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant