feat(metrics): metrics suite + bundled skills by pradeepvrd · Pull Request #32 · pradeepvrd/devops-bench

pradeepvrd · 2026-06-20T21:05:44Z

The metrics suite used to be built inline in pkg/evaluator/evaluate.py (GEval construction, checklist parsing, scoring in evaluate_metrics_batch); this extracts it into devops_bench/metrics/ as registry-driven evaluators (METRICS) over a typed MetricContext/MetricScore, with judges routed through the models layer and judge skills shipped as package data under devops_bench/skills/.

Behavior changes

Each metric declares its own applies() and yields typed scores via the METRICS registry, replacing the monolithic batch function; downstream packages can add metrics without editing the pipeline.
Judges are instantiated through the models layer instead of constructing provider SDK clients directly (provider-agnostic scoring).
Judge skill markdown moves from a hard-coded skills/ filesystem path to devops_bench/skills/ package data, so it resolves under pip install / wheels.
The tool-invocation checklist item is dropped when MCP is disabled, so non-MCP runs are not scored against an inapplicable check.

Bugs fixed

Checklist parsing uses lstrip("- ") instead of strip("- "), so trailing hyphens in requirement text are no longer truncated.
Document-retrieval scoring guards empty doc_name/url fields, which previously matched all text and inflated the rate.

The metrics suite used to be built inline in `pkg/evaluator/evaluate.py` (GEval construction, checklist parsing, scoring in `evaluate_metrics_batch`); this extracts it into `devops_bench/metrics/` as registry-driven evaluators (`METRICS`) over a typed `MetricContext`/`MetricScore`, with judges routed through the `models` layer and judge skills shipped as package data under `devops_bench/skills/`. **Behavior changes** - Each metric declares its own `applies()` and yields typed scores via the `METRICS` registry, replacing the monolithic batch function; downstream packages can add metrics without editing the pipeline. - Judges are instantiated through the `models` layer instead of constructing provider SDK clients directly (provider-agnostic scoring). - Judge skill markdown moves from a hard-coded `skills/` filesystem path to `devops_bench/skills/` package data, so it resolves under `pip install` / wheels. - The tool-invocation checklist item is dropped when MCP is disabled, so non-MCP runs are not scored against an inapplicable check. **Bugs fixed** - Checklist parsing uses `lstrip("- ")` instead of `strip("- ")`, so trailing hyphens in requirement text are no longer truncated. - Document-retrieval scoring guards empty `doc_name`/`url` fields, which previously matched all text and inflated the rate.

pradeepvrd commented Jun 20, 2026

View reviewed changes

Comment thread devops_bench/metrics/__init__.py Outdated

Comment thread devops_bench/metrics/pipeline.py Outdated

pradeepvrd force-pushed the submit/6-chaos branch from eec518a to ef062ee Compare June 21, 2026 01:30

pradeepvrd force-pushed the submit/7-metrics branch from fd6fd3c to 0fc8f78 Compare June 21, 2026 01:30

pradeepvrd force-pushed the submit/6-chaos branch from ef062ee to 8e1f4f0 Compare June 22, 2026 01:53

pradeepvrd force-pushed the submit/7-metrics branch from 0fc8f78 to 396ce1f Compare June 22, 2026 01:53

pradeepvrd force-pushed the submit/6-chaos branch from 8e1f4f0 to 281df06 Compare June 23, 2026 05:04

pradeepvrd force-pushed the submit/7-metrics branch from 396ce1f to 148323c Compare June 23, 2026 05:04

pradeepvrd force-pushed the submit/6-chaos branch from 281df06 to bff43c5 Compare June 23, 2026 06:09

pradeepvrd force-pushed the submit/7-metrics branch from 148323c to 6cbbf71 Compare June 23, 2026 06:09

pradeepvrd force-pushed the submit/6-chaos branch from bff43c5 to 550cec5 Compare June 23, 2026 06:37

pradeepvrd force-pushed the submit/7-metrics branch from 6cbbf71 to cb3145e Compare June 23, 2026 06:37

pradeepvrd force-pushed the submit/6-chaos branch from 550cec5 to 9db9100 Compare June 23, 2026 07:28

pradeepvrd force-pushed the submit/7-metrics branch from cb3145e to 617f5e4 Compare June 23, 2026 07:33

pradeepvrd force-pushed the submit/6-chaos branch from 9db9100 to ba65ba9 Compare June 23, 2026 08:21

pradeepvrd force-pushed the submit/7-metrics branch from 617f5e4 to e349d59 Compare June 23, 2026 08:22

pradeepvrd force-pushed the submit/6-chaos branch from ba65ba9 to 4be11fb Compare June 23, 2026 18:16

pradeepvrd force-pushed the submit/7-metrics branch from e349d59 to 5eb3685 Compare June 23, 2026 18:18

pradeepvrd force-pushed the submit/6-chaos branch from 4be11fb to cc4d588 Compare June 23, 2026 18:35

pradeepvrd force-pushed the submit/7-metrics branch from 5eb3685 to 96891b5 Compare June 23, 2026 18:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(metrics): metrics suite + bundled skills#32

feat(metrics): metrics suite + bundled skills#32
pradeepvrd wants to merge 1 commit into
submit/6-chaosfrom
submit/7-metrics

pradeepvrd commented Jun 20, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pradeepvrd commented Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pradeepvrd commented Jun 20, 2026 •

edited

Loading