feat(metrics): LLM-as-judge metrics pipeline (Stage 2d) by pradeepvrd · Pull Request #8 · pradeepvrd/devops-bench

pradeepvrd · 2026-06-18T07:06:45Z

Extracts the LLM-as-judge scoring code from the legacy monolith into devops_bench/metrics/ (← scoring parts of pkg/evaluator/evaluate.py).

pipeline.py, geval.py, outcome_validity.py, tool_invocation.py, grounding.py, chaos_metrics.py.
Model-agnostic: a single ModelLayerJudge(DeepEvalBaseLLM) wraps an LLMClient from devops_bench.models (JUDGE_PROVIDER/JUDGE_MODEL); no provider SDK. Skills read from repo-root skills/.
Tests under tests/unit/metrics/.

Stacked draft PR — part of the in-place Stage 2/3 restructure (see docs/migration/pr-plan.md). Base is the fork branch shown above; it will be retargeted to gke-labs/main once Stage 1 (gke-labs#89–92) merges. PRs are intended to be reviewed and merged in stage order.

Status: peer-reviewed by 2 teammates + senior sign-off on the full integration branch; full suite green (ruff + 374 unit tests). Do NOT mark ready until its stage is up for merge.

… (2d) Modules moved/refactored: - scoring code extracted from pkg/evaluator/evaluate.py -> devops_bench/metrics/{pipeline,geval,outcome_validity,tool_invocation,grounding,chaos_metrics}.py - pipeline.py: evaluate_metrics_batch + extract_checklist_items (the post-run scoring loop) - geval.py: ModelLayerJudge (DeepEvalBaseLLM) + get_judge_model - outcome_validity.py / tool_invocation.py: GEval factories + skill-criteria loaders - grounding.py: calculate_doc_retrieval_rate + evaluate_documentation_grounding - chaos_metrics.py: evaluate_chaos_metrics (DiagnosisAccuracy/GracefulRecovery + perf numbers) - the orchestration main loop (data loading, agent execution, deployer/ScenarioManager, results I/O) is intentionally left behind for the 3a orchestrator. Bugs fixed vs legacy: - none (faithful extraction); correctness fixes follow in a dedicated fix commit. Improvements vs legacy: - model-agnostic judge: the legacy GeminiDeepEvalModel/AnthropicDeepEvalModel/OllamaDeepEvalModel wrappers instantiated provider SDKs directly; replaced by a single ModelLayerJudge that routes all generation through devops_bench.models (get_model/LLMClient), so no provider SDK is imported in metrics code (JUDGE_PROVIDER/JUDGE_MODEL select the backend). - lazy package surface: metrics/__init__.py re-exports via module __getattr__ so importing the package never eagerly pulls in deepeval or any provider SDK. - structured logging via devops_bench.core.get_logger instead of print(). - pure, unit-testable helpers (extract_checklist_items, _record_metrics) split out of the monolithic loop, with mocked-judge tests under tests/unit/metrics/.

Modules moved/refactored: - see base move commit (devops_bench/metrics extraction from pkg/evaluator/evaluate.py). Bugs fixed vs legacy: - grounding: duplicate constraint texts across guides inflated the denominator (total = len(metrics)) while the numerator counted the deduped map, making a perfect GroundingAccuracy of 5.0 unreachable. Constraints are now deduplicated by text before metrics are built, so applied == total is achievable. - grounding: calculate_doc_retrieval_rate crashed on a missing/None doc_name or url (doc["doc_name"].lower()). Now reads via (doc.get(...) or "").lower(); an empty name/url is also guarded so it does not spuriously match every step ("" in s is always True). - pipeline: extract_checklist_items used line.strip("- "), which also stripped trailing hyphens and spaces and corrupted items like "...staging-". Switched to line.lstrip("- ").strip() so only the leading bullet marker is removed. - pipeline: the " [GEval]" suffix DeepEval appends to GEval metric names was only stripped for the dynamic checklist metrics, leaving keys like "OutcomeValidity [GEval]"/"ToolInvocation [GEval]". _record_metrics now strips the suffix uniformly so all score keys are clean. - geval: ModelLayerJudge.generate called asyncio.run unconditionally, which raises when an event loop is already running on the calling thread. It is now loop-aware: asyncio.run when no loop is running, else the coroutine is run on a worker thread and awaited via its result. Improvements vs legacy: - none (correctness only); see the dedicated feat commit for skills packaging.

Modules moved/refactored: - see base move commit (devops_bench/metrics extraction from pkg/evaluator/evaluate.py). - relocate the judge skill markdown into the package: skills/outcome-validity-checklist.md -> devops_bench/skills/outcome-validity-checklist.md skills/tool-invocation-skill.md -> devops_bench/skills/tool-invocation-skill.md - add devops_bench/skills/__init__.py (makes the data an importable package) and devops_bench/metrics/_skills.py (shared load_skill_text loader). Bugs fixed vs legacy: - none (packaging/refactor only). Improvements vs legacy: - pip-installability: the skills were loaded via a repo-relative path (_REPO_ROOT = Path(__file__).parents[2] / "skills"), which does not exist in a built wheel / pip install. They are now loaded with importlib.resources.files("devops_bench.skills") and shipped as wheel package data (pyproject: [tool.hatch.build.targets.wheel] artifacts = ["devops_bench/skills/*.md"]), so the judge resolves its criteria regardless of the working directory or install mode. - outcome_validity/tool_invocation now delegate to metrics._skills.load_skill_text, removing the duplicated path-resolution logic; a clear FileNotFoundError is raised when a packaged skill is absent.

Modules moved/refactored: - see base move commit; tests only, no source change. Bugs fixed vs legacy: - none (test hardening). Improvements vs legacy: - the evaluate() mocks asserted sequential call order and fed un-suffixed metric names, which is exactly what hid the [GEval]-suffix bug. They are now call-order-agnostic: each result is derived from the metric's real name (the pipeline always calls evaluate([tc], [m]) with one metric) and reported with a " [GEval]" suffix appended, so the uniform suffix-strip is actually tested (helpers _evaluate_by_metric_name / _evaluate_by_outcome / _named_geval; built outcome/tool metrics carry real names instead of bare MagicMocks). - skills tests mock importlib.resources (a fake package traversable) instead of reading the real .md files, so they no longer depend on on-disk skill data.

pradeepvrd · 2026-06-20T08:07:30Z

Superseded by the reconciled cross-cutting refactor (see docs/refactor/e2e-refactor-sequencing-plan.md). Reworked into the layered devops_bench/ package on branch refactor/integration; replaced by the reworked component PRs and capstone #23. Closing as superseded.

pradeepvrd force-pushed the feat/devops-bench-metrics branch from e78bf66 to 4c09a17 Compare June 18, 2026 07:57

pradeepvrd added 4 commits June 18, 2026 01:13

pradeepvrd force-pushed the feat/devops-bench-metrics branch from 4c09a17 to 71c5daf Compare June 18, 2026 08:23

pradeepvrd closed this Jun 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(metrics): LLM-as-judge metrics pipeline (Stage 2d)#8

feat(metrics): LLM-as-judge metrics pipeline (Stage 2d)#8
pradeepvrd wants to merge 4 commits into
integration/devops-bench-stage1from
feat/devops-bench-metrics

pradeepvrd commented Jun 18, 2026

Uh oh!

pradeepvrd commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pradeepvrd commented Jun 18, 2026

Uh oh!

pradeepvrd commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant