Skip to content

feat(metrics): LLM-as-judge metrics pipeline (Stage 2d)#8

Closed
pradeepvrd wants to merge 4 commits into
integration/devops-bench-stage1from
feat/devops-bench-metrics
Closed

feat(metrics): LLM-as-judge metrics pipeline (Stage 2d)#8
pradeepvrd wants to merge 4 commits into
integration/devops-bench-stage1from
feat/devops-bench-metrics

Conversation

@pradeepvrd

Copy link
Copy Markdown
Owner

Extracts the LLM-as-judge scoring code from the legacy monolith into devops_bench/metrics/ (← scoring parts of pkg/evaluator/evaluate.py).

  • pipeline.py, geval.py, outcome_validity.py, tool_invocation.py, grounding.py, chaos_metrics.py.
  • Model-agnostic: a single ModelLayerJudge(DeepEvalBaseLLM) wraps an LLMClient from devops_bench.models (JUDGE_PROVIDER/JUDGE_MODEL); no provider SDK. Skills read from repo-root skills/.
  • Tests under tests/unit/metrics/.

Stacked draft PR — part of the in-place Stage 2/3 restructure (see docs/migration/pr-plan.md). Base is the fork branch shown above; it will be retargeted to gke-labs/main once Stage 1 (gke-labs#89–92) merges. PRs are intended to be reviewed and merged in stage order.

Status: peer-reviewed by 2 teammates + senior sign-off on the full integration branch; full suite green (ruff + 374 unit tests). Do NOT mark ready until its stage is up for merge.

@pradeepvrd pradeepvrd force-pushed the feat/devops-bench-metrics branch from e78bf66 to 4c09a17 Compare June 18, 2026 07:57
… (2d)

Modules moved/refactored:
- scoring code extracted from pkg/evaluator/evaluate.py -> devops_bench/metrics/{pipeline,geval,outcome_validity,tool_invocation,grounding,chaos_metrics}.py
- pipeline.py: evaluate_metrics_batch + extract_checklist_items (the post-run scoring loop)
- geval.py: ModelLayerJudge (DeepEvalBaseLLM) + get_judge_model
- outcome_validity.py / tool_invocation.py: GEval factories + skill-criteria loaders
- grounding.py: calculate_doc_retrieval_rate + evaluate_documentation_grounding
- chaos_metrics.py: evaluate_chaos_metrics (DiagnosisAccuracy/GracefulRecovery + perf numbers)
- the orchestration main loop (data loading, agent execution, deployer/ScenarioManager,
  results I/O) is intentionally left behind for the 3a orchestrator.

Bugs fixed vs legacy:
- none (faithful extraction); correctness fixes follow in a dedicated fix commit.

Improvements vs legacy:
- model-agnostic judge: the legacy GeminiDeepEvalModel/AnthropicDeepEvalModel/OllamaDeepEvalModel
  wrappers instantiated provider SDKs directly; replaced by a single ModelLayerJudge that routes
  all generation through devops_bench.models (get_model/LLMClient), so no provider SDK is imported
  in metrics code (JUDGE_PROVIDER/JUDGE_MODEL select the backend).
- lazy package surface: metrics/__init__.py re-exports via module __getattr__ so importing the
  package never eagerly pulls in deepeval or any provider SDK.
- structured logging via devops_bench.core.get_logger instead of print().
- pure, unit-testable helpers (extract_checklist_items, _record_metrics) split out of the
  monolithic loop, with mocked-judge tests under tests/unit/metrics/.
Modules moved/refactored:
- see base move commit (devops_bench/metrics extraction from pkg/evaluator/evaluate.py).

Bugs fixed vs legacy:
- grounding: duplicate constraint texts across guides inflated the denominator
  (total = len(metrics)) while the numerator counted the deduped map, making a
  perfect GroundingAccuracy of 5.0 unreachable. Constraints are now deduplicated
  by text before metrics are built, so applied == total is achievable.
- grounding: calculate_doc_retrieval_rate crashed on a missing/None doc_name or
  url (doc["doc_name"].lower()). Now reads via (doc.get(...) or "").lower(); an
  empty name/url is also guarded so it does not spuriously match every step
  ("" in s is always True).
- pipeline: extract_checklist_items used line.strip("- "), which also stripped
  trailing hyphens and spaces and corrupted items like "...staging-". Switched
  to line.lstrip("- ").strip() so only the leading bullet marker is removed.
- pipeline: the " [GEval]" suffix DeepEval appends to GEval metric names was only
  stripped for the dynamic checklist metrics, leaving keys like
  "OutcomeValidity [GEval]"/"ToolInvocation [GEval]". _record_metrics now strips
  the suffix uniformly so all score keys are clean.
- geval: ModelLayerJudge.generate called asyncio.run unconditionally, which
  raises when an event loop is already running on the calling thread. It is now
  loop-aware: asyncio.run when no loop is running, else the coroutine is run on a
  worker thread and awaited via its result.

Improvements vs legacy:
- none (correctness only); see the dedicated feat commit for skills packaging.
Modules moved/refactored:
- see base move commit (devops_bench/metrics extraction from pkg/evaluator/evaluate.py).
- relocate the judge skill markdown into the package:
  skills/outcome-validity-checklist.md -> devops_bench/skills/outcome-validity-checklist.md
  skills/tool-invocation-skill.md      -> devops_bench/skills/tool-invocation-skill.md
- add devops_bench/skills/__init__.py (makes the data an importable package) and
  devops_bench/metrics/_skills.py (shared load_skill_text loader).

Bugs fixed vs legacy:
- none (packaging/refactor only).

Improvements vs legacy:
- pip-installability: the skills were loaded via a repo-relative path
  (_REPO_ROOT = Path(__file__).parents[2] / "skills"), which does not exist in a
  built wheel / pip install. They are now loaded with
  importlib.resources.files("devops_bench.skills") and shipped as wheel package
  data (pyproject: [tool.hatch.build.targets.wheel] artifacts =
  ["devops_bench/skills/*.md"]), so the judge resolves its criteria regardless of
  the working directory or install mode.
- outcome_validity/tool_invocation now delegate to metrics._skills.load_skill_text,
  removing the duplicated path-resolution logic; a clear FileNotFoundError is
  raised when a packaged skill is absent.
Modules moved/refactored:
- see base move commit; tests only, no source change.

Bugs fixed vs legacy:
- none (test hardening).

Improvements vs legacy:
- the evaluate() mocks asserted sequential call order and fed un-suffixed metric
  names, which is exactly what hid the [GEval]-suffix bug. They are now
  call-order-agnostic: each result is derived from the metric's real name
  (the pipeline always calls evaluate([tc], [m]) with one metric) and reported
  with a " [GEval]" suffix appended, so the uniform suffix-strip is actually
  tested (helpers _evaluate_by_metric_name / _evaluate_by_outcome / _named_geval;
  built outcome/tool metrics carry real names instead of bare MagicMocks).
- skills tests mock importlib.resources (a fake package traversable) instead of
  reading the real .md files, so they no longer depend on on-disk skill data.
@pradeepvrd pradeepvrd force-pushed the feat/devops-bench-metrics branch from 4c09a17 to 71c5daf Compare June 18, 2026 08:23
@pradeepvrd

Copy link
Copy Markdown
Owner Author

Superseded by the reconciled cross-cutting refactor (see docs/refactor/e2e-refactor-sequencing-plan.md). Reworked into the layered devops_bench/ package on branch refactor/integration; replaced by the reworked component PRs and capstone #23. Closing as superseded.

@pradeepvrd pradeepvrd closed this Jun 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant