feat(metrics): LLM-as-judge metrics pipeline (Stage 2d)#8
Closed
pradeepvrd wants to merge 4 commits into
Closed
Conversation
e78bf66 to
4c09a17
Compare
… (2d)
Modules moved/refactored:
- scoring code extracted from pkg/evaluator/evaluate.py -> devops_bench/metrics/{pipeline,geval,outcome_validity,tool_invocation,grounding,chaos_metrics}.py
- pipeline.py: evaluate_metrics_batch + extract_checklist_items (the post-run scoring loop)
- geval.py: ModelLayerJudge (DeepEvalBaseLLM) + get_judge_model
- outcome_validity.py / tool_invocation.py: GEval factories + skill-criteria loaders
- grounding.py: calculate_doc_retrieval_rate + evaluate_documentation_grounding
- chaos_metrics.py: evaluate_chaos_metrics (DiagnosisAccuracy/GracefulRecovery + perf numbers)
- the orchestration main loop (data loading, agent execution, deployer/ScenarioManager,
results I/O) is intentionally left behind for the 3a orchestrator.
Bugs fixed vs legacy:
- none (faithful extraction); correctness fixes follow in a dedicated fix commit.
Improvements vs legacy:
- model-agnostic judge: the legacy GeminiDeepEvalModel/AnthropicDeepEvalModel/OllamaDeepEvalModel
wrappers instantiated provider SDKs directly; replaced by a single ModelLayerJudge that routes
all generation through devops_bench.models (get_model/LLMClient), so no provider SDK is imported
in metrics code (JUDGE_PROVIDER/JUDGE_MODEL select the backend).
- lazy package surface: metrics/__init__.py re-exports via module __getattr__ so importing the
package never eagerly pulls in deepeval or any provider SDK.
- structured logging via devops_bench.core.get_logger instead of print().
- pure, unit-testable helpers (extract_checklist_items, _record_metrics) split out of the
monolithic loop, with mocked-judge tests under tests/unit/metrics/.
Modules moved/refactored:
- see base move commit (devops_bench/metrics extraction from pkg/evaluator/evaluate.py).
Bugs fixed vs legacy:
- grounding: duplicate constraint texts across guides inflated the denominator
(total = len(metrics)) while the numerator counted the deduped map, making a
perfect GroundingAccuracy of 5.0 unreachable. Constraints are now deduplicated
by text before metrics are built, so applied == total is achievable.
- grounding: calculate_doc_retrieval_rate crashed on a missing/None doc_name or
url (doc["doc_name"].lower()). Now reads via (doc.get(...) or "").lower(); an
empty name/url is also guarded so it does not spuriously match every step
("" in s is always True).
- pipeline: extract_checklist_items used line.strip("- "), which also stripped
trailing hyphens and spaces and corrupted items like "...staging-". Switched
to line.lstrip("- ").strip() so only the leading bullet marker is removed.
- pipeline: the " [GEval]" suffix DeepEval appends to GEval metric names was only
stripped for the dynamic checklist metrics, leaving keys like
"OutcomeValidity [GEval]"/"ToolInvocation [GEval]". _record_metrics now strips
the suffix uniformly so all score keys are clean.
- geval: ModelLayerJudge.generate called asyncio.run unconditionally, which
raises when an event loop is already running on the calling thread. It is now
loop-aware: asyncio.run when no loop is running, else the coroutine is run on a
worker thread and awaited via its result.
Improvements vs legacy:
- none (correctness only); see the dedicated feat commit for skills packaging.
Modules moved/refactored:
- see base move commit (devops_bench/metrics extraction from pkg/evaluator/evaluate.py).
- relocate the judge skill markdown into the package:
skills/outcome-validity-checklist.md -> devops_bench/skills/outcome-validity-checklist.md
skills/tool-invocation-skill.md -> devops_bench/skills/tool-invocation-skill.md
- add devops_bench/skills/__init__.py (makes the data an importable package) and
devops_bench/metrics/_skills.py (shared load_skill_text loader).
Bugs fixed vs legacy:
- none (packaging/refactor only).
Improvements vs legacy:
- pip-installability: the skills were loaded via a repo-relative path
(_REPO_ROOT = Path(__file__).parents[2] / "skills"), which does not exist in a
built wheel / pip install. They are now loaded with
importlib.resources.files("devops_bench.skills") and shipped as wheel package
data (pyproject: [tool.hatch.build.targets.wheel] artifacts =
["devops_bench/skills/*.md"]), so the judge resolves its criteria regardless of
the working directory or install mode.
- outcome_validity/tool_invocation now delegate to metrics._skills.load_skill_text,
removing the duplicated path-resolution logic; a clear FileNotFoundError is
raised when a packaged skill is absent.
Modules moved/refactored: - see base move commit; tests only, no source change. Bugs fixed vs legacy: - none (test hardening). Improvements vs legacy: - the evaluate() mocks asserted sequential call order and fed un-suffixed metric names, which is exactly what hid the [GEval]-suffix bug. They are now call-order-agnostic: each result is derived from the metric's real name (the pipeline always calls evaluate([tc], [m]) with one metric) and reported with a " [GEval]" suffix appended, so the uniform suffix-strip is actually tested (helpers _evaluate_by_metric_name / _evaluate_by_outcome / _named_geval; built outcome/tool metrics carry real names instead of bare MagicMocks). - skills tests mock importlib.resources (a fake package traversable) instead of reading the real .md files, so they no longer depend on on-disk skill data.
4c09a17 to
71c5daf
Compare
Owner
Author
|
Superseded by the reconciled cross-cutting refactor (see docs/refactor/e2e-refactor-sequencing-plan.md). Reworked into the layered devops_bench/ package on branch refactor/integration; replaced by the reworked component PRs and capstone #23. Closing as superseded. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Extracts the LLM-as-judge scoring code from the legacy monolith into
devops_bench/metrics/(← scoring parts ofpkg/evaluator/evaluate.py).pipeline.py,geval.py,outcome_validity.py,tool_invocation.py,grounding.py,chaos_metrics.py.ModelLayerJudge(DeepEvalBaseLLM)wraps anLLMClientfromdevops_bench.models(JUDGE_PROVIDER/JUDGE_MODEL); no provider SDK. Skills read from repo-rootskills/.tests/unit/metrics/.Stacked draft PR — part of the in-place Stage 2/3 restructure (see
docs/migration/pr-plan.md). Base is the fork branch shown above; it will be retargeted togke-labs/mainonce Stage 1 (gke-labs#89–92) merges. PRs are intended to be reviewed and merged in stage order.Status: peer-reviewed by 2 teammates + senior sign-off on the full integration branch; full suite green (ruff + 374 unit tests). Do NOT mark ready until its stage is up for merge.