feat(verification): discriminated-union spec + deadline runner + typed result (Phase 2A/2B)#12
Merged
Merged
Conversation
…d result Phase 2A (refactor: PR #6 rework). Replaces the loose list/dict spec arms with a hand-maintained discriminated union (`SequenceSpec` / `ParallelSpec` + type-tagged leaves), swaps `VerificationResult.details` for typed `children` and `raw` fields, and rewrites `VerifierAgent` around a single monotonic deadline (no per-node budget arithmetic, one clock source). - `verification/spec.py`: `VerificationNode = Annotated[Union, ...]`. Bare leaves are valid whole specs; bare list/dict roots are rejected. The union is hand-maintained at Phase A — structured so a Wave 2 registry-driven `parse_node` can swap the parsing without reshaping the runner's isinstance dispatch. - `verification/base.py`: pydantic `VerificationResult` with `success`, `elapsed_time`, `reason`, optional `name`, `children`, `raw`. `BaseVerifier` gains an optional `name` for result labeling. - `verification/runner.py`: `VerifierAgent.wait_for_condition(spec, timeout_sec=120)` computes one `time.monotonic()` deadline at entry. Parallel children each see the full remaining deadline; sequences are fail-fast (later steps recorded as skipped). - `verification/verifiers/{pod_healthy,scaling_complete}.py`: leaves consume `k8s.kubectl.get_json` / `k8s.conditions.poll_until`; both null-guard `.status` on raw kubectl returns. Result `details=` -> `raw=`, `name` threaded through. Phase 2B authoring contract: - `verification/schema.py`: `json_schema()` emits the pydantic JSON Schema for editor autocomplete; `validate_spec(data)` returns the parsed leaf node and propagates `pydantic.ValidationError` on bad input. - Regression test pins a literal of the migrated `optimize-scale` verification (`type: parallel` with `pod_healthy` + `scaling_complete` children); both leaf `verify` methods are stubbed so it runs without a cluster. Test bar (`tests/unit/verification/`): spec parsing/discrimination; result shape (`details` is gone); deadline semantics (parallel sees full remaining deadline, sequence fail-fast, one monotonic clock); leaf null-status guards on pods AND deployment; bare list/dict spec rejected; lightweight-import guard asserts `import devops_bench.verification` pulls no provider SDK / deepeval / mcp. `ruff check` and `pytest tests/` are green (345 passed). Harness/task-file migration (§7c) is intentionally out of scope — Wave 4 harness PR.
…sub-second leaf floor Review nits from PR #12 (APPROVE-WITH-NITS). - _run_parallel: convert an unexpected leaf exception into a failed child result instead of letting it abort the whole group (catch around ``f.result()``). Pre-fill children with timed-out defaults so a never-completed future stays accounted for without a second materialize pass. - _run_parallel: ``ex.shutdown(wait=False, cancel_futures=True)`` so a deadline-exhausted group does not block on queued workers we no longer want. - _run_leaf: short-circuit when remaining < ``_MIN_LEAF_BUDGET_SECONDS`` (1s) to avoid issuing useless ``kubectl wait --timeout=0.001s`` at the tail of the deadline. - test_runner: add a parallel-with-exhausted-deadline test that uses the REAL ``PodHealthyVerifier`` / ``ScalingCompleteVerifier`` classes (kubectl primitives patched to raise) and asserts all children come back timed-out with ``elapsed_time == 0.0``. Add an unhandled-exception test asserting one bad leaf does not abort the parallel group. pytest: 347 passed (was 345, +2 new). ruff clean.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Phase 2A + 2B of the verification redesign (PR #6 rework). Lands the
type-safe spec, typed result, and single-deadline runner; ships the JSON
Schema /
validate_specauthoring contract and a regression test that pinsthe real
optimize-scaleliteral.Phase 2A — engine
verification/spec.py: hand-maintainedAnnotated[Union, Field(discriminator=\"type\")]ofPodHealthyVerifier | ScalingCompleteVerifier | SequenceSpec | ParallelSpec.nameis metadata, recursion ischecks: list[VerificationNode]. Bare list/dict roots are rejected; a bare leaf is a valid whole spec. Structured so a Wave 2 registry-drivenparse_nodecan swap only the parsing without reshaping the runner'sisinstancedispatch.verification/base.py: pydanticVerificationResultwithsuccess,elapsed_time,reason, optionalname,children(compound),raw(leaf). The legacy loosedetailsis gone.verification/runner.py:VerifierAgent.wait_for_condition(spec, timeout_sec=120)(public signature preserved) computes onetime.monotonic()deadline at entry. Parallel children each see the full remaining deadline; sequences are fail-fast (later steps recorded as skipped). The runner is the only clock source — verifiers also usetime.monotonic.verification/verifiers/{pod_healthy,scaling_complete}.py: consumek8s.kubectl.get_json/k8s.conditions.poll_until; both null-guard.statuson the raw kubectl JSON return.details=→raw=;nameis echoed onto the result.Phase 2B — authoring contract
verification/schema.py:json_schema()(pydantic-derived JSON Schema for editor autocomplete) andvalidate_spec(data)(returns the concrete parsed node, propagatesValidationError).tests/unit/verification/test_optimize_scale_regression.py: pins a literal of the migratedoptimize-scaleverification (type: parallelwithpod_healthy+scaling_completechildren); both leafverifymethods are stubbed so the test runs without a cluster. This was the gap that hid the original parse bug.Out of scope (Wave 4 harness PR): the task-file migration of
complextasks/optimize-scale/task.yamlto native YAML, and theharness/scenario.py/harness/default.pyregistry-lookup change. The engine + authoring contract live independently here; the task-file change can be made against a single component PR.Key decisions / notes for reviewers
VerificationSpec, an already-parsedSequenceSpec/ParallelSpecnode, an existing leaf with averifymethod, or a raw mapping. This keepsscenario.pycallers free to pre-parse (mapping-registry path in §7c) without forcing them to._run_parallelhands each child the full remaining deadline (no rebudgeting). Workers are bounded by_MAX_PARALLEL_WORKERS = 8and by the deadline-boundedverifycalls, so a slowkubectl waitdoes not linger past the deadline._run_sequenceskips remaining children when the deadline elapses between iterations, and also when the previous child failed (fail-fast); skipped children land inchildrenwithsuccess=False.time.monotonicforstart_timeso the runner and leaves share one clock (wastime.timein the legacy code).verification/__init__.pykeeps the import light — no provider SDKs, nodeepeval, nomcpare pulled at package import.test_package_import.pyenforces this in a subprocess.Test plan
uv run ruff check devops_bench/ tests/unit/verification/— cleanuv run pytest tests/ -q— 345 passed (303 baseline + 42 new), 12 unrelatedDeprecationWarningsuv run python -c \"from devops_bench.verification import VerificationSpec, VerifierAgent\"— no SDK / deepeval / mcp pulledoptimize-scalespec parses + dispatches (test_optimize_scale_regression.py).status: nullis null-guarded on both pods and deployment leavestyperejected at parse timeWhat reviewers should scrutinize
wait_for_conditionaccepting four input shapes (VerificationSpec, parsed compound node, leaf withverify, raw mapping) is the right boundary, or whether we should narrow it.timeoutoverrides may only ever tighten this — the documented Phase-A extension point._skipped/_timed_outhelpers carryelapsed_time=0.0; the legacy code usedtime.time() - startfor skipped members. The new value matches the "not run" semantics; flag if you want a different convention.VerificationNodeis a Phase-A hand-maintained union. The Wave 2 metrics PR will introduce aVERIFIERSregistry and swap only the parsing — the runner'sisinstancedispatch stays intact.Blockers
None.