Skip to content

feat(verification): discriminated-union spec + deadline runner + typed result (Phase 2A/2B)#12

Merged
pradeepvrd merged 2 commits into
refactor/integrationfrom
refactor/verification
Jun 20, 2026
Merged

feat(verification): discriminated-union spec + deadline runner + typed result (Phase 2A/2B)#12
pradeepvrd merged 2 commits into
refactor/integrationfrom
refactor/verification

Conversation

@pradeepvrd

Copy link
Copy Markdown
Owner

Summary

Phase 2A + 2B of the verification redesign (PR #6 rework). Lands the
type-safe spec, typed result, and single-deadline runner; ships the JSON
Schema / validate_spec authoring contract and a regression test that pins
the real optimize-scale literal.

Phase 2A — engine

  • verification/spec.py: hand-maintained Annotated[Union, Field(discriminator=\"type\")] of PodHealthyVerifier | ScalingCompleteVerifier | SequenceSpec | ParallelSpec. name is metadata, recursion is checks: list[VerificationNode]. Bare list/dict roots are rejected; a bare leaf is a valid whole spec. Structured so a Wave 2 registry-driven parse_node can swap only the parsing without reshaping the runner's isinstance dispatch.
  • verification/base.py: pydantic VerificationResult with success, elapsed_time, reason, optional name, children (compound), raw (leaf). The legacy loose details is gone.
  • verification/runner.py: VerifierAgent.wait_for_condition(spec, timeout_sec=120) (public signature preserved) computes one time.monotonic() deadline at entry. Parallel children each see the full remaining deadline; sequences are fail-fast (later steps recorded as skipped). The runner is the only clock source — verifiers also use time.monotonic.
  • verification/verifiers/{pod_healthy,scaling_complete}.py: consume k8s.kubectl.get_json / k8s.conditions.poll_until; both null-guard .status on the raw kubectl JSON return. details=raw=; name is echoed onto the result.

Phase 2B — authoring contract

  • verification/schema.py: json_schema() (pydantic-derived JSON Schema for editor autocomplete) and validate_spec(data) (returns the concrete parsed node, propagates ValidationError).
  • tests/unit/verification/test_optimize_scale_regression.py: pins a literal of the migrated optimize-scale verification (type: parallel with pod_healthy + scaling_complete children); both leaf verify methods are stubbed so the test runs without a cluster. This was the gap that hid the original parse bug.

Out of scope (Wave 4 harness PR): the task-file migration of complextasks/optimize-scale/task.yaml to native YAML, and the harness/scenario.py / harness/default.py registry-lookup change. The engine + authoring contract live independently here; the task-file change can be made against a single component PR.

Key decisions / notes for reviewers

  • The runner accepts a VerificationSpec, an already-parsed SequenceSpec / ParallelSpec node, an existing leaf with a verify method, or a raw mapping. This keeps scenario.py callers free to pre-parse (mapping-registry path in §7c) without forcing them to.
  • _run_parallel hands each child the full remaining deadline (no rebudgeting). Workers are bounded by _MAX_PARALLEL_WORKERS = 8 and by the deadline-bounded verify calls, so a slow kubectl wait does not linger past the deadline.
  • _run_sequence skips remaining children when the deadline elapses between iterations, and also when the previous child failed (fail-fast); skipped children land in children with success=False.
  • Leaf verifiers swapped to time.monotonic for start_time so the runner and leaves share one clock (was time.time in the legacy code).
  • verification/__init__.py keeps the import light — no provider SDKs, no deepeval, no mcp are pulled at package import. test_package_import.py enforces this in a subprocess.

Test plan

  • uv run ruff check devops_bench/ tests/unit/verification/ — clean
  • uv run pytest tests/ -q — 345 passed (303 baseline + 42 new), 12 unrelated DeprecationWarnings
  • uv run python -c \"from devops_bench.verification import VerificationSpec, VerifierAgent\" — no SDK / deepeval / mcp pulled
  • Regression: literal of migrated optimize-scale spec parses + dispatches (test_optimize_scale_regression.py)
  • .status: null is null-guarded on both pods and deployment leaves
  • Bare list / bare dict-without-type rejected at parse time

What reviewers should scrutinize

  • Whether the runner's wait_for_condition accepting four input shapes (VerificationSpec, parsed compound node, leaf with verify, raw mapping) is the right boundary, or whether we should narrow it.
  • The parallel-deadline semantics: each child gets the full remaining deadline (intentionally not rebudgeted). Documented as the invariant; per-node timeout overrides may only ever tighten this — the documented Phase-A extension point.
  • The _skipped / _timed_out helpers carry elapsed_time=0.0; the legacy code used time.time() - start for skipped members. The new value matches the "not run" semantics; flag if you want a different convention.
  • VerificationNode is a Phase-A hand-maintained union. The Wave 2 metrics PR will introduce a VERIFIERS registry and swap only the parsing — the runner's isinstance dispatch stays intact.

Blockers

None.

…d result

Phase 2A (refactor: PR #6 rework). Replaces the loose list/dict spec arms with
a hand-maintained discriminated union (`SequenceSpec` / `ParallelSpec` +
type-tagged leaves), swaps `VerificationResult.details` for typed `children`
and `raw` fields, and rewrites `VerifierAgent` around a single monotonic
deadline (no per-node budget arithmetic, one clock source).

- `verification/spec.py`: `VerificationNode = Annotated[Union, ...]`. Bare
  leaves are valid whole specs; bare list/dict roots are rejected. The union
  is hand-maintained at Phase A — structured so a Wave 2 registry-driven
  `parse_node` can swap the parsing without reshaping the runner's
  isinstance dispatch.
- `verification/base.py`: pydantic `VerificationResult` with `success`,
  `elapsed_time`, `reason`, optional `name`, `children`, `raw`. `BaseVerifier`
  gains an optional `name` for result labeling.
- `verification/runner.py`: `VerifierAgent.wait_for_condition(spec,
  timeout_sec=120)` computes one `time.monotonic()` deadline at entry.
  Parallel children each see the full remaining deadline; sequences are
  fail-fast (later steps recorded as skipped).
- `verification/verifiers/{pod_healthy,scaling_complete}.py`: leaves consume
  `k8s.kubectl.get_json` / `k8s.conditions.poll_until`; both null-guard
  `.status` on raw kubectl returns. Result `details=` -> `raw=`, `name`
  threaded through.

Phase 2B authoring contract:
- `verification/schema.py`: `json_schema()` emits the pydantic JSON Schema for
  editor autocomplete; `validate_spec(data)` returns the parsed leaf node and
  propagates `pydantic.ValidationError` on bad input.
- Regression test pins a literal of the migrated `optimize-scale` verification
  (`type: parallel` with `pod_healthy` + `scaling_complete` children); both
  leaf `verify` methods are stubbed so it runs without a cluster.

Test bar (`tests/unit/verification/`): spec parsing/discrimination; result
shape (`details` is gone); deadline semantics (parallel sees full remaining
deadline, sequence fail-fast, one monotonic clock); leaf null-status guards
on pods AND deployment; bare list/dict spec rejected; lightweight-import
guard asserts `import devops_bench.verification` pulls no provider SDK /
deepeval / mcp.

`ruff check` and `pytest tests/` are green (345 passed). Harness/task-file
migration (§7c) is intentionally out of scope — Wave 4 harness PR.
…sub-second leaf floor

Review nits from PR #12 (APPROVE-WITH-NITS).

- _run_parallel: convert an unexpected leaf exception into a failed child
  result instead of letting it abort the whole group (catch around
  ``f.result()``). Pre-fill children with timed-out defaults so a never-completed
  future stays accounted for without a second materialize pass.
- _run_parallel: ``ex.shutdown(wait=False, cancel_futures=True)`` so a
  deadline-exhausted group does not block on queued workers we no longer want.
- _run_leaf: short-circuit when remaining < ``_MIN_LEAF_BUDGET_SECONDS`` (1s)
  to avoid issuing useless ``kubectl wait --timeout=0.001s`` at the tail of
  the deadline.
- test_runner: add a parallel-with-exhausted-deadline test that uses the REAL
  ``PodHealthyVerifier`` / ``ScalingCompleteVerifier`` classes (kubectl
  primitives patched to raise) and asserts all children come back timed-out
  with ``elapsed_time == 0.0``. Add an unhandled-exception test asserting
  one bad leaf does not abort the parallel group.

pytest: 347 passed (was 345, +2 new). ruff clean.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant