Skip to content

feat(conductor): mismatchContext + RemediationApproval gate + seam webhook enable#43

Merged
ontave merged 30 commits into
mainfrom
feature/post-migration-wis
May 21, 2026
Merged

feat(conductor): mismatchContext + RemediationApproval gate + seam webhook enable#43
ontave merged 30 commits into
mainfrom
feature/post-migration-wis

Conversation

@ontave
Copy link
Copy Markdown
Contributor

@ontave ontave commented May 21, 2026

Summary

  • mismatchContext population: pack_pod_health_loop.go now populates all 5 KBCL fields in DriftSignal.spec.mismatchContext when emitting RuntimeDrift signals -- perceivedState, realizableConstraintRef, governanceSnapshotRevision (read from PermissionSnapshot), kbclLayer=realization, selectionAttempt. Unblocks TC-MC-21.
  • RemediationApproval gate: runtime_drift_handler.go gates destructive remediation Job submission on a RemediationApproval CR when automaticRedeployment=false. Emits WaitingForRemediationApproval Event on PackInstalled when blocked; marks approval status.acted=true after consuming it. INV-007 enforced. Unblocks TC-MC-26.
  • Seam webhook enable: compile_enable.go Phase 3 now generates seam-service.yaml + seam-lineage-webhooks.yaml + webhook-certs.yaml (3 ValidatingWebhookConfigurations for lineage immutability, authorship, and domainref). Unblocks TC-MC-10, TC-MC-11.
  • Also includes: DriftSignalHandler race condition fix (skips RuntimeDrift signals to avoid race with RuntimeDriftHandler), OperatorContext watcher + autonomy gate for action dispatchers.

Test plan

  • go test ./internal/agent/... -- all pass (mismatchContext + RemediationApproval gate unit tests)
  • Live TC-MC-10: seam authorship webhook blocks LineageRecord immutability patch -- PASS
  • Live TC-MC-11: seam authorship webhook blocks human LineageRecord creation -- PASS
  • Live TC-MC-26: RemediationApproval gate blocks Job, approval unblocks within 35s -- PASS

ontave added 30 commits May 7, 2026 09:01
talosImage was being set to the raw version string from the UpgradePolicy
(e.g., "v1.12.7") and passed directly to TalosClient.Upgrade, which then
tried to pull "docker.io/library/v1.12.7:latest". talosUpgradeHandler
correctly builds "ghcr.io/siderolabs/installer:<version>"; stack handler
now follows the same pattern.

Rename talosImage to talosVersion when reading from the UpgradePolicy,
then compute talosImage := "ghcr.io/siderolabs/installer:" + talosVersion.

Discovered during live ccs-dev stack upgrade (session/25d).
Stage=true left both Talos and kubelet changes sitting on disk indefinitely;
nodes required manual reboots to apply them. New behaviour mirrors
talosUpgradeHandler: per node, stage the kubelet image (staged mode so it
co-applies on the Talos reboot), then trigger Talos upgrade with stage=false
(immediate reboot), then wait for recovery before moving to the next node.

Drop talosconfig-path node enumeration in favour of TalosClient.Nodes()
(same source; cleaner and already tested via the stub). Require at least one
node (validation failure otherwise).

Tests: rename TestStackUpgrade_RunsBothUpgradeSteps to two tests --
  TestStackUpgrade_NoNodesReturnsValidationFailure
  TestStackUpgrade_RollingUpgrade_AllNodes (verifies per-node loop,
  upgradeCallCount == node count, all ApplyConfiguration calls use staged mode)
…stry

ghcr.io is not accessible from lab nodes. The docker.io registry mirror
(docker.io → 10.20.0.1:5000) is the only configured mirror. Using
docker.io/ image references allows Talos to resolve installer and kubelet
images through the local registry mirror during node upgrades.

Affects talos-upgrade, kube-upgrade, and stack-upgrade capabilities.
All imports of github.com/ontai-dev/conductor/pkg/runnerlib updated to
github.com/ontai-dev/conductor-sdk/runnerlib across 37 files. Internal
pkg/runnerlib deleted. go.mod updated with replace directive pointing to
../conductor-sdk and require entry. go mod tidy completed. All unit tests
pass: go build ./... and go test ./test/unit/... green before deletion.
…patcher types

Update all GVR references, scheme registrations, and import paths in
conductor to consume the migrated dispatcher types from wrapper/api/seam:
PackDelivery (was InfrastructureClusterPack), PackExecution, PackInstalled
(was InfrastructurePackInstance), PackReceipt, PackLog (was PackOperationResult).

packDeliveryRef field replaces clusterPackRef in pack_receipt_drift_loop.go
and all associated tests. compileLaunchBundle now embeds wrapper CRDs via
wrappercrd.FS so agents receive the seam.ontai.dev CRD bundle at startup.
Updates all dynamic-client GVR references from infrastructure.ontai.dev/
infrastructuretalosclusters to seam.ontai.dev/talosclusters. Updates kind
strings from InfrastructureTalosCluster to TalosCluster. Updates pack
execution GVR to seam.ontai.dev/packexecutions. All tests updated to match.
Replace seam-core -> seam and wrapper -> dispatcher in go.mod
replace/require. Update all Go import paths accordingly. Add seam-sdk
replace + require. Update conductor RunnerConfigSpec references and
compile_launch.go/test assertions for post-MIGRATION-3.8 CRD names
(lineagerecords, runnerconfigs under seam.ontai.dev).
…tories

Replace ../seam-core with ../seam and ../wrapper with ../dispatcher
following the seam-core -> seam and wrapper -> dispatcher filesystem
renames. Module paths were already updated in Phase 4.
…onductor

Update all guardian.ontai.dev API group references in conductor:
- compile_enable.go, compile_launch.go: enable bundle apiVersion strings, webhook names
- catalog.go and all 5 catalog YAML entries: apiVersion strings in rendered RBACProfiles
- capability/guardian.go, adapters.go: GVR Group fields for snapshot/profile/policy
- agent pull loops (rbacpolicy, rbacprofile, receipt, signing): GVR Group fields
- All unit, integration, and e2e test fixtures: GVR/GVK Group strings and apiVersion values
…ductor-sdk

- Dockerfile.compiler/execute/agent: seam-core/ -> seam/, wrapper/ -> dispatcher/
- Add COPY conductor-sdk/ and seam-sdk/ to all three builder stages
- cmd/conductor/main.go: fix stale "seam-core scheme" panic message to "seam"
- docs/conductor-schema.md: update InfrastructureRunnerConfig -> RunnerConfig,
  infrastructure.ontai.dev -> seam.ontai.dev throughout

Steps 6.1, 6.3, 6.4 were already complete (single binary entrypoint at
cmd/conductor/, single build target, go.mod already imports conductor-sdk).
Fresh documentation from current codebase. runner.ontai.dev claim removed
(conductor owns no API group). pkg/runnerlib replaced with conductor-sdk
reference. seam-core replaced with seam. All three image modes documented
accurately. Capability table rebuilt from conductor-sdk/runnerlib/constants.go.
…am-sdk/conductor-sdk); fix integration test GVR and CRD for RunnerConfig post-migration
…fig path

Stage upgrade with stage=true then call Reboot explicitly so nodes
reboot immediately after staging rather than waiting for an organic
restart cycle. Previously the upgrade was staged but no reboot was
forced, leaving the desired version un-applied until the next natural
reboot.

Kubeconfig path in execute mode corrected from the directory-style
/var/run/secrets/kubeconfig/value to the file-style
/var/run/secrets/kubeconfig, matching the SubPath mount applied in the
dispatcher Job template.
…el injection, e2e stubs

Adds tenant-mode PackPodHealthLoop: watches pods by pack-name label, tracks consecutive
failures per pack/reason, emits RuntimeDrift DriftSignals to management cluster on threshold.
Adds management-mode RuntimeDriftHandler: reads RemediationPolicy, increments PackLog attempts,
escalates to HumanInterventionRequired event or annotates PackInstalled for auto-redeploy.
Injects seam.ontai.dev/pack-name label into Deployment/StatefulSet/DaemonSet pod templates
before SSA apply in all three apply paths. Adds RemediationPolicy/RemediationApproval CRD stubs.
Six e2e stubs T-CW-38 through T-CW-43 added. All unit tests passing (T-CW-21 through T-CW-43).
Rename wrapper->dispatcher and seam-core->seam throughout compile_enable.go
and compile_enable_test.go. Switch all RBAC rules from infrastructure.ontai.dev
to seam.ontai.dev. Update resource names to post-migration CRD plurals:
runnerconfigs, lineagerecords, driftsignals, seammemberships, packlogs,
packdeliveries, packexecutions, packinstalleds. Replace packoperationresults
with packlogs. Fix platform-executor RBAC to use configmaps (not a CRD).
Update SeamMembership apiVersion to seam.ontai.dev/v1alpha1. Update DSNS
governance annotations to governance.seam.ontai.dev/owner. Delete stale
local runnerconfigs CRD (seam repo is authoritative). All compiler unit
tests pass.
All three import-path branches (PKI extraction, talosconfig secret
emission, TalosCluster mode) checked only importExistingCluster bool,
which is not set when cluster-input.yaml uses mode: import without
the legacy field. Extend each check to also trigger on Mode == "import"
so ccs-mgmt and future mode: import clusters generate mode=import
TalosCluster CRs and emit the talosconfig secret on bootstrap.
…rnals

Replace infrastructure.ontai.dev with seam.ontai.dev in all functional
runtime code: capability_publisher RunnerConfig GVR and DriftSignal apiVersion,
pack_receipt_drift_loop DriftSignal apiVersion, pack_pod_health_loop DriftSignal
apiVersion, packinstance_pull_loop apiVersion, talos/kubernetes version drift loop
DriftSignal apiVersions, receipt_reconciler annotation keys, guardian capability
annotation key. All remaining infrastructure.ontai.dev references are in
comments only and do not affect runtime behavior.
… names

conductor/internal/capability/wrapper.go: pack-deploy handler listed
PackDelivery in seam-tenant-{clusterRef} but all PackDeliveries live in
seam-system; fixes ValidationFailure "ClusterPack has no registryRef".

conductor/cmd/compiler/compile_enable.go: operator table used pre-migration
names (wrapper, seam-core); updated to dispatcher and seam with correct
lease, ServiceAccount, and webhook secret names.
wrapper.go: applyParsedManifest was missing Force=true on SSA patch; SSA
conflicts with kubectl-client-side-apply field manager caused pack-deploy
to fail on resources previously applied by kubectl. All manifest apply
calls now set Force=true.
…fig capabilities to []string

PackDeliveries live in seam-tenant-{clusterRef} alongside PackExecutions. Reverts
the unauthorized change that looked up PackDeliveries in seam-system. Removes the
dispatcher-runner-pack-reader cross-namespace Role+RoleBinding from compile_enable.go
(it was added to paper over the wrong namespace lookup). Updates compile_enable_test.go
and wrapper_runner_rbac_test.go to reflect post-migration names and correct namespace
semantics.

RunnerConfig status.capabilities is now []string (capability names only) instead of
[]CapabilityEntry. The Publish method extracts names before patching status, so the
capability_publisher no longer writes version/mode fields into the RunnerConfig CRD.
… PackLog lookup, namespace

TC-MC-5: rawCompilePackBuild was concatenating YAML files without document separators,
causing the last document in one file to corrupt the first document in the next when
parsed by guardian. Added --- separator before each file in the loop. Regression test
added (TestRawCompilePackBuild_MultiFileDocumentSeparation).

TC-MC-6: Four bugs fixed in the remediation pipeline:
- pack_pod_health_loop: affectedPackInstalledRef.namespace was "seam-"+clusterRef
  instead of l.mgmtTenantNS ("seam-tenant-"+clusterRef); DriftSignal pointed at
  wrong namespace, blocking management conductor from finding the PackInstalled.
- runtime_drift_handler: PackLog lookup used PackInstalled name directly (e.g.
  nginx-ccs-mgmt) but actual names are pack-deploy-result-{exec}-r{N}. Added
  resolvePackExecName() and readPackLogAttempts() helpers that resolve via
  ownerReference chain and label selector ontai.dev/pack-execution={execName}.
- groupversion_info.go: missing +groupName=conductor.ontai.dev marker caused
  controller-gen to emit _.yaml with empty group. Added marker and regenerated.
- config/crd/embed.go: was embedding nothing (go:embed *.yaml on empty dir).
  Now embeds generated RemediationPolicy and RemediationApproval CRDs.
- compile_launch.go: conductor CRD package added to the launch bundle so
  RemediationPolicy/RemediationApproval CRDs are applied at bootstrap.
…ispatchers

Implements Decision 16 B-selection constraint via OperatorContext CR polling:
- OperatorContextWatcher polls seam.ontai.dev/v1alpha1/operatorcontexts in ont-system, caches autonomyLevel and mode with RWMutex
- IsAutonomousActionsAllowed() returns false for observe-only and suggest-only levels
- RuntimeDriftHandler: gates Kueue Job submission on watcher; logs refusal with level under observe-only
- PackPodHealthLoop: gates DriftSignal emission on watcher; same pattern
- kernel/agent.go: constructs watcher, wires into both dispatchers, starts goroutine in onLeaderStart
- CNPG_SECRET_NAME changed to guardian-cnpg-app (CNPG auto-generated Secret, no manual creation needed)
- 9 OperatorContextWatcher unit tests all green

Unblocks TC-MC-22 (observe-only autonomy gate verification).
…igration in tests

Sweeps remaining references to the old infrastructure.ontai.dev API group across
tests and the capability publisher. Updates capabilities format from []string to
[]{name, version} objects to match RunnerConfig status schema.
…generates this name)

guardian-cnpg-app does not exist in seam-system; the actual auto-generated Secret
is guardian-db-app. Corrects an erroneous rename introduced in the prior session.
…ce with RuntimeDriftHandler

DriftSignalHandler was processing all pending DriftSignals regardless of signalKind,
racing with RuntimeDriftHandler and advancing RuntimeDrift signals to queued before
attempt counting could occur. Added signalKind==RuntimeDrift guard (and existing
TalosCluster guard is now adjacent for clarity). Adds TestDriftSignalHandler_RuntimeDrift_Skipped.
…e + seam webhook enable

- pack_pod_health_loop: populate DriftSignal.spec.mismatchContext with all 5
  KBCL fields (perceivedState, realizableConstraintRef, governanceSnapshotRevision,
  kbclLayer=realization, selectionAttempt); read governanceSnapshotRevision from
  PermissionSnapshot snapshot-management in seam-system (unblocks TC-MC-21)

- runtime_drift_handler: gate destructive remediation Job submission on
  RemediationApproval CR presence when autoRedeployment=false; write
  WaitingForRemediationApproval Event on PackInstalled when blocked;
  mark approval acted after consuming (INV-007, unblocks TC-MC-26)

- compile_enable: generate seam-service.yaml + seam-lineage-webhooks.yaml in
  Phase 3 bundle; three ValidatingWebhookConfigurations for lineage immutability,
  authorship, and domainref enforcement (unblocks TC-MC-10, TC-MC-11)
…n-wis

# Conflicts:
#	cmd/compiler/compile_enable.go
#	cmd/compiler/compile_enable_test.go
#	cmd/compiler/compile_launch.go
#	config/crd/embed.go
#	config/crd/seam.ontai.dev_runnerconfigs.yaml
#	internal/agent/capability_publisher.go
#	internal/agent/capability_publisher_test.go
#	internal/capability/wrapper.go
#	test/integration/signing/signing_integration_test.go
#	test/unit/agent/capability_publisher_test.go
#	test/unit/agent/talos_version_drift_loop_test.go
@ontave ontave merged commit 7928056 into main May 21, 2026
@ontave ontave deleted the feature/post-migration-wis branch May 21, 2026 19:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant