feat(conductor): mismatchContext + RemediationApproval gate + seam webhook enable#43
Merged
Conversation
talosImage was being set to the raw version string from the UpgradePolicy (e.g., "v1.12.7") and passed directly to TalosClient.Upgrade, which then tried to pull "docker.io/library/v1.12.7:latest". talosUpgradeHandler correctly builds "ghcr.io/siderolabs/installer:<version>"; stack handler now follows the same pattern. Rename talosImage to talosVersion when reading from the UpgradePolicy, then compute talosImage := "ghcr.io/siderolabs/installer:" + talosVersion. Discovered during live ccs-dev stack upgrade (session/25d).
Stage=true left both Talos and kubelet changes sitting on disk indefinitely; nodes required manual reboots to apply them. New behaviour mirrors talosUpgradeHandler: per node, stage the kubelet image (staged mode so it co-applies on the Talos reboot), then trigger Talos upgrade with stage=false (immediate reboot), then wait for recovery before moving to the next node. Drop talosconfig-path node enumeration in favour of TalosClient.Nodes() (same source; cleaner and already tested via the stub). Require at least one node (validation failure otherwise). Tests: rename TestStackUpgrade_RunsBothUpgradeSteps to two tests -- TestStackUpgrade_NoNodesReturnsValidationFailure TestStackUpgrade_RollingUpgrade_AllNodes (verifies per-node loop, upgradeCallCount == node count, all ApplyConfiguration calls use staged mode)
…stry ghcr.io is not accessible from lab nodes. The docker.io registry mirror (docker.io → 10.20.0.1:5000) is the only configured mirror. Using docker.io/ image references allows Talos to resolve installer and kubelet images through the local registry mirror during node upgrades. Affects talos-upgrade, kube-upgrade, and stack-upgrade capabilities.
All imports of github.com/ontai-dev/conductor/pkg/runnerlib updated to github.com/ontai-dev/conductor-sdk/runnerlib across 37 files. Internal pkg/runnerlib deleted. go.mod updated with replace directive pointing to ../conductor-sdk and require entry. go mod tidy completed. All unit tests pass: go build ./... and go test ./test/unit/... green before deletion.
…patcher types Update all GVR references, scheme registrations, and import paths in conductor to consume the migrated dispatcher types from wrapper/api/seam: PackDelivery (was InfrastructureClusterPack), PackExecution, PackInstalled (was InfrastructurePackInstance), PackReceipt, PackLog (was PackOperationResult). packDeliveryRef field replaces clusterPackRef in pack_receipt_drift_loop.go and all associated tests. compileLaunchBundle now embeds wrapper CRDs via wrappercrd.FS so agents receive the seam.ontai.dev CRD bundle at startup.
Updates all dynamic-client GVR references from infrastructure.ontai.dev/ infrastructuretalosclusters to seam.ontai.dev/talosclusters. Updates kind strings from InfrastructureTalosCluster to TalosCluster. Updates pack execution GVR to seam.ontai.dev/packexecutions. All tests updated to match.
…day-2 operation records
Replace seam-core -> seam and wrapper -> dispatcher in go.mod replace/require. Update all Go import paths accordingly. Add seam-sdk replace + require. Update conductor RunnerConfigSpec references and compile_launch.go/test assertions for post-MIGRATION-3.8 CRD names (lineagerecords, runnerconfigs under seam.ontai.dev).
…tories Replace ../seam-core with ../seam and ../wrapper with ../dispatcher following the seam-core -> seam and wrapper -> dispatcher filesystem renames. Module paths were already updated in Phase 4.
…onductor Update all guardian.ontai.dev API group references in conductor: - compile_enable.go, compile_launch.go: enable bundle apiVersion strings, webhook names - catalog.go and all 5 catalog YAML entries: apiVersion strings in rendered RBACProfiles - capability/guardian.go, adapters.go: GVR Group fields for snapshot/profile/policy - agent pull loops (rbacpolicy, rbacprofile, receipt, signing): GVR Group fields - All unit, integration, and e2e test fixtures: GVR/GVK Group strings and apiVersion values
…ductor-sdk - Dockerfile.compiler/execute/agent: seam-core/ -> seam/, wrapper/ -> dispatcher/ - Add COPY conductor-sdk/ and seam-sdk/ to all three builder stages - cmd/conductor/main.go: fix stale "seam-core scheme" panic message to "seam" - docs/conductor-schema.md: update InfrastructureRunnerConfig -> RunnerConfig, infrastructure.ontai.dev -> seam.ontai.dev throughout Steps 6.1, 6.3, 6.4 were already complete (single binary entrypoint at cmd/conductor/, single build target, go.mod already imports conductor-sdk).
Fresh documentation from current codebase. runner.ontai.dev claim removed (conductor owns no API group). pkg/runnerlib replaced with conductor-sdk reference. seam-core replaced with seam. All three image modes documented accurately. Capability table rebuilt from conductor-sdk/runnerlib/constants.go.
…am-sdk/conductor-sdk); fix integration test GVR and CRD for RunnerConfig post-migration
…bership (agent mode only)
…fig path Stage upgrade with stage=true then call Reboot explicitly so nodes reboot immediately after staging rather than waiting for an organic restart cycle. Previously the upgrade was staged but no reboot was forced, leaving the desired version un-applied until the next natural reboot. Kubeconfig path in execute mode corrected from the directory-style /var/run/secrets/kubeconfig/value to the file-style /var/run/secrets/kubeconfig, matching the SubPath mount applied in the dispatcher Job template.
…el injection, e2e stubs Adds tenant-mode PackPodHealthLoop: watches pods by pack-name label, tracks consecutive failures per pack/reason, emits RuntimeDrift DriftSignals to management cluster on threshold. Adds management-mode RuntimeDriftHandler: reads RemediationPolicy, increments PackLog attempts, escalates to HumanInterventionRequired event or annotates PackInstalled for auto-redeploy. Injects seam.ontai.dev/pack-name label into Deployment/StatefulSet/DaemonSet pod templates before SSA apply in all three apply paths. Adds RemediationPolicy/RemediationApproval CRD stubs. Six e2e stubs T-CW-38 through T-CW-43 added. All unit tests passing (T-CW-21 through T-CW-43).
Rename wrapper->dispatcher and seam-core->seam throughout compile_enable.go and compile_enable_test.go. Switch all RBAC rules from infrastructure.ontai.dev to seam.ontai.dev. Update resource names to post-migration CRD plurals: runnerconfigs, lineagerecords, driftsignals, seammemberships, packlogs, packdeliveries, packexecutions, packinstalleds. Replace packoperationresults with packlogs. Fix platform-executor RBAC to use configmaps (not a CRD). Update SeamMembership apiVersion to seam.ontai.dev/v1alpha1. Update DSNS governance annotations to governance.seam.ontai.dev/owner. Delete stale local runnerconfigs CRD (seam repo is authoritative). All compiler unit tests pass.
All three import-path branches (PKI extraction, talosconfig secret emission, TalosCluster mode) checked only importExistingCluster bool, which is not set when cluster-input.yaml uses mode: import without the legacy field. Extend each check to also trigger on Mode == "import" so ccs-mgmt and future mode: import clusters generate mode=import TalosCluster CRs and emit the talosconfig secret on bootstrap.
…rnals Replace infrastructure.ontai.dev with seam.ontai.dev in all functional runtime code: capability_publisher RunnerConfig GVR and DriftSignal apiVersion, pack_receipt_drift_loop DriftSignal apiVersion, pack_pod_health_loop DriftSignal apiVersion, packinstance_pull_loop apiVersion, talos/kubernetes version drift loop DriftSignal apiVersions, receipt_reconciler annotation keys, guardian capability annotation key. All remaining infrastructure.ontai.dev references are in comments only and do not affect runtime behavior.
… names
conductor/internal/capability/wrapper.go: pack-deploy handler listed
PackDelivery in seam-tenant-{clusterRef} but all PackDeliveries live in
seam-system; fixes ValidationFailure "ClusterPack has no registryRef".
conductor/cmd/compiler/compile_enable.go: operator table used pre-migration
names (wrapper, seam-core); updated to dispatcher and seam with correct
lease, ServiceAccount, and webhook secret names.
wrapper.go: applyParsedManifest was missing Force=true on SSA patch; SSA conflicts with kubectl-client-side-apply field manager caused pack-deploy to fail on resources previously applied by kubectl. All manifest apply calls now set Force=true.
…fig capabilities to []string
PackDeliveries live in seam-tenant-{clusterRef} alongside PackExecutions. Reverts
the unauthorized change that looked up PackDeliveries in seam-system. Removes the
dispatcher-runner-pack-reader cross-namespace Role+RoleBinding from compile_enable.go
(it was added to paper over the wrong namespace lookup). Updates compile_enable_test.go
and wrapper_runner_rbac_test.go to reflect post-migration names and correct namespace
semantics.
RunnerConfig status.capabilities is now []string (capability names only) instead of
[]CapabilityEntry. The Publish method extracts names before patching status, so the
capability_publisher no longer writes version/mode fields into the RunnerConfig CRD.
… PackLog lookup, namespace
TC-MC-5: rawCompilePackBuild was concatenating YAML files without document separators,
causing the last document in one file to corrupt the first document in the next when
parsed by guardian. Added --- separator before each file in the loop. Regression test
added (TestRawCompilePackBuild_MultiFileDocumentSeparation).
TC-MC-6: Four bugs fixed in the remediation pipeline:
- pack_pod_health_loop: affectedPackInstalledRef.namespace was "seam-"+clusterRef
instead of l.mgmtTenantNS ("seam-tenant-"+clusterRef); DriftSignal pointed at
wrong namespace, blocking management conductor from finding the PackInstalled.
- runtime_drift_handler: PackLog lookup used PackInstalled name directly (e.g.
nginx-ccs-mgmt) but actual names are pack-deploy-result-{exec}-r{N}. Added
resolvePackExecName() and readPackLogAttempts() helpers that resolve via
ownerReference chain and label selector ontai.dev/pack-execution={execName}.
- groupversion_info.go: missing +groupName=conductor.ontai.dev marker caused
controller-gen to emit _.yaml with empty group. Added marker and regenerated.
- config/crd/embed.go: was embedding nothing (go:embed *.yaml on empty dir).
Now embeds generated RemediationPolicy and RemediationApproval CRDs.
- compile_launch.go: conductor CRD package added to the launch bundle so
RemediationPolicy/RemediationApproval CRDs are applied at bootstrap.
…ispatchers Implements Decision 16 B-selection constraint via OperatorContext CR polling: - OperatorContextWatcher polls seam.ontai.dev/v1alpha1/operatorcontexts in ont-system, caches autonomyLevel and mode with RWMutex - IsAutonomousActionsAllowed() returns false for observe-only and suggest-only levels - RuntimeDriftHandler: gates Kueue Job submission on watcher; logs refusal with level under observe-only - PackPodHealthLoop: gates DriftSignal emission on watcher; same pattern - kernel/agent.go: constructs watcher, wires into both dispatchers, starts goroutine in onLeaderStart - CNPG_SECRET_NAME changed to guardian-cnpg-app (CNPG auto-generated Secret, no manual creation needed) - 9 OperatorContextWatcher unit tests all green Unblocks TC-MC-22 (observe-only autonomy gate verification).
…igration in tests
Sweeps remaining references to the old infrastructure.ontai.dev API group across
tests and the capability publisher. Updates capabilities format from []string to
[]{name, version} objects to match RunnerConfig status schema.
…generates this name) guardian-cnpg-app does not exist in seam-system; the actual auto-generated Secret is guardian-db-app. Corrects an erroneous rename introduced in the prior session.
…ce with RuntimeDriftHandler DriftSignalHandler was processing all pending DriftSignals regardless of signalKind, racing with RuntimeDriftHandler and advancing RuntimeDrift signals to queued before attempt counting could occur. Added signalKind==RuntimeDrift guard (and existing TalosCluster guard is now adjacent for clarity). Adds TestDriftSignalHandler_RuntimeDrift_Skipped.
…e + seam webhook enable - pack_pod_health_loop: populate DriftSignal.spec.mismatchContext with all 5 KBCL fields (perceivedState, realizableConstraintRef, governanceSnapshotRevision, kbclLayer=realization, selectionAttempt); read governanceSnapshotRevision from PermissionSnapshot snapshot-management in seam-system (unblocks TC-MC-21) - runtime_drift_handler: gate destructive remediation Job submission on RemediationApproval CR presence when autoRedeployment=false; write WaitingForRemediationApproval Event on PackInstalled when blocked; mark approval acted after consuming (INV-007, unblocks TC-MC-26) - compile_enable: generate seam-service.yaml + seam-lineage-webhooks.yaml in Phase 3 bundle; three ValidatingWebhookConfigurations for lineage immutability, authorship, and domainref enforcement (unblocks TC-MC-10, TC-MC-11)
…n-wis # Conflicts: # cmd/compiler/compile_enable.go # cmd/compiler/compile_enable_test.go # cmd/compiler/compile_launch.go # config/crd/embed.go # config/crd/seam.ontai.dev_runnerconfigs.yaml # internal/agent/capability_publisher.go # internal/agent/capability_publisher_test.go # internal/capability/wrapper.go # test/integration/signing/signing_integration_test.go # test/unit/agent/capability_publisher_test.go # test/unit/agent/talos_version_drift_loop_test.go
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
pack_pod_health_loop.gonow populates all 5 KBCL fields inDriftSignal.spec.mismatchContextwhen emitting RuntimeDrift signals --perceivedState,realizableConstraintRef,governanceSnapshotRevision(read from PermissionSnapshot),kbclLayer=realization,selectionAttempt. Unblocks TC-MC-21.runtime_drift_handler.gogates destructive remediation Job submission on aRemediationApprovalCR whenautomaticRedeployment=false. EmitsWaitingForRemediationApprovalEvent on PackInstalled when blocked; marks approvalstatus.acted=trueafter consuming it. INV-007 enforced. Unblocks TC-MC-26.compile_enable.goPhase 3 now generatesseam-service.yaml+seam-lineage-webhooks.yaml+webhook-certs.yaml(3 ValidatingWebhookConfigurations for lineage immutability, authorship, and domainref). Unblocks TC-MC-10, TC-MC-11.Test plan
go test ./internal/agent/...-- all pass (mismatchContext + RemediationApproval gate unit tests)