Stabilize shadow pod / wstunnel / full-mesh lifecycle
Summary
The current shadow pod logic is working and widely used, but there are several reliability issues in the wstunnel/full-mesh flow that can make the system fragile in production. This issue tracks improvements to make shadow pod creation, cleanup, endpoint generation, and full-mesh networking more stable and deterministic.
Issues
-
Full mesh may render the wrong manifest
Full-mesh setup populates WireGuard-related template data, but the default renderer still uses the standard wstunnel-template.yaml. The WireGuard-specific template exists but is not currently selected by default.
Proposal: explicitly select the WireGuard/full-mesh template when Network.FullMesh is enabled, or merge both templates behind a clear mode flag.
-
Ingress host and client endpoint must be aligned
The generated client endpoint should match the ingress host exactly.
Correct endpoint format:
{{name}}-{{namespace}}.{{wildcardDNS}}
Not:
ws-{{name}}.{{wildcardDNS}}
Proposal: centralize endpoint generation and use the same value for ingress host, client annotation, full-mesh script, and docs/tests.
-
Cleanup uses inconsistent resource names/namespaces
Failure cleanup currently uses names derived differently from creation, which can leave shadow Deployments, Services, Ingresses, or ConfigMaps behind.
Proposal: compute shadow resource identity once and reuse it for create, wait, timeout cleanup, failure cleanup, and delete.
-
Namespace creation can diverge from the computed namespace
The code checks/creates one namespace, then later may apply manifests to a computed/truncated namespace.
Proposal: compute the final target namespace before namespace creation, validate it, and create/check exactly that namespace. Same-namespace mode should never truncate the real namespace.
-
Manifest decode errors are only logged
If one generated manifest object fails to decode, the flow continues. This can create partial infrastructure and report success even though networking is broken.
Proposal: treat decode failures as fatal and clean up already-created resources.
-
Shadow pod wait logic only checks for PodIP
The current wait path can accept a pod that has an IP but is not ready, stale, terminating, or from an old ReplicaSet.
Proposal: wait for the current Deployment-owned pod, Ready=True, Deployment availability, and optionally populated Service endpoints.
-
Pod annotation patch failures are swallowed
If the generated wstunnel command or full-mesh pre-exec annotation cannot be persisted to the Kubernetes pod, the remote execution path may become inconsistent.
Proposal: return patch errors to the creation flow, or mark the pod failed when required annotations cannot be persisted.
-
Full-mesh key/script generation should be idempotent
Retries can regenerate keys and prepend duplicate pre-exec scripts. Private keys are also currently logged.
Proposal: persist generated key material safely, avoid duplicate pre-exec injection, and never log private keys.
TLS ingress improvement
Some environments require or strongly prefer ingress traffic over port 443. The default shadow ingress should support optional TLS generation.
Proposed config:
Network:
EnableTunnel: true
WildcardDNS: "tunnel.example.com"
IngressTLS: true
IngressClusterIssuer: "lets-issuer"
When enabled, the generated ingress should include:
metadata:
annotations:
cert-manager.io/cluster-issuer: lets-issuer
spec:
tls:
- hosts:
- {{.Name}}-{{.Namespace}}.{{.WildcardDNS}}
secretName: {{.Name}}-tls
The generated client command should use:
When TLS is disabled, it should keep the current non-TLS behavior:
Acceptance criteria
- Full-mesh mode renders the correct WireGuard-capable manifest.
- Ingress host and generated client endpoint always match.
- Cleanup uses the same computed names as creation.
- Manifest decode failures fail the creation flow and trigger cleanup.
- Wait logic checks readiness, not only
PodIP.
- Annotation patch failures are surfaced.
- Full-mesh key/pre-exec generation is idempotent.
- Optional TLS ingress support is covered by tests.
Stabilize shadow pod / wstunnel / full-mesh lifecycle
Summary
The current shadow pod logic is working and widely used, but there are several reliability issues in the wstunnel/full-mesh flow that can make the system fragile in production. This issue tracks improvements to make shadow pod creation, cleanup, endpoint generation, and full-mesh networking more stable and deterministic.
Issues
Full mesh may render the wrong manifest
Full-mesh setup populates WireGuard-related template data, but the default renderer still uses the standard
wstunnel-template.yaml. The WireGuard-specific template exists but is not currently selected by default.Proposal: explicitly select the WireGuard/full-mesh template when
Network.FullMeshis enabled, or merge both templates behind a clear mode flag.Ingress host and client endpoint must be aligned
The generated client endpoint should match the ingress host exactly.
Correct endpoint format:
Not:
Proposal: centralize endpoint generation and use the same value for ingress host, client annotation, full-mesh script, and docs/tests.
Cleanup uses inconsistent resource names/namespaces
Failure cleanup currently uses names derived differently from creation, which can leave shadow Deployments, Services, Ingresses, or ConfigMaps behind.
Proposal: compute shadow resource identity once and reuse it for create, wait, timeout cleanup, failure cleanup, and delete.
Namespace creation can diverge from the computed namespace
The code checks/creates one namespace, then later may apply manifests to a computed/truncated namespace.
Proposal: compute the final target namespace before namespace creation, validate it, and create/check exactly that namespace. Same-namespace mode should never truncate the real namespace.
Manifest decode errors are only logged
If one generated manifest object fails to decode, the flow continues. This can create partial infrastructure and report success even though networking is broken.
Proposal: treat decode failures as fatal and clean up already-created resources.
Shadow pod wait logic only checks for
PodIPThe current wait path can accept a pod that has an IP but is not ready, stale, terminating, or from an old ReplicaSet.
Proposal: wait for the current Deployment-owned pod,
Ready=True, Deployment availability, and optionally populated Service endpoints.Pod annotation patch failures are swallowed
If the generated wstunnel command or full-mesh pre-exec annotation cannot be persisted to the Kubernetes pod, the remote execution path may become inconsistent.
Proposal: return patch errors to the creation flow, or mark the pod failed when required annotations cannot be persisted.
Full-mesh key/script generation should be idempotent
Retries can regenerate keys and prepend duplicate pre-exec scripts. Private keys are also currently logged.
Proposal: persist generated key material safely, avoid duplicate pre-exec injection, and never log private keys.
TLS ingress improvement
Some environments require or strongly prefer ingress traffic over port
443. The default shadow ingress should support optional TLS generation.Proposed config:
When enabled, the generated ingress should include:
The generated client command should use:
When TLS is disabled, it should keep the current non-TLS behavior:
Acceptance criteria
PodIP.