Skip to content

Stabilize shadow pod networking lifecycle and add optional TLS ingress support #532

Description

@Bianco95

Stabilize shadow pod / wstunnel / full-mesh lifecycle

Summary

The current shadow pod logic is working and widely used, but there are several reliability issues in the wstunnel/full-mesh flow that can make the system fragile in production. This issue tracks improvements to make shadow pod creation, cleanup, endpoint generation, and full-mesh networking more stable and deterministic.

Issues

  1. Full mesh may render the wrong manifest

    Full-mesh setup populates WireGuard-related template data, but the default renderer still uses the standard wstunnel-template.yaml. The WireGuard-specific template exists but is not currently selected by default.

    Proposal: explicitly select the WireGuard/full-mesh template when Network.FullMesh is enabled, or merge both templates behind a clear mode flag.

  2. Ingress host and client endpoint must be aligned

    The generated client endpoint should match the ingress host exactly.

    Correct endpoint format:

    {{name}}-{{namespace}}.{{wildcardDNS}}
    

    Not:

    ws-{{name}}.{{wildcardDNS}}
    

    Proposal: centralize endpoint generation and use the same value for ingress host, client annotation, full-mesh script, and docs/tests.

  3. Cleanup uses inconsistent resource names/namespaces

    Failure cleanup currently uses names derived differently from creation, which can leave shadow Deployments, Services, Ingresses, or ConfigMaps behind.

    Proposal: compute shadow resource identity once and reuse it for create, wait, timeout cleanup, failure cleanup, and delete.

  4. Namespace creation can diverge from the computed namespace

    The code checks/creates one namespace, then later may apply manifests to a computed/truncated namespace.

    Proposal: compute the final target namespace before namespace creation, validate it, and create/check exactly that namespace. Same-namespace mode should never truncate the real namespace.

  5. Manifest decode errors are only logged

    If one generated manifest object fails to decode, the flow continues. This can create partial infrastructure and report success even though networking is broken.

    Proposal: treat decode failures as fatal and clean up already-created resources.

  6. Shadow pod wait logic only checks for PodIP

    The current wait path can accept a pod that has an IP but is not ready, stale, terminating, or from an old ReplicaSet.

    Proposal: wait for the current Deployment-owned pod, Ready=True, Deployment availability, and optionally populated Service endpoints.

  7. Pod annotation patch failures are swallowed

    If the generated wstunnel command or full-mesh pre-exec annotation cannot be persisted to the Kubernetes pod, the remote execution path may become inconsistent.

    Proposal: return patch errors to the creation flow, or mark the pod failed when required annotations cannot be persisted.

  8. Full-mesh key/script generation should be idempotent

    Retries can regenerate keys and prepend duplicate pre-exec scripts. Private keys are also currently logged.

    Proposal: persist generated key material safely, avoid duplicate pre-exec injection, and never log private keys.

TLS ingress improvement

Some environments require or strongly prefer ingress traffic over port 443. The default shadow ingress should support optional TLS generation.

Proposed config:

Network:
  EnableTunnel: true
  WildcardDNS: "tunnel.example.com"
  IngressTLS: true
  IngressClusterIssuer: "lets-issuer"

When enabled, the generated ingress should include:

metadata:
  annotations:
    cert-manager.io/cluster-issuer: lets-issuer
spec:
  tls:
  - hosts:
    - {{.Name}}-{{.Namespace}}.{{.WildcardDNS}}
    secretName: {{.Name}}-tls

The generated client command should use:

wss://{{endpoint}}:443

When TLS is disabled, it should keep the current non-TLS behavior:

ws://{{endpoint}}:80

Acceptance criteria

  • Full-mesh mode renders the correct WireGuard-capable manifest.
  • Ingress host and generated client endpoint always match.
  • Cleanup uses the same computed names as creation.
  • Manifest decode failures fail the creation flow and trigger cleanup.
  • Wait logic checks readiness, not only PodIP.
  • Annotation patch failures are surfaced.
  • Full-mesh key/pre-exec generation is idempotent.
  • Optional TLS ingress support is covered by tests.

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingenhancementNew feature or request

Type

No type

Fields

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions