Skip to content

DEVOP-579: add NetworkPolicy egress rollout plan (doc only)#7

Open
srt0422 wants to merge 6 commits into
allora-network:mainfrom
srt0422:devop-579-networkpolicy-rollout
Open

DEVOP-579: add NetworkPolicy egress rollout plan (doc only)#7
srt0422 wants to merge 6 commits into
allora-network:mainfrom
srt0422:devop-579-networkpolicy-rollout

Conversation

@srt0422
Copy link
Copy Markdown

@srt0422 srt0422 commented May 13, 2026

Summary

Adds tickets/devop-579-network-policy-rollout.md — a staged plan for rolling default-deny-egress NetworkPolicies across our 13 clusters.

Why doc-first

NetworkPolicy egress hardening is a 3-engineer-week project where the bulk of effort is discovery, not deployment. default-deny-egress silently breaks every workload that has an un-enumerated outbound dependency, so rushing it is production-impacting. Capturing the plan now (Phases 0-4, rollback procedure, dependencies on DEVOP-588/589) means subsequent loop runs or human owners can pick up execution without redoing the planning.

This PR adds only the plan document. No NetworkPolicy is deployed.

Test plan

  • Document compiles, internal links resolve.
  • Reviewer: confirm CNI assumption (Calico/Cilium across all 13 clusters?) and adjust Phase 0 if wrong.
  • Reviewer: confirm priority ordering of namespaces in Phase 1 matches actual blast-radius.

Related

Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com


Summary by cubic

Adds a phased plan to roll out default-deny-egress and parallel default-deny-ingress Kubernetes NetworkPolicies across all 13 clusters per DEVOP-579; documentation only (tickets/devop-579-network-policy-rollout.md), no policies are deployed. The plan pins policy names for enforcement/rollback, fixes Phase 0 with correct per‑CNI flow logging and DNS capture via CoreDNS dnstap (full) or Cilium L7 DNS joined to L3/L4 flow logs, requires 48‑hour soak windows with a clean gate, adds a suspect‑egress checklist, and updates SECURITY-RUNBOOK.md with both rollback commands.

  • Dependencies
    • Hard: DEVOP-589 (Harbor proxy-cache) must land before Phase 2.
    • Soft: DEVOP-588 (Kyverno on all clusters) for Phase 4 enforcement.

Written for commit ee15fee. Summary will update on new commits.

Review in cubic

Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 issues found across 1 file

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="tickets/devop-579-network-policy-rollout.md">

<violation number="1" location="tickets/devop-579-network-policy-rollout.md:33">
P2: The discovery checklist omits the DEVOP-579 requirement to explicitly flag suspect egress destinations (webhook/pastebin/ngrok/169.254.169.254).</violation>

<violation number="2" location="tickets/devop-579-network-policy-rollout.md:52">
P1: Phase 3 uses 24-hour soak windows, but linked Linear issue DEVOP-579 specifies 48-hour soaks for staged rollout.</violation>

<violation number="3" location="tickets/devop-579-network-policy-rollout.md:64">
P2: Phase 4 is missing the DEVOP-579 requirement to document the rollout/policies in SECURITY-RUNBOOK.md.</violation>

<violation number="4" location="tickets/devop-579-network-policy-rollout.md:74">
P2: This plan marks ingress NetworkPolicies as out of scope, but linked Linear issue DEVOP-579 requires default-deny for both egress and ingress.</violation>
</file>
Architecture diagram
sequenceDiagram
    participant K8s as Kubernetes Clusters (13)
    participant CNI as CNI Plugin (Calico/Cilium)
    participant Hubble as Hubble/Flow Logs
    participant Harbor as Harbor Registry
    participant Kyverno as Kyverno Policy Engine
    participant Runbook as Rollback Runbook

    Note over K8s,Runbook: Phase 0 — Pre-flight Assessment
    K8s->>CNI: Confirm NetworkPolicy support
    alt CNI supports NetworkPolicy
        CNI-->>K8s: Calico, Cilium, or Antrea confirmed
    else Flannel without --network-policy
        CNI-->>K8s: Need to migrate CNI first
    end
    K8s->>Hubble: Enable flow logs on staging cluster
    Note over Hubble: Capture 7 days baseline traffic

    Note over K8s,Rollback: Phase 1 — Discovery (per namespace)
    loop For each namespace in priority order
        K8s->>Hubble: Query egress flow logs (7 days)
        Hubble-->>K8s: Destination CIDRs, DNS, ports
        K8s->>K8s: Categorize traffic (internal/infra/vendor/registries/customer)
        K8s->>K8s: Document in network-policies/discovery/<namespace>.md
    end

    Note over K8s,Rollback: Phase 2 — Allowlist Authoring
    K8s->>K8s: Create default-deny.yaml (deny all egress except DNS)
    K8s->>K8s: Create allowlist.yaml (derived from Phase 1)
    Note over K8s: DNS to kube-dns/coredns (53/udp, 53/tcp)
    Note over K8s: NTP always allowed (123/udp)
    Note over K8s: Cluster-internal pod-to-pod allowed by default
    K8s->>Harbor: Dependency on DEVOP-589 (Harbor proxy-cache)
    alt DEVOP-589 landed
        Note over K8s: Allowlists reference Harbor proxy instead of direct registries
    else Not yet landed
        Note over K8s: Allowlists must allow direct ghcr.io, docker.io, etc.
    end

    Note over K8s,Rollback: Phase 3 — Staged Rollout
    K8s->>K8s: Day 1: Apply to 1 staging namespace, observe 24h
    K8s->>K8s: Day 2: Apply to all staging namespaces, observe 24h
    K8s->>K8s: Day 3: Apply to 1 production namespace (lowest risk), observe 24h
    K8s->>K8s: Days 4-5: Roll forward remaining namespaces (lowest blast-radius first)
    alt Egress broken for workload
        K8s->>Runbook: kubectl delete networkpolicy default-deny -n <ns>
        Runbook-->>K8s: Egress restored immediately
    end

    Note over K8s,Rollback: Phase 4 — Steady State
    alt DEVOP-588 landed (Kyverno on all clusters)
        Kyverno->>K8s: Auto-flag new namespaces without default-deny
        K8s->>K8s: Monthly review of discovery documents
    else Not yet landed
        Note over K8s: Manual enforcement only
    end
Loading

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.

Comment thread tickets/devop-579-network-policy-rollout.md Outdated
Comment thread tickets/devop-579-network-policy-rollout.md
Comment thread tickets/devop-579-network-policy-rollout.md Outdated
Comment thread tickets/devop-579-network-policy-rollout.md Outdated
@srt0422 srt0422 added the shai-hulud Shai-Hulud supply-chain defense work label May 13, 2026
srt0422 and others added 2 commits May 21, 2026 17:07
NetworkPolicy egress hardening is a 3-engineer-week project that
must NOT be rushed — `default-deny-egress` silently breaks every
workload that has an un-enumerated outbound dependency. The bulk of
the work is discovery (7 days of baseline flow logs per namespace),
not deployment.

This doc captures the staged rollout plan so subsequent loop runs
(or whoever picks up execution) don't redo the planning work. Covers:

- Phase 0: pre-flight (CNI compat, flow log enablement).
- Phase 1: discovery (per-namespace egress enumeration).
- Phase 2: allowlist authoring.
- Phase 3: staged rollout (1 staging → 1 prod → fan out).
- Phase 4: steady-state (Kyverno schema enforcement, monthly review).

Dependencies:
- DEVOP-589 (Harbor proxy-cache) must land before Phase 2 or the
  allowlists will churn.
- DEVOP-588 (Kyverno on all clusters) is a soft dep for Phase 4.

This PR adds the doc only. No NetworkPolicy is deployed.

Linear: https://linear.app/alloralabs/issue/DEVOP-579

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ook hook, ingress in scope

Four findings from cubic addressed:

1. tickets/devop-579-network-policy-rollout.md:33 (P2) — Phase 1
   discovery checklist now explicitly enumerates suspect egress
   destinations to flag for incident review (webhook receivers,
   pastebins, ngrok/tunnel services, 169.254.169.254 / cloud
   metadata, residential dynamic-DNS). Each flagged destination
   gets an owner-review gate before allowlist inclusion.

2. tickets/devop-579-network-policy-rollout.md:52 (P1) — Phase 3
   staged rollout soak windows changed from 24h to the 48h spec'd
   by DEVOP-579, and now require a clean soak before advancing.

3. tickets/devop-579-network-policy-rollout.md:64 (P2) — Phase 4
   steady-state now mandates documenting the rollout, allowlist
   layout, rollback command, and on-call escalation path in
   SECURITY-RUNBOOK.md (DEVOP-571).

4. tickets/devop-579-network-policy-rollout.md:74 (P2) — Ingress
   default-deny is no longer out-of-scope. Added a dedicated
   section laying out the parallel ingress cohort (same Phases 0–4
   shape with ingress-specific discovery, allowlist patterns,
   slower production rollout because ingress blast-radius is
   higher, and Kyverno asserting both directions in Phase 4).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@srt0422 srt0422 force-pushed the devop-579-networkpolicy-rollout branch from f377778 to 7b93a18 Compare May 22, 2026 00:07
Comment thread tickets/devop-579-network-policy-rollout.md
…d DNS-log enablement + join step

@gh-allora flagged that Hubble/Calico egress flow logs are L3/L4 only,
so the Phase 1 line "enumerate destination CIDRs, DNS names, and ports"
can't be satisfied from flow logs alone. Confirmed: Hubble flow records
and Calico flow logs surface src/dst IP, port, and protocol — DNS names
require either a CoreDNS query log feed or Cilium's L7 DNS visibility
(which routes pod DNS through the proxy and records resolved FQDNs).

Fix is structural, not cosmetic:

- Phase 0 now has an explicit "enable verbose DNS query logging" step
  alongside flow log enablement, with concrete options for CoreDNS
  (`log` plugin) and Cilium (L7 DNS via `hubble observe --type=dns`),
  plus a retention check so the 7-day baseline is actually queryable
  before Phase 1 starts.
- Phase 1 line 33 is split into two checklist items: enumerate CIDRs +
  ports from flow logs (the only fields they carry), then resolve to
  FQDNs by joining flow records against the Phase 0 DNS logs on
  (srcPodIP, dstIP) within a short window. Destinations with no DNS
  match (hard-coded IPs, 169.254.169.254, raw cloud-metadata) are
  carried through as IP-only and fall into the existing suspect-
  destination review.

review-fix-loop iteration 1
reviewer(s): gh-allora (human PR thread)
file: tickets/devop-579-network-policy-rollout.md:17,33

Co-Authored-By: Claude Opus 4.7 (review-fix-loop) <noreply@anthropic.com>
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 1 file (changes from recent commits).

Tip: Review your code locally with the cubic CLI to iterate faster.

Fix all with cubic | Re-trigger cubic

Comment thread tickets/devop-579-network-policy-rollout.md Outdated
srt0422 and others added 3 commits May 30, 2026 07:33
…ith real per-CNI enablement

Two problems in the Phase 0 checklist that would have wasted an
engineer's day before they figured out the doc was wrong:

1. `network-policy-engine (Calico)` and `Cilium's native NPL` are not
   real component names. Felix is Calico's per-node policy enforcer;
   Cilium ships NetworkPolicy enforcement built in (no separate "NPL"
   — NPL means NodePort Local in Antrea/Calico, unrelated to
   NetworkPolicy). The flannel-fallback bullet now correctly says the
   only path forward on flannel-without-policy is a CNI migration to
   Calico or Cilium, since flannel itself cannot enforce
   NetworkPolicies.

2. `calicoctl flow logs enable` is not a calicoctl subcommand. Calico
   OSS flow logs are turned on via the FelixConfiguration CR
   (`spec.flowLogsFileEnabled: true`), and the resulting files land
   under `/var/log/calico/flowlogs/` on each node. Also called out
   that OSS file-based flow logs cover allow/deny only — for richer
   flow context the team needs Calico Enterprise / Calico Cloud, and
   the recommendation is to prefer the Cilium staging cluster for
   baseline capture if the option exists. Antrea enablement (Flow
   Exporter feature gate + flow-aggregator) added for completeness
   since one of our clusters is on Antrea.

review-fix-loop iteration 1
reviewer(s): review-fix-loop (correctness lens)
file: tickets/devop-579-network-policy-rollout.md:15-17

Co-Authored-By: Claude Opus 4.7 (review-fix-loop) <noreply@anthropic.com>
…ok now matches actual resource names

The rollback runbook command `kubectl delete networkpolicy default-deny
-n <ns>` would no-op (NotFound) once ingress lands, because the ingress
section calls the ingress policy `default-deny-ingress` while the egress
section never pinned the egress resource name. So:

- An engineer authoring `default-deny.yaml` could legitimately name the
  resource `default-deny-egress`, `egress-default-deny`, or anything
  else. The runbook would silently fail to delete it in an incident.
- Once both directions are deployed, the runbook needs both rollback
  commands, not one.
- The Phase 4 Kyverno asserter needs to grep on a deterministic
  resource name to enforce "every namespace has both default-deny
  policies".

Fix is structural: Phase 2 now contains a pinned naming convention
table that the rollback runbook (Phase 3) and the Kyverno asserter
(Phase 4) both reference by exact `metadata.name`. As a side effect of
pinning, also split the egress baseline allows (DNS/NTP) into a
separate generated policy (`egress-baseline-allow`) so the per-namespace
`egress-allowlist` only contains workload-specific rules — resolves
the Phase 2 ambiguity over which baseline rules live in default-deny
vs allowlist.

Changes:
- New Phase 2 naming-convention table mapping filename ↔ metadata.name
  ↔ purpose for all five policy kinds (3 egress + 2 ingress).
- Rollback runbook now lists both `default-deny-egress` and
  `default-deny-ingress` commands and calls out drift as an incident.
- Phase 4 SECURITY-RUNBOOK hook now references both rollback commands.
- Phase 4 Kyverno bullet now matches by exact metadata.name from the
  pinned table.
- Ingress section's Phase 2 substitution now references the same table
  for both file name and resource name.

review-fix-loop iteration 1
reviewer(s): review-fix-loop (reliability lens)
file: tickets/devop-579-network-policy-rollout.md:52,80,87,112,122

Co-Authored-By: Claude Opus 4.7 (review-fix-loop) <noreply@anthropic.com>
…witch to dnstap (full)

cubic flagged that my iter-1 Phase 0 DNS-log instruction was broken:
the CoreDNS `log` plugin emits client IP + query name + response code
but NOT the answer-section A/AAAA IPs, so the
`(srcPodIP, dstIP)` join described in Phase 1 has nothing on the DNS
side to match `dstIP` against. Confirmed — `log`'s format is per the
CoreDNS docs, and resolved IPs only appear in the actual DNS message
response (the answer section).

Fix is to use the `dnstap` plugin with the `full` flag, which streams
wire-format DNS messages (request + response, including the answer
section) to a Unix socket or TCP collector. A dnstap collector
(`golang-dnstap`, `dnstap-receiver`) decodes those into
`(timestamp, client_pod_ip, query_name, response_ips[])` records that
can actually be joined against flow-log destinations. The Cilium
`hubble observe --type=dns` path was already correct because Hubble
records FQDN and answer IPs together.

Changes:
- Phase 0 DNS-capture bullet now specifies `dnstap ... full` for
  CoreDNS, names the collector requirement, and calls out explicitly
  that the query-only `log` plugin is insufficient (so a future
  reader who has read the old docs doesn't reach for it).
- Phase 1 resolve-to-FQDN bullet now describes the join key
  accurately: `srcPodIP == DNS client IP, dstIP ∈ DNS response answer
  IPs`, instead of pretending `log` output has the answer IPs.

review-fix-loop iteration 2
reviewer(s): cubic-dev-ai (PR thread PRRT_kwDOLZ5Xss6F4Gnj)
file: tickets/devop-579-network-policy-rollout.md:18-21,38

Co-Authored-By: Claude Opus 4.7 (review-fix-loop) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-human-review shai-hulud Shai-Hulud supply-chain defense work

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants