Background
Gossip was added to kOps in v1.7 (July 2017) as a way to run a cluster without depending on Route53 or any external DNS provider. The cluster name's .k8s.local suffix was the trigger; protokube on every node would discover peers via cloud-provider tag lookups and use a gossip overlay to converge on the live set of control-plane and api-server IPs. That set was written to /etc/hosts so workers could resolve api.internal.<cluster> without ever touching DNS.
It was a pragmatic and welcome feature. For years it was the easiest way to spin up a kOps cluster, no zone, no NS records, no special access for the DNS provider. A lot of CI fleets, dev clusters, and air-gapped-ish setups have lived on it.
A condensed history of how we got here:
- v1.7 (July 2017): gossip introduced as the path for "no real DNS available." Initial implementation backed by
weaveworks/mesh.
- v1.16 (February 2020): a second gossip implementation (
memberlistmesh) shipped as an alternative to weaveworks/mesh, partly to give users a choice and partly for resilience against either implementation going unmaintained.
- v1.26 (March 2023):
--dns=none introduced as a topology choice (Hetzner first).
- v1.28 (September 2023): release notes started telling users that
--dns=none is the path forward and pointed at the gossip-deactivation.
- v1.29 (May 2024):
kops create cluster started defaulting to dns=none for non-AWS/non-GCE clouds, and started emitting an explicit deprecation warning in CLI output: "Gossip is deprecated, using None DNS instead". This was the first time the message reached operators at the command line rather than only in release notes.
- v1.29.1 (July 2024): the
dns=none default extended to AWS and GCE too. From this point on, every new cluster on every cloud is None-DNS by default.
- v1.30 (August 2024): the
IsGossip() predicate was renamed to UsesLegacyGossip() to signal long-term direction without breaking anyone.
- v1.35 (March 2026):
kops create cluster no longer creates a gossip cluster at all, even the .k8s.local naming convention now produces a None-DNS cluster. Existing gossip clusters keep updating; new ones cannot be created via the standard flow.
This issue tracks finishing the removal.
Why now
Two reasons, both mostly outside the project's control.
Unmaintained upstream dependencies
Both gossip implementations are pulling in code that nobody's shipping fixes for:
github.com/weaveworks/mesh - last commit 2019. Weaveworks itself wound down in 2024; the repository is archived in spirit if not in label. Anything CVE-relevant in its transitive graph (e.g. older golang.org/x/* versions, older crypto helpers) is on us to vendor and patch.
github.com/jacksontj/memberlistmesh - personal fork from 2019 of HashiCorp's memberlist. Same unmaintained-vendor problem; the upstream memberlist has moved on, the fork has not.
Each of these brings a chunk of code that runs on every kOps node, in a privileged daemon (protokube), that we don't really get to update.
Security pressure beyond just maintenance
protokube in gossip mode needs broad cloud-provider permissions, on every node, including workers, just to discover its peers. Workers in a None-DNS cluster don't need any of that. The kOps default permissions reflect this gap: gossip workers carry permissions that expose considerable cluster topology to anyone who can read the node's instance role.
We have been chipping at this for releases and have always been comfortable with a long deprecation runway. That's changing for a specific reason: AI-powered security review tools are now surfacing both of the points above prominently, the unmaintained mesh dependencies on every node, and the over-broad worker permissions. Whatever a Claude / Codex / Copilot-style scanner finds quickly and consistently is, by construction, a low-friction discovery for someone with malicious intent. The same automation that makes the security posture obvious to defenders makes it obvious to attackers. The deprecation can no longer be paced by what's comfortable for us; it has to be paced by the realistic window before this surface is actively exploited.
Would rather not rush this. Gossip enabled real work for a long time and we don't take user-facing breakage lightly. But the cost of holding the line is now higher than the cost of finishing the migration.
Proposal
Open for discussion, not a decision yet:
- kOps 1.36: ship a simple hybrid mode that lets a gossip cluster keep gossip on the control plane while bootstrapping workers off the API load balancer. Gives operators a single
kops reconcile path to take protokube (and the unmaintained mesh dependencies, and the over-broad worker IAM) off worker nodes without flipping topology in the same window. New gossip clusters remain refused by kops create cluster; existing ones keep updating.
- kOps 1.37: remove gossip code paths entirely.
protokube ships without weaveworks/mesh or memberlistmesh.
A 1.36→1.37 gap of one minor is short, but the warning has been in front of operators since v1.29 (May 2024). The proposal here is the final step of a deprecation that has been visible in kops create cluster output for over 2 years.
If a longer runway is needed for specific operator groups, long-lived gossip clusters that can't easily acquire an API load balancer, or environments where the hybrid bridge in 1.36 is insufficient, please say so on this issue with concrete details. Counter-proposals welcome.
CC @justinsb @rifelpet @ameukam
Background
Gossip was added to kOps in v1.7 (July 2017) as a way to run a cluster without depending on Route53 or any external DNS provider. The cluster name's
.k8s.localsuffix was the trigger; protokube on every node would discover peers via cloud-provider tag lookups and use a gossip overlay to converge on the live set of control-plane and api-server IPs. That set was written to/etc/hostsso workers could resolveapi.internal.<cluster>without ever touching DNS.It was a pragmatic and welcome feature. For years it was the easiest way to spin up a kOps cluster, no zone, no NS records, no special access for the DNS provider. A lot of CI fleets, dev clusters, and air-gapped-ish setups have lived on it.
A condensed history of how we got here:
weaveworks/mesh.memberlistmesh) shipped as an alternative toweaveworks/mesh, partly to give users a choice and partly for resilience against either implementation going unmaintained.--dns=noneintroduced as a topology choice (Hetzner first).--dns=noneis the path forward and pointed at the gossip-deactivation.kops create clusterstarted defaulting todns=nonefor non-AWS/non-GCE clouds, and started emitting an explicit deprecation warning in CLI output:"Gossip is deprecated, using None DNS instead". This was the first time the message reached operators at the command line rather than only in release notes.dns=nonedefault extended to AWS and GCE too. From this point on, every new cluster on every cloud is None-DNS by default.IsGossip()predicate was renamed toUsesLegacyGossip()to signal long-term direction without breaking anyone.kops create clusterno longer creates a gossip cluster at all, even the.k8s.localnaming convention now produces a None-DNS cluster. Existing gossip clusters keep updating; new ones cannot be created via the standard flow.This issue tracks finishing the removal.
Why now
Two reasons, both mostly outside the project's control.
Unmaintained upstream dependencies
Both gossip implementations are pulling in code that nobody's shipping fixes for:
github.com/weaveworks/mesh- last commit 2019. Weaveworks itself wound down in 2024; the repository is archived in spirit if not in label. Anything CVE-relevant in its transitive graph (e.g. oldergolang.org/x/*versions, older crypto helpers) is on us to vendor and patch.github.com/jacksontj/memberlistmesh- personal fork from 2019 of HashiCorp'smemberlist. Same unmaintained-vendor problem; the upstreammemberlisthas moved on, the fork has not.Each of these brings a chunk of code that runs on every kOps node, in a privileged daemon (protokube), that we don't really get to update.
Security pressure beyond just maintenance
protokube in gossip mode needs broad cloud-provider permissions, on every node, including workers, just to discover its peers. Workers in a None-DNS cluster don't need any of that. The kOps default permissions reflect this gap: gossip workers carry permissions that expose considerable cluster topology to anyone who can read the node's instance role.
We have been chipping at this for releases and have always been comfortable with a long deprecation runway. That's changing for a specific reason: AI-powered security review tools are now surfacing both of the points above prominently, the unmaintained mesh dependencies on every node, and the over-broad worker permissions. Whatever a Claude / Codex / Copilot-style scanner finds quickly and consistently is, by construction, a low-friction discovery for someone with malicious intent. The same automation that makes the security posture obvious to defenders makes it obvious to attackers. The deprecation can no longer be paced by what's comfortable for us; it has to be paced by the realistic window before this surface is actively exploited.
Would rather not rush this. Gossip enabled real work for a long time and we don't take user-facing breakage lightly. But the cost of holding the line is now higher than the cost of finishing the migration.
Proposal
Open for discussion, not a decision yet:
kops reconcilepath to take protokube (and the unmaintained mesh dependencies, and the over-broad worker IAM) off worker nodes without flipping topology in the same window. New gossip clusters remain refused bykops create cluster; existing ones keep updating.protokubeships withoutweaveworks/meshormemberlistmesh.A 1.36→1.37 gap of one minor is short, but the warning has been in front of operators since v1.29 (May 2024). The proposal here is the final step of a deprecation that has been visible in
kops create clusteroutput for over 2 years.If a longer runway is needed for specific operator groups, long-lived gossip clusters that can't easily acquire an API load balancer, or environments where the hybrid bridge in 1.36 is insufficient, please say so on this issue with concrete details. Counter-proposals welcome.
CC @justinsb @rifelpet @ameukam