Fix server ip collision#23533
Conversation
|
Thank you for your submission! We require that all contributors sign our Contributor License Agreement ("CLA") before we can accept the contribution. Read and sign the agreement Learn more about why HashiCorp requires a CLA and what the CLA includes Have you signed the CLA already but the status is still pending? Recheck it. |
There was a problem hiding this comment.
Pull request overview
This PR addresses a failure mode where ServerLookup can lose track of the current (live) server entry when two “generations” of a server share lookup keys (Raft ID and/or IP:port), e.g. after a hostname rename with the same address/ID, leading to ErrLeaderNotTracked / “Raft leader not found in server lookup mapping”.
Changes:
- Make
ServerLookup.RemoveServera compare-and-delete operation so stale removals don’t evict a newer live entry occupying the same key(s). - Add regression tests reproducing the hostname-rename event interleaving and locking in the intended “stale removal must not evict live” contract (plus a “live removal still removes” case).
- Extend an existing Serf member test helper to allow setting
serf.Member.Addrfor realistic metadata construction.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
agent/consul/server_lookup.go |
Prevents stale RemoveServer calls from deleting live entries by verifying (ID, Name, Addr) generation match before delete. |
agent/consul/server_serf_rename_test.go |
Adds tests reproducing the rename interleaving and validating the compare-and-delete behavior across rename/relocation/address-reuse cases. |
agent/consul/merge_test.go |
Updates the makeTestNode helper to support setting member IP address for tests that depend on it. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Description
Hey friends long time no see. We had an incident a little bit ago where we removed a consul server for maintenance, renamed it, and when we rejoined it it looked okay! But it wasn't. You'll see things like
To my knowledge there's a couple options if you do find yourself in this situation:
I tried to add tests demonstrating how it can happen.
This happened while I was out, but it looks like it's still live so hopefully this helps. Happy to share our incident notes if it's helpful but tried to extract enough to have this make sense.
Happy to also move this to an issue to discuss further, I was just curious how looking at the code would go.
Testing & Reproduction steps
Links
PR Checklist
PCI review checklist
I have documented a clear reason for, and description of, the change I am making.
If applicable, I've documented a plan to revert these changes if they require more than reverting the pull request.
If applicable, I've documented the impact of any changes to security controls.
Examples of changes to security controls include using new access control methods, adding or removing logging pipelines, etc.