Skip to content

Harden service discovery and game-server readiness handling#28

Closed
VG-prog wants to merge 3 commits into
walkline:masterfrom
VG-prog:vg/tc9-discovery
Closed

Harden service discovery and game-server readiness handling#28
VG-prog wants to merge 3 commits into
walkline:masterfrom
VG-prog:vg/tc9-discovery

Conversation

@VG-prog

@VG-prog VG-prog commented May 22, 2026

Copy link
Copy Markdown

Summary

This PR hardens ToCloud9 service discovery for clustered game-server ownership and player placement.

The important distinction is that a temporarily degraded world loop is not the same as process death, but it also should not be treated as a normal target for new players.

If the sidecar health probe returns 503/504 or times out because the world loop did not process the lightweight probe in time, the registry now keeps the game server registered so existing ownership and owner lookups do not disappear. At the same time, it marks that node as non-admitting, so new player placement skips it and prefers healthy candidates. Actual transport or process liveness failures still remove the server and allow reassignment.

What changed

  • Added game-server readiness and map ownership handling needed by clustered routing.
  • Differentiated degraded sidecar health from transport/process death.
  • Kept live game-server ownership intact on degraded 503/504 style health results.
  • Marked degraded game servers as non-admitting so new player placement does not select a world loop that failed its probe window.
  • Cleared the non-admitting state automatically once a later health probe succeeds.
  • Added a healthy all-map fallback when an assigned map owner is degraded, allowing new placement to move away from a slow owner without deleting that owner's registry row.
  • Added matchmaking health monitoring so gateways can react when queue services disappear.
  • Added registry-side cleanup and GUID allocation support used by the rest of the cluster.
  • Tuned metrics/health timeout handling to avoid creating unrelated client-side timeout noise.

Why this matters

Dropping a live owner during world-loop pressure can split authoritative in-memory state for LFG, battleground, arena, or crossrealm owners. Sending new players to that same overloaded node is also not desirable.

This PR separates those two decisions: degraded nodes stay registered for owner continuity, but they are drained from new placement until they recover. Real process or transport death still removes the node.

Validation

  • git diff --check origin/master..HEAD
  • env GOCACHE=/tmp/tc9-go-build GOFLAGS=-buildvcs=false go build ./...
  • env GOCACHE=/tmp/tc9-go-build GOFLAGS=-buildvcs=false make install

VG-prog added 3 commits May 22, 2026 21:49
Add the shared protobuf, generated runtime code, events, GUID helpers, auth identity helpers, and configuration contracts used by clustered gateway, registry, group, guild, matchmaking, and sidecar flows.

Tests, local tooling, and broad documentation are intentionally excluded from this upstream-focused scope.
Wire service discovery, map readiness, stale-safe health and metrics observers, degraded game-server health handling, gateway-scoped cleanup, and shared GUID allocation support.

Registry and health code now distinguish world-loop degraded state from process or transport death while preserving live map ownership.
Keep world-loop degraded game servers registered for ownership and existing lookups, but mark them as non-admitting so new player placement skips them. Clear the drain state on successful health recovery and fall back to healthy all-map nodes when an assigned owner is degraded.
@VG-prog

VG-prog commented May 22, 2026

Copy link
Copy Markdown
Author

Closing this draft because it was opened from a stacked branch while targeting master, so the GitHub diff includes earlier slices and is not an isolated review target.

Replacement: #37 (#37)

That replacement PR presents the current integration honestly as one review surface. Sorry for the review noise.

@VG-prog VG-prog closed this May 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant