Skip to content

temporal-server: drop hardcoded postgres IP, survive HAProxy/patroni failover #145

Description

@afreidah

Where

  • nomad/jobs/infrastructure/temporal/files/temporal-env.tpl - uses consul-template to read service/munchbox-postgres/leader, regex-extracts the leader's hostname from its patroni member-JSON conn_url, and feeds that to POSTGRES_SEEDS. Restarts temporal on every leader change (change_mode = restart on the template).

Issue

Two layered problems:

  1. HAProxy is silently bypassed. The env template sets POSTGRES_PORT=5433 (HAProxy port). The temporalio/server:1.29.1 config template reads DB_PORT, not POSTGRES_PORT, so the port falls through to the default 5432 and temporal connects straight to patroni. HAProxy's whole "detect role change + kill stale TCP + transparent reconnect" superpower (introduced in commit 7cb39eb for exactly this) is never engaged.

  2. The leader-tracking workaround is what bypasses it. Because the temporal config template only takes a single POSTGRES_SEEDS value (no multi-host parsing), the operator can't point at haproxy-postgres.service.consul directly (multi-A). The current workaround manually picks one leader IP via the KV regex dance, then restarts on every patroni promotion. The restart adds 30-60s of downtime per failover where transparent reconnect via HAProxy would add 2-5s.

The visible failover symptom: temporal is glued to whichever IP rendered, regardless of patroni promotions. When patroni demotes that node to replica, writes fail; restart+reconnect cycle is required.

Fix direction

Smallest viable:

  • Rename POSTGRES_PORT=5433 to DB_PORT=5433 in temporal-env.tpl so temporal actually hits HAProxy on the leader's node.
  • Once HAProxy is in the path, transparent patroni failover works on that side. The KV-leader extraction can be simplified or dropped depending on appetite (a static healthy HAProxy IP is enough if you trust HAProxy to route).

Full HA (separate follow-up):

  • Add a keepalived VIP that follows whichever node has a healthy HAProxy alloc; point temporal at the VIP. Covers the case where the HAProxy host itself dies, not just patroni primary-flip.

Related

#35 - prior temporal DNS cleanup, different scope.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions