Where
nomad/jobs/infrastructure/temporal/files/temporal-env.tpl - uses consul-template to read service/munchbox-postgres/leader, regex-extracts the leader's hostname from its patroni member-JSON conn_url, and feeds that to POSTGRES_SEEDS. Restarts temporal on every leader change (change_mode = restart on the template).
Issue
Two layered problems:
-
HAProxy is silently bypassed. The env template sets POSTGRES_PORT=5433 (HAProxy port). The temporalio/server:1.29.1 config template reads DB_PORT, not POSTGRES_PORT, so the port falls through to the default 5432 and temporal connects straight to patroni. HAProxy's whole "detect role change + kill stale TCP + transparent reconnect" superpower (introduced in commit 7cb39eb for exactly this) is never engaged.
-
The leader-tracking workaround is what bypasses it. Because the temporal config template only takes a single POSTGRES_SEEDS value (no multi-host parsing), the operator can't point at haproxy-postgres.service.consul directly (multi-A). The current workaround manually picks one leader IP via the KV regex dance, then restarts on every patroni promotion. The restart adds 30-60s of downtime per failover where transparent reconnect via HAProxy would add 2-5s.
The visible failover symptom: temporal is glued to whichever IP rendered, regardless of patroni promotions. When patroni demotes that node to replica, writes fail; restart+reconnect cycle is required.
Fix direction
Smallest viable:
- Rename
POSTGRES_PORT=5433 to DB_PORT=5433 in temporal-env.tpl so temporal actually hits HAProxy on the leader's node.
- Once HAProxy is in the path, transparent patroni failover works on that side. The KV-leader extraction can be simplified or dropped depending on appetite (a static healthy HAProxy IP is enough if you trust HAProxy to route).
Full HA (separate follow-up):
- Add a keepalived VIP that follows whichever node has a healthy HAProxy alloc; point temporal at the VIP. Covers the case where the HAProxy host itself dies, not just patroni primary-flip.
Related
#35 - prior temporal DNS cleanup, different scope.
Where
nomad/jobs/infrastructure/temporal/files/temporal-env.tpl- uses consul-template to readservice/munchbox-postgres/leader, regex-extracts the leader's hostname from its patroni member-JSONconn_url, and feeds that toPOSTGRES_SEEDS. Restarts temporal on every leader change (change_mode = restarton the template).Issue
Two layered problems:
HAProxy is silently bypassed. The env template sets
POSTGRES_PORT=5433(HAProxy port). Thetemporalio/server:1.29.1config template readsDB_PORT, notPOSTGRES_PORT, so the port falls through to the default5432and temporal connects straight to patroni. HAProxy's whole "detect role change + kill stale TCP + transparent reconnect" superpower (introduced in commit 7cb39eb for exactly this) is never engaged.The leader-tracking workaround is what bypasses it. Because the temporal config template only takes a single
POSTGRES_SEEDSvalue (no multi-host parsing), the operator can't point athaproxy-postgres.service.consuldirectly (multi-A). The current workaround manually picks one leader IP via the KV regex dance, then restarts on every patroni promotion. The restart adds 30-60s of downtime per failover where transparent reconnect via HAProxy would add 2-5s.The visible failover symptom: temporal is glued to whichever IP rendered, regardless of patroni promotions. When patroni demotes that node to replica, writes fail; restart+reconnect cycle is required.
Fix direction
Smallest viable:
POSTGRES_PORT=5433toDB_PORT=5433intemporal-env.tplso temporal actually hits HAProxy on the leader's node.Full HA (separate follow-up):
Related
#35 - prior temporal DNS cleanup, different scope.