Global messages on XX timed out, subscriber thread never self-heals after half-open TCP connection

When a Unicorn/Puma worker logs:

> Global messages on timed out, message bus is no longer functioning correctly


the process never recovers on its own. Real-time features silently break for all users on that worker until the app server is restarted manually.

MessageBus library only logs a warning but there is no revive mechanism.

I have been using MessageBus with Puma and deployed 2 different applications on 2 different AWS accounts in 2 different regions and when it happens in 1 application it happens to the other as well. That's why I suspect this is something happening at the network layer (like a half-open TCP connection).

I know that in Discourse you use Unicorn and you have a programmatic way to revive things when something goes wrong. Or maybe you don't deploy on AWS and have never experienced this network issue that puts the library in a "dead" state. But on my side, with Puma and deploying on AWS EC2 servers, I had to manually restart the web server each time it happens. (Most of the time this is during the night and by the time I wake up I already have many tickets on my app complaining about live updates 😄 ). It is also pretty random, it can "not happen for weeks", then happen 2 days in a row.

I made a fix that I tested for a long time (using my fork version of MessageBus in the Gemfile) to make sure there were no side effects. I waited for the warning to happen again and for the fix to "revive" the message bus without any side issues. I monitored it for a couple of days where it happened 2 times, and both times it recovered cleanly.

Pull request coming.

Related to these 2 Discourse threads:

- https://meta.discourse.org/t/message-bus-is-no-longer-functioning-correctly/237209
- https://meta.discourse.org/t/after-config-change-some-settings-not-in-sync-on-sibling-nodes-in-multi-node-discourse-setup/372573/4?tl=en

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Global messages on XX timed out, subscriber thread never self-heals after half-open TCP connection #387

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Global messages on XX timed out, subscriber thread never self-heals after half-open TCP connection #387

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions