Skip to content

Global messages on XX timed out, subscriber thread never self-heals after half-open TCP connection #387

Description

@Mubramaj

When a Unicorn/Puma worker logs:

Global messages on timed out, message bus is no longer functioning correctly

the process never recovers on its own. Real-time features silently break for all users on that worker until the app server is restarted manually.

MessageBus library only logs a warning but there is no revive mechanism.

I have been using MessageBus with Puma and deployed 2 different applications on 2 different AWS accounts in 2 different regions and when it happens in 1 application it happens to the other as well. That's why I suspect this is something happening at the network layer (like a half-open TCP connection).

I know that in Discourse you use Unicorn and you have a programmatic way to revive things when something goes wrong. Or maybe you don't deploy on AWS and have never experienced this network issue that puts the library in a "dead" state. But on my side, with Puma and deploying on AWS EC2 servers, I had to manually restart the web server each time it happens. (Most of the time this is during the night and by the time I wake up I already have many tickets on my app complaining about live updates 😄 ). It is also pretty random, it can "not happen for weeks", then happen 2 days in a row.

I made a fix that I tested for a long time (using my fork version of MessageBus in the Gemfile) to make sure there were no side effects. I waited for the warning to happen again and for the fix to "revive" the message bus without any side issues. I monitored it for a couple of days where it happened 2 times, and both times it recovered cleanly.

Pull request coming.

Related to these 2 Discourse threads:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions