When a Unicorn/Puma worker logs:
Global messages on timed out, message bus is no longer functioning correctly
the process never recovers on its own. Real-time features silently break for all users on that worker until the app server is restarted manually.
MessageBus library only logs a warning but there is no revive mechanism.
I have been using MessageBus with Puma and deployed 2 different applications on 2 different AWS accounts in 2 different regions and when it happens in 1 application it happens to the other as well. That's why I suspect this is something happening at the network layer (like a half-open TCP connection).
I know that in Discourse you use Unicorn and you have a programmatic way to revive things when something goes wrong. Or maybe you don't deploy on AWS and have never experienced this network issue that puts the library in a "dead" state. But on my side, with Puma and deploying on AWS EC2 servers, I had to manually restart the web server each time it happens. (Most of the time this is during the night and by the time I wake up I already have many tickets on my app complaining about live updates 😄 ). It is also pretty random, it can "not happen for weeks", then happen 2 days in a row.
I made a fix that I tested for a long time (using my fork version of MessageBus in the Gemfile) to make sure there were no side effects. I waited for the warning to happen again and for the fix to "revive" the message bus without any side issues. I monitored it for a couple of days where it happened 2 times, and both times it recovered cleanly.
Pull request coming.
Related to these 2 Discourse threads:
When a Unicorn/Puma worker logs:
the process never recovers on its own. Real-time features silently break for all users on that worker until the app server is restarted manually.
MessageBus library only logs a warning but there is no revive mechanism.
I have been using MessageBus with Puma and deployed 2 different applications on 2 different AWS accounts in 2 different regions and when it happens in 1 application it happens to the other as well. That's why I suspect this is something happening at the network layer (like a half-open TCP connection).
I know that in Discourse you use Unicorn and you have a programmatic way to revive things when something goes wrong. Or maybe you don't deploy on AWS and have never experienced this network issue that puts the library in a "dead" state. But on my side, with Puma and deploying on AWS EC2 servers, I had to manually restart the web server each time it happens. (Most of the time this is during the night and by the time I wake up I already have many tickets on my app complaining about live updates 😄 ). It is also pretty random, it can "not happen for weeks", then happen 2 days in a row.
I made a fix that I tested for a long time (using my fork version of MessageBus in the Gemfile) to make sure there were no side effects. I waited for the warning to happen again and for the fix to "revive" the message bus without any side issues. I monitored it for a couple of days where it happened 2 times, and both times it recovered cleanly.
Pull request coming.
Related to these 2 Discourse threads: