TL;DR
We need to update our RabbitMQ version. This will require careful planning, management and testing.
Description
We are using a very old (legacy) version of RabbitMQ. I recently ran into a RabbitMQ-related corruption issue on staging and had to delete the pod and the PVC (which deleted the queue data) to fix it. What probably happened was (1) RabbitMQ, which is running on a preemptible node, was terminated abruptly, (2) the WAL write was interrupted and the file became corrupted, (3) RabbitMQ pod restart attempted WAL recovery but this failed repeatedly because the corrupted WAL file persists, causing a CrashLoopBackOff state.
Deleting the PVC removed the corrupted persisted state, but also deleted the queue data (PV Reclaim Policy: Delete on both staging and production).
The risk of this happening on production is lower because the RabbitMQ pod is not on a preemptible node there, but it's still possible anytime the pod is interrupted. Upgrading our RabbitMQ version will help because newer versions are more stable and better at recovering from this issue.
Plan
For any production releases prior to the RabbitMQ upgrade, I will snapshot the PV in case I need to delete the PVC/PV and thus lose the queue data.
Once we are ready to upgrade RabbitMQ, I will:
- Do the upgrade on staging
- Test Celery tasks following site build/deploy and RabbitMQ pod deploy/restarts
- Upgrade on production with the following precautions: (1) backup the RabbitMQ PV before the deploy, and (2) scale down workers before the deploy (then scale back up immediately afterwards), to reduce the chances of file corruption
TL;DR
We need to update our RabbitMQ version. This will require careful planning, management and testing.
Description
We are using a very old (legacy) version of RabbitMQ. I recently ran into a RabbitMQ-related corruption issue on staging and had to delete the pod and the PVC (which deleted the queue data) to fix it. What probably happened was (1) RabbitMQ, which is running on a preemptible node, was terminated abruptly, (2) the WAL write was interrupted and the file became corrupted, (3) RabbitMQ pod restart attempted WAL recovery but this failed repeatedly because the corrupted WAL file persists, causing a CrashLoopBackOff state.
Deleting the PVC removed the corrupted persisted state, but also deleted the queue data (PV Reclaim Policy: Delete on both staging and production).
The risk of this happening on production is lower because the RabbitMQ pod is not on a preemptible node there, but it's still possible anytime the pod is interrupted. Upgrading our RabbitMQ version will help because newer versions are more stable and better at recovering from this issue.
Plan
For any production releases prior to the RabbitMQ upgrade, I will snapshot the PV in case I need to delete the PVC/PV and thus lose the queue data.
Once we are ready to upgrade RabbitMQ, I will: