Summary
Pre-prod (and potentially prod) real-time features are unreliable: feeder /ws/nodes/ and UI /ws/messages/ connections flap every ~3–5 seconds while HTTP packet ingest continues to succeed (POST /api/packets/.../ingest/ → 201).
API logs show:
redis.exceptions.TimeoutError: Timeout reading from redis:6379
Stack: Daphne → Channels await_many_dispatch → channel_receive (blocking read on channels_redis / Redis DB 0).
Environment
| Item |
Value |
| Stack |
Portainer meshflow-preprod (same pattern in prod) |
| Redis image |
redis:8-alpine (not Redis Stack) |
| Redis topology |
Single instance; logical DBs per docs/REDIS.md (DB 0 Channels, DB 1 Celery broker+results, DB 2 cache, DB 3 RF engine) |
| API runtime |
Python 3.14 in container logs (Channels/Django test matrix risk) |
| Clients during incident |
Several feeder bots uploading + holding /ws/nodes/; no UI browsers open for pre-prod |
| Investigation notes |
Local copy under tmp/redis-issue/ (not committed) |
Symptoms
ERROR daphne.server: Exception inside application: Timeout reading from redis:6379
WSDISCONNECT on /ws/messages/ or /ws/nodes/
WSCONNECT again within ~2–5 s (reconnect storm)
- Feeders reconnect; traceroute commands / live messaging unreliable
- Often preceded by
asyncio.exceptions.CancelledError (wrapped in asgi_client_disconnect.py)
Example concurrent feeders (from logs):
- Two feeders on one API key (warning:
API key linked to 2 nodes; use feeder_pubkey_prefix for MC)
- Meshtastic group
node_1127973616 (!433b82f0)
- MeshCore group
node_mc_{internal_id}
What this is not
Key finding: DB 1 cleanup did not fix it
During investigation, Redis DB 1 held ~10k celery-task-meta-* keys (Celery result backend, ~24h TTL). That was a plausible cause of single-threaded Redis delaying DB 0 Channels replies.
After clearing DB 1 (DBSIZE → 4), WebSocket timeouts continued. So either:
- DB 1 result volume was a contributing factor but not sufficient to explain the outage, or
- The primary cause is elsewhere (see theories below).
Celery result hygiene (CELERY_RESULT_EXPIRES, ignore_result on beat tasks) remains worthwhile but is not proven as the fix for this incident.
Diagnostics snapshot (pre-clear)
| DB |
Purpose |
PING |
DBSIZE |
| 0 |
Channels |
~12 ms |
4 |
| 1 |
Celery broker + results |
~2 ms |
~9866 |
| 2 |
Django cache |
~1 ms |
3 |
connected_clients ~33
blocked_clients ~3 # expected: Celery BRPOP + Channels BZPOPMIN
Redis server logs: no errors, no slowlog entries in captured window; BGSAVE ~30–100 ms. Host warning: vm.overcommit_memory not enabled.
Important: socket.connect_ex(('redis', 6379)) only proves the port is open — failures are command read timeouts, not connect failures.
Working theories (ranked, post DB-1-clear)
-
Client read timeout vs channels_redis blocking receive
channels_redis uses BZPOPMIN with a 5s server-side timeout per receive loop. If redis.asyncio client socket_timeout is too low (or Redis is slow to service DB 0 while busy), consumers raise TimeoutError → WS disconnect → feeders reconnect → more group_add / channel traffic → repeat every few seconds.
-
Feeder WebSocket reconnect storm (no UI required)
Multiple bots on /ws/nodes/ with reconnect loops amplify Redis Channels load on DB 0 even without browser tabs.
-
Many Redis connections (~33)
API + Celery workers + beat + RF worker + Site Planner (DB 3) + Channels pools + reconnect churn.
-
Single Redis process / single event loop
Logical DB indices do not isolate CPU — all workloads share one thread. A second Redis process on the same host would still separate Channels from Celery/cache/RF.
-
Python 3.14 in API image
Ahead of typical Channels/Django/redis-py test matrices.
-
Split-host dev (check)
Local API or bot pointing at pre-prod Redis DB 0 doubles groups — see text-message-channels.md.
Deprioritized after DB 1 clear: Celery result key count as primary cause. Incorrect for current env: Redis Stack modules (env is redis:8-alpine).
Settings gaps (current main)
From Meshflow/Meshflow/settings/base.py:
CHANNEL_LAYERS → channels_redis on DB 0 via URL string only — no explicit socket_timeout / connect tuning
CELERY_RESULT_BACKEND = CELERY_BROKER_URL (DB 1)
- No explicit
CELERY_RESULT_EXPIRES in Django settings (Celery default ~24h on results)
Proposed code mitigations (local branch, not deployed)
CHANNEL_REDIS_SOCKET_TIMEOUT default 30s on Channels Redis host dict
CELERY_RESULT_EXPIRES default 3600s
ignore_result=True on high-frequency beat tasks (dispatch_pending_traceroutes every 15s, etc.)
docs/REDIS.md ops notes
Deploy these for defense-in-depth; validate on pre-prod whether WS stability improves after DB 1 clear failed.
Infrastructure option: second Redis instance
Recommendation: prefer two standalone Redis instances on the same host — not Redis Cluster.
| Approach |
Verdict |
| Logical DBs only (current) |
Does not fix single-threaded contention — all commands share one Redis core. |
| Redis Cluster |
High operational cost; channels_redis / Celery need cluster-aware URLs and hash tags; little benefit for one physical host and our key patterns. |
| Two Redis processes (recommended) |
True isolation: Channels blocking reads are not queued behind Celery/RF work on another process. Same host is fine if CPU is not saturated. |
Suggested split
| Instance |
Port (example) |
Consumers |
redis-channels |
6379 |
Django Channels only (CHANNEL_LAYERS → DB 0) |
redis-ops |
6380 |
Celery broker + result backend, Django cache (DB 2), RF Site Planner (REDIS_URL / DB 3) |
Env / compose changes (illustrative):
REDIS_CHANNELS_URL or CHANNEL_LAYERS host → redis-channels:6379/0
CELERY_BROKER_URL → redis-ops:6380/1 (or /0 on dedicated instance)
CACHES → redis-ops:6380/2
- RF engine
REDIS_URL → redis-ops:6380/3
Document in docs/REDIS.md + Portainer stack. Roll out pre-prod first; compare redis-cli --latency on channels instance during feeder load vs combined instance.
Debugging checklist (during incident)
# Per-DB size
redis-cli -a "$REDIS_PASSWORD" -n 0 DBSIZE
redis-cli -a "$REDIS_PASSWORD" -n 1 DBSIZE
# Blocked clients
redis-cli -a "$REDIS_PASSWORD" CLIENT LIST | grep -i block
redis-cli -a "$REDIS_PASSWORD" SLOWLOG GET 20
redis-cli -a "$REDIS_PASSWORD" --latency
# From API container
python -c "
import os, time, redis
pw = os.environ['REDIS_PASSWORD']
for db in (0, 1, 2):
r = redis.Redis(host='redis', port=6379, password=pw, db=db, socket_timeout=10)
t = time.time(); r.ping(); print(f'DB{db} PING', round(time.time()-t, 3), 's')
"
Count feeder WS connections in API logs (WSCONNECT / WSDISCONNECT on /ws/nodes/).
Acceptance criteria
Related
Summary
Pre-prod (and potentially prod) real-time features are unreliable: feeder
/ws/nodes/and UI/ws/messages/connections flap every ~3–5 seconds while HTTP packet ingest continues to succeed (POST /api/packets/.../ingest/→ 201).API logs show:
Stack: Daphne → Channels
await_many_dispatch→channel_receive(blocking read onchannels_redis/ Redis DB 0).Environment
meshflow-preprod(same pattern in prod)redis:8-alpine(not Redis Stack)/ws/nodes/; no UI browsers open for pre-prodtmp/redis-issue/(not committed)Symptoms
ERROR daphne.server: Exception inside application: Timeout reading from redis:6379WSDISCONNECTon/ws/messages/or/ws/nodes/WSCONNECTagain within ~2–5 s (reconnect storm)asyncio.exceptions.CancelledError(wrapped inasgi_client_disconnect.py)Example concurrent feeders (from logs):
API key linked to 2 nodes; use feeder_pubkey_prefix for MC)node_1127973616(!433b82f0)node_mc_{internal_id}What this is not
REDIS_PASSWORD(would tend to fail fast withNOAUTH)LLEN celery= 0)Key finding: DB 1 cleanup did not fix it
During investigation, Redis DB 1 held ~10k
celery-task-meta-*keys (Celery result backend, ~24h TTL). That was a plausible cause of single-threaded Redis delaying DB 0 Channels replies.After clearing DB 1 (
DBSIZE→ 4), WebSocket timeouts continued. So either:Celery result hygiene (
CELERY_RESULT_EXPIRES,ignore_resulton beat tasks) remains worthwhile but is not proven as the fix for this incident.Diagnostics snapshot (pre-clear)
Redis server logs: no errors, no slowlog entries in captured window; BGSAVE ~30–100 ms. Host warning:
vm.overcommit_memorynot enabled.Important:
socket.connect_ex(('redis', 6379))only proves the port is open — failures are command read timeouts, not connect failures.Working theories (ranked, post DB-1-clear)
Client read timeout vs
channels_redisblocking receivechannels_redisusesBZPOPMINwith a 5s server-side timeout per receive loop. Ifredis.asyncioclientsocket_timeoutis too low (or Redis is slow to service DB 0 while busy), consumers raiseTimeoutError→ WS disconnect → feeders reconnect → moregroup_add/ channel traffic → repeat every few seconds.Feeder WebSocket reconnect storm (no UI required)
Multiple bots on
/ws/nodes/with reconnect loops amplify Redis Channels load on DB 0 even without browser tabs.Many Redis connections (~33)
API + Celery workers + beat + RF worker + Site Planner (DB 3) + Channels pools + reconnect churn.
Single Redis process / single event loop
Logical DB indices do not isolate CPU — all workloads share one thread. A second Redis process on the same host would still separate Channels from Celery/cache/RF.
Python 3.14 in API image
Ahead of typical Channels/Django/redis-py test matrices.
Split-host dev (check)
Local API or bot pointing at pre-prod Redis DB 0 doubles groups — see text-message-channels.md.
Deprioritized after DB 1 clear: Celery result key count as primary cause. Incorrect for current env: Redis Stack modules (env is
redis:8-alpine).Settings gaps (current
main)From
Meshflow/Meshflow/settings/base.py:CHANNEL_LAYERS→channels_redison DB 0 via URL string only — no explicitsocket_timeout/ connect tuningCELERY_RESULT_BACKEND = CELERY_BROKER_URL(DB 1)CELERY_RESULT_EXPIRESin Django settings (Celery default ~24h on results)Proposed code mitigations (local branch, not deployed)
CHANNEL_REDIS_SOCKET_TIMEOUTdefault 30s on Channels Redis host dictCELERY_RESULT_EXPIRESdefault 3600signore_result=Trueon high-frequency beat tasks (dispatch_pending_traceroutesevery 15s, etc.)docs/REDIS.mdops notesDeploy these for defense-in-depth; validate on pre-prod whether WS stability improves after DB 1 clear failed.
Infrastructure option: second Redis instance
Recommendation: prefer two standalone Redis instances on the same host — not Redis Cluster.
channels_redis/ Celery need cluster-aware URLs and hash tags; little benefit for one physical host and our key patterns.Suggested split
redis-channelsCHANNEL_LAYERS→ DB 0)redis-opsREDIS_URL/ DB 3)Env / compose changes (illustrative):
REDIS_CHANNELS_URLorCHANNEL_LAYERShost →redis-channels:6379/0CELERY_BROKER_URL→redis-ops:6380/1(or/0on dedicated instance)CACHES→redis-ops:6380/2REDIS_URL→redis-ops:6380/3Document in
docs/REDIS.md+ Portainer stack. Roll out pre-prod first; compareredis-cli --latencyon channels instance during feeder load vs combined instance.Debugging checklist (during incident)
Count feeder WS connections in API logs (
WSCONNECT/WSDISCONNECTon/ws/nodes/).Acceptance criteria
/ws/nodes/stays connected for ≥15 min under normal multi-feeder ingest withoutTimeout reading from redisRelated