Skip to content

Pre-prod/prod: WebSocket flapping — Redis read timeouts on Channels (DB 0) with feeder bots #382

@pskillen

Description

@pskillen

Summary

Pre-prod (and potentially prod) real-time features are unreliable: feeder /ws/nodes/ and UI /ws/messages/ connections flap every ~3–5 seconds while HTTP packet ingest continues to succeed (POST /api/packets/.../ingest/ → 201).

API logs show:

redis.exceptions.TimeoutError: Timeout reading from redis:6379

Stack: Daphne → Channels await_many_dispatchchannel_receive (blocking read on channels_redis / Redis DB 0).

Environment

Item Value
Stack Portainer meshflow-preprod (same pattern in prod)
Redis image redis:8-alpine (not Redis Stack)
Redis topology Single instance; logical DBs per docs/REDIS.md (DB 0 Channels, DB 1 Celery broker+results, DB 2 cache, DB 3 RF engine)
API runtime Python 3.14 in container logs (Channels/Django test matrix risk)
Clients during incident Several feeder bots uploading + holding /ws/nodes/; no UI browsers open for pre-prod
Investigation notes Local copy under tmp/redis-issue/ (not committed)

Symptoms

  1. ERROR daphne.server: Exception inside application: Timeout reading from redis:6379
  2. WSDISCONNECT on /ws/messages/ or /ws/nodes/
  3. WSCONNECT again within ~2–5 s (reconnect storm)
  4. Feeders reconnect; traceroute commands / live messaging unreliable
  5. Often preceded by asyncio.exceptions.CancelledError (wrapped in asgi_client_disconnect.py)

Example concurrent feeders (from logs):

  • Two feeders on one API key (warning: API key linked to 2 nodes; use feeder_pubkey_prefix for MC)
  • Meshtastic group node_1127973616 (!433b82f0)
  • MeshCore group node_mc_{internal_id}

What this is not

Key finding: DB 1 cleanup did not fix it

During investigation, Redis DB 1 held ~10k celery-task-meta-* keys (Celery result backend, ~24h TTL). That was a plausible cause of single-threaded Redis delaying DB 0 Channels replies.

After clearing DB 1 (DBSIZE4), WebSocket timeouts continued. So either:

  • DB 1 result volume was a contributing factor but not sufficient to explain the outage, or
  • The primary cause is elsewhere (see theories below).

Celery result hygiene (CELERY_RESULT_EXPIRES, ignore_result on beat tasks) remains worthwhile but is not proven as the fix for this incident.

Diagnostics snapshot (pre-clear)

DB Purpose PING DBSIZE
0 Channels ~12 ms 4
1 Celery broker + results ~2 ms ~9866
2 Django cache ~1 ms 3
connected_clients ~33
blocked_clients ~3   # expected: Celery BRPOP + Channels BZPOPMIN

Redis server logs: no errors, no slowlog entries in captured window; BGSAVE ~30–100 ms. Host warning: vm.overcommit_memory not enabled.

Important: socket.connect_ex(('redis', 6379)) only proves the port is open — failures are command read timeouts, not connect failures.

Working theories (ranked, post DB-1-clear)

  1. Client read timeout vs channels_redis blocking receive
    channels_redis uses BZPOPMIN with a 5s server-side timeout per receive loop. If redis.asyncio client socket_timeout is too low (or Redis is slow to service DB 0 while busy), consumers raise TimeoutError → WS disconnect → feeders reconnect → more group_add / channel traffic → repeat every few seconds.

  2. Feeder WebSocket reconnect storm (no UI required)
    Multiple bots on /ws/nodes/ with reconnect loops amplify Redis Channels load on DB 0 even without browser tabs.

  3. Many Redis connections (~33)
    API + Celery workers + beat + RF worker + Site Planner (DB 3) + Channels pools + reconnect churn.

  4. Single Redis process / single event loop
    Logical DB indices do not isolate CPU — all workloads share one thread. A second Redis process on the same host would still separate Channels from Celery/cache/RF.

  5. Python 3.14 in API image
    Ahead of typical Channels/Django/redis-py test matrices.

  6. Split-host dev (check)
    Local API or bot pointing at pre-prod Redis DB 0 doubles groups — see text-message-channels.md.

Deprioritized after DB 1 clear: Celery result key count as primary cause. Incorrect for current env: Redis Stack modules (env is redis:8-alpine).

Settings gaps (current main)

From Meshflow/Meshflow/settings/base.py:

  • CHANNEL_LAYERSchannels_redis on DB 0 via URL string only — no explicit socket_timeout / connect tuning
  • CELERY_RESULT_BACKEND = CELERY_BROKER_URL (DB 1)
  • No explicit CELERY_RESULT_EXPIRES in Django settings (Celery default ~24h on results)

Proposed code mitigations (local branch, not deployed)

  • CHANNEL_REDIS_SOCKET_TIMEOUT default 30s on Channels Redis host dict
  • CELERY_RESULT_EXPIRES default 3600s
  • ignore_result=True on high-frequency beat tasks (dispatch_pending_traceroutes every 15s, etc.)
  • docs/REDIS.md ops notes

Deploy these for defense-in-depth; validate on pre-prod whether WS stability improves after DB 1 clear failed.

Infrastructure option: second Redis instance

Recommendation: prefer two standalone Redis instances on the same host — not Redis Cluster.

Approach Verdict
Logical DBs only (current) Does not fix single-threaded contention — all commands share one Redis core.
Redis Cluster High operational cost; channels_redis / Celery need cluster-aware URLs and hash tags; little benefit for one physical host and our key patterns.
Two Redis processes (recommended) True isolation: Channels blocking reads are not queued behind Celery/RF work on another process. Same host is fine if CPU is not saturated.

Suggested split

Instance Port (example) Consumers
redis-channels 6379 Django Channels only (CHANNEL_LAYERS → DB 0)
redis-ops 6380 Celery broker + result backend, Django cache (DB 2), RF Site Planner (REDIS_URL / DB 3)

Env / compose changes (illustrative):

  • REDIS_CHANNELS_URL or CHANNEL_LAYERS host → redis-channels:6379/0
  • CELERY_BROKER_URLredis-ops:6380/1 (or /0 on dedicated instance)
  • CACHESredis-ops:6380/2
  • RF engine REDIS_URLredis-ops:6380/3

Document in docs/REDIS.md + Portainer stack. Roll out pre-prod first; compare redis-cli --latency on channels instance during feeder load vs combined instance.

Debugging checklist (during incident)

# Per-DB size
redis-cli -a "$REDIS_PASSWORD" -n 0 DBSIZE
redis-cli -a "$REDIS_PASSWORD" -n 1 DBSIZE

# Blocked clients
redis-cli -a "$REDIS_PASSWORD" CLIENT LIST | grep -i block

redis-cli -a "$REDIS_PASSWORD" SLOWLOG GET 20
redis-cli -a "$REDIS_PASSWORD" --latency

# From API container
python -c "
import os, time, redis
pw = os.environ['REDIS_PASSWORD']
for db in (0, 1, 2):
    r = redis.Redis(host='redis', port=6379, password=pw, db=db, socket_timeout=10)
    t = time.time(); r.ping(); print(f'DB{db} PING', round(time.time()-t, 3), 's')
"

Count feeder WS connections in API logs (WSCONNECT / WSDISCONNECT on /ws/nodes/).

Acceptance criteria

  • Pre-prod: /ws/nodes/ stays connected for ≥15 min under normal multi-feeder ingest without Timeout reading from redis
  • Traceroute dispatch over WS works reliably with multiple feeders
  • Root cause documented (code vs infra) with evidence from latency/slowlog during repro
  • Fix deployed: code tuning and/or split Redis instances + updated ops docs

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions