Pre-prod/prod: WebSocket flapping — Redis read timeouts on Channels (DB 0) with feeder bots

## Summary

Pre-prod (and potentially prod) **real-time features are unreliable**: feeder `/ws/nodes/` and UI `/ws/messages/` connections flap every ~3–5 seconds while **HTTP packet ingest continues to succeed** (`POST /api/packets/.../ingest/` → 201).

API logs show:

```
redis.exceptions.TimeoutError: Timeout reading from redis:6379
```

Stack: Daphne → Channels `await_many_dispatch` → `channel_receive` (blocking read on `channels_redis` / Redis DB 0).

## Environment

| Item | Value |
|------|--------|
| Stack | Portainer `meshflow-preprod` (same pattern in prod) |
| Redis image | **`redis:8-alpine`** (not Redis Stack) |
| Redis topology | Single instance; logical DBs per [docs/REDIS.md](docs/REDIS.md) (DB 0 Channels, DB 1 Celery broker+results, DB 2 cache, DB 3 RF engine) |
| API runtime | Python **3.14** in container logs (Channels/Django test matrix risk) |
| Clients during incident | **Several feeder bots** uploading + holding `/ws/nodes/`; **no UI browsers** open for pre-prod |
| Investigation notes | Local copy under `tmp/redis-issue/` (not committed) |

## Symptoms

1. `ERROR daphne.server: Exception inside application: Timeout reading from redis:6379`
2. `WSDISCONNECT` on `/ws/messages/` or `/ws/nodes/`
3. `WSCONNECT` again within ~2–5 s (reconnect storm)
4. Feeders reconnect; traceroute commands / live messaging unreliable
5. Often preceded by `asyncio.exceptions.CancelledError` (wrapped in `asgi_client_disconnect.py`)

Example concurrent feeders (from logs):

- Two feeders on one API key (warning: `API key linked to 2 nodes; use feeder_pubkey_prefix for MC`)
- Meshtastic group `node_1127973616` (`!433b82f0`)
- MeshCore group `node_mc_{internal_id}`

## What this is **not**

- Redis TCP down (port 6379 accepts connections)
- Wrong `REDIS_PASSWORD` (would tend to fail fast with `NOAUTH`)
- Channels key explosion on DB 0 (DBSIZE was **4** during investigation)
- Celery broker backlog (`LLEN celery` = **0**)
- MeshCore canonical channels (#379) — no evidence in logs

## Key finding: DB 1 cleanup did **not** fix it

During investigation, Redis DB 1 held ~**10k** `celery-task-meta-*` keys (Celery result backend, ~24h TTL). That was a plausible cause of **single-threaded** Redis delaying DB 0 Channels replies.

**After clearing DB 1** (`DBSIZE` → **4**), **WebSocket timeouts continued**. So either:

- DB 1 result volume was a contributing factor but **not sufficient** to explain the outage, or
- The primary cause is elsewhere (see theories below).

Celery result hygiene (`CELERY_RESULT_EXPIRES`, `ignore_result` on beat tasks) remains worthwhile but is **not proven** as the fix for this incident.

## Diagnostics snapshot (pre-clear)

| DB | Purpose | PING | DBSIZE |
|----|---------|------|--------|
| 0 | Channels | ~12 ms | 4 |
| 1 | Celery broker + results | ~2 ms | ~9866 |
| 2 | Django cache | ~1 ms | 3 |

```
connected_clients ~33
blocked_clients ~3   # expected: Celery BRPOP + Channels BZPOPMIN
```

Redis server logs: no errors, no slowlog entries in captured window; BGSAVE ~30–100 ms. Host warning: `vm.overcommit_memory` not enabled.

**Important:** `socket.connect_ex(('redis', 6379))` only proves the port is open — failures are **command read timeouts**, not connect failures.

## Working theories (ranked, post DB-1-clear)

1. **Client read timeout vs `channels_redis` blocking receive**  
   `channels_redis` uses `BZPOPMIN` with a **5s** server-side timeout per receive loop. If `redis.asyncio` **client** `socket_timeout` is too low (or Redis is slow to service DB 0 while busy), consumers raise `TimeoutError` → WS disconnect → feeders reconnect → more `group_add` / channel traffic → repeat every few seconds.

2. **Feeder WebSocket reconnect storm (no UI required)**  
   Multiple bots on `/ws/nodes/` with reconnect loops amplify Redis Channels load on DB 0 even without browser tabs.

3. **Many Redis connections (~33)**  
   API + Celery workers + beat + RF worker + Site Planner (DB 3) + Channels pools + reconnect churn.

4. **Single Redis process / single event loop**  
   Logical DB indices do **not** isolate CPU — all workloads share one thread. A second Redis **process** on the same host would still separate Channels from Celery/cache/RF.

5. **Python 3.14 in API image**  
   Ahead of typical Channels/Django/redis-py test matrices.

6. **Split-host dev** (check)  
   Local API or bot pointing at pre-prod Redis DB 0 doubles groups — see [text-message-channels.md](docs/features/meshcore/text-message-channels.md).

**Deprioritized after DB 1 clear:** Celery result key count as *primary* cause. **Incorrect for current env:** Redis Stack modules (env is `redis:8-alpine`).

## Settings gaps (current `main`)

From `Meshflow/Meshflow/settings/base.py`:

- `CHANNEL_LAYERS` → `channels_redis` on DB 0 via URL string only — **no** explicit `socket_timeout` / connect tuning
- `CELERY_RESULT_BACKEND = CELERY_BROKER_URL` (DB 1)
- No explicit `CELERY_RESULT_EXPIRES` in Django settings (Celery default ~24h on results)

### Proposed code mitigations (local branch, not deployed)

- `CHANNEL_REDIS_SOCKET_TIMEOUT` default **30s** on Channels Redis host dict
- `CELERY_RESULT_EXPIRES` default **3600s**
- `ignore_result=True` on high-frequency beat tasks (`dispatch_pending_traceroutes` every 15s, etc.)
- `docs/REDIS.md` ops notes

Deploy these for defense-in-depth; **validate on pre-prod whether WS stability improves** after DB 1 clear failed.

## Infrastructure option: second Redis instance

**Recommendation: prefer two standalone Redis instances on the same host — not Redis Cluster.**

| Approach | Verdict |
|----------|--------|
| **Logical DBs only** (current) | Does **not** fix single-threaded contention — all commands share one Redis core. |
| **Redis Cluster** | High operational cost; `channels_redis` / Celery need cluster-aware URLs and hash tags; little benefit for one physical host and our key patterns. |
| **Two Redis processes (recommended)** | True isolation: Channels blocking reads are not queued behind Celery/RF work on another process. Same host is fine if CPU is not saturated. |

### Suggested split

| Instance | Port (example) | Consumers |
|----------|----------------|-----------|
| **`redis-channels`** | 6379 | Django Channels only (`CHANNEL_LAYERS` → DB 0) |
| **`redis-ops`** | 6380 | Celery broker + result backend, Django cache (DB 2), RF Site Planner (`REDIS_URL` / DB 3) |

Env / compose changes (illustrative):

- `REDIS_CHANNELS_URL` or `CHANNEL_LAYERS` host → `redis-channels:6379/0`
- `CELERY_BROKER_URL` → `redis-ops:6380/1` (or `/0` on dedicated instance)
- `CACHES` → `redis-ops:6380/2`
- RF engine `REDIS_URL` → `redis-ops:6380/3`

Document in `docs/REDIS.md` + Portainer stack. Roll out **pre-prod first**; compare `redis-cli --latency` on channels instance during feeder load vs combined instance.

## Debugging checklist (during incident)

```sh
# Per-DB size
redis-cli -a "$REDIS_PASSWORD" -n 0 DBSIZE
redis-cli -a "$REDIS_PASSWORD" -n 1 DBSIZE

# Blocked clients
redis-cli -a "$REDIS_PASSWORD" CLIENT LIST | grep -i block

redis-cli -a "$REDIS_PASSWORD" SLOWLOG GET 20
redis-cli -a "$REDIS_PASSWORD" --latency

# From API container
python -c "
import os, time, redis
pw = os.environ['REDIS_PASSWORD']
for db in (0, 1, 2):
    r = redis.Redis(host='redis', port=6379, password=pw, db=db, socket_timeout=10)
    t = time.time(); r.ping(); print(f'DB{db} PING', round(time.time()-t, 3), 's')
"
```

Count feeder WS connections in API logs (`WSCONNECT` / `WSDISCONNECT` on `/ws/nodes/`).

## Acceptance criteria

- [ ] Pre-prod: `/ws/nodes/` stays connected for ≥15 min under normal multi-feeder ingest without `Timeout reading from redis`
- [ ] Traceroute dispatch over WS works reliably with multiple feeders
- [ ] Root cause documented (code vs infra) with evidence from latency/slowlog during repro
- [ ] Fix deployed: code tuning and/or split Redis instances + updated ops docs

## Related

- [docs/REDIS.md](docs/REDIS.md)
- [docs/features/meshcore/text-message-channels.md](docs/features/meshcore/text-message-channels.md) — feeder WS, split-host Redis
- #142 — optional WS presence for traceroute source eligibility (related but separate)

Approach	Verdict
Logical DBs only (current)	Does not fix single-threaded contention — all commands share one Redis core.
Redis Cluster	High operational cost; `channels_redis` / Celery need cluster-aware URLs and hash tags; little benefit for one physical host and our key patterns.
Two Redis processes (recommended)	True isolation: Channels blocking reads are not queued behind Celery/RF work on another process. Same host is fine if CPU is not saturated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pre-prod/prod: WebSocket flapping — Redis read timeouts on Channels (DB 0) with feeder bots #382

Summary

Environment

Symptoms

What this is not

Key finding: DB 1 cleanup did not fix it

Diagnostics snapshot (pre-clear)

Working theories (ranked, post DB-1-clear)

Settings gaps (current `main`)

Proposed code mitigations (local branch, not deployed)

Infrastructure option: second Redis instance

Suggested split

Debugging checklist (during incident)

Acceptance criteria

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Item	Value
Stack	Portainer `meshflow-preprod` (same pattern in prod)
Redis image	`redis:8-alpine` (not Redis Stack)
Redis topology	Single instance; logical DBs per docs/REDIS.md (DB 0 Channels, DB 1 Celery broker+results, DB 2 cache, DB 3 RF engine)
API runtime	Python 3.14 in container logs (Channels/Django test matrix risk)
Clients during incident	Several feeder bots uploading + holding `/ws/nodes/`; no UI browsers open for pre-prod
Investigation notes	Local copy under `tmp/redis-issue/` (not committed)

DB	Purpose	PING	DBSIZE
0	Channels	~12 ms	4
1	Celery broker + results	~2 ms	~9866
2	Django cache	~1 ms	3

Instance	Port (example)	Consumers
`redis-channels`	6379	Django Channels only (`CHANNEL_LAYERS` → DB 0)
`redis-ops`	6380	Celery broker + result backend, Django cache (DB 2), RF Site Planner (`REDIS_URL` / DB 3)

Pre-prod/prod: WebSocket flapping — Redis read timeouts on Channels (DB 0) with feeder bots #382

Description

Summary

Environment

Symptoms

What this is not

Key finding: DB 1 cleanup did not fix it

Diagnostics snapshot (pre-clear)

Working theories (ranked, post DB-1-clear)

Settings gaps (current main)

Proposed code mitigations (local branch, not deployed)

Infrastructure option: second Redis instance

Suggested split

Debugging checklist (during incident)

Acceptance criteria

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Settings gaps (current `main`)