Skip to content

Fix stuck faststream-stomp consumers (ping=True but no progress)#196

Merged
vrslev merged 13 commits into
mainfrom
fix/stuck-consumers
Jun 9, 2026
Merged

Fix stuck faststream-stomp consumers (ping=True but no progress)#196
vrslev merged 13 commits into
mainfrom
fix/stuck-consumers

Conversation

@lesnik512

@lesnik512 lesnik512 commented Jun 9, 2026

Copy link
Copy Markdown
Member

Summary

In production, faststream-stomp consumers stop processing messages while broker.ping() keeps returning True. Queue depth grows on the broker (ActiveMQ Artemis), the DLQ also grows, and only a pod restart recovers.

This PR addresses every failure mode that can lead to that state:

  • Listener safety net — wrap every per-message handler so a single handler crash can no longer cancel the read loop. Previously, an unhandled exception inside a handler would cancel the inner TaskGroup and silently kill the listen task.
  • is_alive grace fixActiveConnectionState now tracks connected_at; when last_read_time is None, aliveness is bounded by the same server_heartbeat × factor threshold instead of returning True indefinitely. Closes the indefinite-True window after a reconnect where nothing reads.
  • Listen-task death surfaces in is_aliveClient.is_alive() returns False once _listen_task.done(), so dead listeners can no longer report healthy.
  • Unhandled listen-task exceptions are logged_handle_listen_task_done now logs every non-cancelled, non-FailedAllConnectAttemptsError exception at ERROR with the traceback. Previously these were silently swallowed.
  • max_concurrent_handlers knob — opt-in semaphore (default None/unbounded) bounds in-flight handler tasks. When the semaphore is full the read loop pauses, which (combined with the is_alive fix above) converts an invisible stuck-state into a detectable one.
  • Visibility for ack/nack drops — ack-skip-after-reconnect → WARNING; nack-skip-after-reconnect → ERROR. Dropped frames inside maybe_write_frame are also logged with severity by frame type (NACK=ERROR, ACK=WARNING, other=INFO) so DLQ growth becomes diagnosable from logs alone.

Test plan

  • just lint
  • just check-types
  • just test-fast — 244 passed / 1 skipped
  • just test — full suite (requires Docker brokers); run before merge

Notes for reviewers

  • Default behavior is preserved: max_concurrent_handlers is None by default; existing call sites do not need changes.

lesnik512 and others added 12 commits June 9, 2026 09:16
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Catching BaseException would swallow SystemExit, KeyboardInterrupt, and
GeneratorExit raised inside a handler, which should propagate. Handlers
that legitimately fail raise Exception subclasses; only those need to
be contained at the listener boundary.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The note pointing to PR #117 discussion is still relevant after the
logging change — the SystemExit branch remains hard to test directly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@lesnik512 lesnik512 force-pushed the fix/stuck-consumers branch from d4ba5b2 to 0e36964 Compare June 9, 2026 06:17
@lesnik512 lesnik512 requested a review from vrslev June 9, 2026 06:20
@lesnik512 lesnik512 self-assigned this Jun 9, 2026
@vrslev

vrslev commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Perhaps, we should set a sensible default for max_concurrent_handlers?

Bound in-flight handler tasks by default so a stuck handler can no
longer accumulate unbounded create_task entries. 100 is high enough
that fast-handler workloads never hit it; the cap matters only when
handlers are slow, where it makes the stuck state detectable: the
read loop pauses, last_read_time goes stale, and is_alive turns False.

Set max_concurrent_handlers=None to restore the prior unbounded
behavior.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vrslev vrslev merged commit 1d92c0a into main Jun 9, 2026
6 checks passed
@vrslev vrslev deleted the fix/stuck-consumers branch June 9, 2026 09:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants