Skip to content

Gateway connection layer lacks protocol version negotiation and a resilient auto-reconnect/resume client #2130

Description

@MervinPraison

Summary

The gateway is positioned as the always-on front door to agents (Telegram/Discord/Slack/WhatsApp bots, web clients, schedulers). For that role, the client↔server connection lifecycle is a first-class robustness surface — and today it is thin in two ways that bite in production:

  1. No protocol version negotiation. Client and server never exchange a wire-protocol version on connect. If the message shape changes between releases, a newer client against an older gateway (or vice-versa) fails silently or with an opaque error instead of negotiating a common version or rejecting cleanly. For a fleet that rolls out gateway and client updates independently, this is a latent breakage.
  2. No resilient reconnecting client. The server already supports cursor-based event replay on reconnect, but no shipped client actually reconnects: integrators must hand-roll the socket loop, exponential backoff, re-join, and cursor bookkeeping themselves. There is also no gap-detection signal, so a client that reconnects after the bounded event window cannot tell it missed events.

A world-class gateway should make "connect, lose the network, transparently resume where you left off, across compatible versions" the default — not a DIY exercise for every integrator.

Note: this is complementary to and distinct from #2103. That issue covers server-side durability of the pending inbox / in-flight execution and graceful drain on shutdown. This issue covers the wire-protocol contract and client resilience: version negotiation, gap detection, and a built-in auto-reconnect/resume client.

Current behaviour

Version is informational only — never negotiated. The only "version" on the wire is a static string on the /info endpoint:

# src/praisonai/praisonai/gateway/server.py:547
"version": "1.0.0",

The join handshake carries no protocol version, and the core protocol contract declares none:

# src/praisonai/praisonai/gateway/server.py:954
if msg_type == "join":
    ...
    since_cursor = data.get("since")   # cursor for replay — but no protocol version exchanged
# src/praisonai-agents/praisonaiagents/gateway/protocols.py
# grep protocol_version | min_protocol | max_protocol | negotiate | hello-version  -> no matches

Server-side resume exists but is incomplete. A client may pass a since cursor and the server replays missed events:

# src/praisonai/praisonai/gateway/server.py:1486-1488
if since_cursor is not None:
    replay_events = session.get_events_since(since_cursor)
    logger.info(f"Replaying {len(replay_events)} events since cursor {since_cursor}")

However:

  • The per-session event history is bounded (persisted history is clamped to the last ~100 events), so a client reconnecting after the window cannot detect which events it missed — there is no gap-detection signal back to the client.
  • The resume payload restores message/event history only; it carries no presence/health snapshot, so a reconnecting client cannot re-sync gateway state in one round trip.

No client actually reconnects. The TypeScript gateway client is an interface plus two unused config fields — there is no reconnect loop, no backoff, no cursor tracking:

// src/praisonai-ts/src/gateway/index.ts  (450 lines, interface only)
connect(): Promise<void>;
disconnect(): Promise<void>;
retryAttempts?: number;   // declared, never used to drive reconnection
retryDelay?: number;      // declared, never used

There is no equivalent reconnecting client in the Python wrapper either. (gateway/supervisor.py's reconnect is for channel bot adapters, not the gateway WebSocket client.)

Desired behaviour

  • Version negotiation on connect: client advertises a supported protocol range; server replies with the agreed version (or a clear, typed version_unsupported rejection). Mismatched client/server degrade or refuse cleanly instead of failing opaquely.
  • First-class reconnecting client (TS + Python): automatic reconnect with bounded exponential backoff + jitter, automatic re-join with the last cursor, and a connected/reconnecting/disconnected state callback — so integrators get durable connectivity for free.
  • Gap detection: events carry a monotonic sequence; the client surfaces an on_gap(expected, received) signal when it detects a missed range (e.g. reconnect beyond the replay window), so callers can trigger a full re-sync rather than silently dropping events.
  • Resume snapshot completeness: the reconnect/joined payload includes a compact presence/health snapshot alongside history, so one round trip restores client state.

Layer placement

  • Primary layer: core (src/praisonai-agents/praisonaiagents/gateway/protocols.py) — the version-negotiation handshake, the sequence/gap contract, and the resume-cursor shape are protocol contracts that belong in the protocol-driven core, with no heavy imports.
  • Why not core (the impl part): only the contract lives in core; the socket/backoff mechanics live in the wrapper (secondary touch).
  • Why not wrapper (as primary): if the negotiation/sequence/gap contract is defined only in the wrapper server, third-party clients have nothing portable to implement against; the contract must be core so every client (TS, Python, external) agrees on it.
  • Why not tools: connection lifecycle is gateway transport, never an agent-callable integration.
  • Why not plugins: this is intrinsic gateway runtime continuity, not an optional cross-cutting lifecycle policy.
  • Secondary touch: wrapper — implement the negotiated handshake + gap/sequence emission in gateway/server.py, and ship reconnecting clients in src/praisonai-ts/src/gateway/ and a Python gateway client.
  • 3-way surface (CLI + YAML + Python): partial/no — this is a protocol + SDK contract rather than a user feature toggle; the only user-facing knobs are optional (e.g. gateway.protocol_strict to reject unsupported versions, client reconnect/backoff options).

Proposed approach

  • Extension point: protocol contract in core + handshake/sequence implementation in the wrapper server + a reusable reconnecting client.
  • Minimal API sketch:
# core: praisonaiagents/gateway/protocols.py  (additive, backward-compatible)
PROTOCOL_VERSION = 1
MIN_PROTOCOL_VERSION = 1

class HelloOk(TypedDict):
    protocol_version: int          # negotiated result
    cursor: int                    # resume point
    presence: list[dict]           # snapshot for one-round-trip re-sync
    health: dict
# wrapper: a reconnecting client integrators can just use
client = GatewayClient(url, reconnect=True, backoff=Backoff(initial=1, max=30, jitter=0.2))
client.on_gap(lambda expected, received: client.resync())
await client.connect()             # negotiates version, auto-resumes via cursor on every reconnect

Resolution sketch

# Before (today): integrator hand-rolls everything
ws = await websockets.connect(url)
await ws.send(json.dumps({"type": "join", "since": last_cursor}))
# no version check; on socket drop the integrator must write their own
# reconnect loop, backoff, re-join and cursor tracking; missed events
# beyond the ~100-event window vanish with no signal.

# After (proposed): durable by default, version-safe
client = GatewayClient(url, reconnect=True)
await client.connect()             # client/server agree a protocol version or reject cleanly
async for event in client.events():# transparently survives disconnects, resumes from cursor
    handle(event)                  # on_gap fires if a range was missed -> client.resync()

Severity

High — for a component whose entire value proposition is being the robust, always-on entry point to agents, the connection layer currently offers no version safety across releases and forces every integrator to re-implement reconnect/resume. Both are table stakes for a world-class gateway, and the server already does the hard half (cursor replay) — the contract and client just aren't there.

Validation

  • gateway/server.py:547 exposes a static "version": "1.0.0" on /info only; the join handshake (server.py:954) exchanges no protocol version, and praisonaiagents/gateway/protocols.py declares no version/negotiation field (repo-wide grep for protocol_version/negotiate/min_protocol returns nothing).
  • Server-side cursor replay confirmed at gateway/server.py:959,966 (since accepted) and :1486-1488 (get_events_since), but the persisted event history is clamped (~100 events) and no gap-detection signal is returned to the client; the resume payload carries no presence/health snapshot.
  • src/praisonai-ts/src/gateway/index.ts (450 lines) is an interface declaring connect()/disconnect() plus unused retryAttempts/retryDelay; no reconnect loop, backoff, or cursor tracking exists, and there is no Python reconnecting client. gateway/supervisor.py's reconnect applies to channel bot adapters, not the gateway WebSocket client.
  • Differentiated from Gateway loses in-flight execution and queued messages on disconnect, restart, and shutdown — no durable session continuity or graceful drain #2103 (server-side pending-inbox/in-flight durability + graceful drain): that issue does not address wire-protocol versioning, gap detection, or a reconnecting client.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingclaudeAuto-trigger Claude analysisenhancementNew feature or requestjavascriptPull requests that update javascript code

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions