Skip to content

Standardize failed outbound delivery recovery #7

Description

@dimavrem22

Problem

Outbound SMS, iMessage, and email sends can be queued successfully, then fail later via asynchronous delivery webhooks after the agent loop has already completed and gone idle. When the failed outbound message belongs to an active/known thread, the plugin should wake the agent and tell it exactly what failed so it can recover gracefully.

Current state in this repo

  • text.delivery_failed, imessage.delivery_failed, message.bounced, and message.failed handlers exist if webhooks arrive.
  • The setup/subscription path is narrower and does not currently subscribe to those failure events.
  • The wake-up path uses a consult/side-effect turn, so the agent's final text is not automatically sent back on the original channel/thread. The agent must manually use tools for visible recovery.
  • Existing handling does not consistently preserve the original SMS/iMessage conversation id or email thread id in the recovery route.

Fleet standard

Implement the same behavior across Claude Code, Codex, Hermes, and OpenClaw plugins:

  1. Wake the agent only for hard failed outbound delivery events:
    • SMS: text.delivery_failed
    • iMessage: imessage.delivery_failed
    • Email: message.bounced, message.failed
  2. Do not wake on text.delivery_unconfirmed; that is telemetry/status uncertainty, not a hard failed-delivery recovery signal.
  3. Track outbound delivery context when a send queues successfully, keyed by provider/Inkbox message id where available. Store channel, contact/session key, recipient, original body snippet, SMS/iMessage conversation id, and email thread/subject metadata.
  4. When a failure webhook arrives, correlate it to the original thread using outbound context first, then webhook contact/thread/recipient fallback. If no usable thread/session can be resolved, log and do not wake.
  5. Wake the agent with a synthetic recovery turn whose final response is sent on the same channel/thread by default.
  6. The prompt must explain that the previous outbound message failed, include reason/error details and the failed message body when available, and tell the agent it may modify/shorten/retry, use tools to switch channel, or reply exactly [SILENT] to do nothing visible.
  7. Deduplicate repeated failure webhooks by channel + event type + message id, with a payload-hash fallback when no id is present.
  8. Add loop protection so failed recovery sends do not cause unbounded retry loops.

Acceptance criteria

  • Subscriptions include the hard failure events listed above.
  • Unit tests cover SMS, iMessage, and email failure webhooks.
  • Tests prove a correlated failure wakes the right session/thread.
  • Tests prove recovery output is delivered on the same channel/thread by default.
  • Tests prove exact [SILENT] suppresses visible delivery.
  • Tests prove text.delivery_unconfirmed does not wake the agent.
  • Tests prove duplicate failure webhooks do not trigger duplicate recovery turns.

Notes

This should be coordinated with the fleet standardization branch and tracked in admin-console/docs/PLUGIN_FLEET.md.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions