Skip to content

feat: add retry with backoff and job timeouts to event worker #38

@rorybyrne

Description

@rorybyrne

Summary

Add retry logic and timeout handling to the event processing worker. This builds on top of #35 (event processing redesign) to improve reliability.

Features

1. Retry with Exponential Backoff

Failed events should automatically retry with increasing delays:

Attempt 1: fails → wait 1 min → retry
Attempt 2: fails → wait 5 min → retry
Attempt 3: fails → wait 30 min → retry
Attempt N: fails → move to dead letter (or stop retrying)

Schema changes:

ALTER TABLE events ADD COLUMN attempts INTEGER DEFAULT 0;
ALTER TABLE events ADD COLUMN max_attempts INTEGER DEFAULT 5;
ALTER TABLE events ADD COLUMN next_retry_at TIMESTAMP WITH TIME ZONE;

Query change:

-- fetch_pending() becomes:
SELECT * FROM events 
WHERE delivery_status = 'pending'
   OR (delivery_status = 'failed' AND next_retry_at < now() AND attempts < max_attempts)
FOR UPDATE SKIP LOCKED
LIMIT :batch_size

Backoff calculation:

def calculate_next_retry(attempts: int) -> datetime:
    # Exponential backoff: 1min, 5min, 30min, 2hr, 12hr
    delays = [60, 300, 1800, 7200, 43200]
    delay = delays[min(attempts, len(delays) - 1)]
    return datetime.now(UTC) + timedelta(seconds=delay)

2. Job Timeouts

Long-running jobs should be killed and marked as failed:

async def _dispatch(self, event: Event) -> None:
    try:
        async with asyncio.timeout(self._job_timeout):
            await listener.handle(event)
    except asyncio.TimeoutError:
        await outbox.mark_failed(event.id, "Job timed out")

Configuration:

class WorkerConfig:
    job_timeout: int = 300  # 5 minutes default

3. Dead Letter Handling

Events that exceed max_attempts should be moved to a dead letter state:

if event.attempts >= event.max_attempts:
    await outbox.mark_dead_letter(event.id)

Could add a delivery_status = 'dead_letter' or separate table for inspection.

Tasks

  • Add attempts, max_attempts, next_retry_at columns (migration)
  • Update fetch_pending() to include retriable failed events
  • Add calculate_next_retry() backoff logic
  • Update mark_failed() to set next_retry_at and increment attempts
  • Add timeout wrapper in worker dispatch
  • Add dead letter status/handling
  • Add configuration for timeout and max_attempts
  • Update tests

Depends On

Acceptance Criteria

  • Failed events automatically retry up to max_attempts
  • Retry delays increase exponentially
  • Jobs exceeding timeout are killed and marked failed
  • Events exceeding max_attempts stop retrying (dead letter)
  • Retry behavior is configurable

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions