Summary
Add retry logic and timeout handling to the event processing worker. This builds on top of #35 (event processing redesign) to improve reliability.
Features
1. Retry with Exponential Backoff
Failed events should automatically retry with increasing delays:
Attempt 1: fails → wait 1 min → retry
Attempt 2: fails → wait 5 min → retry
Attempt 3: fails → wait 30 min → retry
Attempt N: fails → move to dead letter (or stop retrying)
Schema changes:
ALTER TABLE events ADD COLUMN attempts INTEGER DEFAULT 0;
ALTER TABLE events ADD COLUMN max_attempts INTEGER DEFAULT 5;
ALTER TABLE events ADD COLUMN next_retry_at TIMESTAMP WITH TIME ZONE;
Query change:
-- fetch_pending() becomes:
SELECT * FROM events
WHERE delivery_status = 'pending'
OR (delivery_status = 'failed' AND next_retry_at < now() AND attempts < max_attempts)
FOR UPDATE SKIP LOCKED
LIMIT :batch_size
Backoff calculation:
def calculate_next_retry(attempts: int) -> datetime:
# Exponential backoff: 1min, 5min, 30min, 2hr, 12hr
delays = [60, 300, 1800, 7200, 43200]
delay = delays[min(attempts, len(delays) - 1)]
return datetime.now(UTC) + timedelta(seconds=delay)
2. Job Timeouts
Long-running jobs should be killed and marked as failed:
async def _dispatch(self, event: Event) -> None:
try:
async with asyncio.timeout(self._job_timeout):
await listener.handle(event)
except asyncio.TimeoutError:
await outbox.mark_failed(event.id, "Job timed out")
Configuration:
class WorkerConfig:
job_timeout: int = 300 # 5 minutes default
3. Dead Letter Handling
Events that exceed max_attempts should be moved to a dead letter state:
if event.attempts >= event.max_attempts:
await outbox.mark_dead_letter(event.id)
Could add a delivery_status = 'dead_letter' or separate table for inspection.
Tasks
Depends On
Acceptance Criteria
Summary
Add retry logic and timeout handling to the event processing worker. This builds on top of #35 (event processing redesign) to improve reliability.
Features
1. Retry with Exponential Backoff
Failed events should automatically retry with increasing delays:
Schema changes:
Query change:
Backoff calculation:
2. Job Timeouts
Long-running jobs should be killed and marked as failed:
Configuration:
3. Dead Letter Handling
Events that exceed max_attempts should be moved to a dead letter state:
Could add a
delivery_status = 'dead_letter'or separate table for inspection.Tasks
attempts,max_attempts,next_retry_atcolumns (migration)fetch_pending()to include retriable failed eventscalculate_next_retry()backoff logicmark_failed()to setnext_retry_atand incrementattemptsDepends On
Acceptance Criteria