Summary
The dead-letter queue (DLQ) has a size threshold concept, but alerting only writes to application logs. In production, operators will not see DLQ growth until users report missing deposits/withdrawals.
Problem
DLQ_ALERT_THRESHOLD is documented in .env.example and parsed in src/config/env.ts, but src/stellar/dlq.ts uses a hardcoded constant:
const SIZE_ALERT_THRESHOLD = 50
private static checkSizeAlert(size: number): void {
if (size >= SIZE_ALERT_THRESHOLD) {
logger.error(`[DLQ ALERT] Dead-letter queue size is critically high: ${size} events...`)
}
}
## Proposed solution
Replace hardcoded SIZE_ALERT_THRESHOLD with config.dlq.alertThreshold (or equivalent).
Add an AlertingService abstraction with pluggable channels:
LOG (default, always on)
SLACK_WEBHOOK_URL (optional)
PAGERDUTY_ROUTING_KEY (optional)
Emit alert payload including:
current DLQ size
count by status (PENDING, RETRIED, RESOLVED)
oldest pending event age
link to admin DLQ inspect endpoint
Add alert deduplication/cooldown (e.g. re-alert only every 15 minutes while above threshold).
Expose DLQ alert state as a Prometheus gauge (e.g. dlq_alert_active).
## Acceptance criteria
DLQ_ALERT_THRESHOLD from env is the single source of truth
Crossing threshold triggers at least one external notification when configured
Alert includes actionable metadata (size, status breakdown, oldest pending age)
Cooldown prevents duplicate alerts within configured window
Unit tests for threshold logic and cooldown behavior
Integration test stubs external webhook without hitting real services
README/runbook section: “What to do when DLQ alert fires” (inspect → dry-run retry → resolve)
Summary
The dead-letter queue (DLQ) has a size threshold concept, but alerting only writes to application logs. In production, operators will not see DLQ growth until users report missing deposits/withdrawals.
Problem
DLQ_ALERT_THRESHOLDis documented in.env.exampleand parsed insrc/config/env.ts, butsrc/stellar/dlq.tsuses a hardcoded constant: