Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
8bd372e
add retry module with exponential backoff for transient errors
Mosas2000 May 17, 2026
47ac78c
add tests for retry module (isRetryable, calculateBackoff, withRetry)
Mosas2000 May 17, 2026
8c31125
wrap PostgresEventStore queries with retry on transient errors
Mosas2000 May 17, 2026
ff5ea64
reflect storage health state in /health endpoint status and status code
Mosas2000 May 17, 2026
4fc0ec1
expand classifyError to cover all retryable postgres and network erro…
Mosas2000 May 17, 2026
7cf196a
add storage retry integration tests covering recovery and exhaustion …
Mosas2000 May 17, 2026
cf99c2c
add health endpoint tests verifying storage state is reflected in res…
Mosas2000 May 17, 2026
8998bbc
add EPIPE and ENETUNREACH to classifyError and extend error classific…
Mosas2000 May 17, 2026
fbe1971
add parseRetryConfig for env-based retry tuning with tests
Mosas2000 May 17, 2026
5fde25a
wire retryConfig into PostgresEventStore and PostgresScheduledTipStor…
Mosas2000 May 17, 2026
3c64ce4
document DB_RETRY_* environment variables in .env.example
Mosas2000 May 17, 2026
0faf525
add connection retry runbook with configuration and behavior document…
Mosas2000 May 17, 2026
eb95f5a
log retry configuration at service startup
Mosas2000 May 17, 2026
3803dc5
track db retry attempts, successes, and exhaustions in metrics
Mosas2000 May 17, 2026
89866ae
add tests for retry metrics tracking (attempts, successes, exhausted)
Mosas2000 May 17, 2026
1472ca1
verify retry counters are exposed in /metrics endpoint
Mosas2000 May 17, 2026
b971240
add unit tests for Metrics.recordDbRetry and toJSON retry fields
Mosas2000 May 17, 2026
2f22344
export sleep helper from retry module for testability
Mosas2000 May 17, 2026
e9d21e2
add edge case tests for isRetryable and withRetry boundary conditions
Mosas2000 May 17, 2026
271c928
add calculateBackoff tests for zero base delay and cap boundary
Mosas2000 May 17, 2026
7cb02e6
add parseRetryConfig tests for large values and float truncation
Mosas2000 May 17, 2026
d9b70ed
add tests for custom shouldRetry predicate in withRetry
Mosas2000 May 17, 2026
55612cd
add PostgresEventStore constructor validation tests
Mosas2000 May 17, 2026
a5a717e
pass retryOptions through createEventStore and createScheduledTipStor…
Mosas2000 May 17, 2026
cb36499
add createEventStore factory tests for memory mode and health checks
Mosas2000 May 17, 2026
605dcac
add retry tests for pool exhaustion, deadlock, and serialization fail…
Mosas2000 May 17, 2026
0de6441
add toErrorResponse tests for storage and connection error classifica…
Mosas2000 May 17, 2026
5efa18a
document retry metrics counters in CONNECTION_RETRY runbook
Mosas2000 May 17, 2026
3a436d9
add changelog for connection retry implementation
Mosas2000 May 17, 2026
fe57bbf
add retry attempt count boundary tests for maxAttempts 1 and 2
Mosas2000 May 17, 2026
929878d
resolve merge conflict in server.js startup log
Mosas2000 May 17, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions chainhook/.env.example
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,14 @@ DB_POOL_IDLE_TIMEOUT_MS=30000
DB_POOL_CONNECTION_TIMEOUT_MS=5000
DB_STATEMENT_TIMEOUT_MS=30000

# Database Connection Retry Configuration
# Maximum number of attempts before giving up on a transient failure (default: 5)
DB_RETRY_MAX_ATTEMPTS=5
# Base delay in milliseconds for exponential backoff (default: 200)
DB_RETRY_BASE_DELAY_MS=200
# Maximum delay cap in milliseconds between retry attempts (default: 30000)
DB_RETRY_MAX_DELAY_MS=30000

# CORS Security - Comma-separated list of allowed origins
CORS_ALLOWED_ORIGINS=http://localhost:3000,http://localhost:3001

Expand Down
85 changes: 85 additions & 0 deletions chainhook/CHANGELOG_CONNECTION_RETRY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
# Changelog: Connection Retry Logic (Issue #400)

## Summary

Implements automatic retry with exponential backoff for transient database
connection failures in the chainhook service. The service no longer crashes
or requires a manual restart when the database is temporarily unavailable.

## Changes

### New Files

- `chainhook/retry.js` — Core retry module with `withRetry`, `isRetryable`,
`calculateBackoff`, and `parseRetryConfig` exports.
- `chainhook/retry.test.js` — 59 tests covering all retry module functions.
- `chainhook/storage-retry.test.js` — 35 integration tests for storage-level
retry behavior including recovery, exhaustion, and custom predicates.
- `chainhook/health-retry.test.js` — 8 tests for the health endpoint and
metrics endpoint retry counter fields.
- `chainhook/metrics-retry.test.js` — 8 unit tests for `Metrics.recordDbRetry`.
- `chainhook/CONNECTION_RETRY.md` — Operator runbook covering behavior,
configuration, logging, and troubleshooting.

### Modified Files

- `chainhook/storage.js`
- Added `withRetry` wrapper to all `PostgresEventStore` query methods.
- Added `withRetry` wrapper to all `PostgresScheduledTipStore` query methods.
- Added connection probe in `#initialize()` for both store classes.
- `health()` now returns `{ healthy: false }` instead of throwing when the
database is unreachable.
- `PostgresEventStore` and `PostgresScheduledTipStore` constructors accept
`retryOptions` to override defaults per-instance.
- `createEventStore` and `createScheduledTipStore` factories pass
`retryOptions` through to the store constructors.
- Imported `parseRetryConfig` to read retry settings from environment.

- `chainhook/errors.js`
- Extended `classifyError` to cover all retryable PostgreSQL error codes
(`08000`, `08001`, `08003`, `08004`, `08006`, `40001`, `40P01`) and
Node.js codes (`EPIPE`, `ENETUNREACH`).
- Added message-pattern matching for `connection terminated`, `connection
reset`, `too many connections`, `client checkout timed out`, `idle timeout`.

- `chainhook/server.js`
- `/health` endpoint returns HTTP `503` with `status: "degraded"` when
storage health check fails, instead of always returning `200`.
- Startup log now includes `db_retry_max_attempts` and `db_retry_base_delay_ms`.

- `chainhook/metrics.js`
- Added `dbRetryAttempts`, `dbRetrySuccesses`, `dbRetryExhausted` counters.
- Added `recordDbRetry(outcome)` method.
- `toJSON()` includes `db_retry_attempts`, `db_retry_successes`,
`db_retry_exhausted` fields.

- `chainhook/.env.example`
- Added `DB_RETRY_MAX_ATTEMPTS`, `DB_RETRY_BASE_DELAY_MS`,
`DB_RETRY_MAX_DELAY_MS` with documentation.

### Test Coverage

| File | Tests |
|-------------------------------|-------|
| retry.test.js | 59 |
| storage-retry.test.js | 35 |
| health-retry.test.js | 8 |
| metrics-retry.test.js | 8 |
| errors.test.js (additions) | 20 |
| storage.test.js (additions) | 5 |
| **Total new/updated tests** | **135** |

All 346 tests pass.

## Acceptance Criteria

- [x] Implement retry logic — `withRetry` in `retry.js`, applied to all
Postgres query methods in `storage.js`
- [x] Exponential backoff (max 5 retries) — configurable via `DB_RETRY_*`
env vars, defaults to 5 attempts with 200ms base delay
- [x] Log retry attempts — `WARN` log on each retry, `ERROR` on exhaustion
- [x] Update health endpoint — returns `503` with `healthy: false` when
storage is unreachable
- [x] Graceful error handling — service continues running, returns `503` to
callers instead of crashing
- [x] Add tests for retry scenarios — 135 new/updated tests across 6 files
226 changes: 226 additions & 0 deletions chainhook/CONNECTION_RETRY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,226 @@
# Connection Retry Logic

## Overview

The chainhook service implements automatic retry with exponential backoff for transient database connection failures. When a PostgreSQL operation fails due to a network or connection error, the service retries the operation up to a configurable maximum before propagating the error.

This eliminates the need for manual restarts when the database is temporarily unavailable (e.g., during a rolling restart, network blip, or connection pool exhaustion).

## Behavior

### Retry Strategy

- **Algorithm**: Exponential backoff with full jitter
- **Default max attempts**: 5
- **Default base delay**: 200ms
- **Default max delay**: 30 seconds
- **Jitter**: Up to 100ms added to each delay to prevent thundering herd

The delay before attempt `n` (zero-indexed) is:

```
delay = min(baseDelay * 2^n, maxDelay) + random(0, jitter)
```

Example delays with defaults (no jitter):

| Attempt | Delay |
|---------|---------|
| 1 | 200ms |
| 2 | 400ms |
| 3 | 800ms |
| 4 | 1600ms |
| 5 | 3200ms |

### Retryable Errors

The following errors trigger a retry:

**Node.js network errors:**
- `ECONNREFUSED` — database not accepting connections
- `ECONNRESET` — connection dropped mid-operation
- `ETIMEDOUT` — connection attempt timed out
- `EPIPE` — broken pipe on an established connection
- `EHOSTUNREACH` — host unreachable
- `ENETUNREACH` — network unreachable

**PostgreSQL error codes:**
- `08000` — connection exception
- `08001` — client unable to establish connection
- `08003` — connection does not exist
- `08004` — server rejected connection
- `08006` — connection failure
- `57P03` — cannot connect now (database starting up)
- `53300` — too many connections
- `40001` — serialization failure (safe to retry)
- `40P01` — deadlock detected (safe to retry)

**Message patterns:**
- `connection refused`
- `connection terminated`
- `connection reset`
- `cannot connect`
- `too many connections`
- `client checkout timed out`
- `idle timeout`

### Non-Retryable Errors

Errors that indicate a programming or data problem are not retried:

- Constraint violations (`23505` duplicate key, etc.)
- Syntax errors (`42601`)
- Invalid input (`22003`, `22P02`)
- Permission errors (`42501`)
- Any error not matching the retryable patterns above

### Health Check

The `/health` endpoint uses a reduced retry budget (2 attempts) to avoid blocking health checks during extended outages. When the database is unreachable after retries, the endpoint returns:

```json
{
"status": "degraded",
"storage": {
"healthy": false,
"error": "connect ECONNREFUSED 127.0.0.1:5432"
}
}
```

with HTTP status `503`.

When healthy:

```json
{
"status": "healthy",
"storage": {
"healthy": true,
"storage_mode": "postgres",
"total_events": 1234
}
}
```

with HTTP status `200`.

## Configuration

Retry behavior is tunable via environment variables:

| Variable | Default | Description |
|------------------------|---------|--------------------------------------------------|
| `DB_RETRY_MAX_ATTEMPTS`| `5` | Maximum total attempts (1 = no retries) |
| `DB_RETRY_BASE_DELAY_MS`| `200` | Base delay in ms for exponential backoff |
| `DB_RETRY_MAX_DELAY_MS`| `30000` | Maximum delay cap in ms between attempts |

### Tuning for Different Environments

**Development** (fast feedback):
```bash
DB_RETRY_MAX_ATTEMPTS=2
DB_RETRY_BASE_DELAY_MS=50
DB_RETRY_MAX_DELAY_MS=500
```

**Production** (resilient to longer outages):
```bash
DB_RETRY_MAX_ATTEMPTS=5
DB_RETRY_BASE_DELAY_MS=200
DB_RETRY_MAX_DELAY_MS=30000
```

**High-availability** (tolerate rolling restarts):
```bash
DB_RETRY_MAX_ATTEMPTS=8
DB_RETRY_BASE_DELAY_MS=500
DB_RETRY_MAX_DELAY_MS=60000
```

## Logging

Each retry attempt is logged at `WARN` level:

```json
{
"level": "WARN",
"message": "Retrying operation after transient error",
"operation": "postgres_insert_events",
"attempt": 2,
"max_attempts": 5,
"delay_ms": 400,
"error_code": "ECONNREFUSED",
"error_message": "connect ECONNREFUSED 127.0.0.1:5432"
}
```

When all attempts are exhausted, the final failure is logged at `ERROR` level:

```json
{
"level": "ERROR",
"message": "Operation failed after retries",
"operation": "postgres_insert_events",
"attempts": 5
}
```

## Graceful Degradation

When the database is unavailable and retries are exhausted:

1. **Event ingestion** (`POST /api/chainhook/events`): Returns `503 Service Unavailable` with `Retry-After: 30` header. The Chainhook node will retry delivery.
2. **Read endpoints** (`GET /api/tips`, etc.): Return `503 Service Unavailable`.
3. **Health endpoint** (`GET /health`): Returns `503` with `status: "degraded"`.
4. **Metrics endpoint** (`GET /metrics`): Returns `503`.

The service does not crash. It continues accepting requests and retrying database operations as configured.

## Implementation

The retry logic lives in `chainhook/retry.js` and is used by:

- `chainhook/storage.js` — all `PostgresEventStore` and `PostgresScheduledTipStore` methods
- `chainhook/errors.js` — `classifyError` maps connection errors to `StorageUnavailableError`
- `chainhook/server.js` — health endpoint reflects storage health state

## Testing

```bash
# Run all retry-related tests
cd chainhook
npm test -- retry.test.js
npm test -- storage-retry.test.js
npm test -- health-retry.test.js

# Run the full suite
npm test
```

Test coverage includes:

- `isRetryable` — all retryable and non-retryable error codes
- `calculateBackoff` — exponential growth, cap, and jitter
- `withRetry` — success, recovery, non-retryable bail-out, exhaustion
- `parseRetryConfig` — env var parsing and defaults
- Storage integration — recovery from transient failures in query operations
- Health endpoint — 200/503 based on storage state

## Metrics

Retry activity is tracked in the `/metrics` endpoint:

```json
{
"db_retry_attempts": 12,
"db_retry_successes": 10,
"db_retry_exhausted": 2
}
```

- `db_retry_attempts` — total number of retry attempts made (not counting the first try)
- `db_retry_successes` — operations that eventually succeeded after one or more retries
- `db_retry_exhausted` — operations that failed after exhausting all retry attempts

A healthy service should have `db_retry_exhausted` at or near zero. A non-zero value indicates the database was unreachable for longer than the configured retry window.
10 changes: 8 additions & 2 deletions chainhook/errors.js
Original file line number Diff line number Diff line change
Expand Up @@ -74,11 +74,17 @@ export function classifyError(error) {
const code = error?.code;
const message = lowerMessage;
if (
['ECONNREFUSED', 'ECONNRESET', 'ETIMEDOUT', 'EHOSTUNREACH', 'ENOTFOUND', '57P03', '53300'].includes(code) ||
['ECONNREFUSED', 'ECONNRESET', 'ETIMEDOUT', 'EHOSTUNREACH', 'ENOTFOUND', 'EPIPE', 'ENETUNREACH',
'57P03', '53300', '08000', '08003', '08006', '08001', '08004', '40001', '40P01'].includes(code) ||
message.includes('postgres') ||
message.includes('database') ||
message.includes('connection refused') ||
message.includes('cannot connect')
message.includes('connection terminated') ||
message.includes('connection reset') ||
message.includes('cannot connect') ||
message.includes('too many connections') ||
message.includes('client checkout timed out') ||
message.includes('idle timeout')
) {
return new StorageUnavailableError(error?.message || 'storage unavailable', {
reason: error?.message || String(error),
Expand Down
Loading
Loading