Skip to content

Feat/multi region replication monitoring#715

Open
TheCreatorNode wants to merge 2 commits into
SoroScan:mainfrom
TheCreatorNode:feat/multi-region-replication-monitoring
Open

Feat/multi region replication monitoring#715
TheCreatorNode wants to merge 2 commits into
SoroScan:mainfrom
TheCreatorNode:feat/multi-region-replication-monitoring

Conversation

@TheCreatorNode

Copy link
Copy Markdown
Contributor

fixes #537

- Create prune_events command to delete old ContractEvents
- Support configurable retention period via --retention-days or EVENT_RETENTION_DAYS setting
- Include --dry-run flag to preview deletions without executing
- Add comprehensive test suite covering all functionality
- Log number of deleted records with colored output

Fixes SoroScan#366
## Description
Implements comprehensive replication lag monitoring for multi-region database deployments.

## Changes

### Core Implementation
- **soroscan/ingest/replication.py**: New ReplicationLagMonitor class with:
  - LSN-based lag measurement (fast method)
  - Write-test lag measurement (accurate method)
  - Replica health status checks
  - Configurable threshold-based alerting

### Prometheus Metrics
- soroscan_replication_lag_seconds: Current lag in seconds (Gauge)
- soroscan_replication_lag_checks_total: Total checks performed (Counter)
- soroscan_replication_status: Health status 1=healthy, 0=unhealthy (Gauge)
- soroscan_replication_alerts_total: Total alerts triggered (Counter)

### Celery Tasks
- monitor_replication_lag(): Periodic lag measurement and alerting
- check_replica_health(): Comprehensive replica status validation

### CLI Tool
- check_replication_lag command with:
  - Single check or continuous monitoring modes
  - Configurable check intervals
  - LSN or write-test measurement methods

### Monitoring & Alerting
- Grafana dashboard panels for lag visualization
- Alert rules for warning/critical thresholds
- Health status monitoring
- Check failure detection

### Documentation
- Comprehensive configuration guide
- Usage instructions
- Troubleshooting guide
- Performance impact analysis

## Acceptance Criteria ✅
- [x] Replication lag measured
- [x] Metrics exported to Prometheus
- [x] Dashboard shows lag over time
- [x] Alerts on lag > threshold

## Configuration
Set environment variables:
- REPLICA_DB_ALIAS: Database alias for replica connection
- REPLICATION_LAG_THRESHOLD_SECONDS: Warning threshold (default: 5s)
- REPLICATION_LAG_ALERT_THRESHOLD_SECONDS: Critical threshold (default: 10s)
- REGION_NAME: Region identifier for multi-region deployments

## Testing
- Run single check: python manage.py check_replication_lag
- Run continuous: python manage.py check_replication_lag --continuous
- View metrics: curl http://localhost:8000/metrics | grep replication
@drips-wave

drips-wave Bot commented Jun 1, 2026

Copy link
Copy Markdown

@TheCreatorNode Great news! 🎉 Based on an automated assessment of this PR, the linked Wave issue(s) no longer count against your application limits.

You can now already apply to more issues while waiting for a review of this PR. Keep up the great work! 🚀

Learn more about application limits

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Multi-Region Replication Monitoring

1 participant