Feat/multi region replication monitoring#715
Open
TheCreatorNode wants to merge 2 commits into
Open
Conversation
- Create prune_events command to delete old ContractEvents - Support configurable retention period via --retention-days or EVENT_RETENTION_DAYS setting - Include --dry-run flag to preview deletions without executing - Add comprehensive test suite covering all functionality - Log number of deleted records with colored output Fixes SoroScan#366
## Description Implements comprehensive replication lag monitoring for multi-region database deployments. ## Changes ### Core Implementation - **soroscan/ingest/replication.py**: New ReplicationLagMonitor class with: - LSN-based lag measurement (fast method) - Write-test lag measurement (accurate method) - Replica health status checks - Configurable threshold-based alerting ### Prometheus Metrics - soroscan_replication_lag_seconds: Current lag in seconds (Gauge) - soroscan_replication_lag_checks_total: Total checks performed (Counter) - soroscan_replication_status: Health status 1=healthy, 0=unhealthy (Gauge) - soroscan_replication_alerts_total: Total alerts triggered (Counter) ### Celery Tasks - monitor_replication_lag(): Periodic lag measurement and alerting - check_replica_health(): Comprehensive replica status validation ### CLI Tool - check_replication_lag command with: - Single check or continuous monitoring modes - Configurable check intervals - LSN or write-test measurement methods ### Monitoring & Alerting - Grafana dashboard panels for lag visualization - Alert rules for warning/critical thresholds - Health status monitoring - Check failure detection ### Documentation - Comprehensive configuration guide - Usage instructions - Troubleshooting guide - Performance impact analysis ## Acceptance Criteria ✅ - [x] Replication lag measured - [x] Metrics exported to Prometheus - [x] Dashboard shows lag over time - [x] Alerts on lag > threshold ## Configuration Set environment variables: - REPLICA_DB_ALIAS: Database alias for replica connection - REPLICATION_LAG_THRESHOLD_SECONDS: Warning threshold (default: 5s) - REPLICATION_LAG_ALERT_THRESHOLD_SECONDS: Critical threshold (default: 10s) - REGION_NAME: Region identifier for multi-region deployments ## Testing - Run single check: python manage.py check_replication_lag - Run continuous: python manage.py check_replication_lag --continuous - View metrics: curl http://localhost:8000/metrics | grep replication
|
@TheCreatorNode Great news! 🎉 Based on an automated assessment of this PR, the linked Wave issue(s) no longer count against your application limits. You can now already apply to more issues while waiting for a review of this PR. Keep up the great work! 🚀 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
fixes #537