add prometheus integration #172
Conversation
📝 WalkthroughWalkthroughThis PR adds Prometheus metrics support: new env vars to enable/configure metrics, a Changes
Sequence DiagramsequenceDiagram
actor Main as Application Main
participant Provider as Metrics Provider
participant Services as Services Layer
participant DB as Database
participant HTTP as HTTP Server (/metrics)
participant Prom as Prometheus Scraper
Main->>Provider: InitMetrics(env, db, servs)
alt Metrics Enabled
Provider->>Prom: Register DB Stats Collector
Provider->>HTTP: Start HTTP Server on PBW_METRICS_LISTEN_HOST:PORT
Provider->>Provider: Start Background Goroutine (1min interval)
loop Every 1 Minute
Provider->>Services: Query databases, destinations, backups
Services->>DB: run queries
DB-->>Services: health & status data
Services-->>Provider: health & status data
Provider->>Provider: Update HealthStatus, ResourceCounts, BackupTasksStatus
end
rect rgba(100,150,200,0.5)
Note over HTTP,Prom: Prometheus scrapes /metrics
Prom->>HTTP: GET /metrics
HTTP-->>Prom: Exposition response
end
else Metrics Disabled
Provider-->>Main: Return (no-op)
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related issues
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (2)
internal/metrics/metrics_test.go (1)
14-45: Prefer table-driven subtests (t.Run) for this_test.gofile.This test should follow the project’s test structure convention, even with a single case.
As per coding guidelines, "Write table-driven tests using t.Run with a slice of structs".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/metrics/metrics_test.go` around lines 14 - 45, Convert TestMetricsExposure into a table-driven test that uses t.Run even for the single case: create a slice of test structs (e.g., {name, labels/setup func, expected []string}) and iterate over them, calling t.Run(tc.name, func(t *testing.T) { ... }) to perform the metric updates (using BackupsRunning, BackupsTotal, LastBackupDuration, LastBackupStatus, HealthStatus, HealthyResourcesCount, TotalResourcesCount, BackupTasksStatus), invoke promhttp.Handler().ServeHTTP on an httptest request, assert the response contains the expected strings, and perform cleanup (e.g., BackupsRunning.Dec()) inside the subtest; update TestMetricsExposure to reference the test case fields so the test structure follows the project's table-driven convention.internal/metrics/provider/provider.go (1)
29-96: Metrics collector loop requires cancellation context and operation timeouts.The infinite loop runs with
context.Background()andtime.Sleep(1 * time.Minute)only. Service calls (GetAllDatabases,GetAllDestinations,GetAllBackups) have no deadline, causing potential goroutine hangs during shutdown or when dependencies stall. Use a context with cancellation and apply per-operation timeouts to all service calls.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/metrics/provider/provider.go` around lines 29 - 96, The goroutine uses context.Background() and time.Sleep, causing un-cancelable long-running operations; change it to accept/derive a cancellable parent context and replace time.Sleep with a time.Ticker + select so the loop can stop on ctx.Done(); for each service call (servs.DatabasesService.GetAllDatabases, servs.DestinationsService.GetAllDestinations, servs.BackupsService.GetAllBackups) wrap the call with a short per-operation timeout via context.WithTimeout (and defer cancel) to avoid hangs; ensure you check ctx.Done() between operations and cancel per-call contexts to prevent leaks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.env.example:
- Around line 24-30: The new .env.example entries PBW_ENABLE_METRICS,
PBW_METRICS_LISTEN_HOST, and PBW_METRICS_LISTEN_PORT currently use surrounding
quotes which triggers dotenv-linter's QuoteCharacter rule; edit the .env.example
to remove the surrounding double quotes so the variables are unquoted (e.g.,
PBW_ENABLE_METRICS=false, PBW_METRICS_LISTEN_HOST=0.0.0.0,
PBW_METRICS_LISTEN_PORT=2112) to satisfy the linter.
In `@cmd/app/main.go`:
- Line 49: The call to provider.InitMetrics in main currently swallows startup
failures; change the InitMetrics signature in
internal/metrics/provider/provider.go to return an error for startup-time
failures (e.g., bind/listen errors) and update its internals to return those
errors instead of only logging, then update main to check the returned error
from provider.InitMetrics and call logger.FatalError() (or equivalent fatal
logger) to fail fast when metrics are explicitly enabled; reference
provider.InitMetrics and logger.FatalError() when making these changes.
In `@internal/metrics/metrics_test.go`:
- Around line 16-45: Test mutates global metrics (BackupsRunning, BackupsTotal,
LastBackupDuration, LastBackupStatus, HealthStatus, HealthyResourcesCount,
TotalResourcesCount, BackupTasksStatus) with fixed labels causing flakiness; fix
by making labels unique per test (e.g., append t.Name() or a random suffix) or
instantiate a fresh Prometheus registry for the test and register local metric
instances, and ensure cleanup by calling metric.DeleteLabelValues(...) for every
label combination used or registering t.Cleanup to remove them (or use
registry.Unregister/reset) so no state is left between runs.
In `@internal/metrics/provider/provider.go`:
- Around line 105-108: The metrics startup currently logs an unstructured info
string and treats ListenAndServe errors as non-fatal; update the startup to use
structured logging and fail fast: replace the logger.Info("metrics server
started...") with a structured logger.Info call including logger.KV{"address":
address}, and when http.ListenAndServe(address, mux) returns an error call
logger.FatalError(...) with structured context logger.KV{"address": address,
"error": err} (replace the current logger.Error usage). Locate the relevant
calls to logger.Info and http.ListenAndServe in provider.go to apply these
changes.
In `@internal/service/executions/run_execution.go`:
- Around line 60-95: The updateExec closure currently triggers webhooks, metrics
and logs (s.webhooksService.RunExecutionSuccess/RunExecutionFailed,
metrics.BackupsTotal, metrics.LastBackupDuration, metrics.LastBackupStatus,
logger.Info/Error) before calling s.dbgen.ExecutionsServiceUpdateExecution;
change the order so you call s.dbgen.ExecutionsServiceUpdateExecution(ctx,
params) first and return any error immediately, and only emit the
webhooks/metrics/logs after the DB update succeeds (use the same params and
context), ensuring side effects run only when persistence succeeds.
In `@README.md`:
- Around line 169-177: The metrics list in README.md is missing three exported
metrics; update the "Metrics Exposed" section to include
`pgbackweb_healthy_resources_count`, `pgbackweb_total_resources_count`, and
`pgbackweb_backup_tasks_status` with short descriptions (healthy resources
count, total resources count, and per-task backup status respectively) so the
docs match what the app emits and operators aren't misled.
---
Nitpick comments:
In `@internal/metrics/metrics_test.go`:
- Around line 14-45: Convert TestMetricsExposure into a table-driven test that
uses t.Run even for the single case: create a slice of test structs (e.g.,
{name, labels/setup func, expected []string}) and iterate over them, calling
t.Run(tc.name, func(t *testing.T) { ... }) to perform the metric updates (using
BackupsRunning, BackupsTotal, LastBackupDuration, LastBackupStatus,
HealthStatus, HealthyResourcesCount, TotalResourcesCount, BackupTasksStatus),
invoke promhttp.Handler().ServeHTTP on an httptest request, assert the response
contains the expected strings, and perform cleanup (e.g., BackupsRunning.Dec())
inside the subtest; update TestMetricsExposure to reference the test case fields
so the test structure follows the project's table-driven convention.
In `@internal/metrics/provider/provider.go`:
- Around line 29-96: The goroutine uses context.Background() and time.Sleep,
causing un-cancelable long-running operations; change it to accept/derive a
cancellable parent context and replace time.Sleep with a time.Ticker + select so
the loop can stop on ctx.Done(); for each service call
(servs.DatabasesService.GetAllDatabases,
servs.DestinationsService.GetAllDestinations,
servs.BackupsService.GetAllBackups) wrap the call with a short per-operation
timeout via context.WithTimeout (and defer cancel) to avoid hangs; ensure you
check ctx.Done() between operations and cancel per-call contexts to prevent
leaks.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 019c958b-60e1-468b-b920-a4d3c2d0372c
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (12)
.env.exampleREADME.mdcmd/app/main.godocker/Dockerfiledocker/Dockerfile.devgo.modinternal/config/env.gointernal/metrics/metrics.gointernal/metrics/metrics_test.gointernal/metrics/provider/provider.gointernal/service/executions/run_execution.gointernal/service/executions/run_execution.sql
| ### Metrics Exposed | ||
|
|
||
| - `pgbackweb_backups_total`: Total number of backups executed (success/failed). | ||
| - `pgbackweb_backups_running`: Number of currently active backup operations. | ||
| - `pgbackweb_last_backup_duration_seconds`: Duration of the most recent backup in seconds. | ||
| - `pgbackweb_last_backup_status`: Status of the last backup (1 for success, 0 for failure). | ||
| - `pgbackweb_health_status`: Health status of databases and destinations (1 for healthy, 0 for unhealthy). | ||
| - `pgbackweb_db_stats_*`: Standard database connection pool statistics. | ||
|
|
There was a problem hiding this comment.
Metrics list is incomplete compared to what the app exports.
The “Metrics Exposed” list omits three emitted metrics (pgbackweb_healthy_resources_count, pgbackweb_total_resources_count, pgbackweb_backup_tasks_status), which can mislead operators.
📝 Proposed doc addition
- `pgbackweb_health_status`: Health status of databases and destinations (1 for healthy, 0 for unhealthy).
+- `pgbackweb_healthy_resources_count`: Number of healthy databases/destinations.
+- `pgbackweb_total_resources_count`: Total number of databases/destinations.
+- `pgbackweb_backup_tasks_status`: Number of active/inactive backup tasks.
- `pgbackweb_db_stats_*`: Standard database connection pool statistics.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ### Metrics Exposed | |
| - `pgbackweb_backups_total`: Total number of backups executed (success/failed). | |
| - `pgbackweb_backups_running`: Number of currently active backup operations. | |
| - `pgbackweb_last_backup_duration_seconds`: Duration of the most recent backup in seconds. | |
| - `pgbackweb_last_backup_status`: Status of the last backup (1 for success, 0 for failure). | |
| - `pgbackweb_health_status`: Health status of databases and destinations (1 for healthy, 0 for unhealthy). | |
| - `pgbackweb_db_stats_*`: Standard database connection pool statistics. | |
| ### Metrics Exposed | |
| - `pgbackweb_backups_total`: Total number of backups executed (success/failed). | |
| - `pgbackweb_backups_running`: Number of currently active backup operations. | |
| - `pgbackweb_last_backup_duration_seconds`: Duration of the most recent backup in seconds. | |
| - `pgbackweb_last_backup_status`: Status of the last backup (1 for success, 0 for failure). | |
| - `pgbackweb_health_status`: Health status of databases and destinations (1 for healthy, 0 for unhealthy). | |
| - `pgbackweb_healthy_resources_count`: Number of healthy databases/destinations. | |
| - `pgbackweb_total_resources_count`: Total number of databases/destinations. | |
| - `pgbackweb_backup_tasks_status`: Number of active/inactive backup tasks. | |
| - `pgbackweb_db_stats_*`: Standard database connection pool statistics. |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@README.md` around lines 169 - 177, The metrics list in README.md is missing
three exported metrics; update the "Metrics Exposed" section to include
`pgbackweb_healthy_resources_count`, `pgbackweb_total_resources_count`, and
`pgbackweb_backup_tasks_status` with short descriptions (healthy resources
count, total resources count, and per-task backup status respectively) so the
docs match what the app emits and operators aren't misled.
There was a problem hiding this comment.
🧹 Nitpick comments (4)
docker-compose.yml (3)
40-48: Development-only configuration: hardcoded credentials and encryption key.The hardcoded encryption key and embedded database credentials are appropriate for local development but must not be used in production. Consider adding a comment to make this clear to future maintainers.
📝 Proposed comment addition
pgbackweb: build: context: . dockerfile: docker/Dockerfile container_name: pgbackweb_app + # WARNING: This configuration is for LOCAL DEVELOPMENT ONLY. + # Do not use these hardcoded credentials or encryption keys in production. environment: PBW_ENCRYPTION_KEY: "test_encryption_key_32_chars_long!" PBW_POSTGRES_CONN_STRING: "postgresql://pgbackweb:pgbackweb_password@postgres-pgbackweb:5432/pgbackweb?sslmode=disable"Note: The static analysis warning (CKV_SECRET_4) about credentials in the connection string is a false positive for development docker-compose files.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docker-compose.yml` around lines 40 - 48, Add a clear development-only comment at the top of the environment block in docker-compose.yml indicating the hardcoded PBW_ENCRYPTION_KEY and PBW_POSTGRES_CONN_STRING are for local dev only and must not be used in production; reference the exact environment variables PBW_ENCRYPTION_KEY, PBW_POSTGRES_CONN_STRING, PBW_LISTEN_HOST, PBW_LISTEN_PORT, PBW_PATH_PREFIX, PBW_ENABLE_METRICS, PBW_METRICS_LISTEN_HOST, and PBW_METRICS_LISTEN_PORT so maintainers know which values are dev-only, and mention that production should use secrets/secret manager or CI-injected environment variables (and note the static analysis CKV_SECRET_4 warning is a false positive for this dev compose file).
5-17: Consider adding a volume for database persistence.The
postgres-pgbackwebservice has no volume mount, so all database data will be lost when the container is removed or recreated. For development continuity, consider adding a named volume.💾 Proposed addition for data persistence
postgres-pgbackweb: image: postgres:18 container_name: pgbackweb_postgres_main environment: POSTGRES_USER: pgbackweb POSTGRES_PASSWORD: pgbackweb_password POSTGRES_DB: pgbackweb ports: - "5432:5432" + volumes: + - pgbackweb_data:/var/lib/postgresql/data command: - "postgres" - "-c" - "log_statement=all"Then add at the end of the file:
volumes: pgbackweb_data: test_data:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docker-compose.yml` around lines 5 - 17, The postgres-pgbackweb service has no persistent volume so its data is ephemeral; add a named volume and mount it into the container by updating the postgres-pgbackweb service to include a volumes entry that maps a named volume (e.g., pgbackweb_data) to the Postgres data directory (/var/lib/postgresql/data) and then declare that named volume under the top-level volumes section (e.g., pgbackweb_data:). Reference the service name "postgres-pgbackweb" and the suggested volume name "pgbackweb_data" when making the changes.
35-54: Consider adding a health check for the application service.Adding a health check would allow docker-compose to monitor the application's availability and could help with service orchestration.
🏥 Proposed health check addition
pgbackweb: build: context: . dockerfile: docker/Dockerfile container_name: pgbackweb_app environment: PBW_ENCRYPTION_KEY: "test_encryption_key_32_chars_long!" PBW_POSTGRES_CONN_STRING: "postgresql://pgbackweb:pgbackweb_password@postgres-pgbackweb:5432/pgbackweb?sslmode=disable" PBW_LISTEN_HOST: "0.0.0.0" PBW_LISTEN_PORT: "8085" PBW_PATH_PREFIX: "" PBW_ENABLE_METRICS: "true" PBW_METRICS_LISTEN_HOST: "0.0.0.0" PBW_METRICS_LISTEN_PORT: "2112" ports: - "8085:8085" - "2112:2112" depends_on: - postgres-pgbackweb + healthcheck: + test: ["CMD", "wget", "--quiet", "--tries=1", "--spider", "http://localhost:8085/health"] + interval: 30s + timeout: 10s + retries: 3 + start_period: 40s restart: unless-stopped🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docker-compose.yml` around lines 35 - 54, Add a Docker healthcheck for the pgbackweb service (pgbackweb / container_name pgbackweb_app) so docker-compose can monitor app availability; update the pgbackweb service block to include a healthcheck stanza that periodically runs a lightweight HTTP probe against the app listen host/port (e.g., curl or wget against http://localhost:8085/health or another ready endpoint), and set sensible interval, timeout, retries and start_period values so health transitions are reliable.internal/service/executions/run_execution.go (1)
60-100: Past review concern addressed - DB persistence now happens before side effects.The ordering is now correct: the database update (lines 62-65) is performed first, and webhooks/metrics/logs are only emitted after successful persistence.
Minor optimization: compute duration once per branch.
time.Since(startTime).Seconds()is called twice in each branch (for metrics and logging). Consider computing it once for consistency and minor efficiency.♻️ Optional refactor to compute duration once
// Only emit side effects if database update succeeded + durationSec := time.Since(startTime).Seconds() if params.Status.String == "success" { s.webhooksService.RunExecutionSuccess(backupID) metrics.BackupsTotal.WithLabelValues("success", back.BackupName, back.DatabaseName).Inc() - metrics.LastBackupDuration.WithLabelValues(back.BackupName, back.DatabaseName).Set(time.Since(startTime).Seconds()) + metrics.LastBackupDuration.WithLabelValues(back.BackupName, back.DatabaseName).Set(durationSec) metrics.LastBackupStatus.WithLabelValues(back.BackupName, back.DatabaseName).Set(1.0) logger.Info("execution completed successfully", logger.KV{ "backup_id": backupID.String(), "execution_id": ex.ID.String(), "backup_name": back.BackupName, "database_name": back.DatabaseName, - "duration_sec": time.Since(startTime).Seconds(), + "duration_sec": durationSec, "file_size": params.FileSize.Int64, "path": params.Path.String, }) } if params.Status.String == "failed" { s.webhooksService.RunExecutionFailed(backupID) metrics.BackupsTotal.WithLabelValues("failed", back.BackupName, back.DatabaseName).Inc() - metrics.LastBackupDuration.WithLabelValues(back.BackupName, back.DatabaseName).Set(time.Since(startTime).Seconds()) + metrics.LastBackupDuration.WithLabelValues(back.BackupName, back.DatabaseName).Set(durationSec) metrics.LastBackupStatus.WithLabelValues(back.BackupName, back.DatabaseName).Set(0.0) logger.Error("execution failed", logger.KV{ "backup_id": backupID.String(), "execution_id": ex.ID.String(), "backup_name": back.BackupName, "database_name": back.DatabaseName, "error_message": params.Message.String, - "duration_sec": time.Since(startTime).Seconds(), + "duration_sec": durationSec, }) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/service/executions/run_execution.go` around lines 60 - 100, In updateExec (the closure used to update execution state) compute the elapsed duration once inside each branch instead of calling time.Since(startTime).Seconds() multiple times: declare a local variable (e.g., duration := time.Since(startTime).Seconds()) at the start of the "if params.Status.String == \"success\"" block and at the start of the "if params.Status.String == \"failed\"" block, then use duration for metrics.LastBackupDuration.Set(...) and the logger fields to avoid duplicate calls and ensure consistency.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@docker-compose.yml`:
- Around line 40-48: Add a clear development-only comment at the top of the
environment block in docker-compose.yml indicating the hardcoded
PBW_ENCRYPTION_KEY and PBW_POSTGRES_CONN_STRING are for local dev only and must
not be used in production; reference the exact environment variables
PBW_ENCRYPTION_KEY, PBW_POSTGRES_CONN_STRING, PBW_LISTEN_HOST, PBW_LISTEN_PORT,
PBW_PATH_PREFIX, PBW_ENABLE_METRICS, PBW_METRICS_LISTEN_HOST, and
PBW_METRICS_LISTEN_PORT so maintainers know which values are dev-only, and
mention that production should use secrets/secret manager or CI-injected
environment variables (and note the static analysis CKV_SECRET_4 warning is a
false positive for this dev compose file).
- Around line 5-17: The postgres-pgbackweb service has no persistent volume so
its data is ephemeral; add a named volume and mount it into the container by
updating the postgres-pgbackweb service to include a volumes entry that maps a
named volume (e.g., pgbackweb_data) to the Postgres data directory
(/var/lib/postgresql/data) and then declare that named volume under the
top-level volumes section (e.g., pgbackweb_data:). Reference the service name
"postgres-pgbackweb" and the suggested volume name "pgbackweb_data" when making
the changes.
- Around line 35-54: Add a Docker healthcheck for the pgbackweb service
(pgbackweb / container_name pgbackweb_app) so docker-compose can monitor app
availability; update the pgbackweb service block to include a healthcheck stanza
that periodically runs a lightweight HTTP probe against the app listen host/port
(e.g., curl or wget against http://localhost:8085/health or another ready
endpoint), and set sensible interval, timeout, retries and start_period values
so health transitions are reliable.
In `@internal/service/executions/run_execution.go`:
- Around line 60-100: In updateExec (the closure used to update execution state)
compute the elapsed duration once inside each branch instead of calling
time.Since(startTime).Seconds() multiple times: declare a local variable (e.g.,
duration := time.Since(startTime).Seconds()) at the start of the "if
params.Status.String == \"success\"" block and at the start of the "if
params.Status.String == \"failed\"" block, then use duration for
metrics.LastBackupDuration.Set(...) and the logger fields to avoid duplicate
calls and ensure consistency.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 1741ba75-1a1c-45de-a3db-770b2d4c3e9c
📒 Files selected for processing (7)
.env.exampleREADME.mdcmd/app/main.godocker-compose.ymlinternal/metrics/metrics_test.gointernal/metrics/provider/provider.gointernal/service/executions/run_execution.go
✅ Files skipped from review due to trivial changes (2)
- .env.example
- README.md
🚧 Files skipped from review as they are similar to previous changes (3)
- cmd/app/main.go
- internal/metrics/metrics_test.go
- internal/metrics/provider/provider.go
This pull request adds native Prometheus monitoring support to PG Back Web. It introduces a configurable
/metricsendpoint exposing detailed backup and health metrics, updates the documentation and configuration, and integrates metric collection into the backup execution workflow. The implementation is modular and includes automated tests for metrics exposure.Prometheus Metrics Integration
internal/metrics/metrics.go) defining Prometheus metrics for backups, health status, and resources, and registered them for collection.internal/metrics/provider/provider.go) that starts a metrics HTTP server (configurable via environment variables), periodically updates health/resource metrics, and exposes database connection pool stats.Configuration and Documentation
PBW_ENABLE_METRICS,PBW_METRICS_LISTEN_HOST,PBW_METRICS_LISTEN_PORT) in.env.exampleand documented usage and available metrics inREADME.md. [1] [2] [3]Dependency and Compatibility Updates
go.mod. [1] [2] [3]Testing
/metricsendpoint works as expected (internal/metrics/metrics_test.go).see #166
Summary by CodeRabbit
New Features
Documentation
Bug Fixes / Improvements
Tests
Chores