Catch PostgreSQL storage problems before they slow you down
A Kubernetes operator that detects storage-level performance issues in PostgreSQL — like declining HOT ratios and dead tuple accumulation — and can safely remediate them (VACUUM, online REINDEX) under an explicit, opt-in safety policy before they impact production.
Most Postgres tools optimize queries. This tool prevents storage-level slowdowns:
- Tables getting slower even when queries don't change
- Write performance degrading after adding indexes
- Autovacuum falling behind on high-update tables
Prerequisites:
- Kubernetes cluster
- Crunchy Postgres Operator installed
- A PostgreSQL cluster managed by PGO
Install:
# Install CRDs and operator
kubectl apply -k github.com/mathewjustin/postgres-storage-advisor/config/default
# Create analysis for your database
kubectl apply -f - <<EOF
apiVersion: storage-advisor.postgres.dev/v1alpha1
kind: StorageAnalysis
metadata:
name: my-db-analysis
namespace: default
spec:
postgresClusterRef:
name: my-postgres-cluster # Your PGO cluster name
analysisInterval: 5m
EOF
# Watch for findings
kubectl get storageanalysis my-db-analysis -o yamlProblem: Indexes blocking efficient updates
Impact: 4-10x write slowdown
Example: Table with 90% HOT ratio drops to 60% after new index added
Problem: Autovacuum can't keep up with updates
Impact: 1.5-2x query slowdown, storage bloat
Example: 35% of table rows are dead but not yet cleaned
- Table/index bloat growth
- Vacuum lag detection
- Write amplification analysis
Detection is only half the value — Postgres already has autovacuum. The operator can close the loop on the things autovacuum is bad at, and do it safely:
| Finding | Remediation | Safety tier |
|---|---|---|
dead-tuple-accumulation |
VACUUM (ANALYZE) <table> (no exclusive lock) |
safe |
declining-hot-ratio |
REINDEX TABLE CONCURRENTLY <table> (online, PG12+) |
requires-testing |
Only this fixed, vetted SQL catalog can ever run — recommendation free-text is never executed, and all table identifiers are sanitized.
Remediation is off by default. An action executes only when all of these hold, evaluated in order:
remediation.enabled: true- the action's safety tier is in
remediation.allowedSafety - the current time is inside
remediation.maintenanceWindow(if set) remediation.dryRun: false
Otherwise the action is planned and recorded in
status.remediationHistory (newest first, last 20) with an outcome of
dry-run, skipped-safety, or skipped-window — so you can see exactly what
would have happened before enabling it. An unparseable maintenance window
fails closed.
Storage Advisor Operator
├─ Watches StorageAnalysis CRs
├─ Connects to Postgres via PGO secrets (or a plain Secret, see below)
├─ Collects pg_stat_* metrics (external, no DB code)
├─ Detects patterns from historical trends
├─ Plans remediation, gated by the safety policy
└─ Updates CRD status with findings + remediation history
Analysis never writes to your database. Remediation only writes when you have explicitly enabled it and turned off dry-run (see the safety model above).
- PGO (default):
postgresClusterRefreads the Crunchy Postgres Operator secret/TLS convention ({cluster}-pguser-{cluster}+{cluster}-cluster-cert). - Plain Secret (
localConnection): for local dev or non-PGO Postgres, point at a plainSecretwith keyshost, port, dbname, user, password(non-TLS). When set, it takes precedence over the PGO path.
status:
overallRisk: high
findings:
- table: events
severity: high
issue: declining-hot-ratio
explanation: |
HOT ratio dropped from 89% to 62% (-27 points).
Updates now require index maintenance, slowing writes by ~4x.
Check for recently added indexes.Local dev uses kind + Tilt with a vendored local image registry and a
plain postgres:16-alpine (no PGO/TLS needed):
# 1. Create the kind cluster + local registry (idempotent).
# The registry is required because snap-packaged Docker cannot
# `docker save` to the /tmp path Tilt's kind image-loader uses.
./scripts/setup-local-dev.sh
# 2. Bring everything up (operator + postgres + sample CR, live-reload).
make tilt-up # or: tilt up / tilt ci
# Inspect
kubectl get storageanalysis -n postgres-operator-system
kubectl get pods -n postgres-operator-system
# Run tests
make testThe Tilt stack deploys a sample StorageAnalysis that uses localConnection
and ships remediation in dryRun mode, so the full plan→gate path is
exercised without touching the database.
apiVersion: storage-advisor.postgres.dev/v1alpha1
kind: StorageAnalysis
metadata:
name: production-db
spec:
postgresClusterRef:
name: prod-cluster
analysisInterval: 5m
database: postgres # Optional, defaults to cluster namespec:
# Only analyze specific tables
includeTables:
- public.users
- public.events
# Exclude tables
excludeTables:
- public.audit_logspec:
postgresClusterRef:
name: prod-cluster
remediation:
enabled: true # master switch (default false)
dryRun: false # set true to plan-only (default true)
allowedSafety: # only these tiers may execute
- safe # VACUUM (ANALYZE)
# - requires-testing # REINDEX TABLE CONCURRENTLY (opt in deliberately)
maintenanceWindow: # UTC; optional. Outside it -> skipped-window
days: ["Sat", "Sun"] # empty/omitted = every day
start: "02:00"
end: "04:00"spec:
postgresClusterRef:
name: unused-locally # still required by the schema
database: testdb
localConnection: # takes precedence over PGO when set
secretName: postgres-credentials
# secretNamespace: defaults to the StorageAnalysis namespace- Language: Go 1.21
- Framework: Kubebuilder / controller-runtime
- Storage: ConfigMaps (30-day retention)
- No external dependencies (no Redis, no TimescaleDB)
See docs/architecture.puml for detailed diagrams.
Add custom detection patterns by implementing the Pattern interface:
type MyPattern struct{}
func (p *MyPattern) Detect(snapshots []TableSignals) *Finding {
// Your logic here
return &Finding{...}
}Register in cmd/operator/main.go:
patterns := []analyzer.Pattern{
analyzer.NewDecliningHOTPattern(),
analyzer.NewDeadTuplePattern(),
analyzer.NewMyPattern(), // Your pattern
}✅ Implemented:
- CRD definitions with validation
- Pattern detection (2 patterns)
- PGO (TLS) + plain-Secret (
localConnection) connection modes - Safety-gated automated remediation (VACUUM, online REINDEX) with dry-run default, safety tiers, maintenance window, and audit history
- Time-series storage (30-day retention)
- RBAC manifests, health/readiness probes, unit tests
- kind + Tilt local dev environment
🚧 Roadmap:
- Additional pattern detectors (bloat growth, vacuum lag)
- More remediation actions (autovacuum per-table tuning,
pg_repack) - Prometheus metrics export
- Webhook/Slack notifications
- LLM-powered explanations (optional)
Apache 2.0