Skip to content

mathewjustin/postgres-storage-advisor

Repository files navigation

Postgres Storage Advisor

Catch PostgreSQL storage problems before they slow you down

A Kubernetes operator that detects storage-level performance issues in PostgreSQL — like declining HOT ratios and dead tuple accumulation — and can safely remediate them (VACUUM, online REINDEX) under an explicit, opt-in safety policy before they impact production.

Why This Exists

Most Postgres tools optimize queries. This tool prevents storage-level slowdowns:

  • Tables getting slower even when queries don't change
  • Write performance degrading after adding indexes
  • Autovacuum falling behind on high-update tables

Quick Start

Prerequisites:

Install:

# Install CRDs and operator
kubectl apply -k github.com/mathewjustin/postgres-storage-advisor/config/default

# Create analysis for your database
kubectl apply -f - <<EOF
apiVersion: storage-advisor.postgres.dev/v1alpha1
kind: StorageAnalysis
metadata:
  name: my-db-analysis
  namespace: default
spec:
  postgresClusterRef:
    name: my-postgres-cluster  # Your PGO cluster name
  analysisInterval: 5m
EOF

# Watch for findings
kubectl get storageanalysis my-db-analysis -o yaml

What It Detects

1. Declining HOT Update Ratio

Problem: Indexes blocking efficient updates
Impact: 4-10x write slowdown
Example: Table with 90% HOT ratio drops to 60% after new index added

2. Dead Tuple Accumulation

Problem: Autovacuum can't keep up with updates
Impact: 1.5-2x query slowdown, storage bloat
Example: 35% of table rows are dead but not yet cleaned

Coming Soon

  • Table/index bloat growth
  • Vacuum lag detection
  • Write amplification analysis

What It Fixes (Automated Remediation)

Detection is only half the value — Postgres already has autovacuum. The operator can close the loop on the things autovacuum is bad at, and do it safely:

Finding Remediation Safety tier
dead-tuple-accumulation VACUUM (ANALYZE) <table> (no exclusive lock) safe
declining-hot-ratio REINDEX TABLE CONCURRENTLY <table> (online, PG12+) requires-testing

Only this fixed, vetted SQL catalog can ever run — recommendation free-text is never executed, and all table identifiers are sanitized.

Safety model (nothing runs by accident)

Remediation is off by default. An action executes only when all of these hold, evaluated in order:

  1. remediation.enabled: true
  2. the action's safety tier is in remediation.allowedSafety
  3. the current time is inside remediation.maintenanceWindow (if set)
  4. remediation.dryRun: false

Otherwise the action is planned and recorded in status.remediationHistory (newest first, last 20) with an outcome of dry-run, skipped-safety, or skipped-window — so you can see exactly what would have happened before enabling it. An unparseable maintenance window fails closed.

How It Works

Storage Advisor Operator
    ├─ Watches StorageAnalysis CRs
    ├─ Connects to Postgres via PGO secrets (or a plain Secret, see below)
    ├─ Collects pg_stat_* metrics (external, no DB code)
    ├─ Detects patterns from historical trends
    ├─ Plans remediation, gated by the safety policy
    └─ Updates CRD status with findings + remediation history

Analysis never writes to your database. Remediation only writes when you have explicitly enabled it and turned off dry-run (see the safety model above).

Connecting to Postgres

  • PGO (default): postgresClusterRef reads the Crunchy Postgres Operator secret/TLS convention ({cluster}-pguser-{cluster} + {cluster}-cluster-cert).
  • Plain Secret (localConnection): for local dev or non-PGO Postgres, point at a plain Secret with keys host, port, dbname, user, password (non-TLS). When set, it takes precedence over the PGO path.

Example Output

status:
  overallRisk: high
  findings:
  - table: events
    severity: high
    issue: declining-hot-ratio
    explanation: |
      HOT ratio dropped from 89% to 62% (-27 points).
      Updates now require index maintenance, slowing writes by ~4x.
      Check for recently added indexes.

Development

Local dev uses kind + Tilt with a vendored local image registry and a plain postgres:16-alpine (no PGO/TLS needed):

# 1. Create the kind cluster + local registry (idempotent).
#    The registry is required because snap-packaged Docker cannot
#    `docker save` to the /tmp path Tilt's kind image-loader uses.
./scripts/setup-local-dev.sh

# 2. Bring everything up (operator + postgres + sample CR, live-reload).
make tilt-up                         # or: tilt up   /   tilt ci

# Inspect
kubectl get storageanalysis -n postgres-operator-system
kubectl get pods -n postgres-operator-system

# Run tests
make test

The Tilt stack deploys a sample StorageAnalysis that uses localConnection and ships remediation in dryRun mode, so the full plan→gate path is exercised without touching the database.

Configuration

Basic Example

apiVersion: storage-advisor.postgres.dev/v1alpha1
kind: StorageAnalysis
metadata:
  name: production-db
spec:
  postgresClusterRef:
    name: prod-cluster
  analysisInterval: 5m
  database: postgres  # Optional, defaults to cluster name

Advanced Options

spec:
  # Only analyze specific tables
  includeTables:
    - public.users
    - public.events
  
  # Exclude tables
  excludeTables:
    - public.audit_log

Remediation Example

spec:
  postgresClusterRef:
    name: prod-cluster
  remediation:
    enabled: true            # master switch (default false)
    dryRun: false            # set true to plan-only (default true)
    allowedSafety:           # only these tiers may execute
      - safe                 # VACUUM (ANALYZE)
      # - requires-testing   # REINDEX TABLE CONCURRENTLY (opt in deliberately)
    maintenanceWindow:       # UTC; optional. Outside it -> skipped-window
      days: ["Sat", "Sun"]   # empty/omitted = every day
      start: "02:00"
      end: "04:00"

Plain-Secret / Local Connection

spec:
  postgresClusterRef:
    name: unused-locally     # still required by the schema
  database: testdb
  localConnection:           # takes precedence over PGO when set
    secretName: postgres-credentials
    # secretNamespace: defaults to the StorageAnalysis namespace

Architecture

  • Language: Go 1.21
  • Framework: Kubebuilder / controller-runtime
  • Storage: ConfigMaps (30-day retention)
  • No external dependencies (no Redis, no TimescaleDB)

See docs/architecture.puml for detailed diagrams.

Extending with New Patterns

Add custom detection patterns by implementing the Pattern interface:

type MyPattern struct{}

func (p *MyPattern) Detect(snapshots []TableSignals) *Finding {
    // Your logic here
    return &Finding{...}
}

Register in cmd/operator/main.go:

patterns := []analyzer.Pattern{
    analyzer.NewDecliningHOTPattern(),
    analyzer.NewDeadTuplePattern(),
    analyzer.NewMyPattern(),  // Your pattern
}

Status

Implemented:

  • CRD definitions with validation
  • Pattern detection (2 patterns)
  • PGO (TLS) + plain-Secret (localConnection) connection modes
  • Safety-gated automated remediation (VACUUM, online REINDEX) with dry-run default, safety tiers, maintenance window, and audit history
  • Time-series storage (30-day retention)
  • RBAC manifests, health/readiness probes, unit tests
  • kind + Tilt local dev environment

🚧 Roadmap:

  • Additional pattern detectors (bloat growth, vacuum lag)
  • More remediation actions (autovacuum per-table tuning, pg_repack)
  • Prometheus metrics export
  • Webhook/Slack notifications
  • LLM-powered explanations (optional)

License

Apache 2.0

About

Kubernetes operator for detecting PostgreSQL storage issues like declining HOT updates, dead tuples, and bloat signals

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors