Skip to content

fix: repair DB maintenance retention and add monthly call retention#50

Open
LumenPrima wants to merge 13 commits into
masterfrom
fix/maintenance-retention
Open

fix: repair DB maintenance retention and add monthly call retention#50
LumenPrima wants to merge 13 commits into
masterfrom
fix/maintenance-retention

Conversation

@LumenPrima

Copy link
Copy Markdown
Member

Summary

Three related fixes to the daily DB maintenance run. Reported by J-Man; verified against prod and a real Postgres 17.

Bug 1 — raw partitions never dropped (the ~2-month buildup)

DropOldWeeklyPartitions parsed the partition upper bound with time.Parse("2006-01-02", …), but mqtt_raw_messages is partitioned on a timestamptz column, so pg_get_expr renders bounds like 2026-05-18 00:00:00+00. The date-only layout errored and the code silently continued past every partition — nothing was ever dropped. Verified on prod: 13 stale weekly partitions sitting undropped.

Fix: a shared expiredPartitions helper casts the bound ::timestamptz in SQL and scans into a nullable *time.Time (NULL = DEFAULT/MAXVALUE bound → never dropped).

Bug 2 — timeout starvation

The whole run shared one 5*time.Minute context. A slow step (e.g. state-table decimation) could leave later cleanup steps with an already-expired context; those failures were only logged at Warn, silently skipping retention.

Fix: maintenanceStep runs each step under its own timeout derived from the pipeline context (decimation/purge per-table), so no step can starve another. Shutdown cancellation still propagates.

Feature — monthly call retention (opt-in)

New RETENTION_CALLS knob, default 0 = keep forever (non-destructive on upgrade). When set, DropOldCallPartitions removes the FK-coupled call family per a calendar-month whole-partition policy, in FK-safe order:

  1. drop call_frequencies / call_transmissions partitions
  2. delete transcriptions below the partition boundary (not the raw cutoff — so transcripts for a not-yet-expired boundary month are preserved)
  3. DETACH + DROP the calls partitions

The order was validated empirically: a plain DROP of a referenced calls partition is refused by PostgreSQL because the child FK constraints depend on it; DETACH then DROP works once children are gone. Wired through the full config/override/locked/status plumbing, openapi.yaml, sample.env, and CLAUDE.md.

Tests

  • internal/ingest/maintenance_test.go — unit tests for maintenanceStep (independent budgets, shutdown propagation).
  • internal/database/maintenance_integration_test.go — real-Postgres integration tests (skip without TEST_DATABASE_URL): call-family drop + FK ordering, boundary preservation, disabled no-op, Bug 1 timestamptz parsing + DEFAULT-partition safety.
  • Mutation-tested: swapping boundarycutoff makes TestCallRetention fail, confirming the boundary guard is real.

All pass against postgres:17-alpine; full go build/go vet/go test ./... green.

Notes for reviewers

  • RETENTION_CALLS is disabled by default — no behavior change until an operator opts in.
  • Scope is the call family only; unit_events (still "permanent") and trunking_messages (already row-purged) are deliberately left out.

🤖 Generated with Claude Code

LumenPrima and others added 13 commits May 10, 2026 00:43
- storage: LocalStore tests (save/open round-trip, nested dirs, atomic writes,
  path traversal security, Exists, LocalPath, URL, Type, Dir, NotFound)
- trconfig: LoadConfig, LoadVolumeMap, Discover tests
- 2 new test files, 0 regressions across full suite
Bubble Scatter, Calendar Heatmap, Daily Overview (treemap), Emergency Log,
Recorder Gauges, TG Sunburst, Traffic & Patterns, Unit Tracker — all
registered via card-title/description/order meta tags.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
migrating-auth.md covers the v0.9.8 auth simplification (AUTH_ENABLED
deprecated, three-mode auto-detection). glossary-research.md captures
PostgreSQL-based phonetic/fuzzy lookup strategy for ASR post-correction.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…ards

All GET /api/v1/{systems,calls,recorders} responses are wrapped objects
({systems:[],total} etc.), not plain arrays — pages were iterating the
object directly and displaying no data.

Bugs fixed per page:
- bubble-scatter: unwrap systems/calls; tg_alpha_tag (was tgid_alpha_tag); tg_tag (was tag)
- calendar-heatmap: unwrap systems/calls
- daily-overview: unwrap systems/calls; fix unconditional j-- causing infinite loop;
  d.depth===2 for TG leaves (was d.height===2 which selected root); tg_alpha_tag; tg_tag
- tg-sunburst: unwrap systems/calls; tg_alpha_tag (was tgid_alpha_tag)
- recorder-gauges: unwrap recorders and systems
- emergency-log: fix JS syntax error (bare var(--accent) as expression); system name
  field (name not system_name); wire up trend chart on init
- traffic-patterns: system name field; call-heatmap uses days param (not hours, max 90)
- unit-tracker: unit_alpha_tag (was alpha_tag); var declaration in strict mode

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- Add POST /debug-report (was implemented but missing from spec)
- Add DELETE /users/all with danger notice (bulk delete, admin only)
- Fix /admin/transcribe-backfill path param: {job_id} -> {id} to match code
- go test -race passes clean (zero DATA RACE)
- golangci-lint v1.64.8 with govet/staticcheck/errcheck/unused
- gates new code only via --new-from-rev HEAD~1
- GET /api/v1/recent-events: UNION of calls + unit_events with filters
- GET /api/v1/dashboard/summary: aggregated health/stats/active/top-TG
- Audio Range support (200/206/416) via http.ServeContent
- OpenAPI specs for all three endpoints
- Tests for handlers and audio status codes
…nto UI

- Call history: transcript search wired to server-side /transcriptions/search
- Call history: draggable seek handle with buffered range visualization
- Events page: pre-populate from /recent-events before SSE connects
- Signal Flow: surface query 403 errors instead of silent degradation
- POST /api/v1/query: explicit QueryAdminOnly middleware with clear 403
…and Playwright smoke tests

- docs/quickstart.md: Docker Compose → ingest → UI in 5 minutes
- docs/tr-timing-behavior.md: call ID shifts, unit_event:end lag, mitigations
- docs/release-notes.md: v0.10.0 release notes with features/fixes/upgrade notes
- tests/: Playwright smoke suite (8 tests) against deployed instance
W1: Refresh token rotation — JTI-based rotation with reuse detection
  - New migration adds refresh_token_jti column to users table
  - Refresh handler validates JTI, rejects reused tokens with 401
  - Logout clears stored JTI before expiring cookie

W2: rewriteInstanceID() — replace fragile byte-level JSON with
  json.Unmarshal/json.Marshal round-trip, preventing false matches
  on keys like my_instance_id

W3: AGENTS.md — fix fuzzy match tolerance documentation (±5s → ±10s)

W4: AuthRateLimiter — fix goroutine leak via time.NewTicker with
  ctx.Done() select, ShutdownCtx propagated from main

W5: Middleware chain — swap RateLimiter/Recoverer order so panics
  in rate limiter are caught by Recoverer
Three related fixes to the daily maintenance run.

Bug 1 — raw partitions never dropped. DropOldWeeklyPartitions parsed the
extracted partition upper bound with time.Parse("2006-01-02", ...), but
mqtt_raw_messages is partitioned on a timestamptz column, so pg_get_expr
renders bounds like '2026-05-18 00:00:00+00'. The date-only layout errored
and the code silently skipped every partition, so nothing was ever dropped
(~2 months of stale raw partitions on prod). Fix: a shared expiredPartitions
helper casts the bound ::timestamptz in SQL and scans into a nullable
*time.Time (NULL = DEFAULT/MAXVALUE bound, never dropped).

Bug 2 — timeout starvation. The whole run shared one 5-minute context, so a
slow step (e.g. state-table decimation) could leave later cleanup steps with
an already-expired context; those failures were only logged at Warn, silently
skipping retention. Fix: maintenanceStep runs each step under its own timeout
derived from the pipeline context (decimation/purge per-table), so no step can
starve another; shutdown cancellation still propagates.

Feature — monthly call retention (opt-in via RETENTION_CALLS, default 0 =
keep forever, so non-destructive on upgrade). DropOldCallPartitions removes
the FK-coupled call family per a calendar-month whole-partition policy in
foreign-key-safe order: drop call_frequencies/call_transmissions partitions,
delete transcriptions below the partition boundary (not the raw cutoff, so
transcripts for a not-yet-expired boundary month are preserved), then
DETACH+DROP the calls partitions (a plain DROP is refused while child FK
constraints depend on the partition). Wired through the full
config/override/locked/status plumbing, openapi.yaml, sample.env, CLAUDE.md.

Tests: unit tests for maintenanceStep budgets/shutdown; real-Postgres
integration tests (skip without TEST_DATABASE_URL) for the call-family drop,
the boundary preservation subtlety, the disabled no-op, and Bug 1 timestamptz
bound parsing + DEFAULT-partition safety. Verified against postgres:17-alpine.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant