fix: repair DB maintenance retention and add monthly call retention#50
Open
LumenPrima wants to merge 13 commits into
Open
fix: repair DB maintenance retention and add monthly call retention#50LumenPrima wants to merge 13 commits into
LumenPrima wants to merge 13 commits into
Conversation
- storage: LocalStore tests (save/open round-trip, nested dirs, atomic writes, path traversal security, Exists, LocalPath, URL, Type, Dir, NotFound) - trconfig: LoadConfig, LoadVolumeMap, Discover tests - 2 new test files, 0 regressions across full suite
Bubble Scatter, Calendar Heatmap, Daily Overview (treemap), Emergency Log, Recorder Gauges, TG Sunburst, Traffic & Patterns, Unit Tracker — all registered via card-title/description/order meta tags. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
migrating-auth.md covers the v0.9.8 auth simplification (AUTH_ENABLED deprecated, three-mode auto-detection). glossary-research.md captures PostgreSQL-based phonetic/fuzzy lookup strategy for ASR post-correction. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…ards
All GET /api/v1/{systems,calls,recorders} responses are wrapped objects
({systems:[],total} etc.), not plain arrays — pages were iterating the
object directly and displaying no data.
Bugs fixed per page:
- bubble-scatter: unwrap systems/calls; tg_alpha_tag (was tgid_alpha_tag); tg_tag (was tag)
- calendar-heatmap: unwrap systems/calls
- daily-overview: unwrap systems/calls; fix unconditional j-- causing infinite loop;
d.depth===2 for TG leaves (was d.height===2 which selected root); tg_alpha_tag; tg_tag
- tg-sunburst: unwrap systems/calls; tg_alpha_tag (was tgid_alpha_tag)
- recorder-gauges: unwrap recorders and systems
- emergency-log: fix JS syntax error (bare var(--accent) as expression); system name
field (name not system_name); wire up trend chart on init
- traffic-patterns: system name field; call-heatmap uses days param (not hours, max 90)
- unit-tracker: unit_alpha_tag (was alpha_tag); var declaration in strict mode
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- Add POST /debug-report (was implemented but missing from spec)
- Add DELETE /users/all with danger notice (bulk delete, admin only)
- Fix /admin/transcribe-backfill path param: {job_id} -> {id} to match code
- go test -race passes clean (zero DATA RACE) - golangci-lint v1.64.8 with govet/staticcheck/errcheck/unused - gates new code only via --new-from-rev HEAD~1
- GET /api/v1/recent-events: UNION of calls + unit_events with filters - GET /api/v1/dashboard/summary: aggregated health/stats/active/top-TG - Audio Range support (200/206/416) via http.ServeContent - OpenAPI specs for all three endpoints - Tests for handlers and audio status codes
…nto UI - Call history: transcript search wired to server-side /transcriptions/search - Call history: draggable seek handle with buffered range visualization - Events page: pre-populate from /recent-events before SSE connects - Signal Flow: surface query 403 errors instead of silent degradation - POST /api/v1/query: explicit QueryAdminOnly middleware with clear 403
…and Playwright smoke tests - docs/quickstart.md: Docker Compose → ingest → UI in 5 minutes - docs/tr-timing-behavior.md: call ID shifts, unit_event:end lag, mitigations - docs/release-notes.md: v0.10.0 release notes with features/fixes/upgrade notes - tests/: Playwright smoke suite (8 tests) against deployed instance
W1: Refresh token rotation — JTI-based rotation with reuse detection - New migration adds refresh_token_jti column to users table - Refresh handler validates JTI, rejects reused tokens with 401 - Logout clears stored JTI before expiring cookie W2: rewriteInstanceID() — replace fragile byte-level JSON with json.Unmarshal/json.Marshal round-trip, preventing false matches on keys like my_instance_id W3: AGENTS.md — fix fuzzy match tolerance documentation (±5s → ±10s) W4: AuthRateLimiter — fix goroutine leak via time.NewTicker with ctx.Done() select, ShutdownCtx propagated from main W5: Middleware chain — swap RateLimiter/Recoverer order so panics in rate limiter are caught by Recoverer
Three related fixes to the daily maintenance run.
Bug 1 — raw partitions never dropped. DropOldWeeklyPartitions parsed the
extracted partition upper bound with time.Parse("2006-01-02", ...), but
mqtt_raw_messages is partitioned on a timestamptz column, so pg_get_expr
renders bounds like '2026-05-18 00:00:00+00'. The date-only layout errored
and the code silently skipped every partition, so nothing was ever dropped
(~2 months of stale raw partitions on prod). Fix: a shared expiredPartitions
helper casts the bound ::timestamptz in SQL and scans into a nullable
*time.Time (NULL = DEFAULT/MAXVALUE bound, never dropped).
Bug 2 — timeout starvation. The whole run shared one 5-minute context, so a
slow step (e.g. state-table decimation) could leave later cleanup steps with
an already-expired context; those failures were only logged at Warn, silently
skipping retention. Fix: maintenanceStep runs each step under its own timeout
derived from the pipeline context (decimation/purge per-table), so no step can
starve another; shutdown cancellation still propagates.
Feature — monthly call retention (opt-in via RETENTION_CALLS, default 0 =
keep forever, so non-destructive on upgrade). DropOldCallPartitions removes
the FK-coupled call family per a calendar-month whole-partition policy in
foreign-key-safe order: drop call_frequencies/call_transmissions partitions,
delete transcriptions below the partition boundary (not the raw cutoff, so
transcripts for a not-yet-expired boundary month are preserved), then
DETACH+DROP the calls partitions (a plain DROP is refused while child FK
constraints depend on the partition). Wired through the full
config/override/locked/status plumbing, openapi.yaml, sample.env, CLAUDE.md.
Tests: unit tests for maintenanceStep budgets/shutdown; real-Postgres
integration tests (skip without TEST_DATABASE_URL) for the call-family drop,
the boundary preservation subtlety, the disabled no-op, and Bug 1 timestamptz
bound parsing + DEFAULT-partition safety. Verified against postgres:17-alpine.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three related fixes to the daily DB maintenance run. Reported by J-Man; verified against prod and a real Postgres 17.
Bug 1 — raw partitions never dropped (the ~2-month buildup)
DropOldWeeklyPartitionsparsed the partition upper bound withtime.Parse("2006-01-02", …), butmqtt_raw_messagesis partitioned on a timestamptz column, sopg_get_exprrenders bounds like2026-05-18 00:00:00+00. The date-only layout errored and the code silentlycontinued past every partition — nothing was ever dropped. Verified on prod: 13 stale weekly partitions sitting undropped.Fix: a shared
expiredPartitionshelper casts the bound::timestamptzin SQL and scans into a nullable*time.Time(NULL = DEFAULT/MAXVALUE bound → never dropped).Bug 2 — timeout starvation
The whole run shared one
5*time.Minutecontext. A slow step (e.g. state-table decimation) could leave later cleanup steps with an already-expired context; those failures were only logged atWarn, silently skipping retention.Fix:
maintenanceStepruns each step under its own timeout derived from the pipeline context (decimation/purge per-table), so no step can starve another. Shutdown cancellation still propagates.Feature — monthly call retention (opt-in)
New
RETENTION_CALLSknob, default0= keep forever (non-destructive on upgrade). When set,DropOldCallPartitionsremoves the FK-coupled call family per a calendar-month whole-partition policy, in FK-safe order:call_frequencies/call_transmissionspartitionstranscriptionsbelow the partition boundary (not the raw cutoff — so transcripts for a not-yet-expired boundary month are preserved)DETACH+DROPthecallspartitionsThe order was validated empirically: a plain
DROPof a referencedcallspartition is refused by PostgreSQL because the child FK constraints depend on it;DETACHthenDROPworks once children are gone. Wired through the full config/override/locked/status plumbing,openapi.yaml,sample.env, andCLAUDE.md.Tests
internal/ingest/maintenance_test.go— unit tests formaintenanceStep(independent budgets, shutdown propagation).internal/database/maintenance_integration_test.go— real-Postgres integration tests (skip withoutTEST_DATABASE_URL): call-family drop + FK ordering, boundary preservation, disabled no-op, Bug 1 timestamptz parsing + DEFAULT-partition safety.boundary→cutoffmakesTestCallRetentionfail, confirming the boundary guard is real.All pass against
postgres:17-alpine; fullgo build/go vet/go test ./...green.Notes for reviewers
RETENTION_CALLSis disabled by default — no behavior change until an operator opts in.unit_events(still "permanent") andtrunking_messages(already row-purged) are deliberately left out.🤖 Generated with Claude Code