KurodaKayn · KurodaKayn · Jun 28, 2026 · Jun 28, 2026
diff --git a/doc/plan/database-optimization.md b/doc/plan/database-optimization.md
@@ -11,15 +11,15 @@ Status definitions:
 - `Not started`: no clear implementation has been found yet.
 - `Deferred`: not recommended for the current business stage; only trigger conditions are retained.
 
-Current overall progress: about `69%`. This number is manually estimated by phase weight and can be adjusted later according to actual completed items.
+Current overall progress: about `71%`. This number is manually estimated by phase weight and can be adjusted later according to actual completed items.
 
 | Phase | Weight | Current completion | Status | Completed | Not done / next steps |
 | ----- | ------ | ------------------ | ------ | --------- | --------------------- |
 | Phase 0: Data-layer baseline inventory | 10% | 100% | Done | GORM query observability, `mpp_db_*` metrics, dashboard query-plan audit script, `pg_stat_statements`, database baseline audit script, PostgreSQL exporter table-level health and 24h row-growth panels, and read/write consistency classification | Continue implementing code routing, partitioning, and archiving according to this checklist in later phases |
 | Phase 1: Single-database connection pool, indexing, pagination, and lifecycle governance | 15% | 100% | Done | backend/publish-worker/collab-service application connection pools, Redis client connection pool, PgBouncer writer pool, composite indexes, keyset list pagination, list queries avoiding the large `source_content` field, event/session history retention periods, and R2/S3 cold-event archive worker | None; Phase 2 is complete, and later work moves into Phase 3/4 read replicas, partitioning, and recovery flows |
 | Phase 2: Read models and cache first | 15% | 100% | Done | Redis and Asynq dependencies are reusable; admin dashboard stats, admin project list, and dashboard account summary have short-TTL Redis cache; stats/project list/account cache misses are merged with singleflight; project, prepublish, publish, and account write paths invalidate the related dashboard cache; `workspace_dashboard_stats` and `project_list_summaries` read models are in place, and APIs prefer read models when coverage is complete; async refresh triggers after project save, platform sync, publish completion, and member changes; admin rebuild API, Asynq queue, and worker support full read-model rebuild | None |
 | Phase 3: Read/write splitting | 15% | 100% | Done | Optional `DB_READER_*` connection, application-level DB Router, signed sticky writer, consistency routing for project/stats/workspace/platform_account/publish/prepublish/mediaasset/browser_session/extension, consistency-level inventories for dashboard/publish/collab-service, self-hosted PostgreSQL read replica, managed `postgres-reader` entry point, PgBouncer reader pool, replica lag monitoring and automatic fallback to writer when over threshold | None; Phase 4 continues partitioning, archiving, and recovery flows |
-| Phase 4: Single-database partitioning, archiving, and hot/cold tiering | 15% | 85% | In progress | Collaborative editing already has state + update batch + compaction foundation; `collab_document_update_batches` has PostgreSQL `document_id` hash partition target schema; event and terminal-session history already have row-level R2/S3 archive worker; `publish_events`, `extension_execution_events`, `project_activities`, `workspace_activities`, and `remote_browser_sessions` have PostgreSQL monthly partition target schema; the archive worker exports whole cold monthly partitions to R2/S3 before detaching and dropping them | Archive recovery flow |
+| Phase 4: Single-database partitioning, archiving, and hot/cold tiering | 15% | 100% | Done | Collaborative editing already has state + update batch + compaction foundation; `collab_document_update_batches` has PostgreSQL `document_id` hash partition target schema; event and terminal-session history already have row-level R2/S3 archive worker; `publish_events`, `extension_execution_events`, `project_activities`, `workspace_activities`, and `remote_browser_sessions` have PostgreSQL monthly partition target schema; the archive worker exports whole cold monthly partitions to R2/S3 before detaching and dropping them; archive recovery procedure is documented | None |
 | Phase 5: Citus preparation | 20% | 5% | Not started | Workspace model, `projects.workspace_id`, and personal workspace ID already exist | Global `workspace_id`, Citus distribution column/colocation design, unique constraint and foreign-key review |
 | Phase 6: Citus distributed PostgreSQL operation | 10% | 0% | Deferred | None | Future Citus cluster design, worker/coordinator monitoring and backup, large-tenant isolation strategy |
 
@@ -64,7 +64,7 @@ Atomic commit guidance:
 | Dashboard read models | Done | Added `workspace_dashboard_stats` and `project_list_summaries` read models, idempotently recomputed from fact tables by a centralized readmodel service; async refresh is triggered after project save, platform sync, publish completion, and member changes; admin stats and admin project list prefer read models when coverage is complete; admin rebuild API enqueues through Asynq, and API/worker processes can start readmodel workers for full rebuild from fact tables | None | `backend/internal/models/models.go`, `backend/internal/services/readmodel/service.go`, `backend/internal/services/readmodel/queue.go`, `backend/internal/services/readmodel/service_test.go`, `backend/internal/services/readmodel/queue_test.go`, `backend/internal/services/stats/overview.go`, `backend/internal/services/project/lifecycle.go`, `backend/internal/handlers/dashboard.go`, `backend/cmd/api/main.go`, `backend/cmd/publish-worker/main.go` |
 | Redis read cache | Done | Redis is already used for queues, locks, OAuth, browser sessions, and short-term coordination; admin dashboard stats, admin project list, and dashboard account summary use 15s TTL cache and bypass scoped/sticky-writer strong-consistency paths; stats/project list/account cache misses use singleflight to prevent process-local stampede; stats and account caches use versioned payloads and semantic validation, and Redis read-error fallback is also merged into one DB computation per key; project create/edit/platform save, prepublish sync/draft update, publish queue/execute/fail, and platform account write paths invalidate the related dashboard cache; full read-model rebuild reuses the Redis/Asynq queue | None | `backend/internal/services/stats/overview.go`, `backend/internal/services/stats/overview_test.go`, `backend/internal/services/project/list_cache.go`, `backend/internal/services/project/list_cache_test.go`, `backend/internal/services/prepublish/drafts.go`, `backend/internal/services/publish/service.go`, `backend/internal/services/publish/queue.go`, `backend/internal/services/publish/publication_flow_test.go`, `backend/internal/services/publish/queue_test.go`, `backend/internal/services/platform_account/account_cache.go`, `backend/internal/services/platform_account/account_cache_test.go`, `backend/internal/services/browser_session/complete.go`, `backend/internal/services/browser_session/service_test.go`, `backend/internal/services/readmodel/queue.go` |
 | Read/write splitting | Done | Supports optional `DB_READER_*` read-replica connection, `DefaultRouter`, and signed sticky writer; project/stats/workspace/platform_account/publish/prepublish/mediaasset/browser_session/extension are wired to strong/eventual/writer routing; dashboard, publish, and collab-service consistency-level inventories are complete, with collab-service online path kept writer-only; writer/reader pools are in self-hosted Kubernetes, and managed overlay provides a `postgres-reader` ExternalName entry point; `DB_READER_MAX_REPLICA_LAG` configures the replica lag threshold, eventual/analytics reads automatically fall back to writer when over threshold or lag is unknown, and `mpp_db_replica_lag_seconds` and `mpp_db_replica_healthy` metrics are exposed | None | `backend/internal/db/db.go`, `backend/internal/db/router.go`, `backend/internal/db/replica_lag.go`, `backend/internal/services/publish/service.go`, `backend/internal/services/prepublish/service.go`, `backend/internal/services/mediaasset/service.go`, `backend/internal/services/browser_session/service.go`, `backend/internal/services/extension/service.go`, `backend/internal/app/runtime.go`, `deploy/kubernetes/data-services/self-hosted/postgres.yaml`, `deploy/kubernetes/data-services/self-hosted/pgbouncer.yaml`, `deploy/kubernetes/data-services/managed/services.yaml`, `script/kubernetes/validation/data_services.rb` |
-| Event-table partitioning and archiving | In progress | `publish_events`, `extension_execution_events`, `project_activities`, `workspace_activities`, and terminal `remote_browser_sessions` have default retention periods; the `archive` worker can batch-export JSONL to R2/S3 and delete old hot-table rows after successful upload; PostgreSQL schema initialization now creates monthly `created_at` partitions for `publish_events`, `extension_execution_events`, `project_activities`, `workspace_activities`, and `remote_browser_sessions`, with partition-compatible `(id, created_at)` primary keys and rolling partition creation; the archive worker exports whole cold monthly partitions as JSONL to R2/S3, then detaches and drops the partition after successful upload; PostgreSQL browser-session active-row fallback uses a scoped advisory transaction lock because partitioned unique constraints must include the partition key | Archive recovery flow is not implemented | `backend/internal/db/monthly_partitions.go`, `backend/internal/db/db.go`, `backend/internal/models/models.go`, `backend/internal/db/db_test.go`, `backend/internal/services/browser_session/start.go`, `backend/internal/services/browser_session/cleanup.go`, `backend/internal/services/archive/worker.go`, `backend/internal/services/archive/partitions.go`, `backend/internal/services/archive/worker_test.go`, `backend/internal/services/archive/partitions_test.go` |
+| Event-table partitioning and archiving | Done | `publish_events`, `extension_execution_events`, `project_activities`, `workspace_activities`, and terminal `remote_browser_sessions` have default retention periods; the `archive` worker can batch-export JSONL to R2/S3 and delete old hot-table rows after successful upload; PostgreSQL schema initialization now creates monthly `created_at` partitions for `publish_events`, `extension_execution_events`, `project_activities`, `workspace_activities`, and `remote_browser_sessions`, with partition-compatible `(id, created_at)` primary keys and rolling partition creation; the archive worker exports whole cold monthly partitions as JSONL to R2/S3, then detaches and drops the partition after successful upload; PostgreSQL browser-session active-row fallback uses a scoped advisory transaction lock because partitioned unique constraints must include the partition key; the archive recovery procedure defines inspection, staging restore, optional hot-table reinsertion, and audit checks | None | `backend/internal/db/monthly_partitions.go`, `backend/internal/db/db.go`, `backend/internal/models/models.go`, `backend/internal/db/db_test.go`, `backend/internal/services/browser_session/start.go`, `backend/internal/services/browser_session/cleanup.go`, `backend/internal/services/archive/worker.go`, `backend/internal/services/archive/partitions.go`, `backend/internal/services/archive/worker_test.go`, `backend/internal/services/archive/partitions_test.go`, Phase 4 archive recovery procedure in this document |
 | Collaboration batch governance | In progress | `collab_document_states`, `collab_document_update_batches`, and compaction/retention foundations exist; PostgreSQL schema initialization creates a 16-way `document_id` hash-partitioned `collab_document_update_batches` target table and migrates existing regular-table rows into it | Cold archiving is not implemented | `backend/internal/db/hash_partitions.go`, `backend/internal/db/db.go`, `backend/internal/models/collab.go`, `backend/internal/db/db_test.go`, `collab-service/src/persistence/document-persistence.ts` |
 | Outbox/CDC/event stream | In progress | The publishing queue path has a transactional Outbox: `EnqueuePublishProject` writes `outbox_events` in the same transaction and dispatches immediately after commit; publish worker starts an outbox dispatcher and supports retries for failed/stale processing records; Asynq continues to serve as the task-execution queue, and `PublishEvent` continues to serve as publishing audit | Currently covers only `publish.job_requested`; general business-event outbox, Debezium, and Redpanda/Kafka CDC are not implemented | `backend/internal/services/publish/queue.go`, `backend/internal/services/publish/outbox.go`, `backend/internal/models/models.go` |
 | Citus target state | Not started | Confirmed `workspace_id` as the most suitable distribution-column direction | Citus distributed tables, reference tables, and colocation are not implemented | Phase 5/6 in this document |
@@ -121,7 +121,7 @@ Atomic commit guidance:
 - [x] Partition `remote_browser_sessions` by time or expiration time. Verification entry point: `backend/internal/db/monthly_partitions.go`, `backend/internal/models/models.go`, `backend/internal/services/browser_session/start.go`, `backend/internal/db/db_test.go`.
 - [x] Hash partition `collab_document_update_batches` by `document_id`. Verification entry point: `backend/internal/db/hash_partitions.go`, `backend/internal/db/db.go`, `backend/internal/models/collab.go`, `backend/internal/db/db_test.go`.
 - [x] Export cold partitions to R2/S3. Verification entry point: `backend/internal/services/archive/partitions.go`, `backend/internal/services/archive/partitions_test.go`, `backend/internal/services/archive/worker.go`.
-- [ ] Write archive recovery procedure.
+- [x] Write archive recovery procedure. Verification entry point: Phase 4 archive recovery procedure in this document.
 
 #### Phase 5: Citus Preparation
 
@@ -463,6 +463,17 @@ Costs:
 - GORM AutoMigrate is not suitable for directly managing complex partitioning; design partition DDL separately when this phase becomes real.
 - Queries must include time or tenant filters, otherwise they still scan many partitions.
 
+Archive recovery procedure:
+
+1. Define the recovery request before touching production. Record the table name, object key or prefix, required time range, requester, reason, and whether the goal is inspection only or hot-table reinsertion. Supported archived domains are `publish_events`, `extension_execution_events`, `project_activities`, `workspace_activities`, and terminal `remote_browser_sessions`; `collab_document_update_batches` is not part of the archive worker until a separate collaboration archive flow exists.
+2. Freeze the relevant archive inputs. Keep `EVENT_ARCHIVE_ENABLED=false` in the target restore environment, or pause the publish-worker archive process, so the same records are not re-archived while recovery is in progress. Confirm `EVENT_ARCHIVE_OBJECT_PREFIX`; the default prefix is `archives/database`.
+3. Locate archive objects in R2/S3. Whole-partition archives use the key pattern `<prefix>/<table>/partitions/partition_start=YYYY-MM-DD/partition_end=YYYY-MM-DD/<partition>-<archived_at_unix_nano>.jsonl`. Row-level fallback batches use `<prefix>/<table>/cutoff_date=YYYY-MM-DD/batch-<archived_at_unix_nano>-NNNN.jsonl`.
+4. Restore into a staging database or scratch schema first. Download only the required objects, parse newline-delimited JSON, and load the `row` payload into a temporary table that mirrors the target table schema. Preserve the envelope fields `schema_version`, `table`, `archived_at`, `retention_cutoff`, and, for partition archives, `partition`, `partition_start`, and `partition_end` in a separate manifest table or recovery log.
+5. Validate before reinsertion. Check that every JSONL line has `schema_version=1` and the expected `table`, compare object line counts with temporary-table row counts, verify that `created_at` falls inside the requested range, and sample important foreign keys such as `project_id`, `workspace_id`, `publication_id`, or `session_id` against the current database. For `remote_browser_sessions`, reinsert only terminal rows unless the recovery request explicitly targets offline investigation.
+6. Choose the recovery destination. For audit or support inspection, keep the restored rows in the scratch schema and query them there. For hot-table reinsertion, restore through the partitioned parent table, not a detached child table, so PostgreSQL routes rows into the correct monthly partition. Create missing monthly partitions before insert if the retention window no longer has them.
+7. Reinsert idempotently when production restoration is required. Use a transaction, insert only reviewed rows, and use the table's primary-key conflict handling, for example `ON CONFLICT DO NOTHING`, so replaying the same object is safe. Recreate dependent rows only when the archive flow deleted them; for current domains this mainly means `extension_execution_event_claims`, which are intentionally removed before archived extension-event partitions are dropped.
+8. Verify and close the recovery. Compare restored IDs and counts against the staging manifest, run targeted application queries that need the recovered rows, record the object keys and SQL used, then re-enable the archive worker only after confirming the recovered rows should either remain hot or be exempted from immediate re-archive by retention policy.
+
 ### Phase 5: Citus Preparation
 
 When to apply: