[Issue #1327] pixels retina recovery protocol#1328
Draft
gengdy1545 wants to merge 8 commits into
Draft
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Pixels Retina Recovery Protocol
Summary
This PR introduces a recovery protocol for Pixels Retina that defines how the system recovers internal consistency after process crashes, machine restarts, CDC crashes, or simultaneous CDC + Retina crashes. On startup, Retina first cleans the RG-visibility snapshot in the latest recovery checkpoint into a self-consistent internal recovery starting point; CDC then replays the source-side events that were dropped, starting from the recovery replay timestamp derived from that checkpoint. The system finally converges to the pre-crash state under
READY. Queries are fully fail-closed duringRECOVERINGand only become available after a successfulMarkReadybarrier.This document describes the final recovery-capable design — there is no online dual-track, no fall-back to the legacy recovery path, and no runtime migration path for legacy catalog file states. New deployments initialize a fresh catalog directly with the final recovery-capable schema; any legacy
TEMPORARYrow is treated as an invalid catalog state rather than upgraded by Retina runtime code.Note on metadata layout: ingest-path fields (
tableId,virtualNodeId,firstBlockId,fileMinCommitTs,fileMaxCommitTs) are not persisted in the catalogFILEStable. Their runtime source of truth is the new in-processIngestFileMetadataRegistry; their cross-process source of truth is the recovery checkpoint body'sFileEntry. The catalogFILEStable is only extended with the four-state file type enum andFILE_CLEANUP_AT.Task List
The implementation is organized as a sequence of commit-level tasks. The numbering only encodes dependency and suggested merge order, not priority. No intermediate commit is allowed to let an unfinished recovery-capable path enter a queryable
READY.C00 — Baseline hardening of the current code
MetadataService.addFiles / updateFile / deleteFilesreturntrueon success and treatfalse/exception as publish-barrier failure in all callersMainIndexBuffersnapshot → SQLite transaction commit → drop buffer/cache only after commit;row_id_rangeswrites idempotent on retryPixelsWriteBuffer.addRowRowLocation race (capturecurrentMemTableat append time)FileWriterManager.finish()from publishingREGULARdirectly; centralize publish inPixelsWriteBuffergetRegularFiles(pathId)and route every query-visible call site through it; non-REGULARsubsets are reachable only viagetFilesByType(pathId, types)C01 — Metadata schema and enumeration APIs
TEMPORARY_INGEST / TEMPORARY_GC / REGULAR / RETIREDand addFILE_CLEANUP_AT. This is the final extension point of the catalogFILESschema: no ingest-path columns (tableId,virtualNodeId,firstBlockId,fileMinCommitTs,fileMaxCommitTs) are added, and norowIdStart/rowCountcolumns orRowIdRangeside table are added — the multi-segment range source of truth lives in theLocalIndexServicebackend, and ingest-path metadata flows throughIngestFileMetadataRegistry+ checkpoint bodygetFilesByType(Long pathId, Set<File.Type> types)as the single SQL-layer enumeration entry point and expose client-wrapper compositesgetRegularFiles/listTemporaryFilesDue(long ttlMs)/listRetiredFilesDue()(TEMP TTL is sweeper-injected and the deadline anchor is the filename-embedded UTC timestamp;FILE_CLEANUP_ATisRETIRED-only, enforced fail-fast in the DAO)getRegularFiles; non-REGULARsubsets reachable only viagetFilesByType(diagnostic / admin / RPC fan-out only). This stage only constrains catalog enumeration semantics; query-facing transaction context wiring is deferred to C05.5/C05.6atomicSwapFiles(newFileId, oldFileIds, cleanupAt)in a single metadata transactionTEMPORARYis not supported and is treated as an invalid catalog stateC02 — Ingest publish ordering and write-buffer prefix visibility
FileWriterManager.finish()only does object-block flush,writer.close(), footer/length/checksum checkphysicalClosed/indexFlushed/metadataRegular/registryAdmitted…)fileMinCommitTs / fileMaxCommitTsat the write boundary; footer hidden-timestamp stats are asserted equal to the in-memory truth at the close barrier and act as a physical audit / cross-validation fallback onlyAppendSegmentState.visibleSize, batch-level append handle, andpublishPendingAppend(handle, visible|hidden)firstBlockId; out-of-order physical close / index flush, but in-orderTEMPORARY_INGEST → REGULARpublishPixelsWriteBuffer: physical close →LocalIndexService.flushIndexEntriesOfFile→ single atomic catalog metadata update (file type +FILE_MIN_ROW_ID/FILE_MAX_ROW_IDhull only; commit ts not written to catalog) →IngestFileMetadataRegistry.register(fileId, tableId, virtualNodeId, firstBlockId, fileMinCommitTs, fileMaxCommitTs)→ object cleanupPixelsWriteBuffer.close()paths through the same publisher; neither path may bypass the registry registration stepIngestFileMetadataRegistry:registerinvoked synchronously after the catalogREGULARcommit returns;unregisterinvoked synchronously afterRETIREDmarking / Storage GCatomicSwapFilesreturns; any fileId present inRGVisibilityIndexmust be present in the registry, otherwise fail-closed; rebuilt from checkpoint body'sFileEntryafter restart; consumed only by the checkpoint generator, recovery coordinator and file publisher / retire path, never exposed via RPCC03 — LocalIndexService staging and Visibility replay semantics
resolvePrimary,putMainIndexEntries,putPrimaryEntriesOnly,tombstonePrimaryResolved,updatePrimaryResolved,restorePrimaryEntries,deleteMainIndexRangeresolvePrimaryreturns tri-stateFOUND / NOT_FOUND_OR_ORPHAN / BACKEND_ERRORagainst the current baseline visible file setresolvePrimary → Visibility delete → tombstonePrimaryResolvedappend pending → putMainIndexEntries → putPrimaryEntriesOnly → publishPendingAppend(visible); failure path compensates via Visibility delete + primary tombstone +publish-hiddenor fail-closedUpdateDataIndexKey.timestamp == TableUpdateData.timestampand use it everywhereRGVisibility/ JNI / nativeTileVisibility, fork DELETE byT <= baseTimestamp(COW fold intobaseBitmap) vsT > baseTimestamp(append deletion chain)C04 — Recovery checkpoint and startup cleansing
retinaNodeId,checkpointId,writerEpoch,writeTime,checkpointAppliedTs,topologyHash) +fileEntries[]/ingestSegmentEntries[]/rgEntries[]. The body is the recovery source of truth for the ingest-path fields of each keptFileEntry; the catalogFILEStable is not consulted for these fieldscurrent/previousslots under/pixels/retina/recovery/checkpoint/${encodedNodeId}/; publish order: write body + fsync → atomic etcd swap → async cleanup of replaced bodycheckpointAppliedTs = HWM - 1, then dumps RGVisibilityIndex + ingest-path metadata of checkpoint-admitted REGULAR files fromIngestFileMetadataRegistry+ pending/open ingest segment snapshot fromPixelsWriteBuffer; it does not read the catalogFILEStable to recover ingest fields and does not read back Pixels footersretinaNodeId/topologyHash/checkpointAppliedTsFileEntry / VisibilityEntry / IngestSegmentEntrywith the body as the ingest-path source of truth, cross-validated against catalogFILES(type / hull),pathId → layoutId → tableIdresolution,FILE_NAME → virtualNodeIdparsing, and footerhiddenColumnStatsas best-effort fallback. Rebuild RGVisibilityIndex and MainIndex baseline; produce the baseline visible file set; re-admit keptFileEntrys back intoIngestFileMetadataRegistryscopeReplayFromTs / vnodeReplayFromTs / nodeReplayFromTsper §3.4; degrade toMIN_REPLAY_TSwhen segment chain is untrustworthyREGULARfiles (not covered by checkpoint, not protected by Storage GC journal) asRETIREDand enqueue cleanup;FAILEDif no checkpoint but catalog hasREGULARC05 — Lifecycle, query gate and startup gating
RetinaLifecycleStateand lifecycle coordinator publishingRECOVERING / READY / FAILEDto a leased/pixels/retina/lifecycle/<host:retinaPort>keyRetinaServerImplconstructor logic with the recovery coordinator; remove metadata full-preload from the recovery-capable pathRetinaResourceManagerconstructor into the lifecycleREADYhookQueryVisibility / GetWriteBuffer / RegisterOffload / UpdateRecord / StreamUpdateRecord / AddVisibility / AddWriteBufferon lifecycle stateQueryAvailabilityGatetoTransServiceImpl.beginTrans / beginTransBatchforreadOnly=truemarkReady(...)skeleton; production paths must never observeREADYuntil C06C06 — CDC recovery replay and MarkReady barrier
proto/retina.protowithGetRetinaStatus / GetRecoveryReplayTs / MarkReady, state enum,recoveryAttemptToken / recoveryEpoch / checkpointId / replayTsReady / vnodeReplayFromTs / nodeReplayFromTsRecoveryReplayContext(recoveryAttemptToken, checkpointId, replayMode)toUpdateRecordRequestRECOVERING, accept only DELETE / CDC-replace requests with matching context andtimestamp >= replayFromTs; reject standalone INSERTMarkReadybarrier startsmarkReady(recoveryAttemptToken, checkpointId): validate → close entry → drain in-flight → switch toREADY→ start GC scheduler → invalidate split/cache → unblock queriesMarkReadypath; no shortcut allowedC07 — Storage GC recovery
TEMPORARY_GC; candidate scan viagetRegularFiles; rewrite cutoff =TransService.getSafeGcTimestamp()delete_ts <= safeGcTs; new file Visibility initialized withbaseTimestamp = safeGcTsand chain items fordelete_ts > safeGcTsLocalIndexServiceupdatedatomicSwapFiles, journal becomesSWAPPED_NOT_CHECKPOINTED; promote toCHECKPOINTEDonly after a durable recovery checkpoint baseline accepts the new fileINDEX_SWITCHINGrolls back,SWAPPED_NOT_CHECKPOINTEDeither promotes or rolls back depending on checkpoint baseline acceptance, missing journal with primary pointing to a non-baseline newRowId isFAILEDC08 — Background cleanup, testing and operations
listTemporaryFilesDue(ttlMs)(client-wrapper convenience method composed overgetFilesByType([TEMPORARY_INGEST, TEMPORARY_GC])+ client-sidePixelsFileNameUtils.extractCreateTimeMillis(name) + ttlMs <= nowfilter; createTime is the UTCyyyyMMddHHmmsssegment embedded byDateUtil.getCurTime, TTL is injected by the sweeper rather than read inside the metadata client; filename-parse failure is C-1 WARN+skip) for long-hanging temporary files;TEMPORARY_GCphysical cleanup still requires Storage GC journal validation, the TTL is only a long-hanging safety netlistRetiredFilesDue()(client-wrapper convenience method composed overgetFilesByType([RETIRED])+ client-sidecleanupAt <= nowfilter) cleaning old Visibility → old MainIndex range → physical file → retired catalog (via the genericdeleteFiles(fileIds)RPC), in that orderrecoveryAttemptTokenDetailed Design
1. Goals and boundaries
The recovery protocol covers four crash scenarios:
After Retina restarts, it loads the latest valid recovery checkpoint, cleanses its RG-visibility snapshot against the catalog, and rebuilds an internal recovery starting point. CDC then replays source events from the replay timestamp derived from that checkpoint, in inclusive semantics. Once CDC has acknowledged completion via
MarkReady, Retina runs an internal barrier that closes the replay write entry, drains in-flight replay requests, and switches toREADY— the only edge where queries become visible.Recovery is bound to a fixed Retina topology. The expected node set comes from the static configuration
$PIXELS_HOME/etc/retina, and each checkpoint records atopologyHash. Within one recovery attempt, the expected Retina set,retina.server.port,node.virtual.numand the vnode-to-Retina mapping must stay constant; otherwise the attempt fails closed or restarts.READYonly guarantees per-effect visibility (a single INSERT or single DELETE). CDC replace's DELETE + INSERT pair does not provide combined atomic scan visibility; the short live intermediate state (DELETE applied, INSERT not yet, or vice versa) is normalREADYfreshness behavior, not a recovery-correctness violation.2. Lifecycle and file states
Retina has three external lifecycle states:
RECOVERINGREADYFAILEDRECOVERINGis the only externally visible recovery state. Internally it is split into checkpoint cleansing, waiting for CDC recovery replay, and theMarkReadybarrier. Even after cleansing completes and the replay timestamp is computed, queries remain rejected; otherwise the same read timestamp could observe different snapshots as replay progresses.File states in the catalog are extended to four:
REGULAR— published, query-visible data file;TEMPORARY_INGEST— pre-allocated or in-progress ingest file;TEMPORARY_GC— Storage GC rewrite file, governed by the GC journal;RETIRED— oldREGULARretired by GC swap or recovery cleansing, withFILE_CLEANUP_ATdriving delayed cleanup.FILE_CLEANUP_ATis meaningful only onRETIREDrows;TEMPORARY_INGEST,REGULARandTEMPORARY_GCmust persistNULL. The DAO write path enforces this invariant fail-fast (throws on a straycleanupAton a non-RETIREDrow or a missingcleanupAton aRETIREDrow) so that caller bugs never leak into recovery, rollback or sweep logic. Long-hanging deadlines forTEMPORARY_INGEST/TEMPORARY_GCare derived from the filename-embedded UTCyyyyMMddHHmmsssegment plus a sweeper-supplied TTL, not fromFILE_CLEANUP_AT.Query-visible enumeration is done only through the client-wrapper convenience method
getRegularFiles(pathId)(which internally invokesgetFilesByType(pathId, [REGULAR])); non-REGULARfiles are reachable only through admin/diagnostic APIs (cross-pathgetFilesByTypefor sweepers and the client-wrapper compositeslistTemporaryFilesDue(long ttlMs)/listRetiredFilesDue()). Admin/diagnostic enumeration is forbidden on any query / planner / cache / Storage GC candidate scan path. This C01-level rule only constrains catalog enumeration semantics; the query-facing transaction context wiring is implemented together with the C05 lifecycle/query gate.The catalog
FILEStable holds onlyfileId / name / type / numRowGroup / pathId / minRowId / maxRowId / cleanupAt. Ingest-path fields (tableId,virtualNodeId,firstBlockId,fileMinCommitTs,fileMaxCommitTs) are kept entirely outside the catalog; their runtime source of truth is the in-processIngestFileMetadataRegistry, and their cross-process source of truth is the recovery checkpoint body'sFileEntry.3. Recovery checkpoint
A recovery checkpoint is not a query baseline — it is the only source of truth Retina uses to bootstrap its internal recovery state. The contract is:
Each checkpoint body captures, for one Retina node:
RGVisibilityIndexsnapshot ((fileId, rgId) -> bitmap);checkpointAppliedTs = HWM - 1(the visibility-applied cut for DELETE / UPDATE-old-row);(tableId, virtualNodeId)ingest segment chain (REGULAR-admitted, pending, open). REGULAR-admitted segments are dumped fromIngestFileMetadataRegistry; pending and open segments are taken fromPixelsWriteBuffer. The catalogFILEStable is not consulted for any ingest-path fields, and Pixels footers are not read back during this dump;topologyHash,retinaNodeId,writerEpoch / leaseId,checksum,length.Commit protocol uses an immutable body object + per-node etcd two-slot pointer: write body + fsync, then atomically swap
current/previousin a single etcd transaction. Recovery only reads its own node'scurrentthenprevious; it never lists storage directly.checkpointAppliedTsrequires a co-design with the write path: a write transaction can only enter committed state (and be covered by HWM) after Retina has synchronously completed its visibility / index / write-buffer effects and acked CDC. This invariant is what makesHWM - 1safe to use directly.Recovery replay timestamps are derived, not stored:
earliestUnsafeInsertTsis theminCommitTsof the first non-empty pending/open ingest segment after the last checkpoint-admitted REGULAR segment in that scope. All-DELETE scopes produce+INFhere, so the replay starts fromcheckpointAppliedTsand never falls back to 0. Untrustworthy segment chains degrade to a conservativeMIN_REPLAY_TS = 0— only ever increasing replay, never reducing it.CDC consumes either VNODE mode (
vnodeReplayFromTs[v]) or NODE mode (nodeReplayFromTs). Both are inclusive.4. Cleansing rules
For every entry in the loaded checkpoint body:
fileIdnot in catalogfileIdexists butFILE_TYPE != REGULARfileIdisREGULARbutrecordNumorFILE_MIN_ROW_ID/FILE_MAX_ROW_IDhull mismatches the entryfileIdisREGULARand matchesFor each kept
FileEntry, the body is also cross-validated against (a)pathId → layoutId → tableIdresolution (must matchFileEntry.tableId), (b)FILE_NAME → virtualNodeIdparsing viaPixelsFileNameUtils(must matchFileEntry.virtualNodeId), and (c) footerhiddenColumnStats(its commit-ts min/max must be contained in[fileMinCommitTs, fileMaxCommitTs]). Mismatches discard the whole group with a WARN. A footer-read failure only emits a WARN without forcing the discard, so recovery does not become hard-dependent on large-file footer IO.Catalog files that are
REGULARbut have no entry in the checkpoint body and are not protected by Storage GC journal must be atomically markedRETIRED(FILE_CLEANUP_AT = now) beforeREADY; their source events are redone by CDC replay. This avoids long-lived catalog-onlyREGULARorphans being picked up bygetRegularFiles.Discarding individual entries does not invalidate the checkpoint. Only header / checksum / format errors invalidate the entire body and force a fallback to
previous.5. Subsystem recovery
Visibility. Each kept entry rebuilds an
RGVisibilitywithbaseTimestampandbaseBitmaptaken from the body and an empty deletion chain. CDC replay DELETEs are forked at the apply path:T <= baseTimestamp→ COW fold intobaseBitmap, no chain item;T > baseTimestamp→ append to deletion chain at originalT.The replay start time has no required ordering with
baseTimestamp.WriteBuffer. Memtables are dropped; pre-allocated
TEMPORARY_INGESTobjects are best-effort deleted; new empty buffers are created. To cleanly handle in-flight appends across crashes, every memtable carries anAppendSegmentStateseparating physicalrowBatch.sizefrom a query-visible monotonically increasingvisibleSize.GetWriteBufferand object flush only consume the[0, visibleSize)prefix; failed appends are compensated via Visibility delete + primary tombstone and thenpublishPendingAppend(handle, hidden), or the writer fails closed.Ingest publish. A file becomes
REGULARonly after physical close + footer/length/checksum check + MainIndex durable flush + becomingnextCommitFirstBlockIdin its stream. The catalog metadata update atomically persistsFILE_TYPE = REGULARand theFILE_MIN_ROW_ID/FILE_MAX_ROW_IDhull only; commit ts and other ingest-path fields are not written to the catalog. After the catalog update returns, the file publisher synchronously callsIngestFileMetadataRegistry.register(fileId, tableId, virtualNodeId, firstBlockId, fileMinCommitTs, fileMaxCommitTs). The close barrier asserts footerhiddenColumnStatsequals the in-memory append-order truth, so the footer acts as a physical persistent audit source. The conservative replay rule is that a file isREGULAR_ADMITTEDonly after a durable recovery checkpoint has captured it; otherwise its source events are redone by replay.Index. Recovery rebuilds the MainIndex baseline from the kept
(fileId)set by preferringLocalIndexService's per-file durablerow_id_ranges+row_id_range_flush_markers, with a sanity cross-check against the catalogFILE_MIN_ROW_ID/FILE_MAX_ROW_IDhull andrecordNum; when theLocalIndexServicedata is missing or inconsistent with the flush marker, it falls back to Pixels footer'sRowGroupInformation, combined with the catalog hull, to derive per-RGrgRowOffsetStart/End. All write paths — normal writes, recovery replay, Storage GC — go throughLocalIndexServiceonly. The service exposes staged primitives (resolvePrimary,putMainIndexEntries,putPrimaryEntriesOnly,tombstonePrimaryResolved,updatePrimaryResolved,restorePrimaryEntries,deleteMainIndexRange).resolvePrimaryreturns a tri-stateFOUND / NOT_FOUND_OR_ORPHAN / BACKEND_ERROR, whereNOT_FOUND_OR_ORPHANcovers missing keys, primary tombstones, RowLocations into non-baseline / retired / cleansed-out files. OnlyBACKEND_ERRORmay fail the request. Secondary index is explicitly out of scope for recovery correctness in this stage.Unified write order.
resolvePrimary → Visibility delete → tombstonePrimaryResolvedappend pending → putMainIndexEntries → putPrimaryEntriesOnly → publishPendingAppend(visible)resolvePrimary → append pending → putMainIndexEntries → Visibility delete → updatePrimaryResolved → publishPendingAppend(visible)CDC replace = DELETE + INSERT at the same source timestamp
T, in the same all-or-nothing request, in the same stream, DELETE first.IngestFileMetadataRegistry. An in-process map keyed byfileId, holding(tableId, virtualNodeId, firstBlockId, fileMinCommitTs, fileMaxCommitTs), with a reverse index by(tableId, virtualNodeId)to enumerate a stream's published REGULAR files in order. Lifecycle:registeris invoked synchronously after the catalogREGULARcommit returns, by thePixelsWriteBufferfile publisher. A failed register is fail-closed; the catalog must never observeREGULARwithout a corresponding registry entry.unregisteris invoked synchronously afterRETIREDmarking returns (covering both Storage GCatomicSwapFilesand recovery-time retire paths).RGVisibilityIndexmust be present in the registry. The checkpoint generator fails closed on a lookup miss, since this would indicate a publisher / retire ordering bug.FileEntrys during cleansing. REGULAR files captured by no checkpoint are demoted toRETIRED(§4) instead of being re-registered.Storage GC. Candidate scan and rewrite cutoff use the runtime
TransService.getSafeGcTimestamp(), decoupled from recovery checkpoint. New-file Visibility starts atbaseTimestamp = safeGcTs. Storage GC writes a write-ahead rollback journal (INDEX_SWITCHING / SWAPPED_NOT_CHECKPOINTED / CHECKPOINTED / ABORTED) before primary switch. AfteratomicSwapFiles(newFileId, oldFileIds, cleanupAt), the new file is online but the journal stays atSWAPPED_NOT_CHECKPOINTEDuntil a durable recovery checkpoint baseline accepts it. Recovery decides per-task: keep + advance, roll back to old-file baseline, or fail closed if rollback anchors are unavailable.6. CDC ↔ Retina protocol
New RPCs on the Retina side:
GetRetinaStatusreturns lifecycle state,recoveryAttemptToken(CSPRNG, in-memory only),checkpointId,recoveryEpoch(diagnostic),replayTsReady;GetRecoveryReplayTs(token, checkpointId, mode=VNODE|NODE)returns the replay starting points;MarkReady(token, checkpointId)finishes the recovery cycle.Recovery replay writes carry an optional
RecoveryReplayContext(recoveryAttemptToken, checkpointId, replayMode). InRECOVERING, Retina accepts only requests matching the current context withtimestamp >= replayFromTs(per the chosen mode). Standalone INSERTs are protocol errors; CDC must always emit DELETE + INSERT.Multi-stream requests are acked all-or-nothing at the request level, but request-level failure does not imply physical rollback. Already-applied prefix effects must converge idempotently on retry; otherwise Retina fails closed.
MarkReadyperforms an internal barrier:recoveryAttemptTokenandcheckpointIdagainst the current attempt;RECOVERING;READY;CDC unilateral failures never push Retina into
RECOVERING; backlog catchup happens entirely underREADYwith the same DELETE + INSERT idempotent encoding.7. Query gate
The query gate's authoritative boundary is
TransServiceImpl.beginTrans / beginTransBatchforreadOnly = true. Withretina.enable = true:$PIXELS_HOME/etc/retina;QueryAvailabilitySnapshotfrom the static expected membership and a watch on/pixels/retina/lifecycle/<host:retinaPort>;READY. Anything else —RECOVERING,FAILED, missing, stale, malformed, or watch-not-yet-initialized — fails closed.getRegularFiles(pathId)only expresses the catalog rule that query-visible files areREGULAR; it is not the lifecycle gate by itself. Planner / cache / Trino connector / Turbo planner must produce file lists, splits, cache fills, or scan inputs only under a successful gated read-only transaction once C05.5/C05.6 lands. Bypassing the gate to call metadata enumeration is then a recovery-correctness violation, not an ordinary protocol bug.8. Failure scenarios covered
READY; CDC replays under normal backlog catchup.RECOVERING, cleanses checkpoint, exposes replay timestamps, waits for CDCMarkReady.GetRetinaStatus.RECOVERING; all cleanup steps are idempotent.currentis used.REGULARand is redone by CDC replay.9. Invariants and acceptance
The acceptance section in the design enumerates ~50 invariants. Highlights:
RECOVERINGorFAILED.MarkReadybarrier is the only edge where queries are unblocked.getRegularFiles(pathId)only returnsREGULARafterREADY; non-REGULARfiles are unreachable from query paths and are reachable only throughgetFilesByType(pathId, types)(forbidden on query / planner / cache / Storage GC candidate scan paths).FILEStable does not persist ingest-path fields (tableId,virtualNodeId,firstBlockId,fileMinCommitTs,fileMaxCommitTs); runtime source of truth isIngestFileMetadataRegistry, cross-process source of truth is the recovery checkpoint body'sFileEntry.IngestFileMetadataRegistry–RGVisibilityIndexcoupling invariant: any fileId present inRGVisibilityIndexmust be present in the registry; the checkpoint generator fails closed on a lookup miss.READYdoes not provide combined atomic scan visibility for CDC replace — short windows where DELETE is applied but INSERT is not (or vice versa) are normal live freshness, not bugs.MarkReadybarrier) ship as a single closed loop; no production binary in between is allowed to exposeREADY.