Background
The unstable log structure in pkg/raft holds log entries until they have been written to storage and fsync-ed.
After the introduction of async log writes, the flow of entries from memory to Storage is:
- Entries are appended to
unstable.
- On
handleRaftReady, the unstable entries are extracted and paired with a MsgStorageAppend message.
- The batch of entries is written to Pebble.
- If async log writes are enabled, and the batch qualifies for an async write, the batch is written to Pebble, but not synced.
- If the write doesn't qualify for async write, the entries are written and synced.
- When the entries have been synced, we/Pebble invoke a callback which sends a
MsgStorageAppendResp responses back to the raft instance.
- When handling the append response, raft removes entries from
unstable 1.
Improvement
In this flow, there is a period of time (between steps 3-5) when an entry has already been written to Pebble and sits in memtables, but still resides in the unstable struct. When async writes are enabled, this can last for multiple Ready iterations. Holding these entries in unstable is not strictly necessary, because they are already readable from the log Storage. We should clear them in step (3). This will, effectively, become a "transfer" of entries from unstable to Storage.
In Replication AC, entry tokens are admitted and returned to the leader in step (3), too. Clearing the unstable entries at this point effectively includes them into the replication token "lifetime", and protects the node from OOMs caused by unstable build-ups.
The modification will be along the lines of having a new method/message to raft saying that some/all entries in unstable have been (non-durably) written, so raft can clear them. There can be some complications in the interaction with the async writes protocol.
Alternatively, we can go full on the "transfer" semantics, and remove entries from unstable when Ready returns them. We would still need to deliver "acks" to raft when entries are synced.
Jira issue: CRDB-37890
Epic CRDB-37515
Background
The unstable log structure in
pkg/raftholds log entries until they have been written to storage and fsync-ed.After the introduction of async log writes, the flow of entries from memory to
Storageis:unstable.handleRaftReady, theunstableentries are extracted and paired with aMsgStorageAppendmessage.MsgStorageAppendRespresponses back to the raft instance.unstable1.Improvement
In this flow, there is a period of time (between steps 3-5) when an entry has already been written to Pebble and sits in memtables, but still resides in the
unstablestruct. When async writes are enabled, this can last for multipleReadyiterations. Holding these entries inunstableis not strictly necessary, because they are already readable from the logStorage. We should clear them in step (3). This will, effectively, become a "transfer" of entries fromunstabletoStorage.In Replication AC, entry tokens are admitted and returned to the leader in step (3), too. Clearing the
unstableentries at this point effectively includes them into the replication token "lifetime", and protects the node from OOMs caused byunstablebuild-ups.The modification will be along the lines of having a new method/message to raft saying that some/all entries in
unstablehave been (non-durably) written, so raft can clear them. There can be some complications in the interaction with the async writes protocol.Alternatively, we can go full on the "transfer" semantics, and remove entries from
unstablewhenReadyreturns them. We would still need to deliver "acks" to raft when entries are synced.Jira issue: CRDB-37890
Epic CRDB-37515
Footnotes
Some entries may have already been cleared from
unstableby this time, e.g. if the leadership changed and the new leader has overwritten some entries. We only remove the entries that are a guaranteed to be matched by storage, and there are no in-flight appends overwriting them. See this comment for some details. ↩