Skip to content

kv: make disk reads asynchronous with respect to Raft state machine #105850

Description

@nvb

This issue is the "disk read" counterpart to #17500, which was addressed by etcd-io/raft#8 and #94165. To contextualize this issue, it may be helpful to get re-familiarized with those, optionally with this presentation.

The raft state machine loop (handleRaftReady) is responsible for writing raft entries to the durable raft log, applying committed log entries to the state machine, and sending messages to peers. This event loop is the heart of the raft protocol and each raft write traverses it multiple times between proposal time and ack time. It is therefore important to keep the latency of this loop down, so that a slow iteration does not block writes in the pipeline and create cross-write interference.

To that end, #94165 made raft log writes non-blocking in this loop, so that slow log writes (which much fsync) do not block other raft proposals.

Another case where the event loop may synchronously touch disk is when constructing the list of committed entries to apply. In the common case, this pulls from the raft entry cache, so it is fast. However, on raft entry cache misses, this reads from pebble. Reads from pebble can be slow (relative to a cache hit), which can slow down the event loop because they are performed inline. The effect of this can be seen directly on raft scheduling tail latency.

Example graphs

Entry cache hit rates

Screenshot 2023-06-29 at 2 09 16 AM
Accesses Hits Hit Rate
n1 314468 308334 98.1%
n2 276748 260645 94.2%
n3 271915 255306 93.9%
n4 325052 320766 98.7%
n5 326403 321934 98.6%

Raft scheduler latencies

Screenshot 2023-06-29 at 2 13 39 AM

High raft entry cache hit rate (n4)

Screenshot 2023-06-29 at 2 44 13 AM

Low raft entry cache hit rate (n3)

Screenshot 2023-06-29 at 2 04 40 AM

An alternate design would be to make these disk reads async on raft entry cache misses. Instead of blocking on the log iteration, raft.Storage.Entries could support returning a new ErrEntriesTemporarilyUnavailable error which instructs etcd/raft to retry the read later. This would allow the event loop to continue processing. When the read completes, the event loop would be notified and the read would be retried from the cache (or some other data structure that has no risk of eviction before the read is retries).

This would drive down tail latency for raft writes in cases where the raft entry cache has a less than perfect hit rate.

Jira issue: CRDB-29234

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-kv-replicationRelating to Raft, consensus, and coordination.C-performancePerf of queries or internals. Solution not expected to change functional behavior.T-kvKV Team

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions