kv: make disk reads asynchronous with respect to Raft state machine

This issue is the "disk read" counterpart to https://github.com/cockroachdb/cockroach/issues/17500, which was addressed by https://github.com/etcd-io/raft/pull/8 and https://github.com/cockroachdb/cockroach/pull/94165. To contextualize this issue, it may be helpful to get re-familiarized with those, optionally with [this presentation](https://docs.google.com/presentation/d/1owtj5S38Qky8yWz91sfs7x0pkbrLYTwhSlSJ3QMvADw/edit?usp=sharing).

The raft state machine loop (`handleRaftReady`) is responsible for writing raft entries to the durable raft log, applying committed log entries to the state machine, and sending messages to peers. This event loop is the heart of the raft protocol and each raft write traverses it multiple times between proposal time and ack time. It is therefore important to keep the latency of this loop down, so that a slow iteration does not block writes in the pipeline and create cross-write interference.

To that end, https://github.com/cockroachdb/cockroach/pull/94165 made raft log writes non-blocking in this loop, so that slow log writes (which much fsync) do not block other raft proposals.

Another case where the event loop may synchronously touch disk is when constructing the list of committed entries to apply. In the common case, this [pulls from the raft entry cache](https://github.com/cockroachdb/cockroach/blob/9b1753366e0307c83eb59a6351a844467b6cf9d3/pkg/kv/kvserver/logstore/logstore.go#L522), so it is fast. However, on raft entry cache misses, this [reads from pebble](https://github.com/cockroachdb/cockroach/blob/9b1753366e0307c83eb59a6351a844467b6cf9d3/pkg/kv/kvserver/logstore/logstore.go#L573). Reads from pebble can be slow (relative to a cache hit), which can slow down the event loop because they are performed inline.  The effect of this can be seen directly on raft scheduling tail latency.

<details>
  <summary>Example graphs</summary>

### Entry cache hit rates

<img width="985" alt="Screenshot 2023-06-29 at 2 09 16 AM" src="https://github.com/cockroachdb/cockroach/assets/5438456/8fcaf306-5bb8-46b6-ba9e-3eacbe7edf82">

|       | Accesses | Hits   | Hit Rate |
| ----- | -------- | ------ | -------- |
| n1    | 314468   | 308334 | 98.1%    |
| n2    | 276748   | 260645 | 94.2%    |
| n3    | 271915   | 255306 | 93.9%    |
| n4    | 325052   | 320766 | 98.7%    |
| n5    | 326403   | 321934 | 98.6%    |

### Raft scheduler latencies

<img width="1281" alt="Screenshot 2023-06-29 at 2 13 39 AM" src="https://github.com/cockroachdb/cockroach/assets/5438456/91036033-b3f0-4501-9540-ed1083aae858">
  
### High raft entry cache hit rate (n4)

<img width="1777" alt="Screenshot 2023-06-29 at 2 44 13 AM" src="https://github.com/cockroachdb/cockroach/assets/5438456/aee63acb-505e-4526-a3dd-added5e7432c">

### Low raft entry cache hit rate (n3)

<img width="1759" alt="Screenshot 2023-06-29 at 2 04 40 AM" src="https://github.com/cockroachdb/cockroach/assets/5438456/7cfe7852-9ab5-4195-8930-e85f4e03390d">

</details>

An alternate design would be to make these disk reads async on raft entry cache misses. Instead of blocking on the log iteration, `raft.Storage.Entries` could support returning a new `ErrEntriesTemporarilyUnavailable` error which instructs etcd/raft to retry the read later. This would allow the event loop to continue processing. When the read completes, the event loop would be notified and the read would be retried from the cache (or some other data structure that has no risk of eviction before the read is retries).

This would drive down tail latency for raft writes in cases where the raft entry cache has a less than perfect hit rate.

Jira issue: CRDB-29234

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

kv: make disk reads asynchronous with respect to Raft state machine #105850

Entry cache hit rates

Raft scheduler latencies

High raft entry cache hit rate (n4)

Low raft entry cache hit rate (n3)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

	Accesses	Hits	Hit Rate
n1	314468	308334	98.1%
n2	276748	260645	94.2%
n3	271915	255306	93.9%
n4	325052	320766	98.7%
n5	326403	321934	98.6%

Uh oh!

kv: make disk reads asynchronous with respect to Raft state machine #105850

Description

Entry cache hit rates

Raft scheduler latencies

High raft entry cache hit rate (n4)

Low raft entry cache hit rate (n3)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions