|
| 1 | +# KEP-5966: etcd RangeStream |
| 2 | + |
| 3 | +<!-- toc --> |
| 4 | +- [Summary](#summary) |
| 5 | +- [Motivation](#motivation) |
| 6 | + - [Goals](#goals) |
| 7 | + - [Non-Goals](#non-goals) |
| 8 | +- [Proposal](#proposal) |
| 9 | + - [Notes/Constraints/Caveats](#notesconstraintscaveats) |
| 10 | +- [Design Details](#design-details) |
| 11 | + - [API Changes](#api-changes) |
| 12 | + - [Stream Message Layout](#stream-message-layout) |
| 13 | + - [Supported Options](#supported-options) |
| 14 | + - [Chunk Sizing](#chunk-sizing) |
| 15 | + - [Unsupported Pass Through](#unsupported-pass-through) |
| 16 | + - [Implementation Changes](#implementation-changes) |
| 17 | + - [Test Plan](#test-plan) |
| 18 | + - [Unit tests](#unit-tests) |
| 19 | + - [Integration tests](#integration-tests) |
| 20 | + - [Kubernetes API Server Integration](#kubernetes-api-server-integration) |
| 21 | + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) |
| 22 | +- [Implementation History](#implementation-history) |
| 23 | +<!-- /toc --> |
| 24 | + |
| 25 | +## Summary |
| 26 | + |
| 27 | +The unary Range RPC in etcd builds the entire response in memory before |
| 28 | +sending. For large result sets this causes server-side memory spikes (the KV |
| 29 | +slice, serialized protobuf, and gRPC send buffer all coexist) and redundant |
| 30 | +work when clients paginate (repeated Range calls with increasing keys |
| 31 | +recompute the total count on every page by walking the full B-tree index). |
| 32 | + |
| 33 | +This KEP proposes a new server-streaming `RangeStream` RPC that reuses |
| 34 | +`RangeRequest` and returns results in chunks. The server performs pagination |
| 35 | +internally with adaptive chunk sizing and pins to a single MVCC revision for |
| 36 | +consistency. The total count is derived from the running tally of streamed |
| 37 | +keys, eliminating the separate index traversal required by paginated Range. |
| 38 | + |
| 39 | +## Motivation |
| 40 | + |
| 41 | +The current unary Range RPC has two key problems at scale: |
| 42 | + |
| 43 | +1. **Server-side memory spikes** — the entire response (KV slice, serialized |
| 44 | + protobuf, gRPC send buffer) must coexist in memory before sending. |
| 45 | +2. **Redundant work with client-side pagination** — each paginated Range call |
| 46 | + recomputes the total count by walking the full B-tree index, turning |
| 47 | + per-page cost into O(total_keys) instead of O(limit). |
| 48 | + |
| 49 | +### Goals |
| 50 | + |
| 51 | +- Reduce server-side memory usage for large Range responses by streaming |
| 52 | + results in chunks instead of buffering the entire response. |
| 53 | +- Eliminate redundant count computation across paginated requests by deriving |
| 54 | + the total from the keys visited during streaming. |
| 55 | + |
| 56 | +### Non-Goals |
| 57 | + |
| 58 | +- Supporting custom sort orders in streaming mode. Clients that need |
| 59 | + non-default sort order should use the existing unary `Range` RPC. |
| 60 | + |
| 61 | +## Proposal |
| 62 | + |
| 63 | +Add a server-streaming `RangeStream` RPC to the etcd KV service that accepts |
| 64 | +the existing `RangeRequest` and returns a stream of `RangeStreamResponse` |
| 65 | +messages. The server handles pagination internally, pins to a single MVCC |
| 66 | +revision for snapshot consistency, and uses adaptive chunk sizing to |
| 67 | +auto-tune for different value sizes. The merged stream produces identical |
| 68 | +results to a single unary `Range()` call. |
| 69 | + |
| 70 | +If the pinned revision is compacted during streaming, the server closes the |
| 71 | +stream with `ErrCompacted`. Clients receive this error from `Recv()` and |
| 72 | +should retry the request. |
| 73 | + |
| 74 | +### Notes/Constraints/Caveats |
| 75 | + |
| 76 | +- Clients should not depend on the internal structure of the stream message |
| 77 | + layout (which chunks contain which fields). The contract is that |
| 78 | + `proto.Merge()` across all chunks produces a result identical to a single |
| 79 | + `Range()` call. Clients must merge chunks sequentially in stream order. |
| 80 | +- The server opens a new short-lived bbolt read transaction for each chunk |
| 81 | + rather than holding a single long-running transaction for the entire stream. |
| 82 | + Consistency is maintained by pinning the MVCC revision after the first chunk |
| 83 | + and reusing it for all subsequent chunks. If the pinned revision has been |
| 84 | + compacted by the time a later chunk is read, the server returns |
| 85 | + `ErrCompacted` and the client retries. This per-chunk transaction model |
| 86 | + avoids the bbolt caveat where long-running read transactions can block write |
| 87 | + transactions when the database needs to remap/allocate new pages. |
| 88 | + |
| 89 | +## Design Details |
| 90 | + |
| 91 | +### API Changes |
| 92 | + |
| 93 | +A new `RangeStream` RPC is added to the KV service, along with a new |
| 94 | +`RangeStreamResponse` wrapper message: |
| 95 | + |
| 96 | +```protobuf |
| 97 | +service KV { |
| 98 | + rpc RangeStream(RangeRequest) returns (stream RangeStreamResponse) {} |
| 99 | +} |
| 100 | +
|
| 101 | +message RangeStreamResponse { |
| 102 | + option (versionpb.etcd_version_msg) = "3.7"; |
| 103 | + RangeResponse range_response = 1; |
| 104 | +} |
| 105 | +``` |
| 106 | + |
| 107 | +`RangeStreamResponse` wraps `RangeResponse` so that `proto.Merge()` across |
| 108 | +all chunks produces the same result as a single `Range()` call. The wrapper |
| 109 | +also leaves room for future streaming-specific fields (e.g., progress, |
| 110 | +mid-stream errors). |
| 111 | + |
| 112 | +### Stream Message Layout |
| 113 | + |
| 114 | +| Message | Contents | |
| 115 | +|----------------------|-------------------------------------------------------| |
| 116 | +| Header | ClusterId, MemberId, RaftTerm (sent immediately from v3rpc layer) | |
| 117 | +| First chunk | Revision, Kvs | |
| 118 | +| Intermediate chunks | Kvs only | |
| 119 | +| Final chunk | Kvs, Count, More | |
| 120 | + |
| 121 | +Count is included in the final chunk because the server already visits |
| 122 | +every key during streaming, so the total is a free byproduct of the |
| 123 | +stream itself—no additional tree traversal is needed. Placing count on |
| 124 | +the first chunk would require an upfront O(total_keys) index walk before |
| 125 | +any data flows. Revision is only in the first data chunk. Clients |
| 126 | +reassemble by merging all messages. |
| 127 | + |
| 128 | +Clients should not depend on the structure of this layout—it should be |
| 129 | +treated as internal design. The defined contract is that the merged |
| 130 | +`RangeResponse` produces identical results as `proto.Merge`. |
| 131 | + |
| 132 | +### Supported Options |
| 133 | + |
| 134 | +`RangeStream` accepts the full `RangeRequest` message and supports all |
| 135 | +fields (e.g., `limit`, `keys_only`, `count_only`, `min_mod_revision`, |
| 136 | +`max_mod_revision`, `min_create_revision`, `max_create_revision`) |
| 137 | +except non-default sort order. Requests with non-default sort order |
| 138 | +require server-side post-processing that defeats streaming. The server |
| 139 | +returns `InvalidArgument` for these requests and clients should use |
| 140 | +the unary `Range` RPC instead. |
| 141 | + |
| 142 | +### Chunk Sizing |
| 143 | + |
| 144 | +A new `MaxBytes` field is added to `RangeOptions`. The streaming loop passes |
| 145 | +a byte budget (derived from `MaxRequestBytes`) on each MVCC range call. The |
| 146 | +value-read loop in `kvstore_txn.go` already iterates one-by-one via the |
| 147 | +backend cursor — it accumulates byte size and breaks early when the budget |
| 148 | +is exceeded. This lets MVCC determine the chunk size based on actual data |
| 149 | +size rather than requiring the caller to estimate a key count limit. |
| 150 | + |
| 151 | +### Unsupported Pass Through |
| 152 | + |
| 153 | +- **leasing/kv**: Falls back to unary `Range`. |
| 154 | +- **grpcproxy**: Falls back to unary `Range`. |
| 155 | + |
| 156 | +### Implementation Changes |
| 157 | + |
| 158 | +The following components are modified: |
| 159 | + |
| 160 | +- **Proto** (`api/etcdserverpb/rpc.proto`): New `RangeStream` RPC on the KV |
| 161 | + service. New `RangeStreamResponse` message wrapping `RangeResponse`. |
| 162 | +- **v3rpc** (`server/etcdserver/api/v3rpc/key.go`): `kvServer.RangeStream` — |
| 163 | + validates the request, sends the header message immediately, delegates to |
| 164 | + `EtcdServer.RangeStream`. |
| 165 | +- **EtcdServer** (`server/etcdserver/v3_server.go`): `RangeStream` — same auth |
| 166 | + and linearizability path as unary Range. `rangeStream` — the chunking loop: |
| 167 | + adaptive sizing, revision pinning, cursor advancement via |
| 168 | + `append(lastKey, '\x00')`. |
| 169 | +- **MVCC** (`server/storage/mvcc/`): `treeIndex.Revisions()` skips count |
| 170 | + computation for RangeStream calls, enabling early exit at the limit |
| 171 | + (`index.go`). The total count is derived from the running tally of |
| 172 | + streamed keys and emitted on the final chunk at no extra cost. |
| 173 | +- **Client** (`client/v3/kv.go`): `RangeStreamToRangeResponse` — reassembles |
| 174 | + a stream into a single `RangeResponse` so callers can transparently switch |
| 175 | + between unary and streaming. |
| 176 | +- **gRPC Proxy** (`server/proxy/grpcproxy/`): `kvs2kvc.RangeStream` adapter |
| 177 | + using channel-based `pipeStream` to bridge server/client stream interfaces. |
| 178 | + |
| 179 | +### Test Plan |
| 180 | + |
| 181 | +##### Unit tests |
| 182 | + |
| 183 | +- MVCC microbenchmarks (`server/storage/mvcc/kvstore_range_bench_test.go`) — |
| 184 | + `BenchmarkRangeUnary` vs `BenchmarkRangeStream` |
| 185 | + |
| 186 | +##### Integration tests |
| 187 | + |
| 188 | +- Integration tests (`tests/integration/v3_grpc_test.go`) — every existing |
| 189 | + Range test case also calls `RangeStream` and diffs the reassembled response |
| 190 | + against the unary result. |
| 191 | +- Transparent transform from `Get` to use `RangeStream` to add subtests |
| 192 | + for all existing Get tests (`tests/integration/cache_test.go`, |
| 193 | + `tests/common/kv_test.go`). |
| 194 | +- Robustness tests (Note: `tests/robustness/coverage` in the etcd repository |
| 195 | + will need updating once Kubernetes actually starts making calls, as monitored |
| 196 | + in [ci-etcd-k8s-coverage-amd64](https://testgrid.k8s.io/sig-etcd-periodics#ci-etcd-k8s-coverage-amd64)). |
| 197 | + |
| 198 | +### Kubernetes API Server Integration |
| 199 | + |
| 200 | +RangeStream is gated behind a `RangeStream` feature gate in kube-apiserver |
| 201 | +(Alpha in 1.37, default disabled). |
| 202 | + |
| 203 | +A new `ListStream` method is added to the etcd `kubernetes.Interface` |
| 204 | +as a thin wrapper around the etcd client's `KV.GetStream()`. This |
| 205 | +returns a channel of chunks so callers receive key-value pairs as they |
| 206 | +arrive from the server, keeping the `kubernetes.Interface` abstraction |
| 207 | +consistent rather than reaching into `client.KV` directly. |
| 208 | + |
| 209 | +The primary integration point is the watch cache initialization path. |
| 210 | +When the feature gate is enabled, the watch cache `sync()` uses |
| 211 | +`ListStream` to receive chunks incrementally. Each chunk's key-value |
| 212 | +pairs are converted to synthetic "created" events and queued inline |
| 213 | +without assembling the full list response in memory. |
| 214 | + |
| 215 | +For direct `GetList` calls (e.g., from controllers or when WatchList is |
| 216 | +disabled), the store consumes `ListStream` and decodes each chunk's |
| 217 | +key-value pairs inline as they arrive, overlapping network I/O with |
| 218 | +decode. When the server returns `Unimplemented` or the feature gate is |
| 219 | +disabled, the store falls back to paginated `List` with a conservative |
| 220 | +limit. The `storage.Interface` is unchanged. |
| 221 | + |
| 222 | +### Upgrade / Downgrade Strategy |
| 223 | + |
| 224 | +RangeStream is a new server-side RPC. Older clients that do not call |
| 225 | +`RangeStream` are completely unaffected. On downgrade to an etcd version |
| 226 | +without `RangeStream`, clients calling the RPC will receive an |
| 227 | +`Unimplemented` gRPC error and should fall back to unary `Range`. |
| 228 | + |
| 229 | +## Implementation History |
| 230 | + |
| 231 | +- 2026-03-18: KEP created |
| 232 | + |
0 commit comments