Skip to content

Commit d17ccb7

Browse files
authored
Merge pull request #5967 from Jefftree/etcd-rangestream
[KEP-5966] etcd RangeStream
2 parents 9b3e591 + a46c964 commit d17ccb7

2 files changed

Lines changed: 249 additions & 0 deletions

File tree

Lines changed: 232 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,232 @@
1+
# KEP-5966: etcd RangeStream
2+
3+
<!-- toc -->
4+
- [Summary](#summary)
5+
- [Motivation](#motivation)
6+
- [Goals](#goals)
7+
- [Non-Goals](#non-goals)
8+
- [Proposal](#proposal)
9+
- [Notes/Constraints/Caveats](#notesconstraintscaveats)
10+
- [Design Details](#design-details)
11+
- [API Changes](#api-changes)
12+
- [Stream Message Layout](#stream-message-layout)
13+
- [Supported Options](#supported-options)
14+
- [Chunk Sizing](#chunk-sizing)
15+
- [Unsupported Pass Through](#unsupported-pass-through)
16+
- [Implementation Changes](#implementation-changes)
17+
- [Test Plan](#test-plan)
18+
- [Unit tests](#unit-tests)
19+
- [Integration tests](#integration-tests)
20+
- [Kubernetes API Server Integration](#kubernetes-api-server-integration)
21+
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
22+
- [Implementation History](#implementation-history)
23+
<!-- /toc -->
24+
25+
## Summary
26+
27+
The unary Range RPC in etcd builds the entire response in memory before
28+
sending. For large result sets this causes server-side memory spikes (the KV
29+
slice, serialized protobuf, and gRPC send buffer all coexist) and redundant
30+
work when clients paginate (repeated Range calls with increasing keys
31+
recompute the total count on every page by walking the full B-tree index).
32+
33+
This KEP proposes a new server-streaming `RangeStream` RPC that reuses
34+
`RangeRequest` and returns results in chunks. The server performs pagination
35+
internally with adaptive chunk sizing and pins to a single MVCC revision for
36+
consistency. The total count is derived from the running tally of streamed
37+
keys, eliminating the separate index traversal required by paginated Range.
38+
39+
## Motivation
40+
41+
The current unary Range RPC has two key problems at scale:
42+
43+
1. **Server-side memory spikes** — the entire response (KV slice, serialized
44+
protobuf, gRPC send buffer) must coexist in memory before sending.
45+
2. **Redundant work with client-side pagination** — each paginated Range call
46+
recomputes the total count by walking the full B-tree index, turning
47+
per-page cost into O(total_keys) instead of O(limit).
48+
49+
### Goals
50+
51+
- Reduce server-side memory usage for large Range responses by streaming
52+
results in chunks instead of buffering the entire response.
53+
- Eliminate redundant count computation across paginated requests by deriving
54+
the total from the keys visited during streaming.
55+
56+
### Non-Goals
57+
58+
- Supporting custom sort orders in streaming mode. Clients that need
59+
non-default sort order should use the existing unary `Range` RPC.
60+
61+
## Proposal
62+
63+
Add a server-streaming `RangeStream` RPC to the etcd KV service that accepts
64+
the existing `RangeRequest` and returns a stream of `RangeStreamResponse`
65+
messages. The server handles pagination internally, pins to a single MVCC
66+
revision for snapshot consistency, and uses adaptive chunk sizing to
67+
auto-tune for different value sizes. The merged stream produces identical
68+
results to a single unary `Range()` call.
69+
70+
If the pinned revision is compacted during streaming, the server closes the
71+
stream with `ErrCompacted`. Clients receive this error from `Recv()` and
72+
should retry the request.
73+
74+
### Notes/Constraints/Caveats
75+
76+
- Clients should not depend on the internal structure of the stream message
77+
layout (which chunks contain which fields). The contract is that
78+
`proto.Merge()` across all chunks produces a result identical to a single
79+
`Range()` call. Clients must merge chunks sequentially in stream order.
80+
- The server opens a new short-lived bbolt read transaction for each chunk
81+
rather than holding a single long-running transaction for the entire stream.
82+
Consistency is maintained by pinning the MVCC revision after the first chunk
83+
and reusing it for all subsequent chunks. If the pinned revision has been
84+
compacted by the time a later chunk is read, the server returns
85+
`ErrCompacted` and the client retries. This per-chunk transaction model
86+
avoids the bbolt caveat where long-running read transactions can block write
87+
transactions when the database needs to remap/allocate new pages.
88+
89+
## Design Details
90+
91+
### API Changes
92+
93+
A new `RangeStream` RPC is added to the KV service, along with a new
94+
`RangeStreamResponse` wrapper message:
95+
96+
```protobuf
97+
service KV {
98+
rpc RangeStream(RangeRequest) returns (stream RangeStreamResponse) {}
99+
}
100+
101+
message RangeStreamResponse {
102+
option (versionpb.etcd_version_msg) = "3.7";
103+
RangeResponse range_response = 1;
104+
}
105+
```
106+
107+
`RangeStreamResponse` wraps `RangeResponse` so that `proto.Merge()` across
108+
all chunks produces the same result as a single `Range()` call. The wrapper
109+
also leaves room for future streaming-specific fields (e.g., progress,
110+
mid-stream errors).
111+
112+
### Stream Message Layout
113+
114+
| Message | Contents |
115+
|----------------------|-------------------------------------------------------|
116+
| Header | ClusterId, MemberId, RaftTerm (sent immediately from v3rpc layer) |
117+
| First chunk | Revision, Kvs |
118+
| Intermediate chunks | Kvs only |
119+
| Final chunk | Kvs, Count, More |
120+
121+
Count is included in the final chunk because the server already visits
122+
every key during streaming, so the total is a free byproduct of the
123+
stream itself—no additional tree traversal is needed. Placing count on
124+
the first chunk would require an upfront O(total_keys) index walk before
125+
any data flows. Revision is only in the first data chunk. Clients
126+
reassemble by merging all messages.
127+
128+
Clients should not depend on the structure of this layout—it should be
129+
treated as internal design. The defined contract is that the merged
130+
`RangeResponse` produces identical results as `proto.Merge`.
131+
132+
### Supported Options
133+
134+
`RangeStream` accepts the full `RangeRequest` message and supports all
135+
fields (e.g., `limit`, `keys_only`, `count_only`, `min_mod_revision`,
136+
`max_mod_revision`, `min_create_revision`, `max_create_revision`)
137+
except non-default sort order. Requests with non-default sort order
138+
require server-side post-processing that defeats streaming. The server
139+
returns `InvalidArgument` for these requests and clients should use
140+
the unary `Range` RPC instead.
141+
142+
### Chunk Sizing
143+
144+
A new `MaxBytes` field is added to `RangeOptions`. The streaming loop passes
145+
a byte budget (derived from `MaxRequestBytes`) on each MVCC range call. The
146+
value-read loop in `kvstore_txn.go` already iterates one-by-one via the
147+
backend cursor — it accumulates byte size and breaks early when the budget
148+
is exceeded. This lets MVCC determine the chunk size based on actual data
149+
size rather than requiring the caller to estimate a key count limit.
150+
151+
### Unsupported Pass Through
152+
153+
- **leasing/kv**: Falls back to unary `Range`.
154+
- **grpcproxy**: Falls back to unary `Range`.
155+
156+
### Implementation Changes
157+
158+
The following components are modified:
159+
160+
- **Proto** (`api/etcdserverpb/rpc.proto`): New `RangeStream` RPC on the KV
161+
service. New `RangeStreamResponse` message wrapping `RangeResponse`.
162+
- **v3rpc** (`server/etcdserver/api/v3rpc/key.go`): `kvServer.RangeStream`
163+
validates the request, sends the header message immediately, delegates to
164+
`EtcdServer.RangeStream`.
165+
- **EtcdServer** (`server/etcdserver/v3_server.go`): `RangeStream` — same auth
166+
and linearizability path as unary Range. `rangeStream` — the chunking loop:
167+
adaptive sizing, revision pinning, cursor advancement via
168+
`append(lastKey, '\x00')`.
169+
- **MVCC** (`server/storage/mvcc/`): `treeIndex.Revisions()` skips count
170+
computation for RangeStream calls, enabling early exit at the limit
171+
(`index.go`). The total count is derived from the running tally of
172+
streamed keys and emitted on the final chunk at no extra cost.
173+
- **Client** (`client/v3/kv.go`): `RangeStreamToRangeResponse` — reassembles
174+
a stream into a single `RangeResponse` so callers can transparently switch
175+
between unary and streaming.
176+
- **gRPC Proxy** (`server/proxy/grpcproxy/`): `kvs2kvc.RangeStream` adapter
177+
using channel-based `pipeStream` to bridge server/client stream interfaces.
178+
179+
### Test Plan
180+
181+
##### Unit tests
182+
183+
- MVCC microbenchmarks (`server/storage/mvcc/kvstore_range_bench_test.go`) —
184+
`BenchmarkRangeUnary` vs `BenchmarkRangeStream`
185+
186+
##### Integration tests
187+
188+
- Integration tests (`tests/integration/v3_grpc_test.go`) — every existing
189+
Range test case also calls `RangeStream` and diffs the reassembled response
190+
against the unary result.
191+
- Transparent transform from `Get` to use `RangeStream` to add subtests
192+
for all existing Get tests (`tests/integration/cache_test.go`,
193+
`tests/common/kv_test.go`).
194+
- Robustness tests (Note: `tests/robustness/coverage` in the etcd repository
195+
will need updating once Kubernetes actually starts making calls, as monitored
196+
in [ci-etcd-k8s-coverage-amd64](https://testgrid.k8s.io/sig-etcd-periodics#ci-etcd-k8s-coverage-amd64)).
197+
198+
### Kubernetes API Server Integration
199+
200+
RangeStream is gated behind a `RangeStream` feature gate in kube-apiserver
201+
(Alpha in 1.37, default disabled).
202+
203+
A new `ListStream` method is added to the etcd `kubernetes.Interface`
204+
as a thin wrapper around the etcd client's `KV.GetStream()`. This
205+
returns a channel of chunks so callers receive key-value pairs as they
206+
arrive from the server, keeping the `kubernetes.Interface` abstraction
207+
consistent rather than reaching into `client.KV` directly.
208+
209+
The primary integration point is the watch cache initialization path.
210+
When the feature gate is enabled, the watch cache `sync()` uses
211+
`ListStream` to receive chunks incrementally. Each chunk's key-value
212+
pairs are converted to synthetic "created" events and queued inline
213+
without assembling the full list response in memory.
214+
215+
For direct `GetList` calls (e.g., from controllers or when WatchList is
216+
disabled), the store consumes `ListStream` and decodes each chunk's
217+
key-value pairs inline as they arrive, overlapping network I/O with
218+
decode. When the server returns `Unimplemented` or the feature gate is
219+
disabled, the store falls back to paginated `List` with a conservative
220+
limit. The `storage.Interface` is unchanged.
221+
222+
### Upgrade / Downgrade Strategy
223+
224+
RangeStream is a new server-side RPC. Older clients that do not call
225+
`RangeStream` are completely unaffected. On downgrade to an etcd version
226+
without `RangeStream`, clients calling the RPC will receive an
227+
`Unimplemented` gRPC error and should fall back to unary `Range`.
228+
229+
## Implementation History
230+
231+
- 2026-03-18: KEP created
232+
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
title: etcd RangeStream
2+
kep-number: 5966
3+
authors:
4+
- "@Jefftree"
5+
owning-sig: sig-etcd
6+
participating-sigs:
7+
- sig-api-machinery
8+
status: provisional
9+
creation-date: 2026-03-18
10+
reviewers:
11+
- "@ahrtr"
12+
- "@fuweid"
13+
- "@serathius"
14+
approvers:
15+
- "@ahrtr"
16+
- "@fuweid"
17+
- "@serathius"

0 commit comments

Comments
 (0)