perf(cosmos): strip unused fields when constructing PartitionKeyRange from ObjectNode#49513
perf(cosmos): strip unused fields when constructing PartitionKeyRange from ObjectNode#49513xinlian12 wants to merge 5 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR reduces Azure Cosmos DB Java client routing-map memory usage by compacting cached PartitionKeyRange instances to retain only the fields actually consumed by the routing map (id, minInclusive, maxExclusive, and non-empty parents), while adding unit tests and documenting the change in the Cosmos SDK changelog.
Changes:
- Compact cached
PartitionKeyRangeobjects inRxPartitionKeyRangeCache.updateRoutingMapby stripping unused fields before building the routing map. - Add unit tests ensuring stripped fields are removed from cached ranges and that non-empty
parentsis preserved. - Add a release note entry describing the memory optimization.
Show a summary per file
| File | Description |
|---|---|
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/caches/RxPartitionKeyRangeCache.java | Strips unused fields from PartitionKeyRange instances before caching in the routing map; adds documentation for the optimization. |
| sdk/cosmos/azure-cosmos/CHANGELOG.md | Documents the routing-map memory optimization in the Unreleased notes. |
| sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/caches/RxPartitionKeyRangeCacheTest.java | Adds unit tests validating unused fields are stripped and parents is preserved when non-empty. |
Copilot's findings
- Files reviewed: 3/3 changed files
- Comments generated: 4
67cd5c9 to
bb929d6
Compare
|
@sdkReviewAgent |
… from ObjectNode Reduce per-instance memory footprint of every PartitionKeyRange the SDK deserializes from a service response by removing fields the SDK does not need to retain in heap. Implementation moved into PartitionKeyRange(ObjectNode) -- the universal constructor funnel used by JsonSerializable#instantiateFromObjectNodeAndType for every FeedResponse<PartitionKeyRange> page (and the change-feed deserialization path). All ingress paths that matter to memory pressure flow through this ctor, so the strip happens once at the boundary and the RxPartitionKeyRangeCache requires no change. Dropped fields are kept identical to the equivalent Python optimization in Azure/azure-sdk-for-python#46297 to keep cross-SDK behaviour aligned: Dropped: _rid, _etag, ridPrefix, _self, ownedArchivalPKRangeIds, _ts, lsn Kept: id, minInclusive, maxExclusive, parents, throughputFraction, status, _lsn, and any future field the service starts returning (deny-list, not allow-list). Cross-SDK consistency matters here: a field that Java does not consume today may become useful tomorrow, and aligning with Python's audited deny-list ensures both SDKs make the same memory-vs-future-flexibility trade-off in one place. The ObjectNode argument is mutated in place rather than deep-copied because every production caller obtains it fresh from Jackson deserialization and does not retain another reference. Tests that need to preserve a fully-populated source object can use ObjectNode#deepCopy before handing it to the constructor. The (String) ctor is intentionally untouched -- it is used only by tests/samples that may want full-fidelity round-trip JSON, and a unit test pins that behaviour. PartitionKeyRange#equals / #hashCode already key on id/min/max only, so slim instances remain value-equal to manually-constructed ones. No public-API or wire change. PartitionKeyRange itself is in the com.azure.cosmos.implementation package. New unit tests in PartitionKeyRangeTest cover: - all 7 deny-listed fields are removed via the ObjectNode ctor - throughputFraction, status, _lsn, parents all preserved - unknown future field passes through (deny-list semantics) - non-empty parents preserved (split-merge bookkeeping) - equals / hashCode / toRange behaviour unchanged - null ObjectNode is tolerated (pre-existing contract) - String ctor retains full fidelity - JsonSerializable.instantiateFromObjectNodeAndType (FeedResponse funnel) routes through the stripped ctor Full mvn -P unit verify on azure-cosmos-tests: 2536 tests, 0 failures. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
bb929d6 to
8bd4d3a
Compare
|
✅ Review complete (31:57) Posted 4 inline comment(s). Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage |
|
✅ Review complete (33:49) Posted 3 inline comment(s). Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage |
|
✅ Review complete (36:58) Posted 2 inline comment(s). Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage |
|
✅ Review complete (36:44) Posted 4 inline comment(s). Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage |
Replace the deny-list of 7 known service fields with an allow-list of the 6 fields the SDK actually needs: id, minInclusive, maxExclusive, parents, status, throughputFraction. This mirrors the slots of Python's PKRange namedtuple in Azure/azure-sdk-for-python#46297 and bounds per-instance heap against any future server-side payload growth. Previously the deny-list let new or undocumented fields (e.g. _lsn, or anything the service may add later) silently inflate the cache. Tests updated to pin allow-list semantics, including a test that an unknown future field is dropped rather than passed through. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
@sdkReviewAgent |
…s header - Remove duplicate '#### Other Changes' header (Copilot review comment). - Shorten the PR entry per xinlian12's suggestion, linking to the PR instead of restating the design in the changelog. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Status update after pushing Review comments
CI failures (both look unrelated to this change):
Will let the rest of the pipeline finish to confirm. |
|
✅ Review complete (33:47) Posted 1 inline comment(s). Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage |
|
✅ Review complete (35:22) Posted 1 inline comment(s). Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage |
|
✅ Review complete (36:35) Posted 3 inline comment(s). Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage |
|
✅ Review complete (37:27) No new comments — existing review coverage is sufficient. Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, test-coverage | ⊘ synthesis |
Reviewer @xinlian12 (blocking) and the Kafka CI failure (CosmosSinkTaskTest.retryOnServiceUnavailable -> 'INVALID resource id null') both point at the same root cause: AddressResolver.isSameCollection() calls getResourceId() on PartitionKeyRange instances during target-change detection on a 410/Gone retry. Without _rid on the allow-list, ResourceId.parse(null) throws. Add _rid (Constants.Properties.R_ID) to KEPT_FIELDS. This is a Java-specific deviation from Python's PKRange namedtuple (Python's address-resolution path does not have the equivalent dependency); the rest of the allow-list still mirrors Python. Tests: - Update KEPT_FIELDS / STRIPPED_FIELDS in PartitionKeyRangeTest to reflect _rid is now retained. - Add the recommended getResourceId() assertion on a stripped instance. - Adjust the parents test to assert _rid survives instead of being stripped. Also revises the javadoc on KEPT_FIELDS to call out the Java-specific _rid retention with the exact AddressResolver reason, so future readers don't try to remove it again. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Pushed Root cause confirmed. The reviewer is right and I was wrong to dismiss the earlier Kafka CI failure as flaky. When I grepped earlier I only looked for the literal string Fix: added Tests:
Other comments:
|
The single-arg JSON constructor was only used by the now-removed stringConstructor_isUnaffectedByStrip test. No production code calls it (the deserialization funnel uses PartitionKeyRange(ObjectNode) via JsonSerializable.instantiateFromObjectNodeAndType). Removing it eliminates a path that bypasses the allow-list strip. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
/azp run java - cosmos - tests |
|
/azp run java - cosmos - spark |
|
Azure Pipelines successfully started running 1 pipeline(s). |
1 similar comment
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run java - cosmos - kafka |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Summary
Reduce per-instance memory footprint of every
PartitionKeyRangethe SDK deserializes from a service response by stripping fields the SDK does not need to retain. This is the Java port of the per-record half of Azure/azure-sdk-for-python#46297 (item #2 — "Strip unused fields → compact PKRange").Background
A typical
PartitionKeyRangereturned by Cosmos DB carries 13 fields:{ "_rid" : "90t-ALzvP44CAAAAAAAAUA==", "id" : "0", "_etag" : "\"00001e02-0000-0800-0000-6a2c41690000\"", "minInclusive" : "", "maxExclusive" : "FF", "ridPrefix" : 0, "_self" : "dbs/90t-AA==/colls/90t-ALzvP44=/pkranges/90t-ALzvP44CAAAAAAAAUA==/", "throughputFraction" : 1, "status" : "online", "parents" : [ ], "ownedArchivalPKRangeIds" : [ ], "_ts" : 1781285225, "lsn" : 87, "_lsn" : 87 }In Java, every
PartitionKeyRangestores those fields in a JacksonObjectNode(JsonSerializable.propertyBag). Each unused string field carries entry +TextNode+Stringoverhead — the large strings (_selfpath, base64_rid, quoted_etag) are the main cost. Across many physical partitions × manyCosmosClientinstances this becomes material per-JVM heap.Change
The strip happens in
PartitionKeyRange(ObjectNode)— the single constructor funnel every server-deserialized partition key range flows through (viaJsonSerializable.instantiateFromObjectNodeAndType, whichFeedResponse<PartitionKeyRange>and the change-feed path both use). One source of truth; no cache-side helper needed.Compatibility
PartitionKeyRangeis incom.azure.cosmos.implementation— internal, not public.equals/hashCodesemantics unchanged.(String)constructor still returns a full-fidelity instance for test/sample code that needs it.