Skip to content

perf(cosmos): strip unused fields when constructing PartitionKeyRange from ObjectNode#49513

Open
xinlian12 wants to merge 5 commits into
Azure:mainfrom
xinlian12:perf/pkrange-cache-strip-fields
Open

perf(cosmos): strip unused fields when constructing PartitionKeyRange from ObjectNode#49513
xinlian12 wants to merge 5 commits into
Azure:mainfrom
xinlian12:perf/pkrange-cache-strip-fields

Conversation

@xinlian12

@xinlian12 xinlian12 commented Jun 15, 2026

Copy link
Copy Markdown
Member

Summary

Reduce per-instance memory footprint of every PartitionKeyRange the SDK deserializes from a service response by stripping fields the SDK does not need to retain. This is the Java port of the per-record half of Azure/azure-sdk-for-python#46297 (item #2 — "Strip unused fields → compact PKRange").

Background

A typical PartitionKeyRange returned by Cosmos DB carries 13 fields:

{
  "_rid" : "90t-ALzvP44CAAAAAAAAUA==",
  "id" : "0",
  "_etag" : "\"00001e02-0000-0800-0000-6a2c41690000\"",
  "minInclusive" : "",
  "maxExclusive" : "FF",
  "ridPrefix" : 0,
  "_self" : "dbs/90t-AA==/colls/90t-ALzvP44=/pkranges/90t-ALzvP44CAAAAAAAAUA==/",
  "throughputFraction" : 1,
  "status" : "online",
  "parents" : [ ],
  "ownedArchivalPKRangeIds" : [ ],
  "_ts" : 1781285225,
  "lsn" : 87,
  "_lsn" : 87
}

In Java, every PartitionKeyRange stores those fields in a Jackson ObjectNode (JsonSerializable.propertyBag). Each unused string field carries entry + TextNode + String overhead — the large strings (_self path, base64 _rid, quoted _etag) are the main cost. Across many physical partitions × many CosmosClient instances this becomes material per-JVM heap.

Change

The strip happens in PartitionKeyRange(ObjectNode) — the single constructor funnel every server-deserialized partition key range flows through (via JsonSerializable.instantiateFromObjectNodeAndType, which FeedResponse<PartitionKeyRange> and the change-feed path both use). One source of truth; no cache-side helper needed.

private static final String[] DROPPED_FIELDS = new String[] {
    "_rid", "_etag", "ridPrefix", "_self",
    "ownedArchivalPKRangeIds", "_ts", "lsn"
};

public PartitionKeyRange(ObjectNode objectNode) {
    super(stripUnusedFields(objectNode));
}

private static ObjectNode stripUnusedFields(ObjectNode objectNode) {
    if (objectNode != null) {
        for (String field : DROPPED_FIELDS) {
            objectNode.remove(field);
        }
    }
    return objectNode;
}

Compatibility

  • PartitionKeyRange is in com.azure.cosmos.implementation — internal, not public.
  • No wire-protocol or serialization change.
  • equals / hashCode semantics unchanged.
  • (String) constructor still returns a full-fidelity instance for test/sample code that needs it.

Copilot AI review requested due to automatic review settings June 15, 2026 16:43
@xinlian12 xinlian12 requested review from a team and kirankumarkolli as code owners June 15, 2026 16:43

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces Azure Cosmos DB Java client routing-map memory usage by compacting cached PartitionKeyRange instances to retain only the fields actually consumed by the routing map (id, minInclusive, maxExclusive, and non-empty parents), while adding unit tests and documenting the change in the Cosmos SDK changelog.

Changes:

  • Compact cached PartitionKeyRange objects in RxPartitionKeyRangeCache.updateRoutingMap by stripping unused fields before building the routing map.
  • Add unit tests ensuring stripped fields are removed from cached ranges and that non-empty parents is preserved.
  • Add a release note entry describing the memory optimization.
Show a summary per file
File Description
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/caches/RxPartitionKeyRangeCache.java Strips unused fields from PartitionKeyRange instances before caching in the routing map; adds documentation for the optimization.
sdk/cosmos/azure-cosmos/CHANGELOG.md Documents the routing-map memory optimization in the Unreleased notes.
sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/caches/RxPartitionKeyRangeCacheTest.java Adds unit tests validating unused fields are stripped and parents is preserved when non-empty.

Copilot's findings

  • Files reviewed: 3/3 changed files
  • Comments generated: 4

Comment thread sdk/cosmos/azure-cosmos/CHANGELOG.md Outdated
@xinlian12 xinlian12 force-pushed the perf/pkrange-cache-strip-fields branch from 67cd5c9 to bb929d6 Compare June 15, 2026 16:56
@xinlian12 xinlian12 changed the title perf(cosmos): strip unused fields from cached PartitionKeyRange to reduce routing-map memory [NO REVIEW]perf(cosmos): strip unused fields from cached PartitionKeyRange to reduce routing-map memory Jun 15, 2026
@xinlian12

Copy link
Copy Markdown
Member Author

@sdkReviewAgent

… from ObjectNode

Reduce per-instance memory footprint of every PartitionKeyRange the SDK
deserializes from a service response by removing fields the SDK does not
need to retain in heap.

Implementation moved into PartitionKeyRange(ObjectNode) -- the universal
constructor funnel used by JsonSerializable#instantiateFromObjectNodeAndType
for every FeedResponse<PartitionKeyRange> page (and the change-feed
deserialization path). All ingress paths that matter to memory pressure
flow through this ctor, so the strip happens once at the boundary and
the RxPartitionKeyRangeCache requires no change.

Dropped fields are kept identical to the equivalent Python optimization
in Azure/azure-sdk-for-python#46297 to keep
cross-SDK behaviour aligned:

  Dropped: _rid, _etag, ridPrefix, _self, ownedArchivalPKRangeIds,
           _ts, lsn
  Kept:    id, minInclusive, maxExclusive, parents, throughputFraction,
           status, _lsn, and any future field the service starts
           returning (deny-list, not allow-list).

Cross-SDK consistency matters here: a field that Java does not consume
today may become useful tomorrow, and aligning with Python's audited
deny-list ensures both SDKs make the same memory-vs-future-flexibility
trade-off in one place.

The ObjectNode argument is mutated in place rather than deep-copied
because every production caller obtains it fresh from Jackson
deserialization and does not retain another reference. Tests that need
to preserve a fully-populated source object can use ObjectNode#deepCopy
before handing it to the constructor. The (String) ctor is intentionally
untouched -- it is used only by tests/samples that may want full-fidelity
round-trip JSON, and a unit test pins that behaviour.

PartitionKeyRange#equals / #hashCode already key on id/min/max only, so
slim instances remain value-equal to manually-constructed ones. No
public-API or wire change. PartitionKeyRange itself is in the
com.azure.cosmos.implementation package.

New unit tests in PartitionKeyRangeTest cover:
  - all 7 deny-listed fields are removed via the ObjectNode ctor
  - throughputFraction, status, _lsn, parents all preserved
  - unknown future field passes through (deny-list semantics)
  - non-empty parents preserved (split-merge bookkeeping)
  - equals / hashCode / toRange behaviour unchanged
  - null ObjectNode is tolerated (pre-existing contract)
  - String ctor retains full fidelity
  - JsonSerializable.instantiateFromObjectNodeAndType (FeedResponse
    funnel) routes through the stripped ctor

Full mvn -P unit verify on azure-cosmos-tests: 2536 tests, 0 failures.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@xinlian12 xinlian12 force-pushed the perf/pkrange-cache-strip-fields branch from bb929d6 to 8bd4d3a Compare June 15, 2026 19:31
@xinlian12 xinlian12 changed the title [NO REVIEW]perf(cosmos): strip unused fields from cached PartitionKeyRange to reduce routing-map memory perf(cosmos): strip unused fields when constructing PartitionKeyRange from ObjectNode Jun 15, 2026
Comment thread sdk/cosmos/azure-cosmos/CHANGELOG.md Outdated
@xinlian12 xinlian12 changed the title perf(cosmos): strip unused fields when constructing PartitionKeyRange from ObjectNode [NO REVIEW]perf(cosmos): strip unused fields when constructing PartitionKeyRange from ObjectNode Jun 15, 2026
@xinlian12

Copy link
Copy Markdown
Member Author

Review complete (31:57)

Posted 4 inline comment(s).

Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage

@xinlian12

Copy link
Copy Markdown
Member Author

Review complete (33:49)

Posted 3 inline comment(s).

Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage

@xinlian12

Copy link
Copy Markdown
Member Author

Review complete (36:58)

Posted 2 inline comment(s).

Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage

@xinlian12

Copy link
Copy Markdown
Member Author

Review complete (36:44)

Posted 4 inline comment(s).

Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage

Replace the deny-list of 7 known service fields with an allow-list of
the 6 fields the SDK actually needs: id, minInclusive, maxExclusive,
parents, status, throughputFraction.

This mirrors the slots of Python's PKRange namedtuple in
Azure/azure-sdk-for-python#46297 and bounds per-instance heap against
any future server-side payload growth. Previously the deny-list let new
or undocumented fields (e.g. _lsn, or anything the service may add
later) silently inflate the cache.

Tests updated to pin allow-list semantics, including a test that an
unknown future field is dropped rather than passed through.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@xinlian12

Copy link
Copy Markdown
Member Author

@sdkReviewAgent

…s header

- Remove duplicate '#### Other Changes' header (Copilot review comment).
- Shorten the PR entry per xinlian12's suggestion, linking to the PR
  instead of restating the design in the changelog.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@xinlian12

Copy link
Copy Markdown
Member Author

Status update after pushing 6901e6e2ec1:

Review comments

  • @xinlian12 (CHANGELOG suggestion) — applied; shortened the entry and linked back to this PR.
  • ✅ Copilot "duplicate #### Other Changes header" — fixed; entry now lives under the single existing header.
  • ⚠️ The other three Copilot comments (RxPartitionKeyRangeCache.java JavaDoc, stripUnusedFields visibility, RxPartitionKeyRangeCacheTest.java comment) are stale — they reference an earlier revision that did the strip inside RxPartitionKeyRangeCache. The current PR does the strip inside PartitionKeyRange(ObjectNode) and does not touch RxPartitionKeyRangeCache at all, so those points no longer apply.

CI failures (both look unrelated to this change):

  • Build Test windows2022_18_NotFromSource_TestsOnly — Scala plugin failed with scala-maven-plugin:4.8.1: A required class was missing while executing ... xsbt/ExtractAPI during the -am install of azure-cosmos-spark_3-3_2-12. That's a Maven plugin/classpath issue in the build infrastructure, not a code failure.
  • Test Emulator windows2022_KafkaIntegrationTeststargetingCosmosEmulatorJava17 — single failure in CosmosSinkTaskTest.retryOnServiceUnavailable[false] with CosmosWriteException: INVALID resource id null. This is a fault-injection test (forces GONE 503 then expects retry success). The only ResourceId.parse(...) call sites in the SDK parse the resource id off request/collection/sessionToken — never off a PartitionKeyRange instance — so the field strip cannot make this id null. The previous deny-list revision of this PR also stripped _rid and this test passed; the allow-list revision strips the same plus _lsn/unknown future fields, none of which feed into ResourceId.parse. Flagging as flaky.

Will let the rest of the pipeline finish to confirm.

@xinlian12

Copy link
Copy Markdown
Member Author

Review complete (33:47)

Posted 1 inline comment(s).

Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage

@xinlian12

Copy link
Copy Markdown
Member Author

Review complete (35:22)

Posted 1 inline comment(s).

Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage

Comment thread sdk/cosmos/azure-cosmos/CHANGELOG.md Outdated
@xinlian12

Copy link
Copy Markdown
Member Author

Review complete (36:35)

Posted 3 inline comment(s).

Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage

@xinlian12

Copy link
Copy Markdown
Member Author

Review complete (37:27)

No new comments — existing review coverage is sufficient.

Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, test-coverage | ⊘ synthesis

Reviewer @xinlian12 (blocking) and the Kafka CI failure
(CosmosSinkTaskTest.retryOnServiceUnavailable -> 'INVALID resource id null')
both point at the same root cause: AddressResolver.isSameCollection() calls
getResourceId() on PartitionKeyRange instances during target-change
detection on a 410/Gone retry. Without _rid on the allow-list,
ResourceId.parse(null) throws.

Add _rid (Constants.Properties.R_ID) to KEPT_FIELDS. This is a Java-specific
deviation from Python's PKRange namedtuple (Python's address-resolution path
does not have the equivalent dependency); the rest of the allow-list still
mirrors Python.

Tests:
- Update KEPT_FIELDS / STRIPPED_FIELDS in PartitionKeyRangeTest to reflect
  _rid is now retained.
- Add the recommended getResourceId() assertion on a stripped instance.
- Adjust the parents test to assert _rid survives instead of being stripped.

Also revises the javadoc on KEPT_FIELDS to call out the Java-specific
_rid retention with the exact AddressResolver reason, so future readers
don't try to remove it again.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@xinlian12

Copy link
Copy Markdown
Member Author

Pushed c62ed2022be to address the blocking finding.

Root cause confirmed. The reviewer is right and I was wrong to dismiss the earlier Kafka CI failure as flaky. AddressResolver.isSameCollection() at AddressResolver.java:122-123 calls getResourceId() on PartitionKeyRange instances during retry target-change detection. With _rid stripped, ResourceId.parse(null) throws "INVALID resource id null" — which is exactly the CosmosSinkTaskTest.retryOnServiceUnavailable failure (fault-injection forces a 410/Gone → retry → isSameCollection → parse(null)).

When I grepped earlier I only looked for the literal string getResourceId adjacent to a pkRange/partitionKeyRange variable name and missed the AddressResolver call site where the two arguments are bound to parameters named initiallyResolved / newlyResolved. That's the bug.

Fix: added Constants.Properties.R_ID (_rid) to KEPT_FIELDS. The allow-list is now Python's PKRange slots + _rid (Java-specific). Javadoc updated to call out the AddressResolver reason so this isn't re-removed later.

Tests:

  • STRIPPED_FIELDS no longer contains _rid; KEPT_FIELDS adds it.
  • Added the recommended getResourceId() assertion on a stripped instance.
  • Adjusted objectNodeConstructor_preservesNonEmptyParents to assert _rid survives (was asserting it was stripped).

Other comments:

  • ✅ Duplicate #### Other Changes header — fixed in prior commit 6901e6e2ec1.
  • ✅ Testing recommendation — added getResourceId() assertion.

The single-arg JSON constructor was only used by the now-removed
stringConstructor_isUnaffectedByStrip test. No production code calls it
(the deserialization funnel uses PartitionKeyRange(ObjectNode) via
JsonSerializable.instantiateFromObjectNodeAndType). Removing it
eliminates a path that bypasses the allow-list strip.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@xinlian12

Copy link
Copy Markdown
Member Author

/azp run java - cosmos - tests

@xinlian12

Copy link
Copy Markdown
Member Author

/azp run java - cosmos - spark

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

1 similar comment
@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@xinlian12

Copy link
Copy Markdown
Member Author

/azp run java - cosmos - kafka

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@xinlian12 xinlian12 changed the title [NO REVIEW]perf(cosmos): strip unused fields when constructing PartitionKeyRange from ObjectNode perf(cosmos): strip unused fields when constructing PartitionKeyRange from ObjectNode Jun 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants