Batch ZKHelixAdmin.dropInstance to avoid jute.maxbuffer violations by laxman-ch · Pull Request #171 · linkedin/helix

laxman-ch · 2026-05-06T21:55:07Z

Issues

My PR addresses the following Helix issues and references them in the PR description:

Internal LinkedIn ticket: CICP-2994 (HelixACM pipeline crashes from helix-rest call timeouts).

Description

Here are some details about my PR, including screenshots of any UI changes:

What: Replace the single atomic deleteRecursivelyAtomic([instance, config]) call in ZKHelixAdmin.dropInstance with a two-phase batched approach.

Why: Instances that accumulated large MESSAGES queues (~13K) caused deleteRecursivelyAtomic to bundle every descendant znode into one ZooKeeper multi() packet ~3.75 MB in size. Crossing the 4 MB jute.maxbuffer limit, ZK rejected the packet as CONNECTIONLOSS. The default 24-hour ZkClient retry timeout pinned the Helix REST Jetty thread pool, taking down the embedded REST endpoint until the service was restarted (1,880 threads observed stuck across thread dumps from lor1-0006434).

How:

Phase 1: delete /CONFIGS/PARTICIPANT/{instance} first (single op). Once gone the instance is non-Assignable, so the controller stops generating new state-transition messages while the subtree delete is in flight.
Phase 2: BFS-walk the /INSTANCES/{instance} subtree children-first, then delete in multi() batches of DROP_INSTANCE_DELETE_BATCH_SIZE = 1000 ops. Each on-wire packet is ~240 KB, well under jute.maxbuffer.
NoNode results inside a batch are tolerated, keeping retries idempotent.
NotEmpty results bubble up wrapped as ZkClientException -> ZkException -> KeeperException.NotEmptyException, the same shape the legacy code threw, so the existing 3-retry loop in dropInstancePathsRecursively continues to handle the racy ParticipantHistory write case.
dropInstance's pre-condition is relaxed: it now succeeds when either InstanceConfig or the subtree exists, so a retry after a partial drop can complete cleanup instead of erroring on "does not exist in config".

Tests

The following tests are written for this issue:

In helix-core/src/test/java/org/apache/helix/manager/zk/TestZkHelixAdmin.java:

testZkHelixAdmin (existing, updated): mock-based 3-retry path migrated to OpResult.ErrorResult NOTEMPTY shape.
testDropInstance (existing): regression baseline.
testDropInstanceWithLargeMessageSubtree: 2,500 messages spanning multiple batches.
testDropInstanceResumesAfterPartialDelete: config gone, subtree remains -> resume succeeds.
testDropInstanceWithDeepSubtreeShape: depth>=2 paths under CURRENTSTATES + ERRORS, verifies BFS children-first ordering.
testDropInstanceFitsInSingleBatch: small subtree, single-batch loop iteration.
testDropInstanceFailsFastOnNonRetryableMultiError: non-NotEmpty OpResult error fails fast (multi() invoked exactly once, elapsed < 2s) - directly guards against re-creating the 1,880-stuck-thread incident.
testDropInstanceFailsFastWhenMultiThrows: thrown exception fails fast (no spurious retries).
The following is the result of the "mvn test" command on the appropriate module:

```
mvn -pl helix-core test -Dtest=TestZkHelixAdmin
...
[INFO] Tests run: 28, Failures: 0, Errors: 0, Skipped: 0
[INFO] BUILD SUCCESS
```

Changes that Break Backward Compatibility (Optional)

ZKHelixAdmin.dropInstance is no longer atomic across (InstanceConfig, /INSTANCES/{instance}). A crash between Phase 1 and Phase 2 leaves a stale /INSTANCES/{instance} subtree; the next dropInstance call resumes cleanup. Callers that depended on "either everything is deleted or nothing is" will see partial state during the (typically sub-second) window of Phase 2.
dropInstance no longer throws \"... does not exist in instances for cluster ...\" as a distinct error. The unified pre-condition error message is \"... does not exist in config for cluster ...\" whenever neither InstanceConfig nor the subtree is present.

Documentation (Optional)

N/A

Commits

My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message".

Code Quality

My diff has been formatted using helix-style.xml

Local Review

Local code review completed

The previous deleteRecursivelyAtomic([instance, config]) bundled every descendant znode into a single ZooKeeper multi() packet. Instances that accumulated large MESSAGES queues (~13K) produced ~3.75 MB packets, crossing the 4 MB jute.maxbuffer limit. ZK rejected these as CONNECTIONLOSS, and the default 24 h ZkClient retry timeout pinned the Helix REST Jetty thread pool until the service was restarted. Replace with a two-phase drop: Phase 1: delete InstanceConfig first (single-op). Makes the instance non-Assignable so the controller stops generating new state- transition messages while the subtree delete is in flight. Phase 2: BFS-walk /INSTANCES/{instance} children-first and delete in multi() batches of 1000 ops (~240 KB packets, well below jute.maxbuffer). NoNode results are tolerated for retry idempotency; NotEmpty results bubble up in the legacy exception shape so the existing 3-retry loop continues to handle the racy ParticipantHistory write case. Trade-offs: - Drop is no longer atomic. A crash between phases leaves a stale /INSTANCES/{instance} subtree; the next dropInstance call resumes cleanup. dropInstance's existence check is relaxed to allow the resume case where InstanceConfig is already gone. - Subtree size no longer affects packet size; works for any instance. Tests (helix-core/.../TestZkHelixAdmin): - testZkHelixAdmin: existing mock-based 3-retry path migrated to OpResult.ErrorResult NOTEMPTY shape. - testDropInstanceWithLargeMessageSubtree: 2,500 messages spanning multiple batches. - testDropInstanceResumesAfterPartialDelete: config gone, subtree remains -> resume succeeds. - testDropInstanceWithDeepSubtreeShape: depth>=2 paths under CURRENTSTATES + ERRORS, verifies BFS children-first ordering. - testDropInstanceFitsInSingleBatch: small subtree, single-batch loop iteration. - testDropInstanceFailsFastOnNonRetryableMultiError: non-NotEmpty OpResult error fails fast (multi() invoked exactly once, elapsed < 2s) - directly guards against re-creating the 1880-stuck-thread incident. - testDropInstanceFailsFastWhenMultiThrows: thrown exception fails fast (no spurious retries). Full TestZkHelixAdmin (28 tests) passes.

laxman-ch force-pushed the vchekka/batched-drop-instance branch from 959fa9a to 6dad074 Compare May 6, 2026 22:21

laxman-ch mentioned this pull request May 7, 2026

Batch ZkClient.deleteRecursively to avoid jute.maxbuffer pinning #173

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch ZKHelixAdmin.dropInstance to avoid jute.maxbuffer violations#171

Batch ZKHelixAdmin.dropInstance to avoid jute.maxbuffer violations#171
laxman-ch wants to merge 1 commit into
devfrom
vchekka/batched-drop-instance

laxman-ch commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

laxman-ch commented May 6, 2026

Issues

Description

Tests

Changes that Break Backward Compatibility (Optional)

Documentation (Optional)

Commits

Code Quality

Local Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant