Batch ZKHelixAdmin.dropInstance to avoid jute.maxbuffer violations#171
Draft
laxman-ch wants to merge 1 commit into
Draft
Batch ZKHelixAdmin.dropInstance to avoid jute.maxbuffer violations#171laxman-ch wants to merge 1 commit into
laxman-ch wants to merge 1 commit into
Conversation
The previous deleteRecursivelyAtomic([instance, config]) bundled every
descendant znode into a single ZooKeeper multi() packet. Instances that
accumulated large MESSAGES queues (~13K) produced ~3.75 MB packets,
crossing the 4 MB jute.maxbuffer limit. ZK rejected these as
CONNECTIONLOSS, and the default 24 h ZkClient retry timeout pinned the
Helix REST Jetty thread pool until the service was restarted.
Replace with a two-phase drop:
Phase 1: delete InstanceConfig first (single-op). Makes the instance
non-Assignable so the controller stops generating new state-
transition messages while the subtree delete is in flight.
Phase 2: BFS-walk /INSTANCES/{instance} children-first and delete in
multi() batches of 1000 ops (~240 KB packets, well below
jute.maxbuffer). NoNode results are tolerated for retry idempotency;
NotEmpty results bubble up in the legacy exception shape so the
existing 3-retry loop continues to handle the racy
ParticipantHistory write case.
Trade-offs:
- Drop is no longer atomic. A crash between phases leaves a stale
/INSTANCES/{instance} subtree; the next dropInstance call resumes
cleanup. dropInstance's existence check is relaxed to allow the
resume case where InstanceConfig is already gone.
- Subtree size no longer affects packet size; works for any instance.
Tests (helix-core/.../TestZkHelixAdmin):
- testZkHelixAdmin: existing mock-based 3-retry path migrated to
OpResult.ErrorResult NOTEMPTY shape.
- testDropInstanceWithLargeMessageSubtree: 2,500 messages spanning
multiple batches.
- testDropInstanceResumesAfterPartialDelete: config gone, subtree
remains -> resume succeeds.
- testDropInstanceWithDeepSubtreeShape: depth>=2 paths under
CURRENTSTATES + ERRORS, verifies BFS children-first ordering.
- testDropInstanceFitsInSingleBatch: small subtree, single-batch
loop iteration.
- testDropInstanceFailsFastOnNonRetryableMultiError: non-NotEmpty
OpResult error fails fast (multi() invoked exactly once,
elapsed < 2s) - directly guards against re-creating the
1880-stuck-thread incident.
- testDropInstanceFailsFastWhenMultiThrows: thrown exception fails
fast (no spurious retries).
Full TestZkHelixAdmin (28 tests) passes.
959fa9a to
6dad074
Compare
6 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Issues
Internal LinkedIn ticket: CICP-2994 (HelixACM pipeline crashes from helix-rest call timeouts).
Description
What: Replace the single atomic
deleteRecursivelyAtomic([instance, config])call inZKHelixAdmin.dropInstancewith a two-phase batched approach.Why: Instances that accumulated large MESSAGES queues (~13K) caused
deleteRecursivelyAtomicto bundle every descendant znode into one ZooKeepermulti()packet ~3.75 MB in size. Crossing the 4 MBjute.maxbufferlimit, ZK rejected the packet asCONNECTIONLOSS. The default 24-hourZkClientretry timeout pinned the Helix REST Jetty thread pool, taking down the embedded REST endpoint until the service was restarted (1,880 threads observed stuck across thread dumps fromlor1-0006434).How:
/CONFIGS/PARTICIPANT/{instance}first (single op). Once gone the instance is non-Assignable, so the controller stops generating new state-transition messages while the subtree delete is in flight./INSTANCES/{instance}subtree children-first, then delete inmulti()batches ofDROP_INSTANCE_DELETE_BATCH_SIZE = 1000ops. Each on-wire packet is ~240 KB, well underjute.maxbuffer.NoNoderesults inside a batch are tolerated, keeping retries idempotent.NotEmptyresults bubble up wrapped asZkClientException -> ZkException -> KeeperException.NotEmptyException, the same shape the legacy code threw, so the existing 3-retry loop indropInstancePathsRecursivelycontinues to handle the racyParticipantHistorywrite case.dropInstance's pre-condition is relaxed: it now succeeds when either InstanceConfig or the subtree exists, so a retry after a partial drop can complete cleanup instead of erroring on "does not exist in config".Tests
In
helix-core/src/test/java/org/apache/helix/manager/zk/TestZkHelixAdmin.java:testZkHelixAdmin(existing, updated): mock-based 3-retry path migrated toOpResult.ErrorResultNOTEMPTY shape.testDropInstance(existing): regression baseline.testDropInstanceWithLargeMessageSubtree: 2,500 messages spanning multiple batches.testDropInstanceResumesAfterPartialDelete: config gone, subtree remains -> resume succeeds.testDropInstanceWithDeepSubtreeShape: depth>=2 paths under CURRENTSTATES + ERRORS, verifies BFS children-first ordering.testDropInstanceFitsInSingleBatch: small subtree, single-batch loop iteration.testDropInstanceFailsFastOnNonRetryableMultiError: non-NotEmpty OpResult error fails fast (multi() invoked exactly once, elapsed < 2s) - directly guards against re-creating the 1,880-stuck-thread incident.testDropInstanceFailsFastWhenMultiThrows: thrown exception fails fast (no spurious retries).The following is the result of the "mvn test" command on the appropriate module:
```
mvn -pl helix-core test -Dtest=TestZkHelixAdmin
...
[INFO] Tests run: 28, Failures: 0, Errors: 0, Skipped: 0
[INFO] BUILD SUCCESS
```
Changes that Break Backward Compatibility (Optional)
ZKHelixAdmin.dropInstanceis no longer atomic across(InstanceConfig, /INSTANCES/{instance}). A crash between Phase 1 and Phase 2 leaves a stale/INSTANCES/{instance}subtree; the nextdropInstancecall resumes cleanup. Callers that depended on "either everything is deleted or nothing is" will see partial state during the (typically sub-second) window of Phase 2.dropInstanceno longer throws\"... does not exist in instances for cluster ...\"as a distinct error. The unified pre-condition error message is\"... does not exist in config for cluster ...\"whenever neither InstanceConfig nor the subtree is present.Documentation (Optional)
Commits
Code Quality
Local Review