Batch ZkClient.deleteRecursively to avoid jute.maxbuffer pinning#173
Batch ZkClient.deleteRecursively to avoid jute.maxbuffer pinning#173laxman-ch wants to merge 1 commit into
Conversation
The previous deleteRecursively implementation issued one ZK delete per
znode sequentially. While jute.maxbuffer-safe (each op is its own
packet), it is O(N) in network round-trips: a 13K-message subtree
takes ~40s while pinning the calling thread, and a million-znode
cluster drop can run for tens of minutes. The companion
deleteRecursivelyAtomic was even worse - it bundled the entire
subtree into a single multi() call that crossed jute.maxbuffer at
~13K znodes, which surfaced as CONNECTIONLOSS, then the default 24h
ZkClient retry pinned the Helix REST Jetty pool until restart.
Replace deleteRecursively's recursion+single-op with BFS-collect +
batched multi(). Each multi() carries up to
DELETE_RECURSIVE_BATCH_SIZE = 1000 ops (~240 KB on the wire, well
below the 4 MB jute.maxbuffer default), so this method now works for
any subtree size in O(N / batch_size) round-trips. Roughly 100x
speedup at large N.
Public contract preserved:
- Still non-atomic (the javadoc never promised atomicity).
- Still returns success on a missing path.
- Same exception type (ZkClientException) and same wrapping shape
for the racy NotEmpty case (ZkClientException -> ZkException ->
KeeperException.NotEmptyException) so callers' existing retry
loops continue to work.
- NoNode results inside a batch are tolerated for retry idempotency.
ZKHelixAdmin.dropInstance:
Refactor to use the now-batched deleteRecursively and drop the
ad-hoc batched-delete helpers (~80 LoC removed). Keep the two-phase
drop semantics:
Phase 1: delete InstanceConfig first (instance becomes
non-Assignable, controller stops generating state-transition
messages).
Phase 2: _zkClient.deleteRecursively(instancePath) - batching is
now the ZkClient's responsibility.
dropInstance's pre-condition is relaxed to allow the resume case
where InstanceConfig is already gone but the subtree remains.
Tests:
zookeeper-api/.../TestRawZkClient (4 new):
- testDeleteRecursivelyLargeSubtree: 2,500 children -> spans 3
batches.
- testDeleteRecursivelyDeepSubtree: 3 groups x 100 leaves =
depth-3 paths, verifies BFS children-first ordering across
batch boundaries.
- testDeleteRecursivelyOnMissingPathIsNoOp: idempotent on
missing path.
- testDeleteRecursivelyIdempotentAfterPartialDelete: out-of-band
child delete doesn't fail the batch (NoNode tolerated).
helix-core/.../TestZkHelixAdmin (4 new):
- testDropInstanceWithLargeMessageSubtree: 2,500 messages
end-to-end.
- testDropInstanceResumesAfterPartialDelete: config gone,
subtree remains -> resume succeeds.
- testDropInstanceWithDeepSubtreeShape: depth>=2 paths under
CURRENTSTATES + ERRORS.
- testDropInstanceFailsFastOnNonRetryableError: non-NotEmpty
ZkClientException from deleteRecursively must NOT trigger the
3-retry loop; mock verifies invoked exactly once and elapsed
< 2s. Directly guards against re-creating the 1880-stuck-thread
incident.
Existing testZkHelixAdmin's mock migrated from
deleteRecursivelyAtomic boundary to the new deleteRecursively
boundary (same retry-3-times semantics).
All 27 TestRawZkClient and 26 TestZkHelixAdmin pass.
There was a problem hiding this comment.
Nice work! the batching itself is clean and I am aligned with the approach. Also confirmed Phase-1-first ordering is actually doing what your description claims: deleting
/CONFIGS/PARTICIPANT/{instance} drops the instance from _assignableInstanceConfigMap
(BaseControllerDataProvider.java:404-422), which every rebalancer gates on, and also kicks it out of the updateOfflineInstanceHistory iteration (BaseControllerDataProvider.java:1114-1118). So the "controller stops writing while Phase 2 is in flight" story checks out.
Two things bugging me before I'm comfortable shipping it. Both stem from the same thing: dropInstance is no longer atomic across (config, subtree), but nobody told addInstance about it.
1. Stale subtree poisons the next addInstance
Phase 1 succeeds, JVM dies, subtree is left full of MESSAGES / CURRENTSTATES / ERRORS. Operator (or auto-join) eventually calls addInstance for the same name. Today this "works": the config existence check at ZKHelixAdmin.java:218 passes (config really is gone), and then createPersistent(..., true) is called for each subdir. The catch: that overload silently eats ZkNodeExistsException (ZkClient.java:612-628). Stale children survive. Only HISTORY gets clobbered (line 260).
The painful thing can be that stale CURRENTSTATES are keyed by old session IDs, and the controller reads them across sessions, so it thinks the freshly-registered participant is still hosting whatever the dead incarnation was hosting.
2. Same-name re-add can race inside the Phase 1 -> Phase 2 window
No lock between drop and add (TODO at line 322 already says as much). With the old atomic delete this was fine; with the new two-phase flow, this interleaving is reachable:
T1: dropInstance Phase 1 done (config gone)
T2: addInstance runs, precondition passes, creates fresh config
+ 7 standard subdirs
T1: Phase 2 deleteRecursively(/INSTANCES/{instance})
--> wipes T2's freshly created subdirs
You end up with a ghost: config present, subtree gone. Next handleNewSession loops in isInstanceSetup=false (ZKUtil.java:149-179). And this isn't theoretical -- it can fire from REST DELETE racing REST PUT, from ParticipantDeregistrationStage racing participant auto-join (ParticipantManager.java:245), and from purgeOfflineInstances racing auto-join (the purgeInstance path at line 341-344 doesn't even check LiveInstance).
Suggested fix: one extra precondition in addInstance
Both of these collapse into the same root cause, addInstance assumes (config absent) => (subtree absent), and the PR just made that no longer true. Fix it on the observer side, not the mutator side:
// after the existing config-exists check at ZKHelixAdmin.java:218
String instancePath = PropertyPathBuilder.instance(clusterName, nodeId);
if (_zkClient.exists(instancePath)) {
throw new HelixException("Node " + nodeId + " has a residual subtree at "
+ instancePath + " without an InstanceConfig. Indicates a partial or "
+ "in-flight dropInstance. Run dropInstance to complete cleanup, then retry.");
}This fixes both:
- Issue 1: stale-after-crash now throws loudly. Recovery is one call to
dropInstance(relaxed precondition you already added resumes Phase 2 cleanly), thenaddInstancesucceeds. - Issue 2: every racy interleaving where Phase 2 hasn't finished yet leaves the subtree visible to T2's check, so T2 throws and the caller retries. T2 can only succeed once T1 is fully done. Serialization at the observer.
| */ | ||
| public void deleteRecursively(String path) throws ZkClientException { | ||
| List<String> children; | ||
| if (!exists(path)) { |
There was a problem hiding this comment.
nit: checked the zk side code of this, The if (!exists(path)) return; here is an extra ZK round-trip for a case collectSubtreeChildrenFirst already handles via its ZkNoNodeException catch on the initial getChildren. On the no-op path you pay 2 round-trips instead of 1.
Issues
Internal LinkedIn ticket: CICP-2994. Companion architectural fix to PR #171 (which is the contained dropInstance-only fix). Either can land; this one is preferred.
Description
What: Replace `ZkClient.deleteRecursively`'s single-op recursion with BFS-collect + batched `multi()`. Refactor `ZKHelixAdmin.dropInstance` to consume it.
Why: Two interlocking problems with the old recursive-delete machinery in helix-zookeeper-api:
How:
`ZkClient.deleteRecursively`: BFS-collect children-first into an in-memory list, then issue `multi()` batches of `DELETE_RECURSIVE_BATCH_SIZE = 1000` ops (~240 KB packets, well under `jute.maxbuffer`). Roughly 100× speedup at large N. The public contract is preserved: still non-atomic, still no-op on missing path, same `ZkClientException` type, same `ZkClientException → ZkException → KeeperException.NotEmptyException` shape so callers' existing NotEmpty retry loops keep working. NoNode within a batch is tolerated for idempotency.
`ZKHelixAdmin.dropInstance`: drop the previous atomic-delete-of-(instance, config) approach entirely. Two-phase drop:
The pre-condition check is relaxed to allow the resume case where InstanceConfig is already gone but the subtree remains.
Beneficiaries beyond dropInstance: 13 production callers of `deleteRecursively` get the speedup automatically — including `ZKHelixAdmin.dropCluster`, `addCluster` reset, `addStateModelDef` reset, `ZkBaseDataAccessor.deleteRecursively`, `ParticipantManager.cleanup`, the admin-webapp REST resources, and the ZKDumper CLI.
Tests
In `zookeeper-api/.../TestRawZkClient` (4 new):
In `helix-core/.../TestZkHelixAdmin` (4 new):
Plus the existing `testZkHelixAdmin` mock migrated from the old `deleteRecursivelyAtomic` boundary to the new `deleteRecursively` boundary (same 3-retry-then-fail semantics).
```
mvn -pl zookeeper-api test -Dtest=TestRawZkClient
[INFO] Tests run: 27, Failures: 0, Errors: 0, Skipped: 0
[INFO] BUILD SUCCESS
mvn -pl helix-core test -Dtest=TestZkHelixAdmin
[INFO] Tests run: 26, Failures: 0, Errors: 0, Skipped: 0
[INFO] BUILD SUCCESS
```
Changes that Break Backward Compatibility (Optional)
Documentation (Optional)
Commits
Code Quality
Local Review