Batch ZkClient.deleteRecursively to avoid jute.maxbuffer pinning by laxman-ch · Pull Request #173 · linkedin/helix

laxman-ch · 2026-05-07T16:09:28Z

Issues

My PR addresses the following Helix issues and references them in the PR description:

Internal LinkedIn ticket: CICP-2994. Companion architectural fix to PR #171 (which is the contained dropInstance-only fix). Either can land; this one is preferred.

Description

Here are some details about my PR, including screenshots of any UI changes:

What: Replace `ZkClient.deleteRecursively`'s single-op recursion with BFS-collect + batched `multi()`. Refactor `ZKHelixAdmin.dropInstance` to consume it.

Why: Two interlocking problems with the old recursive-delete machinery in helix-zookeeper-api:

`deleteRecursivelyAtomic` exceeds `jute.maxbuffer`. It bundles every descendant znode into one `multi()`. On instances that accumulated ~13K MESSAGES, the packet was ~3.75 MB and ZK rejected it as `CONNECTIONLOSS`. The default 24h `ZkClient` retry timeout pinned 1,880 Helix REST Jetty threads (observed in the `lor1-0006434` thread dumps), and helixacm pipelines collapsed (CICP-2994).
`deleteRecursively` was jute.maxbuffer-safe but glacially slow. One ZK round-trip per znode means a 13K-znode dropInstance ≈ 40 s pinning a thread, and a million-znode `dropCluster` ≈ tens of minutes. Same thread-exhaustion failure mode, slower onset.

How:

`ZkClient.deleteRecursively`: BFS-collect children-first into an in-memory list, then issue `multi()` batches of `DELETE_RECURSIVE_BATCH_SIZE = 1000` ops (~240 KB packets, well under `jute.maxbuffer`). Roughly 100× speedup at large N. The public contract is preserved: still non-atomic, still no-op on missing path, same `ZkClientException` type, same `ZkClientException → ZkException → KeeperException.NotEmptyException` shape so callers' existing NotEmpty retry loops keep working. NoNode within a batch is tolerated for idempotency.
`ZKHelixAdmin.dropInstance`: drop the previous atomic-delete-of-(instance, config) approach entirely. Two-phase drop:
- Phase 1: delete InstanceConfig first (single op). The instance becomes non-Assignable, so the controller stops generating new state-transition messages while Phase 2 is in flight.
- Phase 2: `_zkClient.deleteRecursively(instancePath)`. Batching is the ZkClient's responsibility.
The pre-condition check is relaxed to allow the resume case where InstanceConfig is already gone but the subtree remains.

Beneficiaries beyond dropInstance: 13 production callers of `deleteRecursively` get the speedup automatically — including `ZKHelixAdmin.dropCluster`, `addCluster` reset, `addStateModelDef` reset, `ZkBaseDataAccessor.deleteRecursively`, `ParticipantManager.cleanup`, the admin-webapp REST resources, and the ZKDumper CLI.

Tests

The following tests are written for this issue:

In `zookeeper-api/.../TestRawZkClient` (4 new):

`testDeleteRecursivelyLargeSubtree` — 2,500 children spanning 3 batches.
`testDeleteRecursivelyDeepSubtree` — depth-3 paths, verifies BFS children-first ordering crosses batch boundaries correctly.
`testDeleteRecursivelyOnMissingPathIsNoOp` — idempotency contract preserved.
`testDeleteRecursivelyIdempotentAfterPartialDelete` — out-of-band child delete tolerated (NoNode within a batch).

In `helix-core/.../TestZkHelixAdmin` (4 new):

`testDropInstanceWithLargeMessageSubtree` — 2,500 messages end-to-end against real ZK.
`testDropInstanceResumesAfterPartialDelete` — config gone, subtree remains → resume succeeds.
`testDropInstanceWithDeepSubtreeShape` — depth≥2 paths under CURRENTSTATES + ERRORS.
`testDropInstanceFailsFastOnNonRetryableError` — directly guards against re-creating the 1,880-stuck-thread incident: non-NotEmpty errors from `deleteRecursively` must NOT trigger the 3-retry loop. Mock verifies `deleteRecursively` invoked exactly once and elapsed < 2 s.

Plus the existing `testZkHelixAdmin` mock migrated from the old `deleteRecursivelyAtomic` boundary to the new `deleteRecursively` boundary (same 3-retry-then-fail semantics).

The following is the result of the "mvn test" command on the appropriate module:

```
mvn -pl zookeeper-api test -Dtest=TestRawZkClient
[INFO] Tests run: 27, Failures: 0, Errors: 0, Skipped: 0
[INFO] BUILD SUCCESS

mvn -pl helix-core test -Dtest=TestZkHelixAdmin
[INFO] Tests run: 26, Failures: 0, Errors: 0, Skipped: 0
[INFO] BUILD SUCCESS
```

Changes that Break Backward Compatibility (Optional)

`ZkClient.deleteRecursively` is now ~100× faster for large subtrees. Any caller that depended on the slow O(N) round-trip pacing as implicit rate-limiting will see this. Considered a feature, not a bug — but flagging for awareness.
`ZKHelixAdmin.dropInstance` is no longer atomic across (InstanceConfig, /INSTANCES/{instance}). A crash between Phase 1 and Phase 2 leaves a stale subtree; the next `dropInstance` call resumes cleanup. Callers that depended on "either everything is deleted or nothing is" will see partial state during the (typically sub-second) Phase 2 window.
`dropInstance` no longer throws `"... does not exist in instances for cluster ..."` as a distinct error message. The unified pre-condition error is `"... does not exist in config for cluster ..."` whenever neither InstanceConfig nor the subtree is present.

Documentation (Optional)

N/A. Updated the `deleteRecursively` javadoc to document the new behavior, atomicity contract, and exception shape.

Commits

My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message".

Code Quality

My diff has been formatted using helix-style.xml

Local Review

Local code review completed

The previous deleteRecursively implementation issued one ZK delete per znode sequentially. While jute.maxbuffer-safe (each op is its own packet), it is O(N) in network round-trips: a 13K-message subtree takes ~40s while pinning the calling thread, and a million-znode cluster drop can run for tens of minutes. The companion deleteRecursivelyAtomic was even worse - it bundled the entire subtree into a single multi() call that crossed jute.maxbuffer at ~13K znodes, which surfaced as CONNECTIONLOSS, then the default 24h ZkClient retry pinned the Helix REST Jetty pool until restart. Replace deleteRecursively's recursion+single-op with BFS-collect + batched multi(). Each multi() carries up to DELETE_RECURSIVE_BATCH_SIZE = 1000 ops (~240 KB on the wire, well below the 4 MB jute.maxbuffer default), so this method now works for any subtree size in O(N / batch_size) round-trips. Roughly 100x speedup at large N. Public contract preserved: - Still non-atomic (the javadoc never promised atomicity). - Still returns success on a missing path. - Same exception type (ZkClientException) and same wrapping shape for the racy NotEmpty case (ZkClientException -> ZkException -> KeeperException.NotEmptyException) so callers' existing retry loops continue to work. - NoNode results inside a batch are tolerated for retry idempotency. ZKHelixAdmin.dropInstance: Refactor to use the now-batched deleteRecursively and drop the ad-hoc batched-delete helpers (~80 LoC removed). Keep the two-phase drop semantics: Phase 1: delete InstanceConfig first (instance becomes non-Assignable, controller stops generating state-transition messages). Phase 2: _zkClient.deleteRecursively(instancePath) - batching is now the ZkClient's responsibility. dropInstance's pre-condition is relaxed to allow the resume case where InstanceConfig is already gone but the subtree remains. Tests: zookeeper-api/.../TestRawZkClient (4 new): - testDeleteRecursivelyLargeSubtree: 2,500 children -> spans 3 batches. - testDeleteRecursivelyDeepSubtree: 3 groups x 100 leaves = depth-3 paths, verifies BFS children-first ordering across batch boundaries. - testDeleteRecursivelyOnMissingPathIsNoOp: idempotent on missing path. - testDeleteRecursivelyIdempotentAfterPartialDelete: out-of-band child delete doesn't fail the batch (NoNode tolerated). helix-core/.../TestZkHelixAdmin (4 new): - testDropInstanceWithLargeMessageSubtree: 2,500 messages end-to-end. - testDropInstanceResumesAfterPartialDelete: config gone, subtree remains -> resume succeeds. - testDropInstanceWithDeepSubtreeShape: depth>=2 paths under CURRENTSTATES + ERRORS. - testDropInstanceFailsFastOnNonRetryableError: non-NotEmpty ZkClientException from deleteRecursively must NOT trigger the 3-retry loop; mock verifies invoked exactly once and elapsed < 2s. Directly guards against re-creating the 1880-stuck-thread incident. Existing testZkHelixAdmin's mock migrated from deleteRecursivelyAtomic boundary to the new deleteRecursively boundary (same retry-3-times semantics). All 27 TestRawZkClient and 26 TestZkHelixAdmin pass.

LZD-PratyushBhatt

.

LZD-PratyushBhatt

Nice work! the batching itself is clean and I am aligned with the approach. Also confirmed Phase-1-first ordering is actually doing what your description claims: deleting
/CONFIGS/PARTICIPANT/{instance} drops the instance from _assignableInstanceConfigMap
(BaseControllerDataProvider.java:404-422), which every rebalancer gates on, and also kicks it out of the updateOfflineInstanceHistory iteration (BaseControllerDataProvider.java:1114-1118). So the "controller stops writing while Phase 2 is in flight" story checks out.

Two things bugging me before I'm comfortable shipping it. Both stem from the same thing: dropInstance is no longer atomic across (config, subtree), but nobody told addInstance about it.

1. Stale subtree poisons the next `addInstance`

Phase 1 succeeds, JVM dies, subtree is left full of MESSAGES / CURRENTSTATES / ERRORS. Operator (or auto-join) eventually calls addInstance for the same name. Today this "works": the config existence check at ZKHelixAdmin.java:218 passes (config really is gone), and then createPersistent(..., true) is called for each subdir. The catch: that overload silently eats ZkNodeExistsException (ZkClient.java:612-628). Stale children survive. Only HISTORY gets clobbered (line 260).

The painful thing can be that stale CURRENTSTATES are keyed by old session IDs, and the controller reads them across sessions, so it thinks the freshly-registered participant is still hosting whatever the dead incarnation was hosting.

2. Same-name re-add can race inside the Phase 1 -> Phase 2 window

No lock between drop and add (TODO at line 322 already says as much). With the old atomic delete this was fine; with the new two-phase flow, this interleaving is reachable:

T1: dropInstance Phase 1 done (config gone)
T2: addInstance runs, precondition passes, creates fresh config
    + 7 standard subdirs
T1: Phase 2 deleteRecursively(/INSTANCES/{instance})
    --> wipes T2's freshly created subdirs

You end up with a ghost: config present, subtree gone. Next handleNewSession loops in isInstanceSetup=false (ZKUtil.java:149-179). And this isn't theoretical -- it can fire from REST DELETE racing REST PUT, from ParticipantDeregistrationStage racing participant auto-join (ParticipantManager.java:245), and from purgeOfflineInstances racing auto-join (the purgeInstance path at line 341-344 doesn't even check LiveInstance).

Suggested fix: one extra precondition in `addInstance`

Both of these collapse into the same root cause, addInstance assumes (config absent) => (subtree absent), and the PR just made that no longer true. Fix it on the observer side, not the mutator side:

// after the existing config-exists check at ZKHelixAdmin.java:218
String instancePath = PropertyPathBuilder.instance(clusterName, nodeId);
if (_zkClient.exists(instancePath)) {
  throw new HelixException("Node " + nodeId + " has a residual subtree at "
      + instancePath + " without an InstanceConfig. Indicates a partial or "
      + "in-flight dropInstance. Run dropInstance to complete cleanup, then retry.");
}

This fixes both:

Issue 1: stale-after-crash now throws loudly. Recovery is one call to dropInstance (relaxed precondition you already added resumes Phase 2 cleanly), then addInstance succeeds.
Issue 2: every racy interleaving where Phase 2 hasn't finished yet leaves the subtree visible to T2's check, so T2 throws and the caller retries. T2 can only succeed once T1 is fully done. Serialization at the observer.

LZD-PratyushBhatt · 2026-05-27T12:15:40Z

   */
  public void deleteRecursively(String path) throws ZkClientException {
-    List<String> children;
+    if (!exists(path)) {


nit: checked the zk side code of this, The if (!exists(path)) return; here is an extra ZK round-trip for a case collectSubtreeChildrenFirst already handles via its ZkNoNodeException catch on the initial getChildren. On the no-op path you pay 2 round-trips instead of 1.

laxman-ch marked this pull request as ready for review May 7, 2026 16:15

laxman-ch requested review from LZD-PratyushBhatt, PranaviAncha, arkmish, kabragaurav, ngngwr, proud-parselmouth and thestreak101 as code owners May 7, 2026 16:15

LZD-PratyushBhatt requested changes May 27, 2026

View reviewed changes

LZD-PratyushBhatt reviewed May 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch ZkClient.deleteRecursively to avoid jute.maxbuffer pinning#173

Batch ZkClient.deleteRecursively to avoid jute.maxbuffer pinning#173
laxman-ch wants to merge 1 commit into
devfrom
vchekka/batched-delete-recursively

laxman-ch commented May 7, 2026

Uh oh!

LZD-PratyushBhatt left a comment •

edited

Loading

Uh oh!

LZD-PratyushBhatt left a comment •

edited

Loading

Uh oh!

LZD-PratyushBhatt May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

laxman-ch commented May 7, 2026

Issues

Description

Tests

Changes that Break Backward Compatibility (Optional)

Documentation (Optional)

Commits

Code Quality

Local Review

Uh oh!

LZD-PratyushBhatt left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LZD-PratyushBhatt left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

1. Stale subtree poisons the next addInstance

2. Same-name re-add can race inside the Phase 1 -> Phase 2 window

Suggested fix: one extra precondition in addInstance

Uh oh!

LZD-PratyushBhatt May 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

LZD-PratyushBhatt left a comment •

edited

Loading

LZD-PratyushBhatt left a comment •

edited

Loading

1. Stale subtree poisons the next `addInstance`

Suggested fix: one extra precondition in `addInstance`