Skip to content

Batch ZkClient.deleteRecursively to avoid jute.maxbuffer pinning#173

Open
laxman-ch wants to merge 1 commit into
devfrom
vchekka/batched-delete-recursively
Open

Batch ZkClient.deleteRecursively to avoid jute.maxbuffer pinning#173
laxman-ch wants to merge 1 commit into
devfrom
vchekka/batched-delete-recursively

Conversation

@laxman-ch
Copy link
Copy Markdown
Collaborator

Issues

  • My PR addresses the following Helix issues and references them in the PR description:

Internal LinkedIn ticket: CICP-2994. Companion architectural fix to PR #171 (which is the contained dropInstance-only fix). Either can land; this one is preferred.

Description

  • Here are some details about my PR, including screenshots of any UI changes:

What: Replace `ZkClient.deleteRecursively`'s single-op recursion with BFS-collect + batched `multi()`. Refactor `ZKHelixAdmin.dropInstance` to consume it.

Why: Two interlocking problems with the old recursive-delete machinery in helix-zookeeper-api:

  1. `deleteRecursivelyAtomic` exceeds `jute.maxbuffer`. It bundles every descendant znode into one `multi()`. On instances that accumulated ~13K MESSAGES, the packet was ~3.75 MB and ZK rejected it as `CONNECTIONLOSS`. The default 24h `ZkClient` retry timeout pinned 1,880 Helix REST Jetty threads (observed in the `lor1-0006434` thread dumps), and helixacm pipelines collapsed (CICP-2994).
  2. `deleteRecursively` was jute.maxbuffer-safe but glacially slow. One ZK round-trip per znode means a 13K-znode dropInstance ≈ 40 s pinning a thread, and a million-znode `dropCluster` ≈ tens of minutes. Same thread-exhaustion failure mode, slower onset.

How:

  • `ZkClient.deleteRecursively`: BFS-collect children-first into an in-memory list, then issue `multi()` batches of `DELETE_RECURSIVE_BATCH_SIZE = 1000` ops (~240 KB packets, well under `jute.maxbuffer`). Roughly 100× speedup at large N. The public contract is preserved: still non-atomic, still no-op on missing path, same `ZkClientException` type, same `ZkClientException → ZkException → KeeperException.NotEmptyException` shape so callers' existing NotEmpty retry loops keep working. NoNode within a batch is tolerated for idempotency.

  • `ZKHelixAdmin.dropInstance`: drop the previous atomic-delete-of-(instance, config) approach entirely. Two-phase drop:

    • Phase 1: delete InstanceConfig first (single op). The instance becomes non-Assignable, so the controller stops generating new state-transition messages while Phase 2 is in flight.
    • Phase 2: `_zkClient.deleteRecursively(instancePath)`. Batching is the ZkClient's responsibility.

    The pre-condition check is relaxed to allow the resume case where InstanceConfig is already gone but the subtree remains.

Beneficiaries beyond dropInstance: 13 production callers of `deleteRecursively` get the speedup automatically — including `ZKHelixAdmin.dropCluster`, `addCluster` reset, `addStateModelDef` reset, `ZkBaseDataAccessor.deleteRecursively`, `ParticipantManager.cleanup`, the admin-webapp REST resources, and the ZKDumper CLI.

Tests

  • The following tests are written for this issue:

In `zookeeper-api/.../TestRawZkClient` (4 new):

  • `testDeleteRecursivelyLargeSubtree` — 2,500 children spanning 3 batches.
  • `testDeleteRecursivelyDeepSubtree` — depth-3 paths, verifies BFS children-first ordering crosses batch boundaries correctly.
  • `testDeleteRecursivelyOnMissingPathIsNoOp` — idempotency contract preserved.
  • `testDeleteRecursivelyIdempotentAfterPartialDelete` — out-of-band child delete tolerated (NoNode within a batch).

In `helix-core/.../TestZkHelixAdmin` (4 new):

  • `testDropInstanceWithLargeMessageSubtree` — 2,500 messages end-to-end against real ZK.
  • `testDropInstanceResumesAfterPartialDelete` — config gone, subtree remains → resume succeeds.
  • `testDropInstanceWithDeepSubtreeShape` — depth≥2 paths under CURRENTSTATES + ERRORS.
  • `testDropInstanceFailsFastOnNonRetryableError` — directly guards against re-creating the 1,880-stuck-thread incident: non-NotEmpty errors from `deleteRecursively` must NOT trigger the 3-retry loop. Mock verifies `deleteRecursively` invoked exactly once and elapsed < 2 s.

Plus the existing `testZkHelixAdmin` mock migrated from the old `deleteRecursivelyAtomic` boundary to the new `deleteRecursively` boundary (same 3-retry-then-fail semantics).

  • The following is the result of the "mvn test" command on the appropriate module:

```
mvn -pl zookeeper-api test -Dtest=TestRawZkClient
[INFO] Tests run: 27, Failures: 0, Errors: 0, Skipped: 0
[INFO] BUILD SUCCESS

mvn -pl helix-core test -Dtest=TestZkHelixAdmin
[INFO] Tests run: 26, Failures: 0, Errors: 0, Skipped: 0
[INFO] BUILD SUCCESS
```

Changes that Break Backward Compatibility (Optional)

  • `ZkClient.deleteRecursively` is now ~100× faster for large subtrees. Any caller that depended on the slow O(N) round-trip pacing as implicit rate-limiting will see this. Considered a feature, not a bug — but flagging for awareness.
  • `ZKHelixAdmin.dropInstance` is no longer atomic across (InstanceConfig, /INSTANCES/{instance}). A crash between Phase 1 and Phase 2 leaves a stale subtree; the next `dropInstance` call resumes cleanup. Callers that depended on "either everything is deleted or nothing is" will see partial state during the (typically sub-second) Phase 2 window.
  • `dropInstance` no longer throws `"... does not exist in instances for cluster ..."` as a distinct error message. The unified pre-condition error is `"... does not exist in config for cluster ..."` whenever neither InstanceConfig nor the subtree is present.

Documentation (Optional)

  • N/A. Updated the `deleteRecursively` javadoc to document the new behavior, atomicity contract, and exception shape.

Commits

Code Quality

  • My diff has been formatted using helix-style.xml

Local Review

  • Local code review completed

The previous deleteRecursively implementation issued one ZK delete per
znode sequentially. While jute.maxbuffer-safe (each op is its own
packet), it is O(N) in network round-trips: a 13K-message subtree
takes ~40s while pinning the calling thread, and a million-znode
cluster drop can run for tens of minutes. The companion
deleteRecursivelyAtomic was even worse - it bundled the entire
subtree into a single multi() call that crossed jute.maxbuffer at
~13K znodes, which surfaced as CONNECTIONLOSS, then the default 24h
ZkClient retry pinned the Helix REST Jetty pool until restart.

Replace deleteRecursively's recursion+single-op with BFS-collect +
batched multi(). Each multi() carries up to
DELETE_RECURSIVE_BATCH_SIZE = 1000 ops (~240 KB on the wire, well
below the 4 MB jute.maxbuffer default), so this method now works for
any subtree size in O(N / batch_size) round-trips. Roughly 100x
speedup at large N.

Public contract preserved:
  - Still non-atomic (the javadoc never promised atomicity).
  - Still returns success on a missing path.
  - Same exception type (ZkClientException) and same wrapping shape
    for the racy NotEmpty case (ZkClientException -> ZkException ->
    KeeperException.NotEmptyException) so callers' existing retry
    loops continue to work.
  - NoNode results inside a batch are tolerated for retry idempotency.

ZKHelixAdmin.dropInstance:
  Refactor to use the now-batched deleteRecursively and drop the
  ad-hoc batched-delete helpers (~80 LoC removed). Keep the two-phase
  drop semantics:
    Phase 1: delete InstanceConfig first (instance becomes
      non-Assignable, controller stops generating state-transition
      messages).
    Phase 2: _zkClient.deleteRecursively(instancePath) - batching is
      now the ZkClient's responsibility.
  dropInstance's pre-condition is relaxed to allow the resume case
  where InstanceConfig is already gone but the subtree remains.

Tests:
  zookeeper-api/.../TestRawZkClient (4 new):
    - testDeleteRecursivelyLargeSubtree: 2,500 children -> spans 3
      batches.
    - testDeleteRecursivelyDeepSubtree: 3 groups x 100 leaves =
      depth-3 paths, verifies BFS children-first ordering across
      batch boundaries.
    - testDeleteRecursivelyOnMissingPathIsNoOp: idempotent on
      missing path.
    - testDeleteRecursivelyIdempotentAfterPartialDelete: out-of-band
      child delete doesn't fail the batch (NoNode tolerated).
  helix-core/.../TestZkHelixAdmin (4 new):
    - testDropInstanceWithLargeMessageSubtree: 2,500 messages
      end-to-end.
    - testDropInstanceResumesAfterPartialDelete: config gone,
      subtree remains -> resume succeeds.
    - testDropInstanceWithDeepSubtreeShape: depth>=2 paths under
      CURRENTSTATES + ERRORS.
    - testDropInstanceFailsFastOnNonRetryableError: non-NotEmpty
      ZkClientException from deleteRecursively must NOT trigger the
      3-retry loop; mock verifies invoked exactly once and elapsed
      < 2s. Directly guards against re-creating the 1880-stuck-thread
      incident.
  Existing testZkHelixAdmin's mock migrated from
  deleteRecursivelyAtomic boundary to the new deleteRecursively
  boundary (same retry-3-times semantics).

All 27 TestRawZkClient and 26 TestZkHelixAdmin pass.
Copy link
Copy Markdown
Collaborator

@LZD-PratyushBhatt LZD-PratyushBhatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

Copy link
Copy Markdown
Collaborator

@LZD-PratyushBhatt LZD-PratyushBhatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! the batching itself is clean and I am aligned with the approach. Also confirmed Phase-1-first ordering is actually doing what your description claims: deleting
/CONFIGS/PARTICIPANT/{instance} drops the instance from _assignableInstanceConfigMap
(BaseControllerDataProvider.java:404-422), which every rebalancer gates on, and also kicks it out of the updateOfflineInstanceHistory iteration (BaseControllerDataProvider.java:1114-1118). So the "controller stops writing while Phase 2 is in flight" story checks out.

Two things bugging me before I'm comfortable shipping it. Both stem from the same thing: dropInstance is no longer atomic across (config, subtree), but nobody told addInstance about it.

1. Stale subtree poisons the next addInstance

Phase 1 succeeds, JVM dies, subtree is left full of MESSAGES / CURRENTSTATES / ERRORS. Operator (or auto-join) eventually calls addInstance for the same name. Today this "works": the config existence check at ZKHelixAdmin.java:218 passes (config really is gone), and then createPersistent(..., true) is called for each subdir. The catch: that overload silently eats ZkNodeExistsException (ZkClient.java:612-628). Stale children survive. Only HISTORY gets clobbered (line 260).

The painful thing can be that stale CURRENTSTATES are keyed by old session IDs, and the controller reads them across sessions, so it thinks the freshly-registered participant is still hosting whatever the dead incarnation was hosting.

2. Same-name re-add can race inside the Phase 1 -> Phase 2 window

No lock between drop and add (TODO at line 322 already says as much). With the old atomic delete this was fine; with the new two-phase flow, this interleaving is reachable:

T1: dropInstance Phase 1 done (config gone)
T2: addInstance runs, precondition passes, creates fresh config
    + 7 standard subdirs
T1: Phase 2 deleteRecursively(/INSTANCES/{instance})
    --> wipes T2's freshly created subdirs

You end up with a ghost: config present, subtree gone. Next handleNewSession loops in isInstanceSetup=false (ZKUtil.java:149-179). And this isn't theoretical -- it can fire from REST DELETE racing REST PUT, from ParticipantDeregistrationStage racing participant auto-join (ParticipantManager.java:245), and from purgeOfflineInstances racing auto-join (the purgeInstance path at line 341-344 doesn't even check LiveInstance).

Suggested fix: one extra precondition in addInstance

Both of these collapse into the same root cause, addInstance assumes (config absent) => (subtree absent), and the PR just made that no longer true. Fix it on the observer side, not the mutator side:

// after the existing config-exists check at ZKHelixAdmin.java:218
String instancePath = PropertyPathBuilder.instance(clusterName, nodeId);
if (_zkClient.exists(instancePath)) {
  throw new HelixException("Node " + nodeId + " has a residual subtree at "
      + instancePath + " without an InstanceConfig. Indicates a partial or "
      + "in-flight dropInstance. Run dropInstance to complete cleanup, then retry.");
}

This fixes both:

  • Issue 1: stale-after-crash now throws loudly. Recovery is one call to dropInstance (relaxed precondition you already added resumes Phase 2 cleanly), then addInstance succeeds.
  • Issue 2: every racy interleaving where Phase 2 hasn't finished yet leaves the subtree visible to T2's check, so T2 throws and the caller retries. T2 can only succeed once T1 is fully done. Serialization at the observer.

*/
public void deleteRecursively(String path) throws ZkClientException {
List<String> children;
if (!exists(path)) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: checked the zk side code of this, The if (!exists(path)) return; here is an extra ZK round-trip for a case collectSubtreeChildrenFirst already handles via its ZkNoNodeException catch on the initial getChildren. On the no-op path you pay 2 round-trips instead of 1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants