Skip to content

Batch ZKHelixAdmin.dropInstance to avoid jute.maxbuffer violations#171

Draft
laxman-ch wants to merge 1 commit into
devfrom
vchekka/batched-drop-instance
Draft

Batch ZKHelixAdmin.dropInstance to avoid jute.maxbuffer violations#171
laxman-ch wants to merge 1 commit into
devfrom
vchekka/batched-drop-instance

Conversation

@laxman-ch
Copy link
Copy Markdown
Collaborator

Issues

  • My PR addresses the following Helix issues and references them in the PR description:

Internal LinkedIn ticket: CICP-2994 (HelixACM pipeline crashes from helix-rest call timeouts).

Description

  • Here are some details about my PR, including screenshots of any UI changes:

What: Replace the single atomic deleteRecursivelyAtomic([instance, config]) call in ZKHelixAdmin.dropInstance with a two-phase batched approach.

Why: Instances that accumulated large MESSAGES queues (~13K) caused deleteRecursivelyAtomic to bundle every descendant znode into one ZooKeeper multi() packet ~3.75 MB in size. Crossing the 4 MB jute.maxbuffer limit, ZK rejected the packet as CONNECTIONLOSS. The default 24-hour ZkClient retry timeout pinned the Helix REST Jetty thread pool, taking down the embedded REST endpoint until the service was restarted (1,880 threads observed stuck across thread dumps from lor1-0006434).

How:

  • Phase 1: delete /CONFIGS/PARTICIPANT/{instance} first (single op). Once gone the instance is non-Assignable, so the controller stops generating new state-transition messages while the subtree delete is in flight.
  • Phase 2: BFS-walk the /INSTANCES/{instance} subtree children-first, then delete in multi() batches of DROP_INSTANCE_DELETE_BATCH_SIZE = 1000 ops. Each on-wire packet is ~240 KB, well under jute.maxbuffer.
  • NoNode results inside a batch are tolerated, keeping retries idempotent.
  • NotEmpty results bubble up wrapped as ZkClientException -> ZkException -> KeeperException.NotEmptyException, the same shape the legacy code threw, so the existing 3-retry loop in dropInstancePathsRecursively continues to handle the racy ParticipantHistory write case.
  • dropInstance's pre-condition is relaxed: it now succeeds when either InstanceConfig or the subtree exists, so a retry after a partial drop can complete cleanup instead of erroring on "does not exist in config".

Tests

  • The following tests are written for this issue:

In helix-core/src/test/java/org/apache/helix/manager/zk/TestZkHelixAdmin.java:

  • testZkHelixAdmin (existing, updated): mock-based 3-retry path migrated to OpResult.ErrorResult NOTEMPTY shape.

  • testDropInstance (existing): regression baseline.

  • testDropInstanceWithLargeMessageSubtree: 2,500 messages spanning multiple batches.

  • testDropInstanceResumesAfterPartialDelete: config gone, subtree remains -> resume succeeds.

  • testDropInstanceWithDeepSubtreeShape: depth>=2 paths under CURRENTSTATES + ERRORS, verifies BFS children-first ordering.

  • testDropInstanceFitsInSingleBatch: small subtree, single-batch loop iteration.

  • testDropInstanceFailsFastOnNonRetryableMultiError: non-NotEmpty OpResult error fails fast (multi() invoked exactly once, elapsed < 2s) - directly guards against re-creating the 1,880-stuck-thread incident.

  • testDropInstanceFailsFastWhenMultiThrows: thrown exception fails fast (no spurious retries).

  • The following is the result of the "mvn test" command on the appropriate module:

```
mvn -pl helix-core test -Dtest=TestZkHelixAdmin
...
[INFO] Tests run: 28, Failures: 0, Errors: 0, Skipped: 0
[INFO] BUILD SUCCESS
```

Changes that Break Backward Compatibility (Optional)

  • ZKHelixAdmin.dropInstance is no longer atomic across (InstanceConfig, /INSTANCES/{instance}). A crash between Phase 1 and Phase 2 leaves a stale /INSTANCES/{instance} subtree; the next dropInstance call resumes cleanup. Callers that depended on "either everything is deleted or nothing is" will see partial state during the (typically sub-second) window of Phase 2.
  • dropInstance no longer throws \"... does not exist in instances for cluster ...\" as a distinct error. The unified pre-condition error message is \"... does not exist in config for cluster ...\" whenever neither InstanceConfig nor the subtree is present.

Documentation (Optional)

  • N/A

Commits

Code Quality

  • My diff has been formatted using helix-style.xml

Local Review

  • Local code review completed

The previous deleteRecursivelyAtomic([instance, config]) bundled every
descendant znode into a single ZooKeeper multi() packet. Instances that
accumulated large MESSAGES queues (~13K) produced ~3.75 MB packets,
crossing the 4 MB jute.maxbuffer limit. ZK rejected these as
CONNECTIONLOSS, and the default 24 h ZkClient retry timeout pinned the
Helix REST Jetty thread pool until the service was restarted.

Replace with a two-phase drop:
  Phase 1: delete InstanceConfig first (single-op). Makes the instance
    non-Assignable so the controller stops generating new state-
    transition messages while the subtree delete is in flight.
  Phase 2: BFS-walk /INSTANCES/{instance} children-first and delete in
    multi() batches of 1000 ops (~240 KB packets, well below
    jute.maxbuffer). NoNode results are tolerated for retry idempotency;
    NotEmpty results bubble up in the legacy exception shape so the
    existing 3-retry loop continues to handle the racy
    ParticipantHistory write case.

Trade-offs:
  - Drop is no longer atomic. A crash between phases leaves a stale
    /INSTANCES/{instance} subtree; the next dropInstance call resumes
    cleanup. dropInstance's existence check is relaxed to allow the
    resume case where InstanceConfig is already gone.
  - Subtree size no longer affects packet size; works for any instance.

Tests (helix-core/.../TestZkHelixAdmin):
  - testZkHelixAdmin: existing mock-based 3-retry path migrated to
    OpResult.ErrorResult NOTEMPTY shape.
  - testDropInstanceWithLargeMessageSubtree: 2,500 messages spanning
    multiple batches.
  - testDropInstanceResumesAfterPartialDelete: config gone, subtree
    remains -> resume succeeds.
  - testDropInstanceWithDeepSubtreeShape: depth>=2 paths under
    CURRENTSTATES + ERRORS, verifies BFS children-first ordering.
  - testDropInstanceFitsInSingleBatch: small subtree, single-batch
    loop iteration.
  - testDropInstanceFailsFastOnNonRetryableMultiError: non-NotEmpty
    OpResult error fails fast (multi() invoked exactly once,
    elapsed < 2s) - directly guards against re-creating the
    1880-stuck-thread incident.
  - testDropInstanceFailsFastWhenMultiThrows: thrown exception fails
    fast (no spurious retries).

Full TestZkHelixAdmin (28 tests) passes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant