Skip to content

Add guard rails to PerInstanceAccessor for instance deletion and capacity updates #163

Open
Chr0nicl3 wants to merge 1 commit into
devfrom
ktripath/add-instance-guard-rails
Open

Add guard rails to PerInstanceAccessor for instance deletion and capacity updates #163
Chr0nicl3 wants to merge 1 commit into
devfrom
ktripath/add-instance-guard-rails

Conversation

@Chr0nicl3
Copy link
Copy Markdown
Collaborator

@Chr0nicl3 Chr0nicl3 commented Apr 14, 2026

Summary

  • Add min-active-replica guard rail to deleteInstance() that validates dropping an instance won't violate minActiveReplicas constraints on any hosted partition
  • Add capacity constraint guard rail to updateInstanceConfig() that validates INSTANCE_CAPACITY_MAP changes don't break WAGED rebalancer constraints
  • Both guard rails support ?force=true query parameter to allow operators to bypass validation
  • Changed IllegalArgumentException handling in updateInstanceConfig() from 500 Server Error to 400 Bad Request since these are client input validation errors

Description

Problem

Two operations in PerInstanceAccessor can cause rebalance failures with no pre-validation:

  1. deleteInstance() directly calls admin.dropInstance() with only a liveness check. If the instance hosts partitions where sibling replicas are already at minActiveReplicas, dropping it violates the constraint and causes rebalance failures.

  2. updateInstanceConfig() only validates topology settings via validateDeltaTopologySettingInInstanceConfig(). When the INSTANCE_CAPACITY_MAP is reduced (e.g., removing a required capacity key), the WAGED rebalancer fails on its next cycle because required capacity keys are missing.

Solution

Guard Rail 1: deleteInstance() — Min-Active-Replica Check

Before calling admin.dropInstance(), we call InstanceValidationUtil.siblingNodesActiveReplicaCheckWithDetails() to verify that dropping this instance won't violate min-active-replica constraints for any hosted partition.

How it works:

  • Iterates all resources in the cluster (skipping disabled/invalid/task resources)
  • For each partition hosted by the target instance, counts healthy siblings (not in DROPPED/ERROR/OFFLINE states)
  • If numHealthySiblings < minActiveReplicas for any partition, returns 400 Bad Request with the specific resource name, partition name, current active replica count, and required minimum
  • If the instance itself is already in an unhealthy state (OFFLINE, ERROR, DROPPED) in the ExternalView, the check passes — removing an already-unhealthy replica doesn't reduce the active count
  • Resources without minActiveReplicas configured (returns -1) are skipped

Edge cases handled:

  • Instance not in any ExternalView (e.g., already drained) → check passes
  • Instance in unhealthy state in ExternalView → check passes (removing it doesn't hurt)
  • Instance config not found → returns 404 Not Found

Example error response when blocked:

{
  "error": "Cannot drop instance host123: MIN_ACTIVE_REPLICA_CHECK_FAILED: Resource MyDB partition MyDB_7 has 2/3 active replicas. Use force=true query param to override."
}

Guard Rail 2: updateInstanceConfig() — Capacity Constraint Validation

New private method validateInstanceCapacityChange() validates that capacity map changes don't break WAGED rebalancer constraints.

How it works:

  1. Checks if the update payload contains INSTANCE_CAPACITY_MAP changes — if not, returns early (no-op for non-capacity updates)
  2. Compares new capacity values against current values — if no capacity key is being reduced, returns early (adding keys or increasing values is always safe)
  3. Simulates the merged config (existing config + update) using the same ZNRecord.update() merge logic that configAccessor.updateInstanceConfig() uses
  4. Calls WagedValidationUtil.validateAndGetInstanceCapacity(clusterConfig, mergedConfig) to verify all required capacity keys from ClusterConfig.getInstanceCapacityKeys() are present
  5. If any required key is missing after the merge, throws IllegalArgumentException → returns 400 Bad Request

Example failure scenario:

  • Cluster requires capacity keys [cpu, memory, disk]
  • Cluster defaults provide {cpu: 100, memory: 256} (no default for disk)
  • Instance has {disk: 500} (providing the missing disk key)
  • Operator updates instance to remove disk → merged capacity becomes {cpu: 100, memory: 256}, missing diskblocked

?force=true Bypass

Both guard rails accept a force query param (default: false). When force=true, all validation is skipped and the operation proceeds directly. This allows operators to explicitly accept risk for known-safe operations.

Additional Fix: IllegalArgumentException HTTP Status

Changed the IllegalArgumentException catch block in updateInstanceConfig() from returning 500 Server Error (serverError(ex)) to 400 Bad Request (badRequest(ex.getMessage())). These exceptions represent client input validation failures (bad topology settings, bad capacity config), not server errors.

Behavioral Changes Summary

Scenario Before After
DELETE /instances/{name} with partitions at minActiveReplicas 200 OK, rebalance failure later 400 with resource/partition/replica details
DELETE /instances/{name}?force=true (same scenario) N/A 200 OK (operator accepted risk)
DELETE /instances/{name} with no active partitions 200 OK 200 OK (check passes, same behavior)
POST /instances/{name}/configs removing a required capacity key 200 OK, WAGED rebalancer crashes next cycle 400 with missing key details
POST /instances/{name}/configs?force=true removing capacity key N/A 200 OK (operator accepted risk)
POST /instances/{name}/configs updating non-capacity field 200 OK 200 OK (no change in behavior)
POST /instances/{name}/configs with invalid topology 500 Server Error 400 Bad Request

Tests

The following tests are added in TestPerInstanceAccessor.java:

deleteInstance guard rail tests:

  • testDeleteInstanceGuardRailPassesForUnassignedInstance — Verifies that deleting an instance with no ExternalView assignments passes the guard rail
  • testDeleteInstanceGuardRailBlocksWhenMinActiveReplicaViolated — Injects an active replica entry into ExternalView, verifies the delete is blocked with MIN_ACTIVE_REPLICA_CHECK_FAILED error details
  • testDeleteInstanceGuardRailBypassWithForce — Verifies ?force=true bypasses the guard rail

updateInstanceConfig capacity guard rail tests:

  • testUpdateInstanceConfigCapacityReductionBlocked — Sets up required capacity keys in ClusterConfig, sets instance capacity, then attempts to reduce capacity removing a required key. Verifies 400 response.
  • testUpdateInstanceConfigCapacityReductionForceBypass — Same setup but with ?force=true, verifies 200 response
  • testUpdateInstanceConfigNonCapacityChangePasses — Updates a non-capacity field, verifies no capacity validation is triggered

Changes that Break Backward Compatibility

  • deleteInstance() now returns 400 instead of 200 when dropping an instance would violate min-active-replica constraints. Callers that previously relied on unconditional 200 responses should add ?force=true if they want to preserve the old behavior.
  • updateInstanceConfig() now returns 400 instead of 200 when capacity map changes would break WAGED rebalancer constraints. Same ?force=true bypass available.
  • updateInstanceConfig() now returns 400 instead of 500 for IllegalArgumentException (topology validation failures). This is a fix, not a break — clients handling 500 for these errors should update to handle 400.

…city updates

Guard rails prevent known failure scenarios where instance operations cause
rebalance failures:

1. deleteInstance(): Run siblingNodesActiveReplicaCheckWithDetails() before
   dropping an instance to verify min-active-replica constraints won't be
   violated. Returns 400 with specific resource/partition/replica details
   when blocked.

2. updateInstanceConfig(): Validate INSTANCE_CAPACITY_MAP changes don't
   break WAGED rebalancer constraints by simulating the merged config and
   checking required capacity keys remain present.

Both guard rails support ?force=true query parameter to allow operators
to bypass validation when they are certain the operation is safe.

Also changed IllegalArgumentException handling in updateInstanceConfig()
from 500 to 400 status code since these are client input validation errors.

CICP-3788

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant