Add guard rails to PerInstanceAccessor for instance deletion and capacity updates #163
Open
Chr0nicl3 wants to merge 1 commit into
Open
Add guard rails to PerInstanceAccessor for instance deletion and capacity updates #163Chr0nicl3 wants to merge 1 commit into
Chr0nicl3 wants to merge 1 commit into
Conversation
…city updates Guard rails prevent known failure scenarios where instance operations cause rebalance failures: 1. deleteInstance(): Run siblingNodesActiveReplicaCheckWithDetails() before dropping an instance to verify min-active-replica constraints won't be violated. Returns 400 with specific resource/partition/replica details when blocked. 2. updateInstanceConfig(): Validate INSTANCE_CAPACITY_MAP changes don't break WAGED rebalancer constraints by simulating the merged config and checking required capacity keys remain present. Both guard rails support ?force=true query parameter to allow operators to bypass validation when they are certain the operation is safe. Also changed IllegalArgumentException handling in updateInstanceConfig() from 500 to 400 status code since these are client input validation errors. CICP-3788 Made-with: Cursor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
deleteInstance()that validates dropping an instance won't violateminActiveReplicasconstraints on any hosted partitionupdateInstanceConfig()that validatesINSTANCE_CAPACITY_MAPchanges don't break WAGED rebalancer constraints?force=truequery parameter to allow operators to bypass validationIllegalArgumentExceptionhandling inupdateInstanceConfig()from500 Server Errorto400 Bad Requestsince these are client input validation errorsDescription
Problem
Two operations in
PerInstanceAccessorcan cause rebalance failures with no pre-validation:deleteInstance()directly callsadmin.dropInstance()with only a liveness check. If the instance hosts partitions where sibling replicas are already atminActiveReplicas, dropping it violates the constraint and causes rebalance failures.updateInstanceConfig()only validates topology settings viavalidateDeltaTopologySettingInInstanceConfig(). When theINSTANCE_CAPACITY_MAPis reduced (e.g., removing a required capacity key), the WAGED rebalancer fails on its next cycle because required capacity keys are missing.Solution
Guard Rail 1:
deleteInstance()— Min-Active-Replica CheckBefore calling
admin.dropInstance(), we callInstanceValidationUtil.siblingNodesActiveReplicaCheckWithDetails()to verify that dropping this instance won't violate min-active-replica constraints for any hosted partition.How it works:
numHealthySiblings < minActiveReplicasfor any partition, returns400 Bad Requestwith the specific resource name, partition name, current active replica count, and required minimumminActiveReplicasconfigured (returns -1) are skippedEdge cases handled:
404 Not FoundExample error response when blocked:
{ "error": "Cannot drop instance host123: MIN_ACTIVE_REPLICA_CHECK_FAILED: Resource MyDB partition MyDB_7 has 2/3 active replicas. Use force=true query param to override." }Guard Rail 2:
updateInstanceConfig()— Capacity Constraint ValidationNew private method
validateInstanceCapacityChange()validates that capacity map changes don't break WAGED rebalancer constraints.How it works:
INSTANCE_CAPACITY_MAPchanges — if not, returns early (no-op for non-capacity updates)ZNRecord.update()merge logic thatconfigAccessor.updateInstanceConfig()usesWagedValidationUtil.validateAndGetInstanceCapacity(clusterConfig, mergedConfig)to verify all required capacity keys fromClusterConfig.getInstanceCapacityKeys()are presentIllegalArgumentException→ returns400 Bad RequestExample failure scenario:
[cpu, memory, disk]{cpu: 100, memory: 256}(no default fordisk){disk: 500}(providing the missingdiskkey)disk→ merged capacity becomes{cpu: 100, memory: 256}, missingdisk→ blocked?force=trueBypassBoth guard rails accept a
forcequery param (default:false). Whenforce=true, all validation is skipped and the operation proceeds directly. This allows operators to explicitly accept risk for known-safe operations.Additional Fix: IllegalArgumentException HTTP Status
Changed the
IllegalArgumentExceptioncatch block inupdateInstanceConfig()from returning500 Server Error(serverError(ex)) to400 Bad Request(badRequest(ex.getMessage())). These exceptions represent client input validation failures (bad topology settings, bad capacity config), not server errors.Behavioral Changes Summary
DELETE /instances/{name}with partitions at minActiveReplicasDELETE /instances/{name}?force=true(same scenario)DELETE /instances/{name}with no active partitionsPOST /instances/{name}/configsremoving a required capacity keyPOST /instances/{name}/configs?force=trueremoving capacity keyPOST /instances/{name}/configsupdating non-capacity fieldPOST /instances/{name}/configswith invalid topologyTests
The following tests are added in
TestPerInstanceAccessor.java:deleteInstance guard rail tests:
testDeleteInstanceGuardRailPassesForUnassignedInstance— Verifies that deleting an instance with no ExternalView assignments passes the guard railtestDeleteInstanceGuardRailBlocksWhenMinActiveReplicaViolated— Injects an active replica entry into ExternalView, verifies the delete is blocked withMIN_ACTIVE_REPLICA_CHECK_FAILEDerror detailstestDeleteInstanceGuardRailBypassWithForce— Verifies?force=truebypasses the guard railupdateInstanceConfig capacity guard rail tests:
testUpdateInstanceConfigCapacityReductionBlocked— Sets up required capacity keys in ClusterConfig, sets instance capacity, then attempts to reduce capacity removing a required key. Verifies 400 response.testUpdateInstanceConfigCapacityReductionForceBypass— Same setup but with?force=true, verifies 200 responsetestUpdateInstanceConfigNonCapacityChangePasses— Updates a non-capacity field, verifies no capacity validation is triggeredChanges that Break Backward Compatibility
deleteInstance()now returns400instead of200when dropping an instance would violate min-active-replica constraints. Callers that previously relied on unconditional200responses should add?force=trueif they want to preserve the old behavior.updateInstanceConfig()now returns400instead of200when capacity map changes would break WAGED rebalancer constraints. Same?force=truebypass available.updateInstanceConfig()now returns400instead of500forIllegalArgumentException(topology validation failures). This is a fix, not a break — clients handling500for these errors should update to handle400.