From ccc27ffdf0d435159dc7d39a9985f2a6d47f8dea Mon Sep 17 00:00:00 2001 From: Karan Tripathi Date: Fri, 30 Jan 2026 15:13:35 +0530 Subject: [PATCH] E2E sequence diagram with component level details --- E2EFlow.md | 581 +++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 581 insertions(+) create mode 100644 E2EFlow.md diff --git a/E2EFlow.md b/E2EFlow.md new file mode 100644 index 0000000000..6948508e1d --- /dev/null +++ b/E2EFlow.md @@ -0,0 +1,581 @@ +# NSS ADD_INSTANCE Operation - Complete Flow + +--- + +## Architecture Overview + +``` +┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ +│ │ │ │ │ │ │ │ │ │ +│ NSS Operator │────▶│ HelixACM │────▶│ Helix REST │────▶│ Helix │────▶│ Espresso │ +│ │ │ Service │ │ API │ │ Controller │ │ Node │ +│ │◀────│ │◀────│ │◀────│ │◀────│ │ +└──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ + │ │ │ │ │ + │ │ │ │ │ + └─────────────────────┴─────────────────────┴─────────────────────┴─────────────────────┘ + ZooKeeper + (Metadata & Coordination) +``` + +### Key Points: +1. **NSS** sends ClusterUpdateRequest to **HelixACM** every 5 seconds +2. **HelixACM** orchestrates through its state machine: INIT → PREPARING → READY → ACKED → COMPLETED +3. **HelixACM** uses **Helix REST API** for health checks and cluster operations +4. **Helix Controller** detects changes via **ZooKeeper** watches +5. **Espresso Node** (participant) receives state transition messages from Controller + +--- + +## ACM State Machine for ADD_INSTANCE + +``` + ┌─────────────────────────────────────────────────┐ + │ INIT │ + │ Operation assigned to instance inst1 │ + └────────────────┬────────────────────────────────┘ + │ Health checks & validations + │ (Stoppable check via Helix REST) + ▼ + ┌─────────────────────────────────────────────────┐ + │ PREPARING │ + │ Instance approved, waiting for readiness │ + │ (For ADD: Auto-approved via NoOpContext) │ + └────────────────┬────────────────────────────────┘ + │ ACM transitions to READY + │ (No evacuation needed for ADD) + ▼ + ┌─────────────────────────────────────────────────┐ + │ READY │ + │ Operation ready for NSS to execute │ + │ State: READY (no state change yet) │ + └────────────────┬────────────────────────────────┘ + │ NSS receives ack, + │ moves inst1: pending → running + ▼ + ┌─────────────────────────────────────────────────┐ + │ ACKED │ + │ NSS confirms received READY state │ + │ NSS provisions & starts inst1 │ + └────────────────┬────────────────────────────────┘ + │ NSS completes provisioning, + │ inst1 joins Helix, gets partitions + ▼ + ┌─────────────────────────────────────────────────┐ + │ COMPLETED │ + │ NSS reports: completed_instances: [inst1] │ + │ Execute post-operation hooks │ + └────────────────┬────────────────────────────────┘ + │ Post-hooks complete, + │ operation committed + ▼ + ┌─────────────────────────────────────────────────┐ + │ INIT │ + │ Operation finished, ready for next │ + └─────────────────────────────────────────────────┘ +``` + +--- + +## Complete Sequence Diagram + +```mermaid +sequenceDiagram + participant NSS as NSS Operator + participant ACM as HelixACM Service
(gRPC + Pipeline) + participant REST as Helix REST API + participant ZK as ZooKeeper + participant CTRL as Helix Controller + participant INST as Espresso inst1
(New Node) + + Note over NSS,INST: PHASE 1: NSS Initiates ADD_INSTANCE via ACM + rect rgb(255, 250, 240) + NSS->>NSS: Aggregate operations
for LiStatefulSet + NSS->>ACM: ClusterUpdateRequest
{
pending_instances: [inst1],
running_instances: [],
completed_instances: []
} + activate ACM + + Note right of ACM: ACM State: INIT
Operation assigned to inst1 + + ACM->>ACM: Immediate async response
(non-blocking) + ACM-->>NSS: ClusterUpdateResponse
{request accepted} + deactivate ACM + + ACM->>ACM: Queue in DedupEventProcessor + end + + Note over NSS,INST: PHASE 2: ACM Pipeline - Health Check (INIT → PREPARING) + rect rgb(240, 255, 240) + activate ACM + ACM->>ACM: Pipeline: Health Check Stage
Select instances for checks + + ACM->>REST: POST /clusters/{cluster}/instances/stoppable
{instances: [existing_nodes]} + activate REST + REST->>ZK: Read IdealState, ExternalView,
CurrentState + activate ZK + ZK-->>REST: Cluster state + deactivate ZK + + REST->>REST: Evaluate:
✓ Min active replicas
✓ Zone distribution
✓ Capacity available + REST-->>ACM: Stoppable check: OK ✓ + deactivate REST + + Note right of ACM: ACM State: INIT → PREPARING + + ACM->>ACM: Select preparation context:
ADD_INSTANCE → NoOpContext
(No evacuation needed) + + ACM->>ACM: NoOpContext.prepare(inst1)
Auto-approve: true + + Note right of ACM: ACM State: PREPARING → READY
(No Helix operation needed) + + ACM->>ZK: Persist ACM state:
/acm-record/{cluster}/inst1
State: READY + activate ZK + ZK-->>ACM: Persisted ✓ + deactivate ZK + deactivate ACM + end + + Note over NSS,INST: PHASE 3: NSS Receives READY, Provisions Instance + rect rgb(255, 255, 240) + Note right of NSS: NSS polls every 5 seconds + + NSS->>ACM: ClusterUpdateRequest
{
pending_instances: [inst1],
running_instances: [],
completed_instances: []
} + activate ACM + ACM->>ACM: Read state: inst1 = READY + ACM-->>NSS: ClusterUpdateResponse
{acks: [inst1]} + Note right of ACM: ACM State: READY
(NSS will transition to ACKED next cycle) + deactivate ACM + + NSS->>NSS: Move inst1:
pending → running + + NSS->>NSS: Provision new host:
- Allocate hardware
- Install OS
- Deploy Espresso service + + INST->>INST: Espresso service starts + + INST->>INST: Initialize HelixManager:
ZKHelixManager.connect() + + INST->>ZK: Create ephemeral LiveInstance:
/CLUSTER/LIVEINSTANCES/inst1
{
sessionId: 0x1abc...,
helixVersion: 1.4.4,
processId: 12345
} + activate ZK + Note right of ZK: Ephemeral node created + deactivate ZK + + INST->>ZK: Setup watchers:
/INSTANCES/inst1/MESSAGES + activate ZK + ZK-->>INST: Watcher registered ✓ + deactivate ZK + + Note right of INST: inst1 is now LIVE
but has no partitions yet + end + + Note over NSS,INST: PHASE 4: NSS Reports Running, ACM Moves to ACKED + rect rgb(250, 245, 255) + NSS->>ACM: ClusterUpdateRequest
{
pending_instances: [],
running_instances: [inst1],
completed_instances: []
} + activate ACM + ACM->>ACM: Update state:
inst1: READY → ACKED + ACM-->>NSS: ClusterUpdateResponse
{no new acks} + Note right of ACM: ACM State: ACKED
Waiting for NSS completion + deactivate ACM + end + + Note over NSS,INST: PHASE 5: Helix Controller Detects New Instance + rect rgb(245, 240, 255) + ZK->>CTRL: ZK Watch Event:
NodeChildrenChanged
/LIVEINSTANCES + activate CTRL + + CTRL->>ZK: Re-subscribe + fetch LiveInstances + activate ZK + ZK-->>CTRL: LiveInstances:
[existing-node1, existing-node2, inst1] + deactivate ZK + + CTRL->>CTRL: onLiveInstanceChange()
New instance detected: inst1 + + CTRL->>ZK: Setup watchers for inst1:
/INSTANCES/inst1/CURRENTSTATE
/INSTANCES/inst1/MESSAGES + activate ZK + ZK-->>CTRL: Watchers registered ✓ + deactivate ZK + + Note right of CTRL: Trigger Helix Pipeline + + CTRL->>CTRL: Stage 1-5: Read cluster data,
compute current state + + CTRL->>ZK: Read:
- IdealStates
- CurrentStates (all instances)
- InstanceConfigs
- Topology + activate ZK + ZK-->>CTRL: Complete cluster state + deactivate ZK + + Note right of CTRL: Stage 6: BestPossibleStateCalcStage
(WAGED Rebalancer) + + CTRL->>CTRL: WAGED Rebalancer compute:
━━━━━━━━━━━━━━━━━━━━━
Input:
• Live instances: 3
• Current: 2 replicas
- existing-node1: LEADER
- existing-node2: STANDBY
• Required: 3 replicas
• MinActiveReplicas: 2
• Topology: 3 zones
━━━━━━━━━━━━━━━━━━━━━
Decision:
Assign partition P0
to inst1 as STANDBY
(lowest utilization, zone spread) + + Note right of CTRL: New Best Possible State:
P0:
existing-node1: LEADER
existing-node2: STANDBY
inst1: STANDBY ← NEW + + Note right of CTRL: Stage 7: MessageGenerationPhase + + CTRL->>CTRL: Compare Best vs Current:
inst1 Best: STANDBY
inst1 Current: OFFLINE
→ Transition: OFFLINE → STANDBY + + CTRL->>CTRL: Create state transition message:
Target: inst1
Resource: Database_P0
Partition: P0
FROM_STATE: OFFLINE
TO_STATE: STANDBY
StateModel: LeaderStandby + + Note right of CTRL: Stages 8-12:
Selection, Throttling, Dispatch + + CTRL->>CTRL: Check constraints:
✓ MinActiveReplicas (2) maintained
✓ No throttle limits + + CTRL->>ZK: Write message:
/INSTANCES/inst1/MESSAGES/msg-abc123
{
FROM_STATE: OFFLINE,
TO_STATE: STANDBY,
PARTITION: Database_P0_0,
RESOURCE: Database_P0,
STATE_MODEL: LeaderStandby
} + activate ZK + ZK-->>CTRL: Message written ✓ + deactivate ZK + + deactivate CTRL + end + + Note over NSS,INST: PHASE 6: Participant Executes State Transition + rect rgb(255, 245, 250) + ZK->>INST: ZK Watch Event:
NodeChildrenChanged
/INSTANCES/inst1/MESSAGES + activate INST + + INST->>ZK: Re-subscribe + fetch messages + activate ZK + ZK-->>INST: Message: OFFLINE → STANDBY + deactivate ZK + + INST->>INST: HelixStateMachineEngine:
Create transition handler + + INST->>INST: Get StateModel:
LeaderStandby (Espresso impl) + + Note right of INST: Execute State Transition:
onBecomeStandbyFromOffline() + + INST->>INST: Application logic (~3-5s):
━━━━━━━━━━━━━━━━━━━━━
• Open partition files
• Start replication from LEADER
• Build indexes
• Mark partition ready for reads
━━━━━━━━━━━━━━━━━━━━━ + + INST->>ZK: Update CurrentState:
/INSTANCES/inst1/CURRENTSTATE/
{session}/Database_P0
{
STATE: STANDBY,
PREVIOUS_STATE: OFFLINE,
START_TIME: ...,
END_TIME: ...
} + activate ZK + ZK-->>INST: CurrentState updated ✓ + deactivate ZK + + INST->>ZK: Delete message:
/INSTANCES/inst1/MESSAGES/msg-abc123 + activate ZK + ZK-->>INST: Message deleted ✓ + deactivate ZK + + deactivate INST + end + + Note over NSS,INST: PHASE 7: Controller Verification & ExternalView Update + rect rgb(245, 255, 255) + ZK->>CTRL: ZK Watch Event:
NodeDataChanged
/INSTANCES/inst1/CURRENTSTATE + activate CTRL + + CTRL->>ZK: Re-subscribe + fetch CurrentState + activate ZK + ZK-->>CTRL: inst1 CurrentState: STANDBY ✓ + deactivate ZK + + CTRL->>CTRL: onStateChange()
Trigger verification pipeline + + CTRL->>CTRL: Compute aggregated state:
existing-node1: P0: LEADER
existing-node2: P0: STANDBY
inst1: P0: STANDBY ← NEW + + CTRL->>CTRL: Verify convergence:
Current == Best Possible ✓
No further messages needed + + CTRL->>ZK: Update ExternalView:
/EXTERNALVIEW/Database_P0
{
Database_P0_0: {
existing-node1: LEADER,
existing-node2: STANDBY,
inst1: STANDBY
}
} + activate ZK + Note right of ZK: ExternalView published
Routers/clients can see inst1 + deactivate ZK + + deactivate CTRL + end + + Note over NSS,INST: PHASE 8: NSS Completes Operation via ACM + rect rgb(250, 250, 255) + NSS->>NSS: Verify inst1 health:
✓ Serving traffic
✓ Partitions assigned
✓ Replication healthy + + NSS->>ACM: ClusterUpdateRequest
{
pending_instances: [],
running_instances: [],
completed_instances: [inst1]
}
CompletionReason: SUCCESS + activate ACM + + Note right of ACM: ACM State: ACKED → COMPLETED + + ACM->>ACM: Operation Complete Stage:
Process completion + + ACM->>ACM: Execute post-operation hooks
(custom validation, cleanup) + + ACM->>ACM: Based on CompletionReason:
• SUCCESS → No action needed
(inst1 already in cluster)
• FAILED → Would DISABLE inst1 + + Note right of ACM: ACM State: COMPLETED → INIT + + ACM->>ZK: Update ACM state:
/acm-record/{cluster}/inst1
State: INIT
(operation finished) + activate ZK + ZK-->>ACM: State persisted ✓ + deactivate ZK + + ACM-->>NSS: ClusterUpdateResponse
{committedOperations: [operation_id]} + deactivate ACM + + NSS->>NSS: Remove inst1 from operation list
Operation committed ✓ + end + + Note over NSS,INST: 🎉 ADD_INSTANCE COMPLETE 🎉
inst1 successfully added to cluster
Serving Database_P0 as STANDBY
Total time: ~10-15 seconds + +``` + +--- + +## Timeline (Approximate) + +| Time | Component | Action | +|------|-----------|--------| +| T+0ms | NSS | Send ClusterUpdateRequest (pending: [inst1]) | +| T+50ms | ACM | Queue request, return immediate response | +| T+100ms | ACM Pipeline | Health Check stage: stoppable check via REST | +| T+500ms | ACM | NoOpContext: auto-approve, transition to READY | +| T+1s | NSS | Poll ACM, receive ack, move inst1 to running | +| T+5s | NSS | Provision hardware, install Espresso | +| T+10s | inst1 | Espresso starts, connect to Helix | +| T+10.1s | inst1 | Create LiveInstance in ZK (ephemeral) | +| T+10.2s | Helix Controller | Detect new LiveInstance, trigger pipeline | +| T+10.5s | Helix Controller | WAGED rebalance: assign P0/STANDBY to inst1 | +| T+10.6s | Helix Controller | Write state transition message to ZK | +| T+10.7s | inst1 | Receive message, execute onBecomeStandbyFromOffline() | +| T+14s | inst1 | Complete transition (open files, replicate, index) | +| T+14.1s | inst1 | Update CurrentState to STANDBY | +| T+14.2s | Helix Controller | Verify convergence, update ExternalView | +| T+16s | NSS | Verify inst1 healthy, send completion | +| T+16.1s | ACM | Execute post-hooks, transition to INIT | +| T+16.2s | NSS | Operation committed ✓ | + +**Total Duration: ~16 seconds** + +--- + +## Key Differences: ADD vs REMOVE Operations + +| Aspect | ADD_INSTANCE | REMOVE_INSTANCE | +|--------|--------------|-----------------| +| **ACM Context** | NoOpContext or AutoAckContext | PreparingByEvacuateContext | +| **Helix Operation** | None (or ENABLE) | EVACUATE | +| **Preparation Time** | Instant (auto-approved) | Wait for evacuation to complete | +| **Helix REST Calls** | Stoppable check only | Stoppable check + setInstanceOperation + isEvacuateFinished | +| **Helix Controller** | Assigns partitions TO new node | Moves partitions OFF old node | +| **State Transitions** | New node: OFFLINE → STANDBY/LEADER | Old node: current → OFFLINE | +| **Post-Completion** | No action needed | DELETE instance from cluster | +| **Total Duration** | ~10-16 seconds | ~30-60 seconds (depends on partition size) | + +--- + +## Component Interactions Summary + +### 1. NSS Operator +- **Role**: Orchestrates ADD_INSTANCE operation +- **Actions**: + - Sends ClusterUpdateRequest every 5 seconds + - Tracks instance state: pending → running → completed + - Provisions hardware when ACM signals READY + - Verifies instance health before marking complete + +### 2. HelixACM Service +- **Role**: Operation coordination and safety validation +- **Actions**: + - Receives operation via gRPC + - Runs async pipeline (Health Check → Operation Prepare) + - Selects NoOpContext for ADD_INSTANCE (no preparation needed) + - Manages state machine: INIT → PREPARING → READY → ACKED → COMPLETED → INIT + - Calls Helix REST for stoppable checks + - Executes post-operation hooks + +### 3. Helix REST API +- **Role**: Stateless API layer for Helix operations +- **Actions**: + - Performs stoppable checks (validate cluster health) + - Reads cluster state from ZooKeeper + - Returns stoppable instance lists + - (For ADD: minimal interaction, no instance operation set) + +### 4. ZooKeeper +- **Role**: Metadata store and coordination +- **Actions**: + - Stores LiveInstance (ephemeral, created by participant) + - Stores IdealState, CurrentState, ExternalView + - Stores ACM operation state (/acm-record) + - Fires watches to notify Controller of changes + - Stores state transition messages + +### 5. Helix Controller +- **Role**: Partition assignment and rebalancing +- **Actions**: + - Detects new LiveInstance via ZK watch + - Runs 19-stage pipeline + - WAGED rebalancer computes optimal partition placement + - Generates state transition messages + - Dispatches messages to participants + - Updates ExternalView after convergence + +### 6. Espresso Node (Participant) +- **Role**: Application that serves data +- **Actions**: + - Creates ephemeral LiveInstance on startup + - Watches /MESSAGES for instructions + - Executes state transitions (OFFLINE → STANDBY) + - Updates CurrentState after transitions + - Implements application logic (file I/O, replication, indexing) + +--- + +## ACM Pipeline Stages for ADD_INSTANCE + +### Stage 1: Health Check and Operation Prepare (INIT → PREPARING) +```java +// Pseudo-code +instances = getInstancesInState(INIT) +stoppableResult = helixRestClient.instancesStoppable(cluster, existingInstances) +if (stoppableResult.canAddNewInstance()) { + // For ADD_INSTANCE: use NoOpContext + context = new NoOpContext() + context.prepare(inst1) // Auto-approves immediately + setState(inst1, PREPARING) +} +``` + +**Result**: Instance moves to PREPARING (no Helix operation needed) + +### Stage 2: Operation Selection (PREPARING → READY) +```java +// Pseudo-code +instances = getInstancesInState(PREPARING) +for (inst in instances) { + if (context.isReady(inst)) { // NoOpContext always returns true + setState(inst, READY) + } +} +``` + +**Result**: Instance moves to READY immediately + +### Stage 3: Operation Update (via NSS polling) +```java +// Pseudo-code +// NSS sends: pending_instances=[], running_instances=[inst1] +if (nssRequest.running_instances.contains(inst1)) { + setState(inst1, ACKED) +} + +// NSS sends: completed_instances=[inst1] +if (nssRequest.completed_instances.contains(inst1)) { + setState(inst1, COMPLETED) +} +``` + +**Result**: READY → ACKED → COMPLETED based on NSS updates + +### Stage 4: Operation Complete (COMPLETED → INIT) +```java +// Pseudo-code +instances = getInstancesInState(COMPLETED) +for (inst in instances) { + executePreCompletionHooks(inst) + + if (completionReason == SUCCESS) { + // For ADD_INSTANCE: no action needed + // Instance already in cluster via direct Helix join + } else if (completionReason == FAILED) { + // Optionally DISABLE instance + helixRestClient.setInstanceOperation(inst, DISABLE) + } + + executePostCompletionHooks(inst) + setState(inst, INIT) +} +``` + +**Result**: Instance moves to INIT, operation complete + +--- + +## ZooKeeper Paths + +### ACM State (at /acm-record): +``` +/acm-record/ +└── ESPRESSO_MYDB/ + └── inst1 + ├── state: READY + ├── operationType: ADD_INSTANCE + ├── lastUpdateTime: 1738234567890 + └── metadata: {...} +``` + +### Helix Cluster State: +``` +/ESPRESSO_MYDB/ +├── LIVEINSTANCES/ +│ ├── existing-node1_11932 [EPHEMERAL] +│ ├── existing-node2_11932 [EPHEMERAL] +│ └── inst1_11932 [EPHEMERAL] ← NEW +├── INSTANCES/ +│ └── inst1_11932/ +│ ├── MESSAGES/ +│ │ └── msg-abc123 (OFFLINE → STANDBY) +│ └── CURRENTSTATE/ +│ └── {sessionId}/ +│ └── Database_P0 +│ └── {STATE: STANDBY, ...} +├── IDEALSTATES/ +│ └── Database_P0 +│ └── {Database_P0_0: {...}} +├── EXTERNALVIEW/ +│ └── Database_P0 +│ └── {Database_P0_0: {existing-node1: LEADER, existing-node2: STANDBY, inst1: STANDBY}} +└── CONFIGS/ + ├── CLUSTER/ + │ └── {MinActiveReplicas: 2, ...} + ├── PARTICIPANT/ + │ ├── existing-node1_11932 + │ ├── existing-node2_11932 + │ └── inst1_11932 ← NEW + └── RESOURCE/ + └── Database_P0 +``` + +--- + +## Error Handling + +### ACM Pipeline Errors +- **Stoppable check fails**: Instance remains in INIT, retries with exponential backoff +- **Cluster in safety mode**: Operations rejected until cluster healthy +- **ZooKeeper write fails**: Pipeline retries on next cycle (idempotent) + +### NSS Errors +- **Provisioning fails**: NSS sends CompletionReason: FAILED +- **ACM unavailable**: NSS continues polling until ACM recovers +- **Timeout**: NSS marks operation failed after configurable timeout + +### Helix Controller Errors +- **Rebalance fails**: Controller logs error, retries on next pipeline cycle +- **Constraint violation**: No messages generated until constraints satisfied +- **Participant timeout**: Controller marks partition ERROR after timeout + +### Participant Errors +- **State transition fails**: Participant reports ERROR in CurrentState +- **Message processing fails**: Message remains until successfully processed +- **ZK connection lost**: Ephemeral LiveInstance disappears, Controller rebalances + +--- + +## Safety Guarantees + +### HelixACM Safety Checks +1. **Stoppable Check**: Ensures cluster can handle new instance without violating constraints +2. **Min Active Replicas**: Verified before any partition reassignment +3. **Zone Distribution**: New instance placed in optimal zone for fault tolerance +4. **Safety Mode**: Auto-activates if >25% instances offline (blocks unsafe operations) +5. **Operation Yielding**: Urgent operations (disruptions) can preempt ADD operations + +### Helix Controller Guarantees +1. **At-least MinActiveReplicas**: Never violates min active replica constraint during transitions +2. **Atomic State Transitions**: All-or-nothing message delivery per partition +3. **Best Possible State Convergence**: Continuously works toward optimal assignment +4. **Fault Tolerance**: Controller failover via super controller reassignment + +### Participant Guarantees +1. **Idempotent Transitions**: State transitions can be safely retried +2. **Consistent State**: CurrentState always reflects actual application state +3. **Message Acknowledgment**: Messages deleted only after successful transition + +--- + +**End of Document**