DeadMaster false detection and data loss on MariaDB when SQL thread is stopped

Hi Team,

### **Description:**

Orchestrator incorrectly triggers DeadMaster failover when SQL threads are stopped on replicas, even when IO threads are still running (proving master is alive). Additionally, for MariaDB instances with semi-sync AFTER_SYNC, relay logs containing all committed transactions are never drained before promotion; turning a zero-data-loss situation into data loss. Furthermore, replicas are re-pointed to the candidate before promotion validation, with no rollback mechanism if validation fails.

Environment:
  - Orchestrator version: 4.30.0, also affects  3.2.6 and all prior versions
  - Database: MariaDB 10.6.x, 10.11.x
  - Replication: Semi-sync with AFTER_SYNC (rpl_semi_sync_master_wait_point=AFTER_SYNC)
  - Topology: 1 master, 2 replicas, GTID-based replication

### **Problem 1: False DeadMaster detection when IO thread is running**

````
File: go/inst/analysis_dao.go lines 203-208:
  
  IFNULL(
      SUM(
          replica_instance.last_checked <= replica_instance.last_seen
          AND replica_instance.slave_io_running != 0
          AND replica_instance.slave_sql_running != 0
      ),
      0
  ) AS count_valid_replicating_replicas,
  
File: go/inst/analysis_dao.go line 499:
  
  } else if a.IsMaster && !a.LastCheckValid && a.CountValidReplicas == a.CountReplicas &&
  a.CountValidReplicatingReplicas == 0 {
      a.Analysis = DeadMaster
      a.Description = "Master cannot be reached by orchestrator and none of its replicas is replicating"
````

If SQL thread is stopped/broken but IO thread is still running, count_valid_replicating_replicas = 0 because the AND condition fails. If Orchestrator also loses its connection to master (e.g., max_connections exhausted with 1-second timeout), it declares DeadMaster; even though the running IO thread proves the master is alive and accepting replica connections.

There is no count_replicas_with_io_running field anywhere. The IO thread state is never evaluated independently in the analysis struct (go/inst/analysis.go lines 125-175).
Additionally, the UnreachableMaster safe path (lines 515-523) ALSO requires CountValidReplicatingReplicas > 0:

```
} else if a.IsMaster && !a.LastCheckValid && !a.LastCheckPartialSuccess && a.CountValidReplicas > 0 &&
  a.CountValidReplicatingReplicas > 0 {
      a.Analysis = UnreachableMaster
```

This means there is no code path that correctly handles "IO running, SQL stopped, master unreachable from Orchestrator." It always falls into DeadMaster.

### **Problem 2: StopReplicationNicely explicitly skipped for MariaDB**

 ```
 File: go/inst/instance_topology_dao.go line 376:
  
  if stopReplicationMethod == StopReplicationNice && !replica.IsMariaDB() {
      StopReplicationNicely(&replica.Key, timeout)
  }
  replica, _ = StopReplication(&replica.Key)
  
  StopReplicationNicely (lines 266-298) is the only function that starts the SQL thread to drain relay
  logs:
  
  // StopReplicationNicely stops a replica such that SQL_thread and IO_thread are aligned (i.e.
  // SQL_thread consumes all relay log entries)
  // It will actually START the sql_thread even if the replica is completely stopped.
  func StopReplicationNicely(instanceKey *InstanceKey, timeout time.Duration) (*Instance, error) {
      ...
      for _, cmd := range []string{`stop slave io_thread`, `start slave sql_thread`} {
          if _, err := ExecInstance(instanceKey, cmd); err != nil {
              return nil, log.Errorf(...)
          }
      }
      if instance.SQLDelay == 0 {
          if instance, err = WaitForSQLThreadUpToDate(instanceKey, timeout, 0); err != nil {
              return instance, err
          }
      }
      _, err = ExecInstance(instanceKey, `stop slave`)
      ...
  }
  
  For MariaDB, this is skipped entirely. Relay logs are never drained.
  
  Full call chain verified:
  
  1. RegroupReplicasGTID calls GetCandidateReplica(masterKey, true) (instance_topology.go line 2525)
  2. GetCandidateReplica with forRematchPurposes=true sets stopReplicationMethod = StopReplicationNice
   (line 2279)
  3. Passed to sortedReplicasDataCenterHint → StopReplicas
  4. StopReplicas checks !replica.IsMariaDB() → skips StopReplicationNicely for MariaDB
```

###  **Problem 3: startReplicationOnCandidate = false in dead master recovery**

```
File: go/logic/topology_recovery.go line 557:
  
  lostReplicas, _, cannotReplicateReplicas, promotedReplica, err =
  inst.RegroupReplicasGTID(failedInstanceKey, true, false, nil,
  &topologyRecovery.PostponedFunctionsContainer, promotedReplicaIsIdeal)
  
  Function signature (go/inst/instance_topology.go line 2509):
  
  func RegroupReplicasGTID(
      masterKey *InstanceKey,
      returnReplicaEvenOnFailureToRegroup bool,  // true
      startReplicationOnCandidate bool,           // false
      ...
  )
```
The SQL thread is never restarted on the promoted candidate. Combined with Problem 2, the relay log  which contains all committed transactions (guaranteed by AFTER_SYNC semi-sync)  is never applied.

### **Problem 4: WaitForSQLThreadUpToDate never starts SQL thread**

```
File: go/inst/instance_topology_dao.go lines 307-355:
  
  func WaitForSQLThreadUpToDate(instanceKey *InstanceKey, overallTimeout time.Duration,
  staleCoordinatesTimeout time.Duration) (instance *Instance, err error) {
      var lastExecBinlogCoordinates BinlogCoordinates
      if overallTimeout == 0 {
          overallTimeout = 24 * time.Hour
      }
      if staleCoordinatesTimeout == 0 {
          staleCoordinatesTimeout = time.Duration(config.Config.ReasonableReplicationLagSeconds) *
  time.Second
      }
      generalTimer := time.NewTimer(overallTimeout)
      staleTimer := time.NewTimer(staleCoordinatesTimeout)
      for {
          instance, err := RetryInstanceFunction(func() (*Instance, error) {
              return ReadTopologyInstance(instanceKey)
          })
          if err != nil {
              return instance, log.Errore(err)
          }
          if instance.SQLThreadUpToDate() {
              return instance, nil
          }
          ...
          select {
          case <-generalTimer.C:
              return instance, log.Errorf("WaitForSQLThreadUpToDate timeout ...")
          case <-staleTimer.C:
              return instance, log.Errorf("WaitForSQLThreadUpToDate stale coordinates timeout ...")
          default:
              time.Sleep(retryInterval)
          }
      }
  }
```

This function only polls ReadTopologyInstance(). It never issues START SLAVE SQL_THREAD. If the SQL thread is stopped, coordinates never advance -> stale timeout -> returns error.

### **Problem 5: Replicas re-pointed before promotion validation (no rollback)**

File: go/logic/topology_recovery.go:

```
- Line 557: RegroupReplicasGTID inside recoverDeadMaster() — issues CHANGE MASTER TO on other replicas
  via moveReplicasViaGTID
  - Line 894: overrideMasterPromotion() — validates SQL thread state, lag, geo-constraints — called AFTER
  recoverDeadMaster() returns
  
  recoveryAttempted, promotedReplica, lostReplicas, err := recoverDeadMaster(topologyRecovery,
  candidateInstanceKey, skipProcesses)  // line ~855
  ...
  if promotedReplica, err = overrideMasterPromotion(); err != nil {  // line 894
      AuditTopologyRecovery(topologyRecovery, err.Error())
  }
  // And this is the end; whether successful or not, we're done.
  resolveRecovery(topologyRecovery, promotedReplica)
  // Now, see whether we are successful or not. From this point there's no going back.
```

If validation at line 894 fails (e.g., WaitForSQLThreadUpToDate times out because SQL thread isn't running), replicas are already re-pointed with no rollback code. The source comment explicitly states: "From this point there's no going back."

**### Impact with AFTER_SYNC semi-sync:**
  
With rpl_semi_sync_master_wait_point=AFTER_SYNC:
  
1. Master writes to binlog
2. Sends event to replica
3. Replica writes to relay log → sends ACK
4. Master receives ACK → then commits (makes visible to clients)
  
Every committed transaction on master is guaranteed to exist in at least one replica's relay log.
  
But Orchestrator turns this zero-data-loss guarantee into data loss because:
  
1. It doesn't start the SQL thread on the promoted MariaDB replica (Problem 3)
2. It doesn't drain relay logs; MariaDB explicitly skipped (Problem 2)
3. WaitForSQLThreadUpToDate just polls without starting anything (Problem 4)
4. Committed transactions sitting in the relay log are never applied
5. The promoted master starts serving with stale applied state
  
The data is in the relay log. Orchestrator just doesn't apply it.

To Reproduce : 
1) Generate a duplicate entry error (Error 1062) on the replicas. SQL thread stops, IO thread continues running.
2) stop the mariadbd on master and wait for failover. 

We'd appreciate the team reviewing the issues outlined above. Looking forward to your thoughts.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DeadMaster false detection and data loss on MariaDB when SQL thread is stopped #106

Description:

Problem 1: False DeadMaster detection when IO thread is running

Problem 2: StopReplicationNicely explicitly skipped for MariaDB

Problem 3: startReplicationOnCandidate = false in dead master recovery

Problem 4: WaitForSQLThreadUpToDate never starts SQL thread

Problem 5: Replicas re-pointed before promotion validation (no rollback)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

DeadMaster false detection and data loss on MariaDB when SQL thread is stopped #106

Description

Description:

Problem 1: False DeadMaster detection when IO thread is running

Problem 2: StopReplicationNicely explicitly skipped for MariaDB

Problem 3: startReplicationOnCandidate = false in dead master recovery

Problem 4: WaitForSQLThreadUpToDate never starts SQL thread

Problem 5: Replicas re-pointed before promotion validation (no rollback)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions