Skip to content

Add ConfigNode ReadOnly heartbeat self-check (DiskFull/DiskCrash)#17724

Open
CRZbulabula wants to merge 1 commit into
masterfrom
confignode-readonly-disk-check
Open

Add ConfigNode ReadOnly heartbeat self-check (DiskFull/DiskCrash)#17724
CRZbulabula wants to merge 1 commit into
masterfrom
confignode-readonly-disk-check

Conversation

@CRZbulabula
Copy link
Copy Markdown
Contributor

Description

Why

ConfigNode currently has no notion of being ReadOnly — only DataNode samples its disk and self-marks. The heartbeat from the leader to other ConfigNodes is effectively just a liveness ping, so a ConfigNode with a full or crashed disk would silently keep accepting writes from peers / clients. This PR extends ReadOnly tracking to ConfigNode and adds a new disk-failure reason (DiskCrash) on top of the existing DiskFull.

What changes

  • New status reason. NodeStatus.DISK_CRASH constant added next to DISK_FULL (both are still string identifiers stored in statusReason).
  • Shared utility. New org.apache.iotdb.commons.cluster.DiskChecker in node-commons: probes a list of directories via test-write + free-space ratio, then runs a single state-machine apply on CommonConfig with priority DiskCrash > DiskFull > Normal. Recovery to Running only fires when the prior reason was disk-related; other ReadOnly reasons (e.g. manual maintenance) are left untouched.
  • ConfigNode leader. Self-checks [systemDir, consensusDir] at the top of HeartbeatService#heartbeatLoopBody (before fanning out heartbeats).
  • ConfigNode follower. Self-checks on every received heartbeat in ConfigNodeRPCServiceProcessor#getConfigNodeHeartBeat, and reports status + statusReason back via newly-added optional fields 4 and 5 on TConfigNodeHeartbeatResp (forward-compatible — old peers simply leave them unset).
  • Leader's self entry in the cache. ConfigNodeHeartbeatCache#updateCurrentStatistics no longer short-circuits for CURRENT_NODE_ID; instead, the self entry mirrors CommonConfig, so show confignodes reflects the leader's own disk state.
  • DataNode. FolderManager now registers each instance into a weak-ref static list and exposes static boolean hasAnyAbnormalFolder(). DataNodeInternalRPCServiceImpl#sampleDiskLoad consults that aggregator and, when any folder is ABNORMAL, maps to DiskCrash (which outranks the existing free-ratio DiskFull check). State-machine application is delegated to DiskChecker.apply so DataNode and ConfigNode follow identical transition rules.

Design notes

  • DataNode keeps its current system-wide free-ratio check (SystemMetric.SYS_DISK_AVAILABLE_SPACE) for DiskFull. The new DiskCrash signal is path-scoped — it just observes already-recorded write failures rather than probing IO itself. ConfigNode runs both checks per-directory through DiskChecker (File.getUsableSpace/getTotalSpace + a tiny Files.createTempFile/write/delete probe), giving symmetric behavior on the two nodes from the cluster's perspective.
  • Thrift fields are optional to keep rolling upgrade safe: an older ConfigNode that doesn't populate them parses as Running with no reason.
  • ABNORMAL on DataNode is sticky: once a folder fails a business write it stays ABNORMAL until restart. This PR intentionally does not introduce an auto-recovery path for ReadOnly(DiskCrash) on DataNode (would require new ABNORMAL -> HEALTHY transitions inside FolderManager and is left as follow-up). ConfigNode does auto-recover, because testWrite reruns every heartbeat.

i18n

New disk health messages live in CommonMessages under both src/main/i18n/en and src/main/i18n/zh:

  • DISK_FULL_SET_READ_ONLY
  • DISK_CRASH_SET_READ_ONLY
  • DISK_CRASH_PROBE_FAILED
  • DISK_RECOVERED_SET_RUNNING

The existing inline English log in sampleDiskLoad is retained (just descriptive context); the state-change log itself is routed through CommonMessages so the Chinese build works out of the box.


This PR has:

  • been self-reviewed.
  • added Javadocs for the new public surface (`DiskChecker`, `FolderManager.hasAnyAbnormalFolder`, `ConfigNodeConfig.getCriticalDirs`).
  • added comments explaining the why where non-obvious (cache self-entry rewrite, optional Thrift fields, priority ordering).
  • added unit tests — `DiskCheckerTest` covers 17 cases including: NORMAL/DISK_FULL/DISK_CRASH detection, crash-wins-over-full, recovery clearing reason, non-disk `ReadOnly` reason left untouched, idempotency, probe-file cleanup.
  • added integration tests. (Heartbeat-driven state transitions are exercised at unit-test level; cluster-level disk failure injection felt heavier than this PR warrants — happy to add an IT in a follow-up if reviewers prefer.)
  • been tested in a test IoTDB cluster.

Known follow-ups
  • DataNode cannot auto-recover from `ReadOnly(DiskCrash)` once any folder is marked ABNORMAL by `FolderManager` — would need an ABNORMAL→HEALTHY re-probe path.
  • Consider unifying DataNode's free-ratio source with ConfigNode's per-directory `File.getUsableSpace` (currently DataNode goes through `MetricService`).

Key changed/added classes in this PR

New

  • `org.apache.iotdb.commons.cluster.DiskChecker`
  • `org.apache.iotdb.commons.cluster.DiskCheckerTest`

Modified

  • `org.apache.iotdb.commons.cluster.NodeStatus`
  • `org.apache.iotdb.commons.i18n.CommonMessages` (en + zh)
  • `org.apache.iotdb.confignode.rpc.thrift.TConfigNodeHeartbeatResp` (Thrift IDL)
  • `org.apache.iotdb.confignode.conf.ConfigNodeConfig`
  • `org.apache.iotdb.confignode.manager.load.service.HeartbeatService`
  • `org.apache.iotdb.confignode.manager.load.cache.node.ConfigNodeHeartbeatCache`
  • `org.apache.iotdb.confignode.manager.load.cache.node.NodeHeartbeatSample`
  • `org.apache.iotdb.confignode.service.thrift.ConfigNodeRPCServiceProcessor`
  • `org.apache.iotdb.db.storageengine.rescon.disk.FolderManager`
  • `org.apache.iotdb.db.protocol.thrift.impl.DataNodeInternalRPCServiceImpl`

…check

ConfigNode now reports its own NodeStatus.ReadOnly when its critical
directories (systemDir, consensusDir) are unwritable or near-full,
mirroring the existing DataNode behavior. NodeStatus reasons are
extended with a new DISK_CRASH constant alongside DISK_FULL, and the
ConfigNode heartbeat carries status/statusReason back to the leader.

- node-commons: new DiskChecker utility (probe + state-machine apply),
  with priority DiskCrash > DiskFull and recovery to Running when the
  reason was disk-related. i18n messages added in en + zh.
- thrift-confignode: TConfigNodeHeartbeatResp gains optional status
  and statusReason fields (forward-compatible).
- confignode: leader self-checks before fanning out heartbeats;
  follower self-checks on receive and reports back; cache reads from
  CommonConfig for the leader's self entry, otherwise from the sample.
- datanode: FolderManager exposes a static hasAnyAbnormalFolder()
  aggregator; sampleDiskLoad treats any ABNORMAL folder as DiskCrash
  (which wins over DiskFull) and reuses DiskChecker.apply.
@sonarqubecloud
Copy link
Copy Markdown

@codecov
Copy link
Copy Markdown

codecov Bot commented May 20, 2026

Codecov Report

❌ Patch coverage is 56.75676% with 48 lines in your changes missing coverage. Please review.
✅ Project coverage is 40.43%. Comparing base (34cc346) to head (a6a8291).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
...db/db/storageengine/rescon/disk/FolderManager.java 11.11% 16 Missing ⚠️
...ol/thrift/impl/DataNodeInternalRPCServiceImpl.java 0.00% 9 Missing ⚠️
...ager/load/cache/node/ConfigNodeHeartbeatCache.java 62.50% 6 Missing ⚠️
...e/manager/load/cache/node/NodeHeartbeatSample.java 0.00% 5 Missing ⚠️
.../service/thrift/ConfigNodeRPCServiceProcessor.java 0.00% 5 Missing ⚠️
...fignode/manager/load/service/HeartbeatService.java 0.00% 3 Missing ⚠️
.../org/apache/iotdb/commons/cluster/DiskChecker.java 94.44% 3 Missing ⚠️
...apache/iotdb/confignode/conf/ConfigNodeConfig.java 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #17724      +/-   ##
============================================
+ Coverage     40.41%   40.43%   +0.01%     
  Complexity     2574     2574              
============================================
  Files          5179     5180       +1     
  Lines        349767   349987     +220     
  Branches      44714    44749      +35     
============================================
+ Hits         141373   141518     +145     
- Misses       208394   208469      +75     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant