Add ConfigNode ReadOnly heartbeat self-check (DiskFull/DiskCrash) by CRZbulabula · Pull Request #17724 · apache/iotdb

CRZbulabula · 2026-05-20T03:28:50Z

Description

Why

ConfigNode currently has no notion of being ReadOnly — only DataNode samples its disk and self-marks. The heartbeat from the leader to other ConfigNodes is effectively just a liveness ping, so a ConfigNode with a full or crashed disk would silently keep accepting writes from peers / clients. This PR extends ReadOnly tracking to ConfigNode and adds a new disk-failure reason (DiskCrash) on top of the existing DiskFull.

What changes

New status reason. NodeStatus.DISK_CRASH constant added next to DISK_FULL (both are still string identifiers stored in statusReason).
Shared utility. New org.apache.iotdb.commons.cluster.DiskChecker in node-commons: probes a list of directories via test-write + free-space ratio, then runs a single state-machine apply on CommonConfig with priority DiskCrash > DiskFull > Normal. Recovery to Running only fires when the prior reason was disk-related; other ReadOnly reasons (e.g. manual maintenance) are left untouched.
ConfigNode leader. Self-checks [systemDir, consensusDir] at the top of HeartbeatService#heartbeatLoopBody (before fanning out heartbeats).
ConfigNode follower. Self-checks on every received heartbeat in ConfigNodeRPCServiceProcessor#getConfigNodeHeartBeat, and reports status + statusReason back via newly-added optional fields 4 and 5 on TConfigNodeHeartbeatResp (forward-compatible — old peers simply leave them unset).
Leader's self entry in the cache. ConfigNodeHeartbeatCache#updateCurrentStatistics no longer short-circuits for CURRENT_NODE_ID; instead, the self entry mirrors CommonConfig, so show confignodes reflects the leader's own disk state.
DataNode. FolderManager now registers each instance into a weak-ref static list and exposes static boolean hasAnyAbnormalFolder(). DataNodeInternalRPCServiceImpl#sampleDiskLoad consults that aggregator and, when any folder is ABNORMAL, maps to DiskCrash (which outranks the existing free-ratio DiskFull check). State-machine application is delegated to DiskChecker.apply so DataNode and ConfigNode follow identical transition rules.

Design notes

DataNode keeps its current system-wide free-ratio check (SystemMetric.SYS_DISK_AVAILABLE_SPACE) for DiskFull. The new DiskCrash signal is path-scoped — it just observes already-recorded write failures rather than probing IO itself. ConfigNode runs both checks per-directory through DiskChecker (File.getUsableSpace/getTotalSpace + a tiny Files.createTempFile/write/delete probe), giving symmetric behavior on the two nodes from the cluster's perspective.
Thrift fields are optional to keep rolling upgrade safe: an older ConfigNode that doesn't populate them parses as Running with no reason.
ABNORMAL on DataNode is sticky: once a folder fails a business write it stays ABNORMAL until restart. This PR intentionally does not introduce an auto-recovery path for ReadOnly(DiskCrash) on DataNode (would require new ABNORMAL -> HEALTHY transitions inside FolderManager and is left as follow-up). ConfigNode does auto-recover, because testWrite reruns every heartbeat.

i18n

New disk health messages live in CommonMessages under both src/main/i18n/en and src/main/i18n/zh:

DISK_FULL_SET_READ_ONLY
DISK_CRASH_SET_READ_ONLY
DISK_CRASH_PROBE_FAILED
DISK_RECOVERED_SET_RUNNING

The existing inline English log in sampleDiskLoad is retained (just descriptive context); the state-change log itself is routed through CommonMessages so the Chinese build works out of the box.

This PR has:

been self-reviewed.
added Javadocs for the new public surface (`DiskChecker`, `FolderManager.hasAnyAbnormalFolder`, `ConfigNodeConfig.getCriticalDirs`).
added comments explaining the why where non-obvious (cache self-entry rewrite, optional Thrift fields, priority ordering).
added unit tests — `DiskCheckerTest` covers 17 cases including: NORMAL/DISK_FULL/DISK_CRASH detection, crash-wins-over-full, recovery clearing reason, non-disk `ReadOnly` reason left untouched, idempotency, probe-file cleanup.
added integration tests. (Heartbeat-driven state transitions are exercised at unit-test level; cluster-level disk failure injection felt heavier than this PR warrants — happy to add an IT in a follow-up if reviewers prefer.)
been tested in a test IoTDB cluster.

Known follow-ups

DataNode cannot auto-recover from `ReadOnly(DiskCrash)` once any folder is marked ABNORMAL by `FolderManager` — would need an ABNORMAL→HEALTHY re-probe path.
Consider unifying DataNode's free-ratio source with ConfigNode's per-directory `File.getUsableSpace` (currently DataNode goes through `MetricService`).

Key changed/added classes in this PR

New

`org.apache.iotdb.commons.cluster.DiskChecker`
`org.apache.iotdb.commons.cluster.DiskCheckerTest`

Modified

`org.apache.iotdb.commons.cluster.NodeStatus`
`org.apache.iotdb.commons.i18n.CommonMessages` (en + zh)
`org.apache.iotdb.confignode.rpc.thrift.TConfigNodeHeartbeatResp` (Thrift IDL)
`org.apache.iotdb.confignode.conf.ConfigNodeConfig`
`org.apache.iotdb.confignode.manager.load.service.HeartbeatService`
`org.apache.iotdb.confignode.manager.load.cache.node.ConfigNodeHeartbeatCache`
`org.apache.iotdb.confignode.manager.load.cache.node.NodeHeartbeatSample`
`org.apache.iotdb.confignode.service.thrift.ConfigNodeRPCServiceProcessor`
`org.apache.iotdb.db.storageengine.rescon.disk.FolderManager`
`org.apache.iotdb.db.protocol.thrift.impl.DataNodeInternalRPCServiceImpl`

…check ConfigNode now reports its own NodeStatus.ReadOnly when its critical directories (systemDir, consensusDir) are unwritable or near-full, mirroring the existing DataNode behavior. NodeStatus reasons are extended with a new DISK_CRASH constant alongside DISK_FULL, and the ConfigNode heartbeat carries status/statusReason back to the leader. - node-commons: new DiskChecker utility (probe + state-machine apply), with priority DiskCrash > DiskFull and recovery to Running when the reason was disk-related. i18n messages added in en + zh. - thrift-confignode: TConfigNodeHeartbeatResp gains optional status and statusReason fields (forward-compatible). - confignode: leader self-checks before fanning out heartbeats; follower self-checks on receive and reports back; cache reads from CommonConfig for the leader's self entry, otherwise from the sample. - datanode: FolderManager exposes a static hasAnyAbnormalFolder() aggregator; sampleDiskLoad treats any ABNORMAL folder as DiskCrash (which wins over DiskFull) and reuses DiskChecker.apply.

sonarqubecloud · 2026-05-20T03:36:22Z

Quality Gate passed

Issues
3 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

codecov · 2026-05-20T04:14:53Z

Codecov Report

❌ Patch coverage is 56.75676% with 48 lines in your changes missing coverage. Please review.
✅ Project coverage is 40.43%. Comparing base (34cc346) to head (a6a8291).
⚠️ Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
...db/db/storageengine/rescon/disk/FolderManager.java	11.11%	16 Missing ⚠️
...ol/thrift/impl/DataNodeInternalRPCServiceImpl.java	0.00%	9 Missing ⚠️
...ager/load/cache/node/ConfigNodeHeartbeatCache.java	62.50%	6 Missing ⚠️
...e/manager/load/cache/node/NodeHeartbeatSample.java	0.00%	5 Missing ⚠️
.../service/thrift/ConfigNodeRPCServiceProcessor.java	0.00%	5 Missing ⚠️
...fignode/manager/load/service/HeartbeatService.java	0.00%	3 Missing ⚠️
.../org/apache/iotdb/commons/cluster/DiskChecker.java	94.44%	3 Missing ⚠️
...apache/iotdb/confignode/conf/ConfigNodeConfig.java	0.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #17724      +/-   ##
============================================
+ Coverage     40.41%   40.43%   +0.01%     
  Complexity     2574     2574              
============================================
  Files          5179     5180       +1     
  Lines        349767   349987     +220     
  Branches      44714    44749      +35     
============================================
+ Hits         141373   141518     +145     
- Misses       208394   208469      +75

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ConfigNode ReadOnly heartbeat self-check (DiskFull/DiskCrash)#17724

Add ConfigNode ReadOnly heartbeat self-check (DiskFull/DiskCrash)#17724
CRZbulabula wants to merge 1 commit into
masterfrom
confignode-readonly-disk-check

CRZbulabula commented May 20, 2026

Uh oh!

sonarqubecloud Bot commented May 20, 2026

Uh oh!

codecov Bot commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

CRZbulabula commented May 20, 2026

Description

Why

What changes

Design notes

i18n

Known follow-ups

Key changed/added classes in this PR

Uh oh!

sonarqubecloud Bot commented May 20, 2026

Quality Gate passed

Uh oh!

codecov Bot commented May 20, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant