Add ConfigNode ReadOnly heartbeat self-check (DiskFull/DiskCrash)#17724
Open
CRZbulabula wants to merge 1 commit into
Open
Add ConfigNode ReadOnly heartbeat self-check (DiskFull/DiskCrash)#17724CRZbulabula wants to merge 1 commit into
CRZbulabula wants to merge 1 commit into
Conversation
…check ConfigNode now reports its own NodeStatus.ReadOnly when its critical directories (systemDir, consensusDir) are unwritable or near-full, mirroring the existing DataNode behavior. NodeStatus reasons are extended with a new DISK_CRASH constant alongside DISK_FULL, and the ConfigNode heartbeat carries status/statusReason back to the leader. - node-commons: new DiskChecker utility (probe + state-machine apply), with priority DiskCrash > DiskFull and recovery to Running when the reason was disk-related. i18n messages added in en + zh. - thrift-confignode: TConfigNodeHeartbeatResp gains optional status and statusReason fields (forward-compatible). - confignode: leader self-checks before fanning out heartbeats; follower self-checks on receive and reports back; cache reads from CommonConfig for the leader's self entry, otherwise from the sample. - datanode: FolderManager exposes a static hasAnyAbnormalFolder() aggregator; sampleDiskLoad treats any ABNORMAL folder as DiskCrash (which wins over DiskFull) and reuses DiskChecker.apply.
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #17724 +/- ##
============================================
+ Coverage 40.41% 40.43% +0.01%
Complexity 2574 2574
============================================
Files 5179 5180 +1
Lines 349767 349987 +220
Branches 44714 44749 +35
============================================
+ Hits 141373 141518 +145
- Misses 208394 208469 +75 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Description
Why
ConfigNode currently has no notion of being
ReadOnly— only DataNode samples its disk and self-marks. The heartbeat from the leader to other ConfigNodes is effectively just a liveness ping, so a ConfigNode with a full or crashed disk would silently keep accepting writes from peers / clients. This PR extendsReadOnlytracking to ConfigNode and adds a new disk-failure reason (DiskCrash) on top of the existingDiskFull.What changes
NodeStatus.DISK_CRASHconstant added next toDISK_FULL(both are still string identifiers stored instatusReason).org.apache.iotdb.commons.cluster.DiskCheckerinnode-commons: probes a list of directories via test-write + free-space ratio, then runs a single state-machine apply onCommonConfigwith priority DiskCrash > DiskFull > Normal. Recovery toRunningonly fires when the prior reason was disk-related; otherReadOnlyreasons (e.g. manual maintenance) are left untouched.[systemDir, consensusDir]at the top ofHeartbeatService#heartbeatLoopBody(before fanning out heartbeats).ConfigNodeRPCServiceProcessor#getConfigNodeHeartBeat, and reportsstatus+statusReasonback via newly-added optional fields 4 and 5 onTConfigNodeHeartbeatResp(forward-compatible — old peers simply leave them unset).ConfigNodeHeartbeatCache#updateCurrentStatisticsno longer short-circuits forCURRENT_NODE_ID; instead, the self entry mirrorsCommonConfig, soshow confignodesreflects the leader's own disk state.FolderManagernow registers each instance into a weak-ref static list and exposesstatic boolean hasAnyAbnormalFolder().DataNodeInternalRPCServiceImpl#sampleDiskLoadconsults that aggregator and, when any folder isABNORMAL, maps toDiskCrash(which outranks the existing free-ratioDiskFullcheck). State-machine application is delegated toDiskChecker.applyso DataNode and ConfigNode follow identical transition rules.Design notes
SystemMetric.SYS_DISK_AVAILABLE_SPACE) forDiskFull. The newDiskCrashsignal is path-scoped — it just observes already-recorded write failures rather than probing IO itself. ConfigNode runs both checks per-directory throughDiskChecker(File.getUsableSpace/getTotalSpace+ a tinyFiles.createTempFile/write/deleteprobe), giving symmetric behavior on the two nodes from the cluster's perspective.optionalto keep rolling upgrade safe: an older ConfigNode that doesn't populate them parses asRunningwith no reason.ReadOnly(DiskCrash)on DataNode (would require newABNORMAL -> HEALTHYtransitions insideFolderManagerand is left as follow-up). ConfigNode does auto-recover, becausetestWritereruns every heartbeat.i18n
New disk health messages live in
CommonMessagesunder bothsrc/main/i18n/enandsrc/main/i18n/zh:DISK_FULL_SET_READ_ONLYDISK_CRASH_SET_READ_ONLYDISK_CRASH_PROBE_FAILEDDISK_RECOVERED_SET_RUNNINGThe existing inline English log in
sampleDiskLoadis retained (just descriptive context); the state-change log itself is routed throughCommonMessagesso the Chinese build works out of the box.This PR has:
Known follow-ups
Key changed/added classes in this PR
New
Modified