Skip to content

fix(storage): constants + gossip tuning + Delete error surface (#495, #498, #500, #501, #502)#18

Open
jackthepunished wants to merge 4 commits into
mainfrom
fix/storage-cleanup-batch
Open

fix(storage): constants + gossip tuning + Delete error surface (#495, #498, #500, #501, #502)#18
jackthepunished wants to merge 4 commits into
mainfrom
fix/storage-cleanup-batch

Conversation

@jackthepunished

@jackthepunished jackthepunished commented May 13, 2026

Copy link
Copy Markdown
Owner

Summary

Targets upstream: poyrazK#495, poyrazK#498, poyrazK#500, poyrazK#501, poyrazK#502

Test plan

  • go build ./internal/storage/...
  • go test ./internal/storage/node/ (TestLocalStoreInvalidAbsolutePath is pre-existing failure on origin/main, unrelated)

Summary by CodeRabbit

  • Bug Fixes

    • Improved consistency across storage components to prevent system drift.
    • Enhanced robustness of data deletion with comprehensive error reporting.
  • Improvements

    • Made failure detection and health monitoring timeouts configurable for better operational tuning.

Review Change Stack

…lete errors (poyrazK#495, poyrazK#498, poyrazK#500, poyrazK#501, poyrazK#502)

* internal/storage/storageconst: new package that owns the previously
  duplicated MaxObjectSize and ChunkSize values; coordinator and node
  packages now alias from there so they cannot drift. (poyrazK#500, poyrazK#502)

* gossip.go: replace hardcoded 2s / 5s timeouts with named constants
  (defaultGossipRPCTimeout, defaultFailureDetectionTimeout,
  defaultFailCheckInterval) and add Set* overrides so embedders and
  tests can tune for higher-latency networks without recompiling. The
  alive→suspect threshold no longer assumes LAN. (poyrazK#500, poyrazK#501)

* gossip.go.sendGossip: tighten the double-checked locking note —
  the read-lock-release / write-lock-reacquire pattern already covers
  the documented race because the second hold re-checks both peers and
  members maps, but the new comments make that explicit and link it
  to the closePeerLocked invariant. (poyrazK#495)

* node/store.go.Delete: propagate os.Remove errors instead of dropping
  the meta-file removal error on the floor. Both failures are returned
  together so the operator can distinguish data-vs-meta partial
  failures. (poyrazK#498)
Copilot AI review requested due to automatic review settings May 13, 2026 23:58
@coderabbitai

coderabbitai Bot commented May 13, 2026

Copy link
Copy Markdown

Warning

Rate limit exceeded

@poyrazK has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 54 minutes and 17 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c8e0b042-5313-4815-8245-a017443a0332

📥 Commits

Reviewing files that changed from the base of the PR and between 8c4667e and 65d05f3.

📒 Files selected for processing (1)
  • internal/storage/node/gossip.go
📝 Walkthrough

Walkthrough

This PR consolidates hardcoded storage size limits and chunk sizes into a shared storageconst package, then updates coordinator and node components to use these constants. It also refactors LocalStore.Delete to handle metadata and data file removal independently, and adds configurable gossip protocol timeouts via setter methods.

Changes

Storage Constants and Gossip Tuning

Layer / File(s) Summary
Shared storage constants foundation
internal/storage/storageconst/constants.go
New storageconst package exports MaxObjectSize (5 GB) and ChunkSize (1 MB) as the single source of truth for storage limits across components.
Coordinator service alignment
internal/storage/coordinator/service.go
Coordinator const block re-exports storage constants from storageconst instead of hardcoding chunkSize and maxObjectSize.
Node components storage constant alignment
internal/storage/node/rpc.go, internal/storage/node/store.go
defaultRetrieveChunkSize in rpc.go and maxObjectSize in store.go now reference shared storageconst constants, keeping size limits synchronized across node and coordinator.
Store Delete error handling refinement
internal/storage/node/store.go
LocalStore.Delete independently removes .meta sidecars and data files, combining both errors when both operations fail while avoiding stale metadata confusion.
Gossip protocol timeout configurability
internal/storage/node/gossip.go
New GossipProtocol fields and setter methods (SetFailureDetectionTimeout, SetGossipRPCTimeout, SetFailCheckInterval) enable overriding gossip timeouts before Start. Methods Start, detectFailures, and sendGossip use configured durations instead of hardcoded values.

🎯 2 (Simple) | ⏱️ ~12 minutes

A storage home, now shared and tidy,
Constants unified to keep things nidy,
Gossip protocols bow to the knob,
Deletions dance, both data and job.
🐰✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main changes: constants consolidation, gossip tuning configuration, and Delete error handling improvements, referencing the specific upstream issues addressed.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/storage-cleanup-batch

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

jackthepunished and others added 2 commits May 14, 2026 03:21
Use %w for the primary error and %s for the secondary so errorlint's
no-non-wrapping-format-verb rule is happy without losing the meta error
context.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/storage/node/gossip.go`:
- Around line 18-22: The comment wrongly states the gossip tuning constants are
"exported" while they are implemented as unexported (lowercase) identifiers;
update the comment in internal/storage/node/gossip.go to accurately describe
them as package-private (unexported) lowercase constants or alternatively export
the identifiers if intended—i.e., either change the wording to say "These
identifiers are package-private (unexported) so tests, embedders, and ops
tooling must override them within the package or via explicit export" or rename
the constants to exported names (capitalize) so the original text matches the
code; ensure the comment references the same "gossip protocol tuning constants"
block.
- Around line 75-97: The three setters
(GossipProtocol.SetFailureDetectionTimeout, SetGossipRPCTimeout,
SetFailCheckInterval) must guard against non-positive durations to avoid
panics/degenerate timers; add an early check like "if d <= 0 { return }" at the
start of each method before acquiring g.mu and assigning
g.failureDetectionTimeout, g.gossipRPCTimeout or g.failCheckInterval so invalid
inputs are ignored and existing routines using time.NewTicker or timeout logic
won't receive zero/negative durations.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7ddeee1a-2227-4f71-8c5a-f08501c36453

📥 Commits

Reviewing files that changed from the base of the PR and between 09730b4 and 8c4667e.

📒 Files selected for processing (5)
  • internal/storage/coordinator/service.go
  • internal/storage/node/gossip.go
  • internal/storage/node/rpc.go
  • internal/storage/node/store.go
  • internal/storage/storageconst/constants.go

Comment on lines +18 to +22
// Gossip protocol tuning constants. These are deliberately exported so that
// tests, embedders, and ops tooling can override them in lock-step without
// editing source. The defaults are conservative for low-latency LAN clusters
// and may need to be raised for WAN/high-latency deployments to avoid false
// positives.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Comment says constants are “exported”, but they are not.

These identifiers are lowercase (package-private), so the wording is inaccurate and can mislead embedders.

Suggested wording tweak
-// Gossip protocol tuning constants. These are deliberately exported so that
+// Gossip protocol tuning defaults. These are package-level defaults so that
 // tests, embedders, and ops tooling can override them in lock-step without
 // editing source. The defaults are conservative for low-latency LAN clusters
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/storage/node/gossip.go` around lines 18 - 22, The comment wrongly
states the gossip tuning constants are "exported" while they are implemented as
unexported (lowercase) identifiers; update the comment in
internal/storage/node/gossip.go to accurately describe them as package-private
(unexported) lowercase constants or alternatively export the identifiers if
intended—i.e., either change the wording to say "These identifiers are
package-private (unexported) so tests, embedders, and ops tooling must override
them within the package or via explicit export" or rename the constants to
exported names (capitalize) so the original text matches the code; ensure the
comment references the same "gossip protocol tuning constants" block.

Comment on lines +75 to 97
// SetFailureDetectionTimeout overrides the alive→suspect promotion threshold.
// Must be called before Start.
func (g *GossipProtocol) SetFailureDetectionTimeout(d time.Duration) {
g.mu.Lock()
defer g.mu.Unlock()
g.failureDetectionTimeout = d
}

// SetGossipRPCTimeout overrides the per-RPC Gossip timeout.
// Must be called before Start.
func (g *GossipProtocol) SetGossipRPCTimeout(d time.Duration) {
g.mu.Lock()
defer g.mu.Unlock()
g.gossipRPCTimeout = d
}

// SetFailCheckInterval overrides how often detectFailures runs.
// Must be called before Start.
func (g *GossipProtocol) SetFailCheckInterval(d time.Duration) {
g.mu.Lock()
defer g.mu.Unlock()
g.failCheckInterval = d
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate non-positive duration overrides in setter APIs.

Set* currently accepts <= 0 durations. That can panic time.NewTicker at Line 141 and cause immediate/degenerate timeouts at Line 163 and Line 320.

Suggested guard pattern
 func (g *GossipProtocol) SetFailureDetectionTimeout(d time.Duration) {
 	g.mu.Lock()
 	defer g.mu.Unlock()
+	if d <= 0 {
+		g.logger.Warn("ignoring non-positive failure detection timeout", "value", d, "using_default", defaultFailureDetectionTimeout)
+		d = defaultFailureDetectionTimeout
+	}
 	g.failureDetectionTimeout = d
 }

 func (g *GossipProtocol) SetGossipRPCTimeout(d time.Duration) {
 	g.mu.Lock()
 	defer g.mu.Unlock()
+	if d <= 0 {
+		g.logger.Warn("ignoring non-positive gossip RPC timeout", "value", d, "using_default", defaultGossipRPCTimeout)
+		d = defaultGossipRPCTimeout
+	}
 	g.gossipRPCTimeout = d
 }

 func (g *GossipProtocol) SetFailCheckInterval(d time.Duration) {
 	g.mu.Lock()
 	defer g.mu.Unlock()
+	if d <= 0 {
+		g.logger.Warn("ignoring non-positive fail-check interval", "value", d, "using_default", defaultFailCheckInterval)
+		d = defaultFailCheckInterval
+	}
 	g.failCheckInterval = d
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/storage/node/gossip.go` around lines 75 - 97, The three setters
(GossipProtocol.SetFailureDetectionTimeout, SetGossipRPCTimeout,
SetFailCheckInterval) must guard against non-positive durations to avoid
panics/degenerate timers; add an early check like "if d <= 0 { return }" at the
start of each method before acquiring g.mu and assigning
g.failureDetectionTimeout, g.gossipRPCTimeout or g.failCheckInterval so invalid
inputs are ignored and existing routines using time.NewTicker or timeout logic
won't receive zero/negative durations.

Setters no-op after Start() is called to prevent confusing race conditions
where a goroutine reads a new value while the old value is already live in
the tickers. A started flag is set under mutex protection in Start().
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants