fix(persistence): refuse multi-shard AOF at startup + gate BGREWRITEAOF (P0-FIX-01a/b) by pilotspacex-byte · Pull Request #129 · pilotspace/moon

pilotspacex-byte · 2026-05-26T15:23:06Z

Summary

P0-FIX-01a (command-level, defence-in-depth): BGREWRITEAOF returns a clear ERR under --shards >= 2 + --disk-offload enable + --appendonly yes, pointing operators at docs/runbooks/multi-shard-aof-rewrite.md.
P0-FIX-01b (startup, load-bearing): Moon refuses to start under --shards >= 2 + --appendonly yes unless --unsafe-multishard-aof is passed as an explicit escape hatch (for cache-only deployments). Exits with code 2 + actionable error.
docs(readme,changelog) (2nd commit) sharpens v0.1.12 launch posture, adds the Valkey 9.1.0 column + Moon vs Redis vs Valkey section, splits "what's in main" into v0.1.12 GA vs v0.2.0-alpha additions, and lands the missing v0.1.9 / v0.1.10 / v0.1.12 CHANGELOG entries.

Why

Empirical re-verification on HEAD 6e49050 (2026-05-26) found the durability bug is in the multi-shard AOF path itself, not the rewrite path that the 33-day-old memory blamed:

Configuration	Recovered
`--shards 1 --appendonly yes --appendfsync always` (control)	5000 / 5000 ✅
`--shards 1 --disk-offload enable --appendonly yes` (control)	12714 / 12714 ✅
`--shards 2 --disk-offload enable --appendonly yes` (BGREWRITEAOF + SIGKILL)	7892 / 12662 ❌ −38 %
`--shards 2 --disk-offload enable --appendonly yes` (plain SIGKILL, no rewrite)	7888 / 12655 ❌ −38 %
`--shards 2 --disk-offload enable --appendonly yes --appendfsync always`	2474 / 5000 ❌ −50 %
`--shards 2 --disk-offload disable --appendonly yes --appendfsync always`	2453 / 5000 ❌ −50 %

--appendfsync always does not save you. --disk-offload disable does not save you. Only --shards 1 recovers reliably. Multi-shard masters are explicitly cache-only in v1.0 until the v2.0 multi-shard AOF replay lands (P0-INVEST-01, 1-2 wk, tracked in tmp/SHIP-PLAN-v1.0-rc1-single-node.md).

Test plan

Unit test test_bgrewriteaof_sharded_refuses_under_unsafe_config covers gate-on + gate-off paths and asserts the gate does not flip AOF_REWRITE_IN_PROGRESS. Green on cargo test --release --lib persistence::tests.
Live OrbStack boundary tests:
- PASS --shards 1 + AOF starts cleanly (no false positives)
- PASS --shards 2 + AOF + --unsafe-multishard-aof starts (escape hatch)
- PASS --shards 2 + --appendonly no starts (cache-only)
- REFUSED --shards 2 + AOF without escape hatch (exit code 2 + documented stderr)
Re-ran the original repro post-fix: BGREWRITEAOF returns the documented ERR line at the wire.
CI green (rust 1.94, both feature sets, clippy clean, audit-unsafe, audit-unwrap).
Encode the 5-config discriminator as #[ignore] crash-matrix tests under tests/ (tracked as CRASH-01-LITE in the ship plan; not in this PR).

Operator impact

Existing --shards >= 2 + --appendonly yes deployments will fail to start after upgrade. The error message is actionable: pick --shards 1, --appendonly no, or --unsafe-multishard-aof. Runbook walks each option.
Single-shard deployments are unaffected.
--appendonly no (any shard count) is unaffected.

Out of scope (next PRs)

P0-INVEST-01: root-cause the multi-shard AOF durability bug. Gates lift when this lands.
CRASH-01-LITE: scripted crash-matrix tests in tests/.
v0.1.13 tag once the gates soak for ~1 week.

Summary by CodeRabbit

Bug Fixes
- Added safety gate preventing data loss in multi-shard with AOF persistence configurations.
New Features
- Introduced --unsafe-multishard-aof flag to override durability safety checks when needed.
Documentation
- Updated release notes, README with production readiness matrix, and operator runbooks.

…OF (P0-FIX-01a/b) Empirical re-verification on HEAD 6e49050 (2026-05-26) found that `--shards >= 2 + --appendonly yes` silently loses ~50 % of writes on SIGKILL, independent of `--appendfsync` and `--disk-offload`. The original 33-day-old bug memory had narrowed the loss to BGREWRITEAOF + disk-offload; the discriminator matrix below shows the bug is in the multi-shard AOF durability path itself. | Configuration | Recovered | |--------------------------------------------------------------------------------|----------------| | --shards 1 --appendonly yes --appendfsync always | 5000 / 5000 | | --shards 1 --disk-offload enable --appendonly yes | 12714 / 12714 | | --shards 2 --disk-offload enable --appendonly yes (BGREWRITEAOF + SIGKILL) | 7892 / 12662 | | --shards 2 --disk-offload enable --appendonly yes (plain SIGKILL, no rewrite) | 7888 / 12655 | | --shards 2 --disk-offload enable --appendonly yes --appendfsync always | 2474 / 5000 | | --shards 2 --disk-offload disable --appendonly yes --appendfsync always | 2453 / 5000 | Two complementary gates ship in this commit; both lift in v2.0 when multi-shard AOF replay walks every shard's segment manifest on recovery (see docs/runbooks/multi-shard-aof-rewrite.md): P0-FIX-01a (defence-in-depth, command-level) bgrewriteaof_start_sharded refuses with a clear ERR when the multi-shard + disk-offload + AOF combo is active. Gated by MULTI_SHARD_AOF_REWRITE_UNSAFE: AtomicBool, set once in main.rs. Unit test test_bgrewriteaof_sharded_refuses_under_unsafe_config covers gate-on + gate-off paths and asserts the gate does not flip AOF_REWRITE_IN_PROGRESS. P0-FIX-01b (load-bearing, startup) main.rs aborts with exit code 2 if `--shards >= 2 + --appendonly yes` without `--unsafe-multishard-aof`. The new flag is the explicit escape hatch for cache-only deployments where the loss window is acceptable. Boundary tests verified live on OrbStack: PASS --shards 1 + AOF starts cleanly (no false positives) PASS --shards 2 + AOF + --unsafe-multishard-aof starts PASS --shards 2 + --appendonly no starts (cache-only) REFUSED --shards 2 + AOF without escape hatch Files src/command/persistence.rs + gate + unit test src/main.rs + startup refusal + BGREWRITEAOF gate set src/config.rs + --unsafe-multishard-aof flag docs/runbooks/multi-shard-aof-rewrite.md + operator runbook Reproducer scripts live in tmp/ (gitignored): p0-repro.sh, p0-no-rewrite.sh, p0-always.sh, p0-multishard-no-offload.sh, p0-shards1-exact.sh. Encoding them as #[ignore] crash-matrix tests is tracked as CRASH-01-LITE in the ship plan. Multi-shard masters with AOF are now explicitly cache-only in v1.0. Root-cause investigation P0-INVEST-01 (1-2 wk) is the prerequisite to lifting the startup gate in v2.0. author: Tin Dang

…lpha-leak qualifiers README * Bumps version badge v0.1.10 → v0.1.12 and replaces the "experimental" status with "single-node production-grade" plus a "cluster v0.2 alpha" badge, mirroring the new ship plan posture. * Replaces the blanket experimental warning with a "production-grade architecture, pre-1.0 maturity" framing that points at the new Production readiness section for the honest GA matrix. * Reconciles platform support — macOS is a supported development platform per the PRODUCTION-CONTRACT Tier table; production deployments target Linux. * Adds a Valkey 9.1.0 column to the peak-throughput tables (honest "not yet benched" placeholders) and a new Moon vs Redis vs Valkey section: a three-way comparison table plus "when to choose" guidance, all traced to docs/comparison-valkey.md. * Rewrites the trailing roadmap into a Production readiness section with what's GA today, what's not, operator gotchas, and a roadmap table. Alpha-leak qualifiers added so v0.1.12 framing does not implicitly promise v0.2.0-alpha features: * Quick-start HEXPIRE / HTTL lines annotated "(v0.2.0-alpha; build from main)". * Hash-field TTL benchmark section retitled "v0.2.0-alpha preview" with a callout that the latest tag (v0.1.12) does not include it. * "What's already in main" list split into v0.1.12 (latest tag, single-node production-grade) and v0.2.0-alpha additions (hash-field TTL, PITR, CDC, multi-node cluster soak). * Comparison-table row for hash-field TTL qualified as "v0.2-alpha". CHANGELOG * Adds v0.1.12 entry covering Phase 189 (DashTable pre-sizing + --initial-keyspace-hint, PERF-07/09), Phase 190 (moon_memory_bytes Prometheus gauge with 7 subsystem kinds, MEMORY DOCTOR schema, resident_bytes trait), Phase 191 (jemalloc narenas:8 cap, --memory-arenas-cap, mimalloc-alt feature, OPERATOR-GUIDE Memory Accounting), Phase 177 dispatch observability, text-index default feature, SDK validate.{py,rs}, Python SDK graph parser fix, CI hygiene. * Adds v0.1.10 entry (single-shard PSYNC2 wired end-to-end). * Adds v0.1.9 Lunaris Retriever Gap Closure entry. * Consolidates three orphan Unreleased blocks under v0.1.3. * Sharpens v0.2.0-alpha entry with TL;DR headline capabilities (hash-field TTL stack, PITR, CDC, multi-node cluster soak). * Fixes version ordering so v0.1.12 sits above v0.1.11. No code changes; this is purely documentation framing aligned to the v1.0-rc1 single-node ship plan in tmp/SHIP-PLAN-v1.0-rc1-single-node.md. author: Tin Dang

qodo-code-review · 2026-05-26T15:23:12Z

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

coderabbitai · 2026-05-26T15:23:23Z

📝 Walkthrough

Walkthrough

This PR implements a safety gate for a known multi-shard AOF rewrite durability bug: it refuses startup under unsafe configurations unless explicitly overridden, gates the BGREWRITEAOF command at runtime, and documents the issue and recovery procedures for operators. Documentation and marketing content are updated to reflect the v0.1.12 release and production readiness status.

Changes

Multi-Shard AOF Safety Gates

Layer / File(s)	Summary
Configuration flag and startup enforcement `src/config.rs`, `src/main.rs`	Adds `--unsafe-multishard-aof` CLI flag (default false) to `ServerConfig`; main startup logic refuses to start with multi-shard + appendonly + non-override, exits code 2 with data-loss warning.
Runtime BGREWRITEAOF command gating `src/command/persistence.rs`, `src/main.rs`	Introduces `MULTI_SHARD_AOF_REWRITE_UNSAFE` AtomicBool for process-wide gating; `bgrewriteaof_start_sharded` returns error frame before dispatch when unsafe multi-shard + disk-offload + appendonly conditions detected; main.rs sets gate and emits operator warnings at startup.
Gate validation tests `src/command/persistence.rs`	Concurrency-safe test verifies unsafe gate produces expected error, does not set `AOF_REWRITE_IN_PROGRESS`, and gate disabling removes unsafe error.
Release notes and operational runbook `CHANGELOG.md`, `docs/runbooks/multi-shard-aof-rewrite.md`	CHANGELOG reorganized with v0.1.12 release and v0.2.0-alpha unreleased features; new runbook documents refusal gates, data-loss measurements, recovery, avoidance, v2.0 removal timeline, and telemetry signals.
Product documentation and marketing `README.md`	Badges updated to v0.1.12; introductory section rewritten for production-grade positioning; Benchmarks and Hash-field TTL expanded with Moon vs Redis vs Valkey comparisons; Production readiness matrix added with recommended configs, operator gotchas, shipped/in-progress features.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A gate stands guard where shards align,
With AOF that dares to write—
One flag to rule the dangerous line,
And tests that burn both day and night.
Now operators know the path to flight! ✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main changes: refusing multi-shard AOF at startup and gating BGREWRITEAOF, matching the two core P0 fixes (P0-FIX-01a/b) implemented across config, command, and main modules.
Description check	✅ Passed	The description comprehensively covers the summary, rationale, test plan, and operator impact, though it lacks explicit confirmation that the standard checklist items (cargo fmt, clippy, cargo test, consistency tests) have passed.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/p0-multishard-aof-gate

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

README.md (1)

229-233: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Quick-start production flags now conflict with startup safety gate.

This command should fail under the new startup refusal (--shards >= 2 + --appendonly yes without override), so the README is currently instructing an invalid config.

Suggested README correction

 # Or with production flags
 ./target/release/moon \
   --port 6379 \
-  --shards 8 \
-  --appendonly yes --appendfsync everysec \
+  --shards 1 \
+  --appendonly yes --appendfsync everysec \
   --maxmemory 8g --maxmemory-policy allkeys-lfu
+
+# Multi-shard cache-only alternative
+# ./target/release/moon --shards 8 --appendonly no ...
+
+# Unsafe override (not recommended; known durability risk)
+# ./target/release/moon --shards 8 --appendonly yes --unsafe-multishard-aof ...

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@README.md` around lines 229 - 233, The README's quick-start example uses
conflicting flags (--shards 8 together with --appendonly yes) which will trigger
the new startup safety gate and refuse to start; update the example command
under the block that contains the flags (--port, --shards, --appendonly,
--appendfsync, --maxmemory, --maxmemory-policy) to a valid configuration (e.g.,
set --shards 1 or remove/disable --appendonly) or explicitly show the required
override flag and text that allows bypassing the safety gate (add a clear
placeholder like --<startup-override> if an override exists) so the documented
command actually starts successfully.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/runbooks/multi-shard-aof-rewrite.md`:
- Around line 10-16: Three fenced code blocks in
docs/runbooks/multi-shard-aof-rewrite.md are missing language identifiers
(markdownlint MD040). Edit the three blocks shown (the startup refusal block
starting "REFUSING TO START: --shards 2 + --appendonly yes...", the BGREWRITEAOF
interaction block containing "BGREWRITEAOF" and "(error) ERR BGREWRITEAOF...",
and the final explanatory block starting "BGREWRITEAOF gated for this
config...") and add the language tags: use ```text for the two plain-text blocks
and ```redis for the BGREWRITEAOF example so markdownlint MD040 is satisfied.

In `@src/main.rs`:
- Around line 273-289: The --check-config path currently returns before the
multishard-AOF safety gate runs, so add the same refusal logic used at startup
into the check_config branch: detect the condition (num_shards >= 2 &&
config.appendonly == "yes" && !config.unsafe_multishard_aof) inside the
check_config handling and print the identical error message and exit non‑zero
(or return an error) so preflight fails the same way real startup would; use the
same symbols/strings (num_shards, config.appendonly,
config.unsafe_multishard_aof) and the same message text used near the startup
gate to keep behavior consistent.

---

Outside diff comments:
In `@README.md`:
- Around line 229-233: The README's quick-start example uses conflicting flags
(--shards 8 together with --appendonly yes) which will trigger the new startup
safety gate and refuse to start; update the example command under the block that
contains the flags (--port, --shards, --appendonly, --appendfsync, --maxmemory,
--maxmemory-policy) to a valid configuration (e.g., set --shards 1 or
remove/disable --appendonly) or explicitly show the required override flag and
text that allows bypassing the safety gate (add a clear placeholder like
--<startup-override> if an override exists) so the documented command actually
starts successfully.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c11a2da9-b702-43f0-91ac-59786ae9a841

📥 Commits

Reviewing files that changed from the base of the PR and between 6e49050 and 7b61898.

📒 Files selected for processing (6)

CHANGELOG.md
README.md
docs/runbooks/multi-shard-aof-rewrite.md
src/command/persistence.rs
src/config.rs
src/main.rs

coderabbitai · 2026-05-26T15:27:54Z

+```
+REFUSING TO START: --shards 2 + --appendonly yes has a known data-loss
+bug on SIGKILL (~50 % loss verified 2026-05-26). Fix: use --shards 1,
+or pass --appendonly no for cache-only deployments, or pass
+--unsafe-multishard-aof to acknowledge the risk and start anyway. See
+docs/runbooks/multi-shard-aof-rewrite.md.
+```


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add fenced code languages to satisfy markdownlint MD040.

These three fenced blocks are missing language identifiers and will keep markdownlint warnings active.

Suggested doc-only fix

-``` +```text REFUSING TO START: --shards 2 + --appendonly yes has a known data-loss ... -``` +``` -``` +```redis > BGREWRITEAOF (error) ERR BGREWRITEAOF is unsafe with --shards >= 2 + --disk-offload enable ... -``` +``` -``` +```text BGREWRITEAOF gated for this config (known data-loss path; see docs/runbooks/multi-shard-aof-rewrite.md). Use --shards 1 or --disk-offload disable to re-enable rewrite. -``` +```

Also applies to: 20-26, 88-92

🧰 Tools

🪛 markdownlint-cli2 (0.22.1)

[warning] 10-10: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/runbooks/multi-shard-aof-rewrite.md` around lines 10 - 16, Three fenced code blocks in docs/runbooks/multi-shard-aof-rewrite.md are missing language identifiers (markdownlint MD040). Edit the three blocks shown (the startup refusal block starting "REFUSING TO START: --shards 2 + --appendonly yes...", the BGREWRITEAOF interaction block containing "BGREWRITEAOF" and "(error) ERR BGREWRITEAOF...", and the final explanatory block starting "BGREWRITEAOF gated for this config...") and add the language tags: use ```text for the two plain-text blocks and ```redis for the BGREWRITEAOF example so markdownlint MD040 is satisfied.

coderabbitai · 2026-05-26T15:27:54Z

+    // P0-FIX-01b: refuse to start under the known durability bug
+    // (`shards >= 2 + appendonly yes` loses ~50 % of writes on SIGKILL,
+    //  verified 2026-05-26 on HEAD `6e49050`; reproducer in
+    //  `tmp/p0-no-rewrite.sh` and `tmp/p0-always.sh`).  The bug is
+    // independent of `--appendfsync` and `--disk-offload` settings.  An
+    // operator can override via `--unsafe-multishard-aof` if the
+    // deployment is cache-only and the loss window is acceptable.
+    if num_shards >= 2 && config.appendonly == "yes" && !config.unsafe_multishard_aof {
+        eprintln!(
+            "REFUSING TO START: --shards {num_shards} + --appendonly yes has a known data-loss \
+             bug on SIGKILL (~50 % loss verified 2026-05-26). Fix: use --shards 1, or pass \
+             --appendonly no for cache-only deployments, or pass --unsafe-multishard-aof to \
+             acknowledge the risk and start anyway. See \
+             docs/runbooks/multi-shard-aof-rewrite.md."
+        );
+        std::process::exit(2);
+    }


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Mirror this refusal in --check-config validation.

Line 143 returns from --check-config before Line 280 runs, so preflight can pass a config that real startup immediately refuses. Please enforce the same gate in the check_config branch.

Suggested patch

@@ if config.check_config { + if config.shards >= 2 && config.appendonly == "yes" && !config.unsafe_multishard_aof { + return Err(anyhow::anyhow!( + "--shards {} + --appendonly yes is refused unless --unsafe-multishard-aof is set (or use --shards 1 / --appendonly no)", + config.shards + )); + } // Validate shard count is reasonable if config.shards == 0 { return Err(anyhow::anyhow!("--shards must be >= 1")); }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/main.rs` around lines 273 - 289, The --check-config path currently returns before the multishard-AOF safety gate runs, so add the same refusal logic used at startup into the check_config branch: detect the condition (num_shards >= 2 && config.appendonly == "yes" && !config.unsafe_multishard_aof) inside the check_config handling and print the identical error message and exit non‑zero (or return an error) so preflight fails the same way real startup would; use the same symbols/strings (num_shards, config.appendonly, config.unsafe_multishard_aof) and the same message text used near the startup gate to keep behavior consistent.

TinDang97 added 2 commits May 26, 2026 22:19

coderabbitai Bot reviewed May 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(persistence): refuse multi-shard AOF at startup + gate BGREWRITEAOF (P0-FIX-01a/b)#129

fix(persistence): refuse multi-shard AOF at startup + gate BGREWRITEAOF (P0-FIX-01a/b)#129
pilotspacex-byte wants to merge 2 commits into
mainfrom
fix/p0-multishard-aof-gate

pilotspacex-byte commented May 26, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

qodo-code-review Bot commented May 26, 2026

Uh oh!

coderabbitai Bot commented May 26, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 26, 2026

Uh oh!

coderabbitai Bot May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pilotspacex-byte commented May 26, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Test plan

Operator impact

Out of scope (next PRs)

Summary by CodeRabbit

Uh oh!

qodo-code-review Bot commented May 26, 2026

Qodo reviews are paused for this user.

Uh oh!

coderabbitai Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pilotspacex-byte commented May 26, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 26, 2026 •

edited

Loading