Skip to content

fix(persistence): refuse multi-shard AOF at startup + gate BGREWRITEAOF (P0-FIX-01a/b)#129

Open
pilotspacex-byte wants to merge 2 commits into
mainfrom
fix/p0-multishard-aof-gate
Open

fix(persistence): refuse multi-shard AOF at startup + gate BGREWRITEAOF (P0-FIX-01a/b)#129
pilotspacex-byte wants to merge 2 commits into
mainfrom
fix/p0-multishard-aof-gate

Conversation

@pilotspacex-byte
Copy link
Copy Markdown
Contributor

@pilotspacex-byte pilotspacex-byte commented May 26, 2026

Summary

  • P0-FIX-01a (command-level, defence-in-depth): BGREWRITEAOF returns a clear ERR under --shards >= 2 + --disk-offload enable + --appendonly yes, pointing operators at docs/runbooks/multi-shard-aof-rewrite.md.
  • P0-FIX-01b (startup, load-bearing): Moon refuses to start under --shards >= 2 + --appendonly yes unless --unsafe-multishard-aof is passed as an explicit escape hatch (for cache-only deployments). Exits with code 2 + actionable error.
  • docs(readme,changelog) (2nd commit) sharpens v0.1.12 launch posture, adds the Valkey 9.1.0 column + Moon vs Redis vs Valkey section, splits "what's in main" into v0.1.12 GA vs v0.2.0-alpha additions, and lands the missing v0.1.9 / v0.1.10 / v0.1.12 CHANGELOG entries.

Why

Empirical re-verification on HEAD 6e49050 (2026-05-26) found the durability bug is in the multi-shard AOF path itself, not the rewrite path that the 33-day-old memory blamed:

Configuration Recovered
--shards 1 --appendonly yes --appendfsync always (control) 5000 / 5000 ✅
--shards 1 --disk-offload enable --appendonly yes (control) 12714 / 12714 ✅
--shards 2 --disk-offload enable --appendonly yes (BGREWRITEAOF + SIGKILL) 7892 / 12662 ❌ −38 %
--shards 2 --disk-offload enable --appendonly yes (plain SIGKILL, no rewrite) 7888 / 12655 ❌ −38 %
--shards 2 --disk-offload enable --appendonly yes --appendfsync always 2474 / 5000 ❌ −50 %
--shards 2 --disk-offload disable --appendonly yes --appendfsync always 2453 / 5000 ❌ −50 %

--appendfsync always does not save you. --disk-offload disable does not save you. Only --shards 1 recovers reliably. Multi-shard masters are explicitly cache-only in v1.0 until the v2.0 multi-shard AOF replay lands (P0-INVEST-01, 1-2 wk, tracked in tmp/SHIP-PLAN-v1.0-rc1-single-node.md).

Test plan

  • Unit test test_bgrewriteaof_sharded_refuses_under_unsafe_config covers gate-on + gate-off paths and asserts the gate does not flip AOF_REWRITE_IN_PROGRESS. Green on cargo test --release --lib persistence::tests.
  • Live OrbStack boundary tests:
    • PASS --shards 1 + AOF starts cleanly (no false positives)
    • PASS --shards 2 + AOF + --unsafe-multishard-aof starts (escape hatch)
    • PASS --shards 2 + --appendonly no starts (cache-only)
    • REFUSED --shards 2 + AOF without escape hatch (exit code 2 + documented stderr)
  • Re-ran the original repro post-fix: BGREWRITEAOF returns the documented ERR line at the wire.
  • CI green (rust 1.94, both feature sets, clippy clean, audit-unsafe, audit-unwrap).
  • Encode the 5-config discriminator as #[ignore] crash-matrix tests under tests/ (tracked as CRASH-01-LITE in the ship plan; not in this PR).

Operator impact

  • Existing --shards >= 2 + --appendonly yes deployments will fail to start after upgrade. The error message is actionable: pick --shards 1, --appendonly no, or --unsafe-multishard-aof. Runbook walks each option.
  • Single-shard deployments are unaffected.
  • --appendonly no (any shard count) is unaffected.

Out of scope (next PRs)

  • P0-INVEST-01: root-cause the multi-shard AOF durability bug. Gates lift when this lands.
  • CRASH-01-LITE: scripted crash-matrix tests in tests/.
  • v0.1.13 tag once the gates soak for ~1 week.

Summary by CodeRabbit

  • Bug Fixes

    • Added safety gate preventing data loss in multi-shard with AOF persistence configurations.
  • New Features

    • Introduced --unsafe-multishard-aof flag to override durability safety checks when needed.
  • Documentation

    • Updated release notes, README with production readiness matrix, and operator runbooks.

Review Change Stack

TinDang97 added 2 commits May 26, 2026 22:19
…OF (P0-FIX-01a/b)

Empirical re-verification on HEAD 6e49050 (2026-05-26) found that
`--shards >= 2 + --appendonly yes` silently loses ~50 % of writes on
SIGKILL, independent of `--appendfsync` and `--disk-offload`. The
original 33-day-old bug memory had narrowed the loss to
BGREWRITEAOF + disk-offload; the discriminator matrix below shows the
bug is in the multi-shard AOF durability path itself.

| Configuration                                                                  | Recovered      |
|--------------------------------------------------------------------------------|----------------|
| --shards 1 --appendonly yes --appendfsync always                               | 5000 / 5000    |
| --shards 1 --disk-offload enable --appendonly yes                              | 12714 / 12714  |
| --shards 2 --disk-offload enable --appendonly yes (BGREWRITEAOF + SIGKILL)     | 7892 / 12662   |
| --shards 2 --disk-offload enable --appendonly yes (plain SIGKILL, no rewrite)  | 7888 / 12655   |
| --shards 2 --disk-offload enable --appendonly yes --appendfsync always         | 2474 / 5000    |
| --shards 2 --disk-offload disable --appendonly yes --appendfsync always        | 2453 / 5000    |

Two complementary gates ship in this commit; both lift in v2.0 when
multi-shard AOF replay walks every shard's segment manifest on
recovery (see docs/runbooks/multi-shard-aof-rewrite.md):

P0-FIX-01a (defence-in-depth, command-level)
  bgrewriteaof_start_sharded refuses with a clear ERR when the
  multi-shard + disk-offload + AOF combo is active. Gated by
  MULTI_SHARD_AOF_REWRITE_UNSAFE: AtomicBool, set once in main.rs.
  Unit test test_bgrewriteaof_sharded_refuses_under_unsafe_config
  covers gate-on + gate-off paths and asserts the gate does not
  flip AOF_REWRITE_IN_PROGRESS.

P0-FIX-01b (load-bearing, startup)
  main.rs aborts with exit code 2 if `--shards >= 2 + --appendonly
  yes` without `--unsafe-multishard-aof`. The new flag is the
  explicit escape hatch for cache-only deployments where the loss
  window is acceptable. Boundary tests verified live on OrbStack:
    PASS  --shards 1 + AOF starts cleanly (no false positives)
    PASS  --shards 2 + AOF + --unsafe-multishard-aof starts
    PASS  --shards 2 + --appendonly no starts (cache-only)
    REFUSED  --shards 2 + AOF without escape hatch

Files
  src/command/persistence.rs  + gate + unit test
  src/main.rs                 + startup refusal + BGREWRITEAOF gate set
  src/config.rs               + --unsafe-multishard-aof flag
  docs/runbooks/multi-shard-aof-rewrite.md  + operator runbook

Reproducer scripts live in tmp/ (gitignored): p0-repro.sh,
p0-no-rewrite.sh, p0-always.sh, p0-multishard-no-offload.sh,
p0-shards1-exact.sh. Encoding them as #[ignore] crash-matrix tests
is tracked as CRASH-01-LITE in the ship plan.

Multi-shard masters with AOF are now explicitly cache-only in v1.0.
Root-cause investigation P0-INVEST-01 (1-2 wk) is the prerequisite
to lifting the startup gate in v2.0.

author: Tin Dang
…lpha-leak qualifiers

README
  * Bumps version badge v0.1.10 → v0.1.12 and replaces the
    "experimental" status with "single-node production-grade" plus a
    "cluster v0.2 alpha" badge, mirroring the new ship plan posture.
  * Replaces the blanket experimental warning with a "production-grade
    architecture, pre-1.0 maturity" framing that points at the new
    Production readiness section for the honest GA matrix.
  * Reconciles platform support — macOS is a supported development
    platform per the PRODUCTION-CONTRACT Tier table; production
    deployments target Linux.
  * Adds a Valkey 9.1.0 column to the peak-throughput tables (honest
    "not yet benched" placeholders) and a new Moon vs Redis vs Valkey
    section: a three-way comparison table plus "when to choose"
    guidance, all traced to docs/comparison-valkey.md.
  * Rewrites the trailing roadmap into a Production readiness section
    with what's GA today, what's not, operator gotchas, and a roadmap
    table.

  Alpha-leak qualifiers added so v0.1.12 framing does not implicitly
  promise v0.2.0-alpha features:

  * Quick-start HEXPIRE / HTTL lines annotated "(v0.2.0-alpha; build
    from main)".
  * Hash-field TTL benchmark section retitled "v0.2.0-alpha preview"
    with a callout that the latest tag (v0.1.12) does not include it.
  * "What's already in main" list split into v0.1.12 (latest tag,
    single-node production-grade) and v0.2.0-alpha additions
    (hash-field TTL, PITR, CDC, multi-node cluster soak).
  * Comparison-table row for hash-field TTL qualified as
    "v0.2-alpha".

CHANGELOG
  * Adds v0.1.12 entry covering Phase 189 (DashTable pre-sizing +
    --initial-keyspace-hint, PERF-07/09), Phase 190 (moon_memory_bytes
    Prometheus gauge with 7 subsystem kinds, MEMORY DOCTOR schema,
    resident_bytes trait), Phase 191 (jemalloc narenas:8 cap,
    --memory-arenas-cap, mimalloc-alt feature, OPERATOR-GUIDE Memory
    Accounting), Phase 177 dispatch observability, text-index default
    feature, SDK validate.{py,rs}, Python SDK graph parser fix, CI
    hygiene.
  * Adds v0.1.10 entry (single-shard PSYNC2 wired end-to-end).
  * Adds v0.1.9 Lunaris Retriever Gap Closure entry.
  * Consolidates three orphan Unreleased blocks under v0.1.3.
  * Sharpens v0.2.0-alpha entry with TL;DR headline capabilities
    (hash-field TTL stack, PITR, CDC, multi-node cluster soak).
  * Fixes version ordering so v0.1.12 sits above v0.1.11.

No code changes; this is purely documentation framing aligned to the
v1.0-rc1 single-node ship plan in tmp/SHIP-PLAN-v1.0-rc1-single-node.md.

author: Tin Dang
@qodo-code-review
Copy link
Copy Markdown

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 26, 2026

📝 Walkthrough

Walkthrough

This PR implements a safety gate for a known multi-shard AOF rewrite durability bug: it refuses startup under unsafe configurations unless explicitly overridden, gates the BGREWRITEAOF command at runtime, and documents the issue and recovery procedures for operators. Documentation and marketing content are updated to reflect the v0.1.12 release and production readiness status.

Changes

Multi-Shard AOF Safety Gates

Layer / File(s) Summary
Configuration flag and startup enforcement
src/config.rs, src/main.rs
Adds --unsafe-multishard-aof CLI flag (default false) to ServerConfig; main startup logic refuses to start with multi-shard + appendonly + non-override, exits code 2 with data-loss warning.
Runtime BGREWRITEAOF command gating
src/command/persistence.rs, src/main.rs
Introduces MULTI_SHARD_AOF_REWRITE_UNSAFE AtomicBool for process-wide gating; bgrewriteaof_start_sharded returns error frame before dispatch when unsafe multi-shard + disk-offload + appendonly conditions detected; main.rs sets gate and emits operator warnings at startup.
Gate validation tests
src/command/persistence.rs
Concurrency-safe test verifies unsafe gate produces expected error, does not set AOF_REWRITE_IN_PROGRESS, and gate disabling removes unsafe error.
Release notes and operational runbook
CHANGELOG.md, docs/runbooks/multi-shard-aof-rewrite.md
CHANGELOG reorganized with v0.1.12 release and v0.2.0-alpha unreleased features; new runbook documents refusal gates, data-loss measurements, recovery, avoidance, v2.0 removal timeline, and telemetry signals.
Product documentation and marketing
README.md
Badges updated to v0.1.12; introductory section rewritten for production-grade positioning; Benchmarks and Hash-field TTL expanded with Moon vs Redis vs Valkey comparisons; Production readiness matrix added with recommended configs, operator gotchas, shipped/in-progress features.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A gate stands guard where shards align,
With AOF that dares to write—
One flag to rule the dangerous line,
And tests that burn both day and night.
Now operators know the path to flight!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main changes: refusing multi-shard AOF at startup and gating BGREWRITEAOF, matching the two core P0 fixes (P0-FIX-01a/b) implemented across config, command, and main modules.
Description check ✅ Passed The description comprehensively covers the summary, rationale, test plan, and operator impact, though it lacks explicit confirmation that the standard checklist items (cargo fmt, clippy, cargo test, consistency tests) have passed.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/p0-multishard-aof-gate

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
README.md (1)

229-233: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Quick-start production flags now conflict with startup safety gate.

This command should fail under the new startup refusal (--shards >= 2 + --appendonly yes without override), so the README is currently instructing an invalid config.

Suggested README correction
 # Or with production flags
 ./target/release/moon \
   --port 6379 \
-  --shards 8 \
-  --appendonly yes --appendfsync everysec \
+  --shards 1 \
+  --appendonly yes --appendfsync everysec \
   --maxmemory 8g --maxmemory-policy allkeys-lfu
+
+# Multi-shard cache-only alternative
+# ./target/release/moon --shards 8 --appendonly no ...
+
+# Unsafe override (not recommended; known durability risk)
+# ./target/release/moon --shards 8 --appendonly yes --unsafe-multishard-aof ...
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@README.md` around lines 229 - 233, The README's quick-start example uses
conflicting flags (--shards 8 together with --appendonly yes) which will trigger
the new startup safety gate and refuse to start; update the example command
under the block that contains the flags (--port, --shards, --appendonly,
--appendfsync, --maxmemory, --maxmemory-policy) to a valid configuration (e.g.,
set --shards 1 or remove/disable --appendonly) or explicitly show the required
override flag and text that allows bypassing the safety gate (add a clear
placeholder like --<startup-override> if an override exists) so the documented
command actually starts successfully.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/runbooks/multi-shard-aof-rewrite.md`:
- Around line 10-16: Three fenced code blocks in
docs/runbooks/multi-shard-aof-rewrite.md are missing language identifiers
(markdownlint MD040). Edit the three blocks shown (the startup refusal block
starting "REFUSING TO START: --shards 2 + --appendonly yes...", the BGREWRITEAOF
interaction block containing "BGREWRITEAOF" and "(error) ERR BGREWRITEAOF...",
and the final explanatory block starting "BGREWRITEAOF gated for this
config...") and add the language tags: use ```text for the two plain-text blocks
and ```redis for the BGREWRITEAOF example so markdownlint MD040 is satisfied.

In `@src/main.rs`:
- Around line 273-289: The --check-config path currently returns before the
multishard-AOF safety gate runs, so add the same refusal logic used at startup
into the check_config branch: detect the condition (num_shards >= 2 &&
config.appendonly == "yes" && !config.unsafe_multishard_aof) inside the
check_config handling and print the identical error message and exit non‑zero
(or return an error) so preflight fails the same way real startup would; use the
same symbols/strings (num_shards, config.appendonly,
config.unsafe_multishard_aof) and the same message text used near the startup
gate to keep behavior consistent.

---

Outside diff comments:
In `@README.md`:
- Around line 229-233: The README's quick-start example uses conflicting flags
(--shards 8 together with --appendonly yes) which will trigger the new startup
safety gate and refuse to start; update the example command under the block that
contains the flags (--port, --shards, --appendonly, --appendfsync, --maxmemory,
--maxmemory-policy) to a valid configuration (e.g., set --shards 1 or
remove/disable --appendonly) or explicitly show the required override flag and
text that allows bypassing the safety gate (add a clear placeholder like
--<startup-override> if an override exists) so the documented command actually
starts successfully.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c11a2da9-b702-43f0-91ac-59786ae9a841

📥 Commits

Reviewing files that changed from the base of the PR and between 6e49050 and 7b61898.

📒 Files selected for processing (6)
  • CHANGELOG.md
  • README.md
  • docs/runbooks/multi-shard-aof-rewrite.md
  • src/command/persistence.rs
  • src/config.rs
  • src/main.rs

Comment on lines +10 to +16
```
REFUSING TO START: --shards 2 + --appendonly yes has a known data-loss
bug on SIGKILL (~50 % loss verified 2026-05-26). Fix: use --shards 1,
or pass --appendonly no for cache-only deployments, or pass
--unsafe-multishard-aof to acknowledge the risk and start anyway. See
docs/runbooks/multi-shard-aof-rewrite.md.
```
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add fenced code languages to satisfy markdownlint MD040.

These three fenced blocks are missing language identifiers and will keep markdownlint warnings active.

Suggested doc-only fix
-```
+```text
 REFUSING TO START: --shards 2 + --appendonly yes has a known data-loss
 ...
-```
+```

-```
+```redis
 > BGREWRITEAOF
 (error) ERR BGREWRITEAOF is unsafe with --shards >= 2 + --disk-offload enable
 ...
-```
+```

-```
+```text
 BGREWRITEAOF gated for this config (known data-loss path; see
 docs/runbooks/multi-shard-aof-rewrite.md). Use --shards 1 or
 --disk-offload disable to re-enable rewrite.
-```
+```

Also applies to: 20-26, 88-92

🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 10-10: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/runbooks/multi-shard-aof-rewrite.md` around lines 10 - 16, Three fenced
code blocks in docs/runbooks/multi-shard-aof-rewrite.md are missing language
identifiers (markdownlint MD040). Edit the three blocks shown (the startup
refusal block starting "REFUSING TO START: --shards 2 + --appendonly yes...",
the BGREWRITEAOF interaction block containing "BGREWRITEAOF" and "(error) ERR
BGREWRITEAOF...", and the final explanatory block starting "BGREWRITEAOF gated
for this config...") and add the language tags: use ```text for the two
plain-text blocks and ```redis for the BGREWRITEAOF example so markdownlint
MD040 is satisfied.

Comment thread src/main.rs
Comment on lines +273 to +289
// P0-FIX-01b: refuse to start under the known durability bug
// (`shards >= 2 + appendonly yes` loses ~50 % of writes on SIGKILL,
// verified 2026-05-26 on HEAD `6e49050`; reproducer in
// `tmp/p0-no-rewrite.sh` and `tmp/p0-always.sh`). The bug is
// independent of `--appendfsync` and `--disk-offload` settings. An
// operator can override via `--unsafe-multishard-aof` if the
// deployment is cache-only and the loss window is acceptable.
if num_shards >= 2 && config.appendonly == "yes" && !config.unsafe_multishard_aof {
eprintln!(
"REFUSING TO START: --shards {num_shards} + --appendonly yes has a known data-loss \
bug on SIGKILL (~50 % loss verified 2026-05-26). Fix: use --shards 1, or pass \
--appendonly no for cache-only deployments, or pass --unsafe-multishard-aof to \
acknowledge the risk and start anyway. See \
docs/runbooks/multi-shard-aof-rewrite.md."
);
std::process::exit(2);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Mirror this refusal in --check-config validation.

Line 143 returns from --check-config before Line 280 runs, so preflight can pass a config that real startup immediately refuses. Please enforce the same gate in the check_config branch.

Suggested patch
@@
     if config.check_config {
+        if config.shards >= 2 && config.appendonly == "yes" && !config.unsafe_multishard_aof {
+            return Err(anyhow::anyhow!(
+                "--shards {} + --appendonly yes is refused unless --unsafe-multishard-aof is set (or use --shards 1 / --appendonly no)",
+                config.shards
+            ));
+        }
         // Validate shard count is reasonable
         if config.shards == 0 {
             return Err(anyhow::anyhow!("--shards must be >= 1"));
         }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/main.rs` around lines 273 - 289, The --check-config path currently
returns before the multishard-AOF safety gate runs, so add the same refusal
logic used at startup into the check_config branch: detect the condition
(num_shards >= 2 && config.appendonly == "yes" && !config.unsafe_multishard_aof)
inside the check_config handling and print the identical error message and exit
non‑zero (or return an error) so preflight fails the same way real startup
would; use the same symbols/strings (num_shards, config.appendonly,
config.unsafe_multishard_aof) and the same message text used near the startup
gate to keep behavior consistent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants