Skip to content

fix(governance-store): propagate cascade_seq store read error#3015

Merged
chefsale merged 1 commit into
masterfrom
claude/cascade-seq-store-error-cmux8h
Jun 29, 2026
Merged

fix(governance-store): propagate cascade_seq store read error#3015
chefsale merged 1 commit into
masterfrom
claude/cascade-seq-store-error-cmux8h

Conversation

@chefsale

Copy link
Copy Markdown
Member

Description

This PR fixes a critical error handling bug in the cascade upgrade logic that could cause cohort divergence across the cluster.

The Problem:
The code was using .ok() to swallow read errors when fetching the namespace governance head, treating transient store errors the same as genuinely-absent heads. This is unsafe because:

  • If a node hits a transient store error, it would pin cascade_seq to None
  • Peer nodes that successfully read the head would pin cascade_seq to Some(seq)
  • This divergence in the migration-status rollup comparison would cause the cohort to split

The Fix:
Changed the error handling to propagate read errors with ? while still allowing genuinely-absent heads to result in None. This ensures:

  • Transient errors fail fast and are retried, keeping the cluster in sync
  • Only truly missing heads result in None pins
  • The cohort comparison remains consistent across all nodes

The refactoring also improves code clarity by separating the namespace resolution from the head lookup, making the error handling intent explicit.

Test plan

Existing tests cover the cascade upgrade logic. The change is defensive—it prevents a failure mode rather than adding new functionality. CI should pass with no additional test changes needed.

Wire contract (SDK gate)

N/A — no HTTP wire DTOs or routes changed.

Documentation update

N/A — internal implementation detail.

https://claude.ai/code/session_011sjZkyNWU9gHaSjnVTfnaV

CascadeUpgrade::apply computed the migration's governance-position pin
(`cascade_seq`) with `.ok()`/`.flatten()`, swallowing both the namespace
resolve error and the gov-head read error. A transient store read failure
would pin `cascade_seq = None` on the affected node while peers that read
successfully pinned `Some(seq)` for the same op, diverging the cohort
comparison the migration-status rollup performs.

Propagate the read error via `?` instead. A genuinely-absent gov-head
still yields `None` (no pin); only an actual store error now fails the
apply, keeping the pinned state deterministic across nodes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011sjZkyNWU9gHaSjnVTfnaV
@chefsale chefsale changed the title Fix cascade_seq error handling to prevent cohort divergence fix(governance-store): propagate cascade_seq store read error Jun 29, 2026

@meroreviewer meroreviewer Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 MeroReviewer

Reviewed by 1 agents | Quality score: 85% | Review time: 137.3s


✅ No Issues Found

All agents reviewed the code and found no issues. LGTM! 🎉


🤖 Generated by MeroReviewer | Review ID: review-f7991528

@meroreviewer

meroreviewer Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Documentation Review

The following documentation may need updates based on the changes in this PR:

  • 🟡 architecture/auto-follow.html: Files matching crates/governance-store/** were changed but architecture/auto-follow.html was not updated (per source_to_docs_mapping).
  • 🟡 architecture/local-governance.html: Files matching crates/governance-store/** were changed but architecture/local-governance.html was not updated (per source_to_docs_mapping).

@chefsale chefsale merged commit f3aaaaa into master Jun 29, 2026
111 of 112 checks passed
@chefsale chefsale deleted the claude/cascade-seq-store-error-cmux8h branch June 29, 2026 07:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants