Skip to content

fix: keep lastLog cache in sync after append truncation failures#4

Draft
anyasabo wants to merge 1 commit into
mainfrom
fix/append-truncation-store-failure-lastlog
Draft

fix: keep lastLog cache in sync after append truncation failures#4
anyasabo wants to merge 1 commit into
mainfrom
fix/append-truncation-store-failure-lastlog

Conversation

@anyasabo

@anyasabo anyasabo commented May 6, 2026

Copy link
Copy Markdown
Owner

Problem

appendEntries can leave cached lastLog metadata stale when this sequence occurs:

  1. follower truncates conflicting suffix,
  2. writing replacement entries fails (StoreLogs error),
  3. function returns without recomputing lastLog from durable store.

This creates cache/store divergence (r.getLastLog() vs LogStore).

How this can happen in concrete terms

A realistic failure chain in production:

  • follower receives conflicting append from leader,
  • DeleteRange succeeds (durable suffix removed),
  • replacement append fails due to transient I/O error, ENOSPC, or backend log-store failure,
  • node reports stale lastLog index/term from memory cache.

Existing mitigations and gaps

Mitigations that reduce probability:

  • healthy disks and alerting on storage errors,
  • robust log-store implementations.

Gap prior to this PR: even when storage returned an explicit append failure, cache was not reconciled.

Impact

Consistency of local bookkeeping degrades under storage fault:

  • incorrect retry/backoff behavior potential,
  • prolonged replication instability after fault,
  • harder post-fault recovery due to stale in-memory metadata.

How we would notice in production

  • Error log: failed to append to logs shortly after conflict truncation.
  • Replication not converging despite apparent progress.
  • Potential mismatch between on-disk last index and last_log_* stats/debug output.

Provenance / Preconditions

  • Longstanding latent bug; code had explicit TODO noting stale lastLog risk in this path.
  • TODO and path originate in early implementation (fdb8145f, 2016), with logging additions later.
  • No special cluster configuration required; trigger requires truncation + StoreLogs failure.

What this PR changes

  • Recomputes cached lastLog from durable store on StoreLogs failure.
  • Adds regression test that reproduces truncation+append-failure path with injected log-store error.

Reviewer reproduction (live in-process Raft node)

Reproduce pre-fix behavior

  1. Checkout parent commit:
    • git checkout 0d1b34c^
  2. Bring this PR's regression test into that tree:
    • git checkout fix/append-truncation-store-failure-lastlog -- raft_test.go
  3. Run:
    • go test -run "TestRaft_AppendEntriesStoreLogsFailureRefreshesLastLogAfterTruncate" -count=1 .
  4. Expected pre-fix: test fails with stale cached lastLog (cache index > durable store index).

Verify fixed behavior

  1. Checkout this branch (fix/append-truncation-store-failure-lastlog).
  2. Run:
    • go test -run "TestRaft_AppendEntriesStoreLogsFailureRefreshesLastLogAfterTruncate" -count=1 .
  3. Expected: pass; cached metadata matches durable store after failure.

Test plan

  • go test -run "TestRaft_AppendEntriesStoreLogsFailureRefreshesLastLogAfterTruncate" -count=1 .
  • go test -run "TestRaft_AppendEntry$|TestRaft_AppendEntriesStoreLogsFailureRefreshesLastLogAfterTruncate" -count=1 .

appendEntries can truncate a conflicting suffix, fail to store replacement entries, and leave cached lastLog metadata stale versus durable storage. Recompute lastLog from the log store on StoreLogs failure and add a regression test that reproduces the stale-cache path.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant