Skip to content

Fix compaction abort rescheduling before queue pick#14800

Open
xingbowang wants to merge 2 commits into
facebook:mainfrom
xingbowang:2026_05_28_compaction_abort_bug
Open

Fix compaction abort rescheduling before queue pick#14800
xingbowang wants to merge 2 commits into
facebook:mainfrom
xingbowang:2026_05_28_compaction_abort_bug

Conversation

@xingbowang
Copy link
Copy Markdown
Contributor

@xingbowang xingbowang commented May 28, 2026

Summary

  • AbortAllCompactions can race with an already-scheduled automatic compaction before the background worker calls PickCompactionFromQueue(). MaybeScheduleFlushOrCompaction() has already consumed one unscheduled_compactions_ credit when it scheduled the worker, but the CF is still present in compaction_queue_ with queued_for_compaction=true. If the worker returns kCompactionAborted before popping the CF, ResumeAllCompactions() can see unscheduled_compactions_ == 0 and fail to schedule the still-queued work, leaving compaction permanently stalled until DB restart.

  • Restore the unscheduled compaction credit in the non-prepicked abort path, matching the existing BG-work-stopped handling for the same scheduled-before-pick state. Prepicked/manual compactions are not adjusted because they do not represent an unpopped automatic compaction_queue_ entry whose scheduling credit was consumed.

  • Add AbortScheduledAutomaticCompactionBeforePick to deterministically reproduce the lost-credit race with a sync point at BackgroundCallCompaction:0. The test verifies that ResumeAllCompactions() schedules the still-queued automatic compaction by checking COMPACT_WRITE_BYTES advances and L0 file count drops.

Test Plan

  • Without the implementation fix, DBCompactionAbortTest.AbortScheduledAutomaticCompactionBeforePick fails because COMPACT_WRITE_BYTES remains unchanged after ResumeAllCompactions().

AbortAllCompactions can race with an already-scheduled automatic compaction before the background worker calls PickCompactionFromQueue(). MaybeScheduleFlushOrCompaction() has already consumed one unscheduled_compactions_ credit when it scheduled the worker, but the CF is still present in compaction_queue_ with queued_for_compaction=true. If the worker returns kCompactionAborted before popping the CF, ResumeAllCompactions() can see unscheduled_compactions_ == 0 and fail to schedule the still-queued work, leaving compaction permanently stalled until DB restart.

Restore the unscheduled compaction credit in the non-prepicked abort path, matching the existing BG-work-stopped handling for the same scheduled-before-pick state. Prepicked/manual compactions are not adjusted because they do not represent an unpopped automatic compaction_queue_ entry whose scheduling credit was consumed.

Add AbortScheduledAutomaticCompactionBeforePick to deterministically reproduce the lost-credit race with a sync point at BackgroundCallCompaction:0. The test verifies that ResumeAllCompactions() schedules the still-queued automatic compaction by checking COMPACT_WRITE_BYTES advances and L0 file count drops.

Add a bug-fix release note under unreleased_history/bug_fixes for the next release.

Test Plan: without the implementation fix, AbortScheduledAutomaticCompactionBeforePick fails because COMPACT_WRITE_BYTES remains 0 after ResumeAllCompactions(). Verified final tree with make format-auto, make -j128 db_compaction_abort_test, timeout 60s ./db_compaction_abort_test --gtest_filter=DBCompactionAbortTest.AbortScheduledAutomaticCompactionBeforePick, timeout 60s ./db_compaction_abort_test, COERCE_CONTEXT_SWITCH=1 make -j128 db_compaction_abort_test && timeout 60s ./db_compaction_abort_test --gtest_filter=DBCompactionAbortTest.AbortScheduledAutomaticCompactionBeforePick --gtest_repeat=5, and make check-sources. After adding the release note, reran make check-sources.
@meta-cla meta-cla Bot added the CLA Signed label May 28, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 28, 2026

✅ clang-tidy: No findings on changed lines

Completed in 108.4s.

@github-actions
Copy link
Copy Markdown

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit 81e4eb2


Codex review failed before producing findings.

WARNING: proceeding, even though we could not update PATH: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

  • Codex may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /codex-review [context] — Request a code review
  • /codex-query <question> — Ask about the PR or codebase

@github-actions
Copy link
Copy Markdown

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit 81e4eb2


Summary

This PR correctly fixes a real race condition where AbortAllCompactions() can permanently stall compaction by losing an unscheduled_compactions_ credit when abort fires before PickCompactionFromQueue(). The fix is minimal, correctly scoped, thread-safe, and matches the existing BG-work-stopped pattern. No high-severity issues found.

High-severity findings (0):

No high-severity findings.

Full review (click to expand)

Findings

🔴 HIGH

None.

🟡 MEDIUM

M1. Inconsistent !is_prepicked guard in BG-work-stopped path -- db_impl_compaction_flush.cc:4080
  • Issue: The existing BG-work-stopped path (line 4080) increments unscheduled_compactions_++ unconditionally (no !is_prepicked guard), while the new abort path correctly guards with !is_prepicked. For prepicked/manual compactions, no unscheduled_compactions_ credit was consumed by MaybeScheduleFlushOrCompaction(), so the unconditional increment in the BG-work-stopped path could create a spurious credit.
  • Root cause: The BG-work-stopped path predates the abort fix and was written without considering the prepicked distinction. Its comment says "Since we didn't pop a cfd from the compaction queue" -- but prepicked compactions never had a queue entry to begin with.
  • Suggested fix: Add the same !is_prepicked guard to the BG-work-stopped path for consistency:
    } else {
        status = error_handler_.GetBGError();
        if (!is_prepicked) {
            unscheduled_compactions_++;
        }
    }
    This is a pre-existing issue, not introduced by this PR. Could be addressed as a follow-up.
M2. Test coverage gaps for edge cases -- db_compaction_abort_test.cc
  • Issue: The new test covers the primary scenario (single CF, single background thread, automatic compaction) but does not cover:
    • Multiple CFs in compaction_queue_ when abort fires (multiple credits lost and restored)
    • max_background_compactions > 1 (multiple workers aborting simultaneously)
    • Verifying that manual/prepicked compactions do NOT incorrectly increment unscheduled_compactions_ during abort
  • Root cause: Test scope limited to the minimal reproduction case.
  • Suggested fix: Consider adding a multi-CF or multi-thread variant in a follow-up. The current test is sufficient to validate the core fix.

🟢 LOW / NIT

L1. Consider centralizing credit restoration -- db_impl_compaction_flush.cc
  • Issue: Credit restoration (unscheduled_compactions_++) now appears in four separate locations (lines 4080, 4070, 4161, 4174), each with slightly different guard conditions. This fragmentation makes the credit system harder to reason about.
  • Suggested fix: Consider extracting a helper method in a follow-up:
    void RestoreUnscheduledCompactionCredit(bool is_prepicked) {
        if (!is_prepicked) { unscheduled_compactions_++; }
    }

Cross-Component Analysis

Context Affected? Verified?
WritePreparedTxnDB No Same scheduling path
FIFO/Universal compaction Yes, works correctly Style-agnostic scheduling
Multiple CFs Yes, works correctly Per-worker credit accounting
Multiple BG threads Yes, works correctly Each worker restores own credit
Nested abort/resume Yes, works correctly Counter semantics preserved
Manual compaction No (guarded by !is_prepicked) Verified via caller audit
Forwarded-to-bottom compaction No (guarded by !is_prepicked) Verified: original pick consumes credit, forwarded worker is prepicked
ReadOnly DB No No compaction scheduling

Key invariant verified: MaybeScheduleFlushOrCompaction() at line 3995 (in BackgroundCallCompaction cleanup) cannot consume the restored credit because compaction_aborted_ is still > 0 at that point. AbortAllCompactions() waits for bg_compaction_scheduled_ == 0 (line 3188), which happens AFTER line 3986's decrement, so ResumeAllCompactions() cannot be called until after the cleanup completes.

Positive Observations

  • The fix precisely mirrors the existing BG-work-stopped pattern (line 4080), demonstrating good codebase awareness.
  • The !is_prepicked guard is more correct than the existing BG-work-stopped path, showing attention to the credit accounting model.
  • The test uses SyncPointAbortHelper consistently with other tests in the file.
  • The comment on the new code clearly explains the invariant being maintained.
  • The sync point "BackgroundCallCompaction:0" precisely targets the window between worker scheduling and queue pick, making the test deterministic.

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

  • Claude may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /claude-review [context] — Request a code review
  • /claude-query <question> — Ask about the PR or codebase

@xingbowang xingbowang force-pushed the 2026_05_28_compaction_abort_bug branch from 81e4eb2 to 578e1d9 Compare May 28, 2026 18:15
@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented May 28, 2026

@xingbowang has imported this pull request. If you are a Meta employee, you can view this in D106684155.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants