Skip to content

Reduce manifest rotation for foreground metadata ops#14797

Open
pdillinger wants to merge 1 commit into
facebook:mainfrom
pdillinger:optimize_foreground_manifest_ops
Open

Reduce manifest rotation for foreground metadata ops#14797
pdillinger wants to merge 1 commit into
facebook:mainfrom
pdillinger:optimize_foreground_manifest_ops

Conversation

@pdillinger
Copy link
Copy Markdown
Contributor

Summary:
Async WAL precreation in #14738 / D105020559 was motivated by slow file creation time on remote storage. MANIFEST does not need the same precreation treatment as WAL because most MANIFEST writes come from background flush and compaction work, but user-facing metadata operations can still pay MANIFEST rotation file creation latency inline. File ingestion performance is a particular concern to some Meta users.

Relax the effective MANIFEST rotation limit by 25% for MANIFEST write batches containing any foreground VersionEdit, while keeping background-only flush/compaction batches on the configured or auto-tuned limit. This covers column family manipulation, external file ingestion and import, and DeleteFilesInRange(s). SetOptions remains expected to avoid MANIFEST writes; the test keeps a regression guard for that behavior.

The relaxation is intentionally bounded. It reduces the chance that foreground metadata operations create a new MANIFEST inline, while still allowing foreground operations to rotate once the current MANIFEST is beyond the relaxed threshold. Heavier blocking operations like manual Flush or CompactRange already trigger additional file creation and do not get this treatment here, though that could be reconsidered later.

This should reduce a potential latency hazard of manifest file size auto-tuning: more frequent MANIFEST rotations. With this change, rotation latency is shifted toward background-only MANIFEST batches when possible.

Test Plan:
Expanded DBEtc3Test.AutoTuneManifestSize to cover the foreground threshold behavior and the original auto-tuning behavior in separate phases:

  • verifies foreground-only CreateColumnFamily writes get only bounded 25% headroom by asserting the first four large-CF additions do not rotate and the fifth does;
  • verifies auto-tuned background thresholds still prevent excessive rotation;
  • verifies foreground operations stay below the relaxed threshold for CreateColumnFamily, IngestExternalFile, CreateColumnFamilyWithImport, and DeleteFilesInRanges;
  • verifies SetOptions still does not write to MANIFEST;
  • verifies a following background flush still rotates at the normal threshold;
  • preserves the persisted compacted manifest size close/reopen coverage.

Summary:
Async WAL precreation in facebook#14738 / D105020559 was motivated by slow file creation time on remote storage. MANIFEST does not need the same precreation treatment as WAL because most MANIFEST writes come from background flush and compaction work, but user-facing metadata operations can still pay MANIFEST rotation file creation latency inline. File ingestion performance is a particular concern to some Meta users.

Relax the effective MANIFEST rotation limit by 25% for MANIFEST write batches containing any foreground VersionEdit, while keeping background-only flush/compaction batches on the configured or auto-tuned limit. This covers column family manipulation, external file ingestion and import, and DeleteFilesInRange(s). SetOptions remains expected to avoid MANIFEST writes; the test keeps a regression guard for that behavior.

The relaxation is intentionally bounded. It reduces the chance that foreground metadata operations create a new MANIFEST inline, while still allowing foreground operations to rotate once the current MANIFEST is beyond the relaxed threshold. Heavier blocking operations like manual Flush or CompactRange already trigger additional file creation and do not get this treatment here, though that could be reconsidered later.

This should reduce a potential latency hazard of manifest file size auto-tuning: more frequent MANIFEST rotations. With this change, rotation latency is shifted toward background-only MANIFEST batches when possible.

Test Plan:
Expanded DBEtc3Test.AutoTuneManifestSize to cover the foreground threshold behavior and the original auto-tuning behavior in separate phases:
- verifies foreground-only CreateColumnFamily writes get only bounded 25% headroom by asserting the first four large-CF additions do not rotate and the fifth does;
- verifies auto-tuned background thresholds still prevent excessive rotation;
- verifies foreground operations stay below the relaxed threshold for CreateColumnFamily, IngestExternalFile, CreateColumnFamilyWithImport, and DeleteFilesInRanges;
- verifies SetOptions still does not write to MANIFEST;
- verifies a following background flush still rotates at the normal threshold;
- preserves the persisted compacted manifest size close/reopen coverage.
@pdillinger pdillinger requested a review from xingbowang May 27, 2026 23:30
@meta-cla meta-cla Bot added the CLA Signed label May 27, 2026
@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented May 27, 2026

@pdillinger has imported this pull request. If you are a Meta employee, you can view this in D106578771.

@github-actions
Copy link
Copy Markdown

✅ clang-tidy: No findings on changed lines

Completed in 279.4s.

@github-actions
Copy link
Copy Markdown

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit a94c122


Codex review failed before producing findings.

WARNING: proceeding, even though we could not update PATH: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

  • Codex may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /codex-review [context] — Request a code review
  • /codex-query <question> — Ask about the PR or codebase

@github-actions
Copy link
Copy Markdown

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit a94c122


Summary

Clean, well-scoped change that reduces MANIFEST rotation latency for user-facing metadata operations by relaxing the rotation threshold by 25% for foreground edits. The implementation is correct, backwards compatible, and the is_foreground_operation_ flag is properly kept in-memory only (not serialized). The test expansion is thorough with 5 distinct phases covering bounded relaxation, auto-tuning, foreground operations, and persistence across reopen.

High-severity findings (0):
No high-severity findings.

Full review (click to expand)

Findings

🔴 HIGH

No high-severity findings.

🟡 MEDIUM

M1. Missing test coverage for mixed foreground + background batch — db/db_etc3_test.cc
  • Issue: The batching logic in ProcessManifestWrites (version_set.cc:6038-6054) allows non-CF-manipulation foreground edits (e.g., IngestExternalFile) to be grouped with background edits (e.g., Flush) in the same batch. When this happens, the entire batch gets the relaxed threshold. No test covers this scenario.
  • Root cause: The test only exercises foreground-only and background-only batches.
  • Suggested fix: Add a test phase that triggers a background Flush concurrently with an IngestExternalFile to verify the mixed-batch behavior. This is difficult to make deterministic without sync points, so it may be acceptable to leave as-is if the behavior is considered obviously correct from the code.
M2. DropColumnFamily not individually tested as foreground in Phase 4 — db/db_etc3_test.cc
  • Issue: Phase 4 tests CreateColumnFamily, IngestExternalFile, CreateColumnFamilyWithImport, and DeleteFilesInRanges as foreground operations but does not test DropColumnFamily individually. While DropColumnFamily is covered via IsColumnFamilyManipulation(), explicit test coverage would strengthen confidence.
  • Root cause: Phase 4 was designed to test the newly marked operations; CF manipulation was already implicitly foreground.
  • Suggested fix: Add a DropCfFn(false) call in Phase 4 with a ASSERT_EQ check, or note in a comment that CF drop is covered by IsColumnFamilyManipulation().

🟢 LOW / NIT

L1. Removed stray blank line is unrelated cleanup — db/db_impl/db_impl.cc:6981
  • Issue: The diff removes an empty line in IngestExternalFiles between versions_->LogAndApply( arguments. This is a minor unrelated formatting fix mixed into the PR.
  • Suggested fix: None needed, but noting for reviewers.
L2. Consider UNLIKELY() hint for foreground branch — db/version_set.cc:6125
  • Issue: Most MANIFEST writes are from background flush/compaction, so has_foreground_operation is usually false. A UNLIKELY() hint could marginally improve branch prediction.
  • Root cause: ProcessManifestWrites is I/O-dominated, so the impact is negligible in practice.
  • Suggested fix: Optional: if (UNLIKELY(has_foreground_operation) && ...).
L3. The 25% relaxation value is undocumented as potentially changeable — include/rocksdb/options.h:1009
  • Issue: The public documentation hardcodes "25%" as the relaxation amount. If this changes in a future release, the documentation must be updated. Consider noting it as an implementation detail.
  • Suggested fix: Optional: Add "currently 25%" or "approximately 25%" to signal it may change.
L4. Overflow guard threshold is more conservative than necessary — db/version_set.cc:6125
  • Issue: The guard enforced_limit < (uint64_t{1} << 60) prevents overflow, but the actual overflow boundary is much higher (~UINT64_MAX * 4/5). The (1<<60) cutoff is ~12x more conservative. This is fine but means foreground relaxation silently stops applying for manifests > 1 exabyte.
  • Suggested fix: None needed; the conservatism is acceptable.

Cross-Component Analysis

Context Does code execute? Assumptions hold? Action needed?
WritePreparedTxnDB YES (same LogAndApply) YES safe
ReadOnly DB NO (no writes) N/A safe
SecondaryInstance NO (tails MANIFEST) N/A safe
CompactionService Compaction edits on primary YES (background, not marked) safe
User-defined timestamps YES YES (rotation independent of key format) safe
Atomic groups (multi-CF ingest) YES YES (25% applied once per batch, not per edit) safe
FIFO/Universal compaction YES YES (compaction not marked foreground) safe

Assumption stress-test results:

  • "Bounded relaxation" claim: CONFIRMED. The 25% applies to the threshold once per ProcessManifestWrites call, not per edit.
  • "Auto-tuning feedback loops" concern: REFUTED. TuneMaxManifestFileSize() is only called from UpdatedMutableDbOptions(), not per-write. Self-correcting.
  • "Batch CF operations missing foreground" concern: REFUTED. IsForegroundOperation() includes IsColumnFamilyManipulation(), covering all CF add/drop automatically.
  • Integer overflow: Protected by enforced_limit < (uint64_t{1} << 60) guard. Safe.
  • tuned_max = 0 case: enforced_limit = 0, rotation always happens. Correct.

Positive Observations

  1. Clean separation: The is_foreground_operation_ flag follows the exact same pattern as is_no_manifest_write_dummy_ -- in-memory only, not serialized, with a setter and const getter.
  2. IsForegroundOperation() design: Clever use of || IsColumnFamilyManipulation() to avoid needing to mark CF add/drop at every call site.
  3. Test quality: The 5-phase test structure is well-organized. The include_background_manifest_write parameter elegantly allows testing foreground-only vs. mixed behavior.
  4. Bounded by design: The 25% is intentionally small enough that manifest growth is bounded, while providing meaningful latency improvement for remote storage.
  5. No format changes: No MANIFEST format impact, fully backwards compatible.

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

  • Claude may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /claude-review [context] — Request a code review
  • /claude-query <question> — Ask about the PR or codebase

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants