Skip to content

Optimize MultiScan dispatch for sorted blocks#14783

Open
xingbowang wants to merge 1 commit into
facebook:mainfrom
xingbowang:codex/multiscan-sorted-dispatch
Open

Optimize MultiScan dispatch for sorted blocks#14783
xingbowang wants to merge 1 commit into
facebook:mainfrom
xingbowang:codex/multiscan-sorted-dispatch

Conversation

@xingbowang
Copy link
Copy Markdown
Contributor

@xingbowang xingbowang commented May 25, 2026

Summary

  • Propagate validated, sorted MultiScan range state from DBIter::Prepare() through MultiScanArgs.
  • Mark block-based table IO jobs as already sorted when the public MultiScan ranges have been validated.
  • Keep IODispatcher::SubmitJob() as the normalization boundary for unsorted callers, while allowing sorted callers to skip the defensive block-handle sort.
  • Update private dispatcher coalescing helpers to consume sorted block indices and add debug assertions for that precondition.

Testing

CI, new unit test

@meta-cla meta-cla Bot added the CLA Signed label May 25, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 25, 2026

⚠️ clang-tidy: 1 warning(s) on changed lines

Completed in 399.3s.

Summary by check

Check Count
cppcoreguidelines-special-member-functions 1
Total 1

Details

db/db_iterator_test.cc (1 warning(s))
db/db_iterator_test.cc:4421:10: warning: class 'SyncPointGuard' defines a non-default destructor but does not define a copy constructor, a copy assignment operator, a move constructor or a move assignment operator [cppcoreguidelines-special-member-functions]

@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented May 25, 2026

@xingbowang has imported this pull request. If you are a Meta employee, you can view this in D106301516.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 25, 2026

Codex Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit ae391b5


Codex review failed before producing findings.

WARNING: proceeding, even though we could not update PATH: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

  • Codex may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /codex-review [context] — Request a code review
  • /codex-query <question> — Ask about the PR or codebase

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 25, 2026

Claude Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit ae391b5


Summary

Sound optimization that eliminates redundant O(n log n) sorts in the MultiScan IO dispatch path by propagating a "pre-sorted" flag from the validation layer. The core invariant (sorted scan ranges produce offset-sorted block handles in block-based tables) is correct. One significant gap undermines effectiveness in production.

High-severity findings (1):

  • [options.h:2026] CopyConfigFrom() does not copy scan_ranges_are_sorted, causing the optimization flag to be lost in the production LevelIterator path (version_set.cc:1217).
Full review (click to expand)

Findings

🔴 HIGH

H1. CopyConfigFrom() missing scan_ranges_are_sortedinclude/rocksdb/options.h:2026
  • Issue: CopyConfigFrom() copies io_coalesce_threshold, max_prefetch_size, use_async_io, and io_dispatcher, but does NOT copy the new scan_ranges_are_sorted field.
  • Root cause: The PR updated the copy constructor, move constructor, copy assignment operator, and move assignment operator to include the new field, but missed CopyConfigFrom().
  • Impact: In the production path, LevelIterator::Prepare() at db/version_set.cc:1217 calls CopyConfigFrom() to propagate config to per-file MultiScanArgs. The sorted flag is lost, defaulting to false, so IODispatcher::SubmitJob() always performs the defensive sort. The optimization is effectively dead in the main production code path. This is not a correctness bug (false = sort = safe), but the optimization never fires where it matters most.
  • Suggested fix: Add scan_ranges_are_sorted = other.scan_ranges_are_sorted; to CopyConfigFrom().

🟡 MEDIUM

M1. Public field allows callers to bypass validation — include/rocksdb/options.h:2054
  • Issue: scan_ranges_are_sorted is a public field on MultiScanArgs. Any caller can set it to true without going through DBIter::Prepare() validation, potentially causing IODispatcher to skip sorting on unsorted block handles.
  • Impact: In release builds, the removed sorts in PrepareIORequests() and PreCoalesceBlocks() are replaced only by debug assertions. An incorrectly set flag would produce silently wrong IO coalescing.
  • Suggested fix: Consider making scan_ranges_are_sorted private with a getter, or document clearly that it is an internal optimization hint that should not be set by users. The existing block_handles_are_sorted on JobOptions has the same concern but is on a less user-facing struct.
M2. No integration test for the full optimization path — tests
  • Issue: The new tests verify individual pieces (DBIter sets flag, IODispatcher skips sort), but there is no integration test that exercises the complete path: DBIter::Prepare()LevelIteratorBlockBasedTableIterator::Prepare()IODispatcher::SubmitJob() with the sorted optimization active. Such a test would have caught the CopyConfigFrom() gap.
  • Suggested fix: Add an integration test (perhaps in db_test.cc or a MultiScan integration test) that verifies the sort sync point is NOT hit when using the full MultiScan path with sorted ranges.

🟢 LOW / NIT

L1. Test only verifies sync point hit count, not read correctness — util/io_dispatcher_test.cc:837
  • Issue: SortedBlockHandlesSkipDispatcherSorts verifies that the sort sync point is/isn't hit, but doesn't verify that the data read back is actually correct when sorting is skipped. A test that reads blocks and verifies content would provide stronger assurance.
  • Suggested fix: The test already reads blocks via ReadIndex, which is good. Consider adding content verification (check block data matches expected values).
L2. block_handles_are_sorted on JobOptions in public header — include/rocksdb/io_dispatcher.h:157
  • Issue: JobOptions is in a public header. The block_handles_are_sorted field is an internal optimization detail that users shouldn't need to interact with.
  • Suggested fix: Acceptable as-is since IOJob construction is already an advanced/internal use case.

Cross-Component Analysis

Context Affected? Analysis
Standard block-based table YES Key order = offset order. Invariant holds.
Partitioned indexes YES Each partition still maintains key=offset order. Safe.
WritePreparedTxnDB NO Same iterator path, no effect on block ordering.
ReadOnly DB NO Same read path.
User-defined timestamps NO Timestamps don't affect block physical ordering.
LevelIterator (multi-file) YES H1 applies — flag lost via CopyConfigFrom().
Direct BBTI::Prepare (tests) YES Works correctly when flag is propagated directly.

Sorted-order invariant verification:

  • DBIter::ValidateScanOptions() rejects overlapping ranges (start[i] must be >= end[i-1])
  • CollectBlockHandles() iterates the index forward via Seek() + Next(), producing block handles in offset order
  • Overlap dedup (check_overlap flag) only applies to the first block of each new scan range, correctly handling the case where adjacent non-overlapping ranges share a boundary block
  • The invariant holds: validated sorted non-overlapping scan ranges → offset-sorted block handles

Assumption stress-test:

  • Claim: "scan ranges sorted by key → block handles sorted by offset"
  • Precondition: block-based table stores data blocks in key order (true by design)
  • Precondition: non-overlapping ranges processed in order produce monotonically advancing index positions (true: Seek starts at/after prior range's end)
  • No counterexample found for standard block-based tables

Positive Observations

  • The optimization is fail-safe: block_handles_are_sorted defaults to false, so all existing code paths and new code paths that don't explicitly set it continue to sort defensively.
  • The block_indices_to_read.reserve() addition is a good micro-optimization.
  • Iterating cache lookups in offset order (the new behavior even for the unsorted path) is actually better for cache locality than the previous arbitrary-order iteration.
  • The BlockIndicesAreSortedByOffset debug assertion function is a clean way to validate the invariant without release overhead.
  • The TEST_SYNC_POINT placement enables precise testing of the sort-skip behavior.

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

  • Claude may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /claude-review [context] — Request a code review
  • /claude-query <question> — Ask about the PR or codebase

MultiScan collects data block handles in scan order after DBIter validates that the public scan ranges are sorted and non-overlapping. Propagate that validation through MultiScanArgs and IOJob so IODispatcher can skip its public-boundary sort when the submitted block handle vector is already ordered by file offset.

SubmitJob remains the only dispatcher boundary that normalizes block order. When callers do not set block_handles_are_sorted, SubmitJob sorts the ReadSet index once and cache-miss collection walks that sorted index. Private helpers such as PreCoalesceBlocks and PrepareIORequests now consume already-sorted block index vectors instead of performing their own defensive sorts; debug assertions validate that precondition for future private-call changes.

Keep the scan-ranges-sorted bit internal to MultiScanArgs: DBIter is the only setter after validation, while internal readers use a getter. CopyConfigFrom copies the bit so LevelIterator per-file scan arguments preserve the optimization in the production path.

Tests: built db_iter_test, db_iterator_test, and io_dispatcher_test; ran timeout 60 ./db_iterator_test --gtest_filter=DBMultiScanIteratorTest/DBMultiScanIteratorTest.SortedRangesSkipIODispatcherSort/*; ran timeout 60 ./db_iter_test --gtest_filter=DBIteratorTest.PrepareMarksValidatedScanRangesSorted:DBIteratorTest.PrepareDoesNotForwardInvalidScanRanges; ran timeout 60 ./io_dispatcher_test --gtest_filter=IODispatcherTest.SortedBlockHandlesSkipDispatcherSorts:IODispatcherTest.VerifyReadRequestDetails:IODispatcherTest.VerifyCoalescing:IODispatcherTest.MemoryAccountingWithPartialGroupConsumption:IODispatcherTest.CoalescedGroupsSplitByMemoryBudget:IODispatcherTest.MemoryReleasedAfterReadIndexThenReleaseBlock; ran git diff --check.
@xingbowang xingbowang closed this May 27, 2026
@xingbowang xingbowang force-pushed the codex/multiscan-sorted-dispatch branch from ae391b5 to 38e616d Compare May 27, 2026 12:48
@xingbowang xingbowang reopened this May 27, 2026
@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented May 27, 2026

@xingbowang has imported this pull request. If you are a Meta employee, you can view this in D106301516.

@xingbowang xingbowang marked this pull request as draft May 27, 2026 21:50
@github-actions
Copy link
Copy Markdown

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit 38e616d


Codex review failed before producing findings.

WARNING: proceeding, even though we could not update PATH: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

  • Codex may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /codex-review [context] — Request a code review
  • /codex-query <question> — Ask about the PR or codebase

@github-actions
Copy link
Copy Markdown

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit 38e616d


Summary

This PR eliminates three redundant O(N log N) sorts in the MultiScan IO dispatch pipeline by propagating a "sorted" flag from DBIter::Prepare() through to IODispatcher. The core invariant -- sorted, validated scan ranges imply sorted block handles within a single SST file -- has been verified as sound. The implementation is clean with appropriate encapsulation (private setter, friend class). One medium-severity concern about debug-only assertion safety in internal helpers, and a few low-severity suggestions.

High-severity findings (0):

No high-severity findings.

Full review (click to expand)

Findings

🟡 MEDIUM

M1. Debug-only assertions replacing sorts in internal helpers -- util/io_dispatcher_imp.cc:842,903
  • Issue: PrepareIORequests() and PreCoalesceBlocks() previously sorted their input defensively. The PR removes those sorts and replaces them with assert(BlockIndicesAreSortedByOffset(...)), which only fires in debug builds. In release builds, if an internal refactor introduces an unsorted caller, the functions will silently produce incorrect IO coalescing (potentially reading wrong data or missing blocks).
  • Root cause: The sorted invariant is now established at the SubmitJob boundary and propagated downward. The internal helpers trust this invariant without runtime verification.
  • Why this is MEDIUM, not HIGH: Currently, all callers of PrepareIORequests and PreCoalesceBlocks originate from SubmitJob, which always ensures sorted order (either by sorting or trusting the flag). The fallback path from ExecuteAsyncIO and DrainPendingPrefetchGroups both preserve sorted order from the already-sorted coalesced_block_indices. The risk is future maintenance, not current correctness.
  • Suggested fix: Consider keeping a lightweight O(N) check in release builds (checking only that each offset >= previous) in these two functions, at least until the optimization matures. Alternatively, document the sorted precondition prominently in the function signatures.

🟢 LOW / NIT

L1. block_handles_are_sorted in public header -- include/rocksdb/io_dispatcher.h:157
  • Issue: JobOptions::block_handles_are_sorted is added to a public header. External users implementing custom IODispatcher usage could set this flag to true for unsorted handles, bypassing sorting and causing incorrect IO behavior.
  • Suggested fix: Add a comment warning that setting this incorrectly causes undefined behavior. Or consider making it a non-public detail if JobOptions can be made internal.
L2. scan_ranges_are_sorted_ naming could be more precise -- include/rocksdb/options.h:2074
  • Issue: The field name scan_ranges_are_sorted_ describes the scan ranges, but the downstream effect is that block handles are sorted. A reader might wonder whether "sorted" means "sorted by start key" (which is the user-facing contract) vs "validated to guarantee sorted block handles" (which is the actual semantic).
  • Suggested fix: The current name is acceptable given the comment explains the semantics. No change required, just a nit.
L3. reserve() added but not in pre-existing pattern -- util/io_dispatcher_imp.cc:728
  • Issue: block_indices_to_read.reserve(job->block_handles.size()) is added. This is a good practice (avoids reallocations) but was not present before. It's a positive change but slightly inconsistent with the original code style.
  • Suggested fix: Keep as-is. This is a minor improvement.

Cross-Component Analysis

Context Does code execute? Assumptions hold? Action needed?
WritePreparedTxnDB YES (same iterator path) YES None
ReadOnly DB YES YES None
User-defined timestamps YES YES (CollectBlockHandles handles timestamps) None
Direct BBT::Prepare (tests) YES Safe - default false None
Partitioned index YES YES (partitions maintain offset order) None
Async IO fallback YES YES (fallback indices from sorted coalesced groups) None
DrainPendingPrefetchGroups YES YES (sequential group processing preserves order) None

Assumption stress-test:

  • Claim: "Sorted scan ranges imply sorted block handles" -- Verified: CollectBlockHandles seeks forward through the index iterator for each sorted range. Within a single SST file, the index iterator returns blocks in offset order. Boundary block deduplication (skip if back() == current) preserves sort order. Confirmed sound.
  • Claim: "block_indices_to_read is built in sorted order after SubmitJob change" -- Verified: The loop now iterates rs->sorted_block_indices_ (which are raw indices sorted by offset). Cache misses push the raw index i in sorted order.
  • Claim: "Fallback indices maintain sorted order" -- Verified: ExecuteAsyncIO iterates coalesced groups in order, pushing block indices from each group sequentially.

Positive Observations

  • Good encapsulation: MarkScanRangesAsSorted() is private with friend class DBIter, preventing incorrect external usage.
  • Conservative default: block_handles_are_sorted defaults to false, ensuring no change for existing callers.
  • Complete copy/move semantics: scan_ranges_are_sorted_ is correctly propagated in copy constructor, move constructor, copy assignment, move assignment, and CopyConfigFrom().
  • Original MultiScanArgs unmodified: The test PrepareMarksValidatedScanRangesSorted verifies the user's MultiScanArgs object is not mutated -- only the internal copy gets the flag.
  • Good test coverage: Three new tests cover: (1) flag propagation through Prepare, (2) rejection of invalid ranges, (3) end-to-end verification that sort is skipped via sync point, and (4) unit-level IODispatcher test with both sorted and unsorted paths.
  • Debug assertion helper: BlockIndicesAreSortedByOffset is well-implemented with bounds checking and is cleanly guarded by #ifndef NDEBUG.
  • Performance improvement: Eliminates up to 3 O(N log N) sorts and 2 vector copies per SST file per MultiScan dispatch.

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

  • Claude may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /claude-review [context] — Request a code review
  • /claude-query <question> — Ask about the PR or codebase

@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented May 27, 2026

@xingbowang has imported this pull request. If you are a Meta employee, you can view this in D106301516.

@xingbowang xingbowang marked this pull request as ready for review May 28, 2026 00:02
Comment thread db/db_iterator_test.cc
ASSERT_OK(db_->CompactRange({}, nullptr, nullptr));
ASSERT_EQ(2, NumTableFilesAtLevel(49));

struct SyncPointGuard {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: is this necessary, can't we just add it to end of test assuming test always passes

Comment thread include/rocksdb/options.h
// and non-overlapping at the public Iterator::Prepare API boundary.
// Internal iterators use this to preserve sorted block handle order and avoid
// repeated dispatcher-side sorting.
bool scan_ranges_are_sorted_ = false;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a scenario when this is actually false? Seems like DBIter always sets it to true

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants