feat(btree): refactor GlobalIndexScan & add inte test for btree global index#262
feat(btree): refactor GlobalIndexScan & add inte test for btree global index#262lszskye wants to merge 2 commits intoalibaba:mainfrom
Conversation
| } | ||
| // TODO(lisizhuo.lsz): add executor in UnionGlobalIndexReader | ||
| readers.push_back(std::make_shared<UnionGlobalIndexReader>(std::move(union_readers), | ||
| /*executor=*/nullptr)); |
There was a problem hiding this comment.
Add executor for UnionGlobalIndexReader. May add param in GlobalIndexScan::Create().
There was a problem hiding this comment.
Pull request overview
This PR refactors the global-index scanning flow to build GlobalIndexReader instances directly (including local-to-global row id rewriting and unioning across ranges) and updates downstream scan code to use RowRangeIndex for row-range filtering. It also adds new integration test coverage and test datasets for btree global index behavior (including partitioned tables / multi-meta cases).
Changes:
- Introduces
OffsetGlobalIndexReaderandUnionGlobalIndexReader, and refactorsGlobalIndexScanImplto build per-field readers by index type and row-range shards. - Replaces various
std::vector<Range>row-range plumbing withRowRangeIndex, and updates file-store filtering accordingly. - Adds/updates integration tests and test datasets for btree global index (parquet/orc).
Reviewed changes
Copilot reviewed 60 out of 147 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
include/paimon/global_index/global_index_scan.h |
Replaces range-scan APIs with CreateReaders(...) APIs and adds RowRangeIndex support. |
src/paimon/core/global_index/global_index_scan_impl.{h,cpp} |
New implementation that builds reader sets using offset + union wrappers and supports evaluator-based predicate scanning. |
src/paimon/common/global_index/{offset,union}_global_index_reader.{h,cpp} |
Adds reader wrappers for global row-id rewriting and unioning results across shards/readers. |
src/paimon/core/table/source/data_evolution_batch_scan.{h,cpp} |
Switches from vector-search-aware evaluation to predicate-only + RowRangeIndex filtering and indexed split wrapping. |
src/paimon/core/operation/file_store_scan.{h,cpp} |
Moves manifest row-range pruning into a shared RowRangeIndex-based filter. |
include/paimon/scan_context.h + src/paimon/core/operation/scan_context.cpp |
Removes vector search from scan context/filter APIs. |
test/inte/global_index_test.cpp |
Updates tests for new reader/evaluator APIs and adds btree global index integration tests. |
test/test_data/**/append_with_btree_with_partition.db/** |
Adds new snapshot/schema/test-data fixtures for btree global index integration tests. |
src/paimon/common/global_index/CMakeLists.txt + src/paimon/CMakeLists.txt |
Registers new reader wrapper sources for builds/tests. |
Comments suppressed due to low confidence (1)
include/paimon/scan_context.h:156
- This PR removes the public ScanContextBuilder::SetVectorSearch() API and ScanFilter::GetVectorSearch(), which is a breaking change for vector-search consumers. The PR description focuses on btree/global index scan refactor, but doesn’t mention removing vector search support or the replacement API; please either keep backward compatibility (or provide an alternative) and update the public API docs / PR description accordingly.
std::shared_ptr<Executor> executor_;
std::shared_ptr<FileSystem> specific_file_system_;
std::map<std::string, std::string> options_;
};
/// Filter configuration for table scan operations
class PAIMON_EXPORT ScanFilter {
public:
ScanFilter(const std::shared_ptr<Predicate>& predicate,
const std::vector<std::map<std::string, std::string>>& partition_filters,
const std::optional<int32_t>& bucket_filter)
: predicates_(predicate),
bucket_filter_(bucket_filter),
partition_filters_(partition_filters) {}
std::shared_ptr<Predicate> GetPredicate() const {
return predicates_;
}
std::optional<int32_t> GetBucketFilter() const {
return bucket_filter_;
}
const std::vector<std::map<std::string, std::string>>& GetPartitionFilters() const {
return partition_filters_;
}
private:
std::shared_ptr<Predicate> predicates_;
std::optional<int32_t> bucket_filter_;
std::vector<std::map<std::string, std::string>> partition_filters_;
};
/// `ScanContextBuilder` used to build a `ScanContext`, has input validation.
class PAIMON_EXPORT ScanContextBuilder {
public:
/// Constructs a `ScanContextBuilder` with required parameters.
/// @param path The root path of the table.
explicit ScanContextBuilder(const std::string& path);
~ScanContextBuilder();
/// If limit is not set, it defaults to unlimited.
ScanContextBuilder& SetLimit(int32_t limit);
/// Set a bucket filter to scan only specific bucket.
ScanContextBuilder& SetBucketFilter(int32_t bucket_filter);
/// partition_filters in vector is supposed to be OR, filter in map is supposed to be AND, e.g.,
/// partition_filters is {{k1=1,k2=10}, {k1=2,k2=20}} => OR(AND(k1=1,k2=10), AND(k1=2,k2=20))
ScanContextBuilder& SetPartitionFilter(
const std::vector<std::map<std::string, std::string>>& partition_filters);
/// Set a predicate for filtering data.
ScanContextBuilder& SetPredicate(const std::shared_ptr<Predicate>& predicate);
/// Sets the result of a global index search (e.g., row ids (may with scores) from a distributed
/// index lookup). This is used to push down index-filtered row ids into the scan for efficient
/// data retrieval.
ScanContextBuilder& SetGlobalIndexResult(
const std::shared_ptr<GlobalIndexResult>& global_index_result);
/// The options added or set in `ScanContextBuilder` have high priority and will be merged with
/// the options in table schema.
ScanContextBuilder& AddOption(const std::string& key, const std::string& value);
/// Set a configuration options map to set some option entries which are not defined in the
/// table schema or whose values you want to overwrite.
/// @note The options map will clear the options added by `AddOption()` before.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Purpose
Linked issue: #38
Introduce:
OffsetGlobalIndexReader: wraps anyGlobalIndexReaderand rewritesthe row ids in its results by adding a fixed offset. This is how local
row ids become global row ids.
UnionGlobalIndexReader: wraps a list ofGlobalIndexReaders andunions their results via
GlobalIndexResult::Or. It can run sub-readersin parallel through an
Executor, or sequentially when no executor isgiven.
Add inte test for btree global index
Tests
GlobalIndexTest