Skip to content

feat(btree): refactor GlobalIndexScan & add inte test for btree global index#262

Open
lszskye wants to merge 2 commits intoalibaba:mainfrom
lszskye:global_index_eval
Open

feat(btree): refactor GlobalIndexScan & add inte test for btree global index#262
lszskye wants to merge 2 commits intoalibaba:mainfrom
lszskye:global_index_eval

Conversation

@lszskye
Copy link
Copy Markdown
Collaborator

@lszskye lszskye commented May 6, 2026

Purpose

Linked issue: #38

Introduce:

  • OffsetGlobalIndexReader: wraps any GlobalIndexReader and rewrites
    the row ids in its results by adding a fixed offset. This is how local
    row ids become global row ids.
  • UnionGlobalIndexReader: wraps a list of GlobalIndexReaders and
    unions their results via GlobalIndexResult::Or. It can run sub-readers
    in parallel through an Executor, or sequentially when no executor is
    given.

Add inte test for btree global index

Tests

GlobalIndexTest

Comment thread include/paimon/global_index/global_index_result.h
Comment thread src/paimon/common/global_index/offset_global_index_reader_test.cpp
Comment thread src/paimon/common/global_index/union_global_index_reader.cpp
Comment thread src/paimon/common/global_index/union_global_index_reader_test.cpp
Comment thread src/paimon/common/global_index/union_global_index_reader_test.cpp
Comment thread src/paimon/core/global_index/global_index_evaluator.h
}
// TODO(lisizhuo.lsz): add executor in UnionGlobalIndexReader
readers.push_back(std::make_shared<UnionGlobalIndexReader>(std::move(union_readers),
/*executor=*/nullptr));
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add executor for UnionGlobalIndexReader. May add param in GlobalIndexScan::Create().

Comment thread src/paimon/core/operation/data_evolution_file_store_scan_test.cpp
Comment thread include/paimon/global_index/global_index_scan.h
Comment thread include/paimon/global_index/global_index_reader.h
Comment thread test/inte/global_index_test.cpp
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the global-index scanning flow to build GlobalIndexReader instances directly (including local-to-global row id rewriting and unioning across ranges) and updates downstream scan code to use RowRangeIndex for row-range filtering. It also adds new integration test coverage and test datasets for btree global index behavior (including partitioned tables / multi-meta cases).

Changes:

  • Introduces OffsetGlobalIndexReader and UnionGlobalIndexReader, and refactors GlobalIndexScanImpl to build per-field readers by index type and row-range shards.
  • Replaces various std::vector<Range> row-range plumbing with RowRangeIndex, and updates file-store filtering accordingly.
  • Adds/updates integration tests and test datasets for btree global index (parquet/orc).

Reviewed changes

Copilot reviewed 60 out of 147 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
include/paimon/global_index/global_index_scan.h Replaces range-scan APIs with CreateReaders(...) APIs and adds RowRangeIndex support.
src/paimon/core/global_index/global_index_scan_impl.{h,cpp} New implementation that builds reader sets using offset + union wrappers and supports evaluator-based predicate scanning.
src/paimon/common/global_index/{offset,union}_global_index_reader.{h,cpp} Adds reader wrappers for global row-id rewriting and unioning results across shards/readers.
src/paimon/core/table/source/data_evolution_batch_scan.{h,cpp} Switches from vector-search-aware evaluation to predicate-only + RowRangeIndex filtering and indexed split wrapping.
src/paimon/core/operation/file_store_scan.{h,cpp} Moves manifest row-range pruning into a shared RowRangeIndex-based filter.
include/paimon/scan_context.h + src/paimon/core/operation/scan_context.cpp Removes vector search from scan context/filter APIs.
test/inte/global_index_test.cpp Updates tests for new reader/evaluator APIs and adds btree global index integration tests.
test/test_data/**/append_with_btree_with_partition.db/** Adds new snapshot/schema/test-data fixtures for btree global index integration tests.
src/paimon/common/global_index/CMakeLists.txt + src/paimon/CMakeLists.txt Registers new reader wrapper sources for builds/tests.
Comments suppressed due to low confidence (1)

include/paimon/scan_context.h:156

  • This PR removes the public ScanContextBuilder::SetVectorSearch() API and ScanFilter::GetVectorSearch(), which is a breaking change for vector-search consumers. The PR description focuses on btree/global index scan refactor, but doesn’t mention removing vector search support or the replacement API; please either keep backward compatibility (or provide an alternative) and update the public API docs / PR description accordingly.
    std::shared_ptr<Executor> executor_;
    std::shared_ptr<FileSystem> specific_file_system_;
    std::map<std::string, std::string> options_;
};

/// Filter configuration for table scan operations
class PAIMON_EXPORT ScanFilter {
 public:
    ScanFilter(const std::shared_ptr<Predicate>& predicate,
               const std::vector<std::map<std::string, std::string>>& partition_filters,
               const std::optional<int32_t>& bucket_filter)
        : predicates_(predicate),
          bucket_filter_(bucket_filter),
          partition_filters_(partition_filters) {}

    std::shared_ptr<Predicate> GetPredicate() const {
        return predicates_;
    }
    std::optional<int32_t> GetBucketFilter() const {
        return bucket_filter_;
    }
    const std::vector<std::map<std::string, std::string>>& GetPartitionFilters() const {
        return partition_filters_;
    }

 private:
    std::shared_ptr<Predicate> predicates_;
    std::optional<int32_t> bucket_filter_;
    std::vector<std::map<std::string, std::string>> partition_filters_;
};

/// `ScanContextBuilder` used to build a `ScanContext`, has input validation.
class PAIMON_EXPORT ScanContextBuilder {
 public:
    /// Constructs a `ScanContextBuilder` with required parameters.
    /// @param path The root path of the table.
    explicit ScanContextBuilder(const std::string& path);
    ~ScanContextBuilder();
    /// If limit is not set, it defaults to unlimited.
    ScanContextBuilder& SetLimit(int32_t limit);
    /// Set a bucket filter to scan only specific bucket.
    ScanContextBuilder& SetBucketFilter(int32_t bucket_filter);
    /// partition_filters in vector is supposed to be OR, filter in map is supposed to be AND, e.g.,
    /// partition_filters is {{k1=1,k2=10}, {k1=2,k2=20}} => OR(AND(k1=1,k2=10), AND(k1=2,k2=20))
    ScanContextBuilder& SetPartitionFilter(
        const std::vector<std::map<std::string, std::string>>& partition_filters);
    /// Set a predicate for filtering data.
    ScanContextBuilder& SetPredicate(const std::shared_ptr<Predicate>& predicate);

    /// Sets the result of a global index search (e.g., row ids (may with scores) from a distributed
    /// index lookup). This is used to push down index-filtered row ids into the scan for efficient
    /// data retrieval.
    ScanContextBuilder& SetGlobalIndexResult(
        const std::shared_ptr<GlobalIndexResult>& global_index_result);

    /// The options added or set in `ScanContextBuilder` have high priority and will be merged with
    /// the options in table schema.
    ScanContextBuilder& AddOption(const std::string& key, const std::string& value);
    /// Set a configuration options map to set some option entries which are not defined in the
    /// table schema or whose values you want to overwrite.
    /// @note The options map will clear the options added by `AddOption()` before.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/paimon/common/global_index/union_global_index_reader.cpp
Comment thread test/inte/global_index_test.cpp
Comment thread test/inte/global_index_test.cpp
Comment thread src/paimon/core/table/source/data_evolution_batch_scan.cpp
Comment thread src/paimon/core/global_index/global_index_scan_impl.cpp
Comment thread src/paimon/common/global_index/CMakeLists.txt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants