feat: full text search sql extension#501
Open
ivscheianu wants to merge 9 commits into
Open
Conversation
1647528 to
88e9569
Compare
# Conflicts: # integration-tests/test_lance_spark.py
Integrates upstream changes (MemWAL sharding, BlobSourceContext rule, LanceBucketFunction, TPC-H benchmark, CHAR/VARCHAR DDL, VectorUDT writes, filter pushdown V2 migration, and more) while preserving all FTS SQL extension work (LanceFtsPredicateRule, LanceMatch/Phrase/ MultiMatch functions, AbstractLanceFtsPredicateRule, FTS read options, FTS metadata and tests). Conflict resolutions: - LanceSparkSessionExtensions.scala (all variants): inject both LanceFtsPredicateRule (FTS branch) and LanceBlobSourceContextRule (upstream) as optimizer rules. - BaseLanceNamespaceSparkCatalog.java: expose all five functions — LanceFragmentIdWithDefault, LanceMatch, LancePhrase, LanceMultiMatch (FTS branch), and LanceBucketFunction (upstream). - LanceScan.java: import both FullTextQueryUtils (FTS branch) and SparkLanceShardingUtils (upstream). - LanceScanTest.java: keep FTS metadata tests (HEAD) and add upstream's testOutputPartitioningWithBucketInfo.
Upstream changed LanceScanBuilder's 6th constructor argument from Map<K,V> to ShardingSpec. The FTS metadata tests were passing Collections.emptyMap() which no longer compiles.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
SQL functions for querying Lance FTS indexes
Adds
lance_match,lance_phrase, andlance_multi_matchSQL functions that push full-text search predicates down to the Lance inverted index from Spark SQL. Previously, users could build FTS indexes but had no SQL interface to query them.Starting point for #234
Usage
lance_matchoptionsfuzziness0= exact match; omitted = engine-determined by token lengthoperatorORANDorOR, controls whether all terms must matchboost1.0prefix_length0max_expansions50Unknown option keys are rejected at planning time.
lance_phraseparameterscolumnwith_position=true)queryslop0lance_multi_matchoptionsoperatorORANDorOR, controls whether all terms must match across columnsRequires at least 2 column arguments. Unknown option keys are rejected at planning time.
What's included
SHOW FUNCTIONS)DataSourceV2Relation.copy()handled via thin subclasses)LIMITqueries scan all fragments (same as ANN)EXPLAIN EXTENDEDmetadataQuery neareststructural equality inLanceSparkReadOptions(was using reference equality)docs/src/operations/dql/fts.md, cross-references fromcreate-index.mdandconfig.mdLimitations and caveats
No global relevance ordering. Lance computes BM25 scores with globally consistent IDF statistics (corpus stats are aggregated across all committed FTS index segments before scoring), so scores from different fragments are comparable. However, Spark collects fragment results in task-completion order, not BM25 rank order.
SELECT * FROM t WHERE lance_match(...) LIMIT Nreturns N matching rows, but which N rows you get depends on Spark task scheduling rather than relevance. There is nolance_score()function to surface scores; that is a planned future extension.One FTS predicate per WHERE clause. Lance
ScanOptionscarries a singleFullTextQueryobject, so there is no API for composing multiple independent FTS queries in one scan. The optimizer rule detects multiplelance_match/lance_phrasecalls within a singleFilternode and rejects them at planning time with anIllegalArgumentException. Separate FTS predicates in independent subqueries (differentFilternodes) are allowed. For multi-column search, uselance_multi_match.FTS predicates cannot appear inside OR.
WHERE lance_match(a, 'x') OR other_col > 5raisesIllegalArgumentException. The FTS predicate is extracted from the filter and pushed to the scanner as a single query. If it were inside an OR, extracting it would silently convert OR semantics to AND (returning only rows matching both predicates, instead of either). The OR detection traverses the full condition tree at any nesting depth, not just top-level operands.OR(FTS, FTS)gets a dedicated error message directing users tolance_multi_match.FTS predicates must be top-level AND conjuncts. Wrapping an FTS call in
NOT,CASE, or any expression other thanANDis rejected at planning time. The predicate is consumed by the optimizer rule before execution and cannot be evaluated as a row-level expression, so negation or conditional wrapping would silently produce wrong results.FTS + vector search cannot be combined. The optimizer rule checks whether a nearest-neighbor query is already present in the relation options and raises
IllegalArgumentExceptionif so. LanceScanOptionssupports either a full-text query or a vector query, but not both simultaneously.No planning-time index validation. The optimizer rule validates argument types and values but does not check whether the target column actually has an FTS index. If it doesn't, the error surfaces from the Lance engine at scan time as a task-level failure rather than a SQL-level
IllegalArgumentException. Planning-time index metadata checks are a UX improvement for follow-on work.lance_phraserequires positional index. The FTS index must be built withwith_position = true. Without it, phrase queries fail at scan time with a Lance engine error. There is no planning-time guard because the connector does not inspect index metadata.No per-column boost weights for
lance_multi_match. The underlyingFullTextQuery.multiMatchJava API supports per-column boosts, but the variadic SQL signature (lance_multi_match(query, col1, col2, ...)) provides no syntax for attaching weights to individual columns. The optimizer rule passesnullfor boosts, applying uniform weighting. This is a deliberate scope reduction because adding boost syntax requires a design decision on the signature (a trailing options string creates ambiguity with column names).Full BM25 scoring, no WAND early stopping. Every row matching the FTS predicate is scored and returned. The Rust-side
wand_factorparameter (which trades recall for throughput via WAND early stopping) is intentionally not exposed because SQL WHERE-filter semantics require returning all matching rows, makingwand_factor < 1.0incorrect in this context. When a future TOP-K relevance extension is designed,wand_factorbecomes the natural performance knob.Per-fragment FTS index segment opens. Each Spark task scans one fragment and independently opens all committed FTS index segments to build a global
MemBM25Scorer. Total segment-open cost per query scales asN_fragments × M_index_segments. For a 500-fragment dataset with 20 index segments, that's 10,000 segment opens per query. The primary mitigation is runningOPTIMIZE TABLEto compact both data fragments (reducingN_fragments) and FTS index segments (reducingM_index_segments). A future optimization could broadcast pre-computed corpus statistics from the driver to avoid redundant segment opens on executors.Query text must be a string literal. The query argument is pushed to the FTS index at planning time and injected into the scan options before execution begins. Runtime-valued expressions (column references, computed expressions) cannot be pushed down and are rejected with
IllegalArgumentException.Column names are resolved as string constants. The column argument must be a string literal or a column reference that resolves at planning time. The FTS index is keyed by exact column name, so a mismatch between the SQL column name and the Lance schema column name produces a scan-time error.