feat: full text search sql extension by ivscheianu · Pull Request #501 · lance-format/lance-spark

ivscheianu · 2026-05-04T11:10:36Z

SQL functions for querying Lance FTS indexes

Adds lance_match, lance_phrase, and lance_multi_match SQL functions that push full-text search predicates down to the Lance inverted index from Spark SQL. Previously, users could build FTS indexes but had no SQL interface to query them.

Starting point for #234

Usage

-- Keyword search
SELECT * FROM t WHERE lance_match(body, 'hello world');

-- With options
SELECT * FROM t WHERE lance_match(body, 'hello', 'fuzziness=1,operator=AND');

-- Phrase search (requires index built with with_position=true)
SELECT * FROM t WHERE lance_phrase(body, 'hello world');
SELECT * FROM t WHERE lance_phrase(body, 'hello world', 1);  -- slop

-- Multi-column search (OR semantics by default)
SELECT * FROM t WHERE lance_multi_match('hello', title, body);
SELECT * FROM t WHERE lance_multi_match('hello', 'operator=AND', title, body);

`lance_match` options

Key	Type	Default	Description
`fuzziness`	int	auto	Edit distance. `0` = exact match; omitted = engine-determined by token length
`operator`	string	`OR`	`AND` or `OR`, controls whether all terms must match
`boost`	float	`1.0`	Scoring multiplier
`prefix_length`	int	`0`	Leading chars that must match exactly for fuzzy
`max_expansions`	int	`50`	Max fuzzy term expansions (≥ 1)

Unknown option keys are rejected at planning time.

`lance_phrase` parameters

Parameter	Type	Default	Description
`column`	string	(required)	Column with an FTS index (built with `with_position=true`)
`query`	string	(required)	Phrase to search for
`slop`	int	`0`	Max number of intervening unmatched positions permitted (≥ 0)

`lance_multi_match` options

Key	Type	Default	Description
`operator`	string	`OR`	`AND` or `OR`, controls whether all terms must match across columns

Requires at least 2 column arguments. Unknown option keys are rejected at planning time.

What's included

Three V2 sentinel functions registered in the Lance catalog (SHOW FUNCTIONS)
Catalyst optimizer rule that intercepts FTS calls, validates arguments at planning time, and injects the query into the scan options without requiring ANTLR grammar changes
Support for Spark 3.4, 3.5, 4.0, and 4.1 (version-specific DataSourceV2Relation.copy() handled via thin subclasses)
Fragment pruning guard so LIMIT queries scan all fragments (same as ANN)
FTS query visible in EXPLAIN EXTENDED metadata
Fix for Query nearest structural equality in LanceSparkReadOptions (was using reference equality)
JUnit integration tests and Python integration tests
Documentation: docs/src/operations/dql/fts.md, cross-references from create-index.md and config.md

Limitations and caveats

No global relevance ordering. Lance computes BM25 scores with globally consistent IDF statistics (corpus stats are aggregated across all committed FTS index segments before scoring), so scores from different fragments are comparable. However, Spark collects fragment results in task-completion order, not BM25 rank order. SELECT * FROM t WHERE lance_match(...) LIMIT N returns N matching rows, but which N rows you get depends on Spark task scheduling rather than relevance. There is no lance_score() function to surface scores; that is a planned future extension.
One FTS predicate per WHERE clause. Lance ScanOptions carries a single FullTextQuery object, so there is no API for composing multiple independent FTS queries in one scan. The optimizer rule detects multiple lance_match/lance_phrase calls within a single Filter node and rejects them at planning time with an IllegalArgumentException. Separate FTS predicates in independent subqueries (different Filter nodes) are allowed. For multi-column search, use lance_multi_match.
FTS predicates cannot appear inside OR. WHERE lance_match(a, 'x') OR other_col > 5 raises IllegalArgumentException. The FTS predicate is extracted from the filter and pushed to the scanner as a single query. If it were inside an OR, extracting it would silently convert OR semantics to AND (returning only rows matching both predicates, instead of either). The OR detection traverses the full condition tree at any nesting depth, not just top-level operands. OR(FTS, FTS) gets a dedicated error message directing users to lance_multi_match.
FTS predicates must be top-level AND conjuncts. Wrapping an FTS call in NOT, CASE, or any expression other than AND is rejected at planning time. The predicate is consumed by the optimizer rule before execution and cannot be evaluated as a row-level expression, so negation or conditional wrapping would silently produce wrong results.
FTS + vector search cannot be combined. The optimizer rule checks whether a nearest-neighbor query is already present in the relation options and raises IllegalArgumentException if so. Lance ScanOptions supports either a full-text query or a vector query, but not both simultaneously.
No planning-time index validation. The optimizer rule validates argument types and values but does not check whether the target column actually has an FTS index. If it doesn't, the error surfaces from the Lance engine at scan time as a task-level failure rather than a SQL-level IllegalArgumentException. Planning-time index metadata checks are a UX improvement for follow-on work.
lance_phrase requires positional index. The FTS index must be built with with_position = true. Without it, phrase queries fail at scan time with a Lance engine error. There is no planning-time guard because the connector does not inspect index metadata.
No per-column boost weights for lance_multi_match. The underlying FullTextQuery.multiMatch Java API supports per-column boosts, but the variadic SQL signature (lance_multi_match(query, col1, col2, ...)) provides no syntax for attaching weights to individual columns. The optimizer rule passes null for boosts, applying uniform weighting. This is a deliberate scope reduction because adding boost syntax requires a design decision on the signature (a trailing options string creates ambiguity with column names).
Full BM25 scoring, no WAND early stopping. Every row matching the FTS predicate is scored and returned. The Rust-side wand_factor parameter (which trades recall for throughput via WAND early stopping) is intentionally not exposed because SQL WHERE-filter semantics require returning all matching rows, making wand_factor < 1.0 incorrect in this context. When a future TOP-K relevance extension is designed, wand_factor becomes the natural performance knob.
Per-fragment FTS index segment opens. Each Spark task scans one fragment and independently opens all committed FTS index segments to build a global MemBM25Scorer. Total segment-open cost per query scales as N_fragments × M_index_segments. For a 500-fragment dataset with 20 index segments, that's 10,000 segment opens per query. The primary mitigation is running OPTIMIZE TABLE to compact both data fragments (reducing N_fragments) and FTS index segments (reducing M_index_segments). A future optimization could broadcast pre-computed corpus statistics from the driver to avoid redundant segment opens on executors.
Query text must be a string literal. The query argument is pushed to the FTS index at planning time and injected into the scan options before execution begins. Runtime-valued expressions (column references, computed expressions) cannot be pushed down and are rejected with IllegalArgumentException.
Column names are resolved as string constants. The column argument must be a string literal or a column reference that resolves at planning time. The FTS index is keyed by exact column name, so a mismatch between the SQL column name and the Lance schema column name produces a scan-time error.

# Conflicts: # integration-tests/test_lance_spark.py

…elpers

Integrates upstream changes (MemWAL sharding, BlobSourceContext rule, LanceBucketFunction, TPC-H benchmark, CHAR/VARCHAR DDL, VectorUDT writes, filter pushdown V2 migration, and more) while preserving all FTS SQL extension work (LanceFtsPredicateRule, LanceMatch/Phrase/ MultiMatch functions, AbstractLanceFtsPredicateRule, FTS read options, FTS metadata and tests). Conflict resolutions: - LanceSparkSessionExtensions.scala (all variants): inject both LanceFtsPredicateRule (FTS branch) and LanceBlobSourceContextRule (upstream) as optimizer rules. - BaseLanceNamespaceSparkCatalog.java: expose all five functions — LanceFragmentIdWithDefault, LanceMatch, LancePhrase, LanceMultiMatch (FTS branch), and LanceBucketFunction (upstream). - LanceScan.java: import both FullTextQueryUtils (FTS branch) and SparkLanceShardingUtils (upstream). - LanceScanTest.java: keep FTS metadata tests (HEAD) and add upstream's testOutputPartitioningWithBucketInfo.

Upstream changed LanceScanBuilder's 6th constructor argument from Map<K,V> to ShardingSpec. The FTS metadata tests were passing Collections.emptyMap() which no longer compiles.

github-actions Bot added the enhancement New feature or request label May 4, 2026

feat: full text search sql extension

88e9569

ivscheianu force-pushed the fts-sql-extension branch from 1647528 to 88e9569 Compare May 4, 2026 11:15

chore: applied spotless, changed thrown exception

9ed3184

cccs-jory mentioned this pull request May 5, 2026

Support Lance FTS Queries lance-format/lance-trino#116

Open

ivscheianu added 7 commits May 10, 2026 08:33

Merge upstream/main into fts-sql-extension

984549c

Merge remote-tracking branch 'upstream/main' into fts-sql-extension

a54a763

# Conflicts: # integration-tests/test_lance_spark.py

refactor: improve variable naming and extract shared option-parsing h…

e22587d

…elpers

fix: compile error

01b1432

fix: pass null ShardingSpec in LanceScanBuilder 6-arg calls in tests

9229377

Upstream changed LanceScanBuilder's 6th constructor argument from Map<K,V> to ShardingSpec. The FTS metadata tests were passing Collections.emptyMap() which no longer compiles.

chore: apply spotless formatting to LanceScanTest.java

d83d412

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: full text search sql extension#501

feat: full text search sql extension#501
ivscheianu wants to merge 9 commits into
lance-format:mainfrom
ivscheianu:fts-sql-extension

ivscheianu commented May 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ivscheianu commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

SQL functions for querying Lance FTS indexes

Usage

lance_match options

lance_phrase parameters

lance_multi_match options

What's included

Limitations and caveats

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ivscheianu commented May 4, 2026 •

edited

Loading

`lance_match` options

`lance_phrase` parameters

`lance_multi_match` options