Skip to content

feat: full text search sql extension#501

Open
ivscheianu wants to merge 9 commits into
lance-format:mainfrom
ivscheianu:fts-sql-extension
Open

feat: full text search sql extension#501
ivscheianu wants to merge 9 commits into
lance-format:mainfrom
ivscheianu:fts-sql-extension

Conversation

@ivscheianu
Copy link
Copy Markdown
Contributor

@ivscheianu ivscheianu commented May 4, 2026

SQL functions for querying Lance FTS indexes

Adds lance_match, lance_phrase, and lance_multi_match SQL functions that push full-text search predicates down to the Lance inverted index from Spark SQL. Previously, users could build FTS indexes but had no SQL interface to query them.

Starting point for #234

Usage

-- Keyword search
SELECT * FROM t WHERE lance_match(body, 'hello world');

-- With options
SELECT * FROM t WHERE lance_match(body, 'hello', 'fuzziness=1,operator=AND');

-- Phrase search (requires index built with with_position=true)
SELECT * FROM t WHERE lance_phrase(body, 'hello world');
SELECT * FROM t WHERE lance_phrase(body, 'hello world', 1);  -- slop

-- Multi-column search (OR semantics by default)
SELECT * FROM t WHERE lance_multi_match('hello', title, body);
SELECT * FROM t WHERE lance_multi_match('hello', 'operator=AND', title, body);

lance_match options

Key Type Default Description
fuzziness int auto Edit distance. 0 = exact match; omitted = engine-determined by token length
operator string OR AND or OR, controls whether all terms must match
boost float 1.0 Scoring multiplier
prefix_length int 0 Leading chars that must match exactly for fuzzy
max_expansions int 50 Max fuzzy term expansions (≥ 1)

Unknown option keys are rejected at planning time.

lance_phrase parameters

Parameter Type Default Description
column string (required) Column with an FTS index (built with with_position=true)
query string (required) Phrase to search for
slop int 0 Max number of intervening unmatched positions permitted (≥ 0)

lance_multi_match options

Key Type Default Description
operator string OR AND or OR, controls whether all terms must match across columns

Requires at least 2 column arguments. Unknown option keys are rejected at planning time.

What's included

  • Three V2 sentinel functions registered in the Lance catalog (SHOW FUNCTIONS)
  • Catalyst optimizer rule that intercepts FTS calls, validates arguments at planning time, and injects the query into the scan options without requiring ANTLR grammar changes
  • Support for Spark 3.4, 3.5, 4.0, and 4.1 (version-specific DataSourceV2Relation.copy() handled via thin subclasses)
  • Fragment pruning guard so LIMIT queries scan all fragments (same as ANN)
  • FTS query visible in EXPLAIN EXTENDED metadata
  • Fix for Query nearest structural equality in LanceSparkReadOptions (was using reference equality)
  • JUnit integration tests and Python integration tests
  • Documentation: docs/src/operations/dql/fts.md, cross-references from create-index.md and config.md

Limitations and caveats

  • No global relevance ordering. Lance computes BM25 scores with globally consistent IDF statistics (corpus stats are aggregated across all committed FTS index segments before scoring), so scores from different fragments are comparable. However, Spark collects fragment results in task-completion order, not BM25 rank order. SELECT * FROM t WHERE lance_match(...) LIMIT N returns N matching rows, but which N rows you get depends on Spark task scheduling rather than relevance. There is no lance_score() function to surface scores; that is a planned future extension.

  • One FTS predicate per WHERE clause. Lance ScanOptions carries a single FullTextQuery object, so there is no API for composing multiple independent FTS queries in one scan. The optimizer rule detects multiple lance_match/lance_phrase calls within a single Filter node and rejects them at planning time with an IllegalArgumentException. Separate FTS predicates in independent subqueries (different Filter nodes) are allowed. For multi-column search, use lance_multi_match.

  • FTS predicates cannot appear inside OR. WHERE lance_match(a, 'x') OR other_col > 5 raises IllegalArgumentException. The FTS predicate is extracted from the filter and pushed to the scanner as a single query. If it were inside an OR, extracting it would silently convert OR semantics to AND (returning only rows matching both predicates, instead of either). The OR detection traverses the full condition tree at any nesting depth, not just top-level operands. OR(FTS, FTS) gets a dedicated error message directing users to lance_multi_match.

  • FTS predicates must be top-level AND conjuncts. Wrapping an FTS call in NOT, CASE, or any expression other than AND is rejected at planning time. The predicate is consumed by the optimizer rule before execution and cannot be evaluated as a row-level expression, so negation or conditional wrapping would silently produce wrong results.

  • FTS + vector search cannot be combined. The optimizer rule checks whether a nearest-neighbor query is already present in the relation options and raises IllegalArgumentException if so. Lance ScanOptions supports either a full-text query or a vector query, but not both simultaneously.

  • No planning-time index validation. The optimizer rule validates argument types and values but does not check whether the target column actually has an FTS index. If it doesn't, the error surfaces from the Lance engine at scan time as a task-level failure rather than a SQL-level IllegalArgumentException. Planning-time index metadata checks are a UX improvement for follow-on work.

  • lance_phrase requires positional index. The FTS index must be built with with_position = true. Without it, phrase queries fail at scan time with a Lance engine error. There is no planning-time guard because the connector does not inspect index metadata.

  • No per-column boost weights for lance_multi_match. The underlying FullTextQuery.multiMatch Java API supports per-column boosts, but the variadic SQL signature (lance_multi_match(query, col1, col2, ...)) provides no syntax for attaching weights to individual columns. The optimizer rule passes null for boosts, applying uniform weighting. This is a deliberate scope reduction because adding boost syntax requires a design decision on the signature (a trailing options string creates ambiguity with column names).

  • Full BM25 scoring, no WAND early stopping. Every row matching the FTS predicate is scored and returned. The Rust-side wand_factor parameter (which trades recall for throughput via WAND early stopping) is intentionally not exposed because SQL WHERE-filter semantics require returning all matching rows, making wand_factor < 1.0 incorrect in this context. When a future TOP-K relevance extension is designed, wand_factor becomes the natural performance knob.

  • Per-fragment FTS index segment opens. Each Spark task scans one fragment and independently opens all committed FTS index segments to build a global MemBM25Scorer. Total segment-open cost per query scales as N_fragments × M_index_segments. For a 500-fragment dataset with 20 index segments, that's 10,000 segment opens per query. The primary mitigation is running OPTIMIZE TABLE to compact both data fragments (reducing N_fragments) and FTS index segments (reducing M_index_segments). A future optimization could broadcast pre-computed corpus statistics from the driver to avoid redundant segment opens on executors.

  • Query text must be a string literal. The query argument is pushed to the FTS index at planning time and injected into the scan options before execution begins. Runtime-valued expressions (column references, computed expressions) cannot be pushed down and are rejected with IllegalArgumentException.

  • Column names are resolved as string constants. The column argument must be a string literal or a column reference that resolves at planning time. The FTS index is keyed by exact column name, so a mismatch between the SQL column name and the Lance schema column name produces a scan-time error.

@github-actions github-actions Bot added the enhancement New feature or request label May 4, 2026
@ivscheianu ivscheianu force-pushed the fts-sql-extension branch from 1647528 to 88e9569 Compare May 4, 2026 11:15
# Conflicts:
#	integration-tests/test_lance_spark.py
Integrates upstream changes (MemWAL sharding, BlobSourceContext rule,
LanceBucketFunction, TPC-H benchmark, CHAR/VARCHAR DDL, VectorUDT
writes, filter pushdown V2 migration, and more) while preserving all
FTS SQL extension work (LanceFtsPredicateRule, LanceMatch/Phrase/
MultiMatch functions, AbstractLanceFtsPredicateRule, FTS read options,
FTS metadata and tests).

Conflict resolutions:
- LanceSparkSessionExtensions.scala (all variants): inject both
  LanceFtsPredicateRule (FTS branch) and LanceBlobSourceContextRule
  (upstream) as optimizer rules.
- BaseLanceNamespaceSparkCatalog.java: expose all five functions —
  LanceFragmentIdWithDefault, LanceMatch, LancePhrase, LanceMultiMatch
  (FTS branch), and LanceBucketFunction (upstream).
- LanceScan.java: import both FullTextQueryUtils (FTS branch) and
  SparkLanceShardingUtils (upstream).
- LanceScanTest.java: keep FTS metadata tests (HEAD) and add upstream's
  testOutputPartitioningWithBucketInfo.
Upstream changed LanceScanBuilder's 6th constructor argument from
Map<K,V> to ShardingSpec. The FTS metadata tests were passing
Collections.emptyMap() which no longer compiles.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant