Skip to content

Implement true LanceDB hybrid retrieval#2040

Open
jioffe502 wants to merge 4 commits into
NVIDIA:mainfrom
jioffe502:codex/lancedb-true-hybrid-search
Open

Implement true LanceDB hybrid retrieval#2040
jioffe502 wants to merge 4 commits into
NVIDIA:mainfrom
jioffe502:codex/lancedb-true-hybrid-search

Conversation

@jioffe502
Copy link
Copy Markdown
Collaborator

Summary

  • Implements real LanceDB hybrid retrieval by passing aligned raw query_texts alongside precomputed vectors.
  • Keeps query_texts execution-only: it is stripped from persistent VDB constructor kwargs and forwarded only for hybrid=True retrieval calls.
  • Replaces the LanceDB hybrid=True retrieval NotImplementedError with LanceDB 0.30.2 hybrid query construction: table.search(query_type="hybrid", vector_column_name=..., fts_columns="text").vector(vector).text(query_text).

Behavioral Notes

  • No CLI surface changes.
  • Existing overwrite/append semantics are unchanged.
  • Dense retrieval stays VDB-agnostic; query_texts is not forwarded for dense retrieval.
  • Hybrid LanceDB retrieval now requires query_texts and validates that query/vector counts match.
  • where / _filter, top_k, refine_factor, n_probe / nprobes, result_fields, and search_kwargs behavior are preserved.
  • A conflicting non-hybrid search_kwargs["query_type"] now raises a clear ValueError.

Validation

  • cd /localhome/local-jioffe/nv-ingest-lancedb/nemo_retriever
  • /localhome/local-jioffe/.local/bin/uv run --extra dev pytest -q tests/test_retriever_queries.py tests/test_nv_ingest_vdb_operator.py tests/test_lancedb_retrieval_where.py
    • 32 passed
  • /localhome/local-jioffe/.local/bin/uv run --extra dev pytest -q tests/test_root_cli_workflow.py tests/test_graph_pipeline_cli.py tests/test_lancedb_write_policy.py
    • 21 passed, 1 warning
  • git diff --check
    • clean

E2E Findings

JP20 LanceDB Hybrid

  • Page image extraction enabled: yes, default path; no --no-extract-page-as-image.
  • Pages processed: 1,940
  • Graph rows: 3,192
  • Persisted/uploadable rows: 3,185
  • Recall:
    • recall@1: 0.6609
    • recall@3: 0.8522
    • recall@5: 0.9304
    • recall@10: 0.9565
  • LanceDB indexes confirmed: vector IvfHnswSq plus FTS text_idx.

BO767 LanceDB Hybrid

  • Pages processed: 54,730
  • Graph rows: 80,436
  • Persisted LanceDB rows: 76,299
  • BEIR queries: 1,005
  • Total time: 1484.75s / 0:24:44.753
  • Throughput: 36.86 PPS
  • Recall:
    • recall@1: 0.5811
    • recall@3: 0.7950
    • recall@5: 0.8488
    • recall@10: 0.8985
  • NDCG:
    • ndcg@1: 0.5811
    • ndcg@3: 0.7076
    • ndcg@5: 0.7297
    • ndcg@10: 0.7460
  • LanceDB table confirmed with 76,299 rows and indexes:
    • Index(IvfHnswSq, columns=["vector"], name="vector_idx")
    • Index(FTS, columns=["text"], name="text_idx")

Dense vs Hybrid Observability

  • BEIR metric names are unchanged (recall@k, ndcg@k) and do not themselves indicate dense vs hybrid.
  • The run summary stdout includes VDB kwargs: {"hybrid": true, ...}.
  • The runtime summary JSON currently records vdb_op: "lancedb" and metrics but does not include vdb_kwargs or an explicit retrieval mode.
  • Recommended follow-up: persist vdb_kwargs or retrieval_mode: hybrid|dense into run.runtime.summary.json for easier auditability in future runs.

@jioffe502 jioffe502 requested review from a team as code owners May 14, 2026 21:21
@jioffe502 jioffe502 requested a review from edknv May 14, 2026 21:21
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 14, 2026

Greptile Summary

This PR replaces the NotImplementedError stub for LanceDB hybrid retrieval with a working implementation using the LanceDB 0.30.2 table.search(query_type="hybrid").vector().text() API. query_texts is treated as execution-only: stripped from persistent VDB constructor kwargs and injected only during hybrid retrieval calls.

  • lancedb.py: validates query_texts presence and alignment, rejects conflicting search_kwargs["query_type"], materialises vectors/texts only for the hybrid path, and chains .vector().text() on the LanceDB hybrid query builder.
  • operators.py: pops query_texts from the constructor kwargs so stale strings are never persisted, then re-injects the runtime value into retrieval kwargs when hybrid=True.
  • Tests: five new integration tests in test_lancedb_retrieval_where.py and two new unit tests in test_nv_ingest_vdb_operator.py covering the happy path, missing texts, length mismatch, where filtering, and conflicting query_type.

Confidence Score: 5/5

Safe to merge — the hybrid retrieval path is well-guarded with input validation, the dense path is unchanged, and E2E recall figures confirm the implementation works end-to-end.

The logic is straightforward: query_texts flows through exactly one place per call, validation catches null and misaligned inputs before any query is issued, and the LanceDB API usage matches the documented 0.30.2 call pattern. The two minor suggestions have no impact on correctness in production use.

No files require special attention; lancedb.py has a cosmetic ordering improvement opportunity but nothing that affects correctness.

Important Files Changed

Filename Overview
nemo_retriever/src/nemo_retriever/vdb/lancedb.py Implements LanceDB hybrid query by wiring query_texts through table.search().vector().text(); replaces the previous NotImplementedError; adds validation for missing/misaligned texts and conflicting query_type.
nemo_retriever/src/nemo_retriever/vdb/operators.py Strips query_texts from persistent constructor kwargs and re-injects it at call time when hybrid=True, correctly preventing stale texts from being stored.
nemo_retriever/src/nemo_retriever/retriever.py Adds an alignment comment documenting the ordering dependency between embedded rows and query_texts; no logic changes.
nemo_retriever/tests/test_lancedb_retrieval_where.py Adds five hybrid-path tests covering the happy path, missing texts, misaligned lengths, where filtering, and conflicting query_type; extends _tiny_table to optionally build an FTS index.
nemo_retriever/tests/test_nv_ingest_vdb_operator.py Adds two operator tests verifying query_texts is forwarded for hybrid mode and withheld for dense mode.

Sequence Diagram

sequenceDiagram
    participant R as Retriever
    participant Op as RetrieveVdbOperator
    participant L as LanceDB.retrieval()
    participant DB as LanceDB table

    R->>Op: "process(vectors, query_texts=[...], hybrid=True)"
    Note over Op: filter_retrieval_kwargs strips query_texts
    Op->>Op: "re-inject query_texts if hybrid=True"
    Op->>L: "retrieval(vectors, hybrid=True, query_texts=[...])"
    L->>L: validate query_texts not None
    L->>L: "set search_kwargs query_type=hybrid"
    L->>L: materialise vectors_for_search + query_texts_list
    L->>L: validate length alignment
    L->>DB: connect().open_table()
    loop per (vector, query_text)
        L->>DB: "table.search(query_type=hybrid).vector(v).text(t)"
        DB-->>L: top-k hybrid results
    end
    L-->>Op: list[list[dict]]
    Op-->>R: normalised results
Loading
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
nemo_retriever/src/nemo_retriever/vdb/lancedb.py:597-609
The alignment check fires after `lancedb.connect().open_table()`. If `query_texts` and `vectors` lengths differ, the table connection is already open and the call fails. Moving the validation before the table open avoids the unnecessary I/O and gives a cleaner error path.

```suggestion
        if hybrid:
            vectors_for_search = list(vectors)
            query_texts_list = [query_texts] if isinstance(query_texts, str) else list(query_texts)
            if len(query_texts_list) != len(vectors_for_search):
                raise ValueError(
                    "LanceDB hybrid retrieval requires query_texts length to match vectors length; "
                    f"got query_texts={len(query_texts_list)} vectors={len(vectors_for_search)}."
                )
        else:
            vectors_for_search = vectors
            query_texts_list = []

        table = lancedb.connect(uri=table_path).open_table(table_name)
```

### Issue 2 of 2
nemo_retriever/tests/test_lancedb_retrieval_where.py:108-123
The string-shorthand path for `query_texts` (passing a bare `str` instead of a list) is present in the implementation (`[query_texts] if isinstance(query_texts, str) else list(query_texts)`) but not covered by any test. If the `isinstance` guard were accidentally removed, `list("alpha")` would produce `["a", "l", "p", "h", "a"]`, causing a length-mismatch error rather than the intended single-query behaviour. A test like `op.retrieval([[1.0, 0.0]], ..., hybrid=True, query_texts="alpha")` would pin the contract.

Reviews (3): Last reviewed commit: "Merge branch 'main' into codex/lancedb-t..." | Re-trigger Greptile

Comment thread nemo_retriever/src/nemo_retriever/vdb/lancedb.py Outdated
Comment thread nemo_retriever/src/nemo_retriever/vdb/lancedb.py Outdated
@jioffe502
Copy link
Copy Markdown
Collaborator Author

Greptile follow-up addressed in d96dcf47: added retrieval type hints, restored dense lazy iteration, and clarified the missing query_texts error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant