Implement true LanceDB hybrid retrieval#2040
Conversation
Greptile SummaryThis PR replaces the
|
| Filename | Overview |
|---|---|
| nemo_retriever/src/nemo_retriever/vdb/lancedb.py | Implements LanceDB hybrid query by wiring query_texts through table.search().vector().text(); replaces the previous NotImplementedError; adds validation for missing/misaligned texts and conflicting query_type. |
| nemo_retriever/src/nemo_retriever/vdb/operators.py | Strips query_texts from persistent constructor kwargs and re-injects it at call time when hybrid=True, correctly preventing stale texts from being stored. |
| nemo_retriever/src/nemo_retriever/retriever.py | Adds an alignment comment documenting the ordering dependency between embedded rows and query_texts; no logic changes. |
| nemo_retriever/tests/test_lancedb_retrieval_where.py | Adds five hybrid-path tests covering the happy path, missing texts, misaligned lengths, where filtering, and conflicting query_type; extends _tiny_table to optionally build an FTS index. |
| nemo_retriever/tests/test_nv_ingest_vdb_operator.py | Adds two operator tests verifying query_texts is forwarded for hybrid mode and withheld for dense mode. |
Sequence Diagram
sequenceDiagram
participant R as Retriever
participant Op as RetrieveVdbOperator
participant L as LanceDB.retrieval()
participant DB as LanceDB table
R->>Op: "process(vectors, query_texts=[...], hybrid=True)"
Note over Op: filter_retrieval_kwargs strips query_texts
Op->>Op: "re-inject query_texts if hybrid=True"
Op->>L: "retrieval(vectors, hybrid=True, query_texts=[...])"
L->>L: validate query_texts not None
L->>L: "set search_kwargs query_type=hybrid"
L->>L: materialise vectors_for_search + query_texts_list
L->>L: validate length alignment
L->>DB: connect().open_table()
loop per (vector, query_text)
L->>DB: "table.search(query_type=hybrid).vector(v).text(t)"
DB-->>L: top-k hybrid results
end
L-->>Op: list[list[dict]]
Op-->>R: normalised results
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 2
nemo_retriever/src/nemo_retriever/vdb/lancedb.py:597-609
The alignment check fires after `lancedb.connect().open_table()`. If `query_texts` and `vectors` lengths differ, the table connection is already open and the call fails. Moving the validation before the table open avoids the unnecessary I/O and gives a cleaner error path.
```suggestion
if hybrid:
vectors_for_search = list(vectors)
query_texts_list = [query_texts] if isinstance(query_texts, str) else list(query_texts)
if len(query_texts_list) != len(vectors_for_search):
raise ValueError(
"LanceDB hybrid retrieval requires query_texts length to match vectors length; "
f"got query_texts={len(query_texts_list)} vectors={len(vectors_for_search)}."
)
else:
vectors_for_search = vectors
query_texts_list = []
table = lancedb.connect(uri=table_path).open_table(table_name)
```
### Issue 2 of 2
nemo_retriever/tests/test_lancedb_retrieval_where.py:108-123
The string-shorthand path for `query_texts` (passing a bare `str` instead of a list) is present in the implementation (`[query_texts] if isinstance(query_texts, str) else list(query_texts)`) but not covered by any test. If the `isinstance` guard were accidentally removed, `list("alpha")` would produce `["a", "l", "p", "h", "a"]`, causing a length-mismatch error rather than the intended single-query behaviour. A test like `op.retrieval([[1.0, 0.0]], ..., hybrid=True, query_texts="alpha")` would pin the contract.
Reviews (3): Last reviewed commit: "Merge branch 'main' into codex/lancedb-t..." | Re-trigger Greptile
|
Greptile follow-up addressed in |
Summary
query_textsalongside precomputed vectors.query_textsexecution-only: it is stripped from persistent VDB constructor kwargs and forwarded only forhybrid=Trueretrieval calls.hybrid=TrueretrievalNotImplementedErrorwith LanceDB 0.30.2 hybrid query construction:table.search(query_type="hybrid", vector_column_name=..., fts_columns="text").vector(vector).text(query_text).Behavioral Notes
query_textsis not forwarded for dense retrieval.query_textsand validates that query/vector counts match.where/_filter,top_k,refine_factor,n_probe/nprobes,result_fields, andsearch_kwargsbehavior are preserved.search_kwargs["query_type"]now raises a clearValueError.Validation
cd /localhome/local-jioffe/nv-ingest-lancedb/nemo_retriever/localhome/local-jioffe/.local/bin/uv run --extra dev pytest -q tests/test_retriever_queries.py tests/test_nv_ingest_vdb_operator.py tests/test_lancedb_retrieval_where.py32 passed/localhome/local-jioffe/.local/bin/uv run --extra dev pytest -q tests/test_root_cli_workflow.py tests/test_graph_pipeline_cli.py tests/test_lancedb_write_policy.py21 passed, 1 warninggit diff --checkE2E Findings
JP20 LanceDB Hybrid
--no-extract-page-as-image.1,9403,1923,185recall@1: 0.6609recall@3: 0.8522recall@5: 0.9304recall@10: 0.9565IvfHnswSqplus FTStext_idx.BO767 LanceDB Hybrid
54,73080,43676,2991,0051484.75s/0:24:44.75336.86 PPSrecall@1: 0.5811recall@3: 0.7950recall@5: 0.8488recall@10: 0.8985ndcg@1: 0.5811ndcg@3: 0.7076ndcg@5: 0.7297ndcg@10: 0.746076,299rows and indexes:Index(IvfHnswSq, columns=["vector"], name="vector_idx")Index(FTS, columns=["text"], name="text_idx")Dense vs Hybrid Observability
recall@k,ndcg@k) and do not themselves indicate dense vs hybrid.VDB kwargs: {"hybrid": true, ...}.vdb_op: "lancedb"and metrics but does not includevdb_kwargsor an explicit retrieval mode.vdb_kwargsorretrieval_mode: hybrid|denseintorun.runtime.summary.jsonfor easier auditability in future runs.