Update embeddings for table and column by id (with verified)#2049
Conversation
Greptile SummaryThis PR eliminates the Neo4j round-trip in the tabular embedding pipeline:
|
| Filename | Overview |
|---|---|
| nemo_retriever/src/nemo_retriever/graph/tabular_fetch_embeddings_operator.py | Major refactor: operator now builds embeddings from in-memory (tables_df, columns_df) instead of querying Neo4j directly. Table embedding text format silently changed (added db_name prefix, comma separator for columns) while docstring claims format is preserved; empty column list produces trailing ", columns: " artifact. |
| nemo_retriever/src/nemo_retriever/graph/tabular_schema_extract_operator.py | Now returns (tables_df, columns_df) tuple with post-ingest UUIDs instead of an empty DataFrame; adds early-return guard for None connector; straightforward and correct. |
| nemo_retriever/src/nemo_retriever/tabular_data/ingestion/extract_data.py | store_relational_db_in_neo4j now returns the schemas dict instead of None/[]; return value threaded through correctly. |
| nemo_retriever/src/nemo_retriever/tabular_data/ingestion/write_to_graph.py | populate_tabular_data now returns all_schemas dict (was returning []); removes the dead intermediate all_schemas={} assignment. Previously flagged ID backfill issue for re-ingested tables remains unaddressed. |
| nemo_retriever/src/nemo_retriever/vdb/adt_vdb.py | Adds non-abstract put() with NotImplementedError stub and detailed docstring explaining why it's not abstract; design is correct and matches the PutVdbOperator guard pattern. |
| nemo_retriever/src/nemo_retriever/vdb/lancedb.py | Adds id column to LanceDB schema and implements put() with strict update-only semantics and pre-flight existence check. records parameter missing type annotation; schema migration concern for existing tables previously flagged. |
| nemo_retriever/src/nemo_retriever/vdb/operators.py | Adds PutVdbOperator extending IngestVdbOperator; construction-time guard correctly detects backends that inherit the VDB.put stub; sidecar path correctly wired. |
| nemo_retriever/src/nemo_retriever/vdb/init.py | Exports PutVdbOperator in both import and all; clean change. |
| nemo_retriever/tests/test_nv_ingest_vdb_operator.py | Adds thorough PutVdbOperator tests: guard rejection, happy-path delegation with key/table_name forwarding, and sidecar metadata merging; _StubPutVDB correctly implements all abstract methods. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[TabularSchemaExtractOp.process] --> B[extract_tabular_db_data]
B --> C[store_relational_db_in_neo4j]
C --> D[populate_tabular_data returns schemas dict]
D --> E[concat tables_df and columns_df]
E --> F[TabularFetchEmbeddingsOp.process receives tuple]
F --> G[_build_rows builds text from DataFrames]
G --> H[BatchEmbedActor embeds rows]
H --> I{PutVdbOperator.process}
I --> J[to_client_vdb_records plus sidecar merge]
J --> K[LanceDB.put with stable key]
K --> L{row exists in table?}
L -->|yes| M[merge_insert update in place]
L -->|no| N[raise KeyError]
M --> O[return counts dict]
Prompt To Fix All With AI
Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 3
nemo_retriever/src/nemo_retriever/graph/tabular_fetch_embeddings_operator.py:52-56
**Table embedding text format changed, contradicting the docstring**
The class docstring says "The text templates match the previous Neo4j-derived format" but two concrete differences exist versus `query_neo4j_tables_for_embedding` in the deleted `embeddings.py`: (1) the old table text started with `schema_name:` and had **no** `db_name:` prefix, while the new `_create_table_text` opens with `db_name: {database_name}`; (2) the old Cypher used `apoc.text.join(columns, ' ')` (space separator) while `_create_table_text` uses `','.join(column_pieces)` (comma, no space). Any LanceDB rows written by the old pipeline carry vectors that encode the old text; `PutVdbOperator` will overwrite them with vectors from the new text, causing a silent semantic shift. The docstring should be corrected to describe the new format rather than claiming parity with the deleted `embeddings.py` format.
### Issue 2 of 3
nemo_retriever/src/nemo_retriever/graph/tabular_fetch_embeddings_operator.py:172-173
When `columns` is empty, `','.join(column_pieces)` produces an empty string, resulting in `", columns: "` at the end of the text — a trailing field with no value. The old Cypher used a hard `MATCH (t)-[...]->(c)` which silently excluded tables with no columns. Guard the append conditionally so tables without columns get clean text.
```suggestion
if column_pieces:
text += f", columns: {','.join(column_pieces)}"
return text
```
### Issue 3 of 3
nemo_retriever/src/nemo_retriever/vdb/lancedb.py:540-545
The `records` parameter of `LanceDB.put` is missing a type annotation while the base-class stub declares it as `records: list`. Public methods on public classes must carry complete type annotations per the `type-hints-public-api` rule, and `VDB.put`'s own signature sets the expected type.
```suggestion
def put(
self,
records: list,
table_name: str | None = None,
key: str = "id",
) -> dict[str, int]:
```
Reviews (11): Last reviewed commit: "resolve comment" | Re-trigger Greptile
- TabularFetchEmbeddingsOp: replace TypeError on non-tuple input with a call to fetch_tabular_embedding_dataframe so the pipeline can still produce embedding rows from Neo4j when upstream did not pass (tables_df, columns_df). - Add nemo_retriever.tabular_data.ingestion.embeddings module that queries Neo4j for Table/Column docs and returns an embedding-ready DataFrame matching the unstructured pipeline format. - lancedb: drop stale comment in _create_lancedb_results.
…into feature/update-emmdedding-for-table-and-column-by-id
Rename `VDB.upsert` / `UpsertVdbOperator` to `VDB.put` / `PutVdbOperator` and tighten the contract to an in-place replace: missing keys and rows not already present in the target table now raise `KeyError` instead of being inserted, and `put` no longer creates tables on the fly. Updates the LanceDB implementation and the operator tests accordingly.
Add PutVdbOperator to the public API alongside IngestVdbOperator and RetrieveVdbOperator so callers using the standard package import path (`from nemo_retriever.vdb import PutVdbOperator`) can access it.
Collapse the multi-line NotImplementedError raise into a single line to match what the pre-commit hook produces, unblocking CI.
Description
Checklist