Update embeddings for table and column by id (with verified) by DinaLaptii · Pull Request #2049 · NVIDIA/NeMo-Retriever

DinaLaptii · 2026-05-18T09:50:36Z

Description

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

greptile-apps · 2026-05-18T09:55:10Z

Greptile Summary

This PR eliminates the Neo4j round-trip in the tabular embedding pipeline: TabularSchemaExtractOp now returns (tables_df, columns_df) carrying post-ingest UUIDs directly, and TabularFetchEmbeddingsOp builds embedding rows from those DataFrames in-memory. It also introduces VDB.put / LanceDB.put and the new PutVdbOperator for stable-key in-place updates of existing VDB rows.

Embedding pipeline refactor: TabularSchemaExtractOp returns (tables_df, columns_df) instead of an empty DataFrame; populate_tabular_data now returns the schemas dict; TabularFetchEmbeddingsOp builds text from the in-memory pair without querying Neo4j; the deleted embeddings.py is no longer used.
VDB.put / LanceDB.put: Non-abstract stub on the base class with a NotImplementedError fallback; LanceDB implements strict update-only semantics (pre-flight existence check via Lance filter, merge_insert().when_matched_update_all()); PutVdbOperator wraps this with the same sidecar-metadata wiring as IngestVdbOperator.
Tests: Three new test cases cover PutVdbOperator construction-time guard, happy-path delegation, and sidecar merge path.

Confidence Score: 5/5

Safe to merge; the core VDB put mechanism and operator wiring are correct, and the three new tests cover the guard, happy-path, and sidecar paths.

The structural changes (returning schemas from populate_tabular_data, threading UUIDs through the operator chain, non-abstract VDB.put, LanceDB.put with pre-flight existence check) are all mechanically sound. The only substantive finding is that the table embedding text format silently changed from the deleted embeddings.py (db_name prefix added, column separator changed space to comma) while the docstring claims parity — a documentation inaccuracy rather than a runtime defect. No data-loss, auth, or crash paths were identified.

tabular_fetch_embeddings_operator.py — inaccurate docstring about format parity with the old Neo4j-query path, and the unconditional column suffix when columns is empty.

Important Files Changed

Filename	Overview
nemo_retriever/src/nemo_retriever/graph/tabular_fetch_embeddings_operator.py	Major refactor: operator now builds embeddings from in-memory (tables_df, columns_df) instead of querying Neo4j directly. Table embedding text format silently changed (added db_name prefix, comma separator for columns) while docstring claims format is preserved; empty column list produces trailing ", columns: " artifact.
nemo_retriever/src/nemo_retriever/graph/tabular_schema_extract_operator.py	Now returns (tables_df, columns_df) tuple with post-ingest UUIDs instead of an empty DataFrame; adds early-return guard for None connector; straightforward and correct.
nemo_retriever/src/nemo_retriever/tabular_data/ingestion/extract_data.py	store_relational_db_in_neo4j now returns the schemas dict instead of None/[]; return value threaded through correctly.
nemo_retriever/src/nemo_retriever/tabular_data/ingestion/write_to_graph.py	populate_tabular_data now returns all_schemas dict (was returning []); removes the dead intermediate all_schemas={} assignment. Previously flagged ID backfill issue for re-ingested tables remains unaddressed.
nemo_retriever/src/nemo_retriever/vdb/adt_vdb.py	Adds non-abstract put() with NotImplementedError stub and detailed docstring explaining why it's not abstract; design is correct and matches the PutVdbOperator guard pattern.
nemo_retriever/src/nemo_retriever/vdb/lancedb.py	Adds id column to LanceDB schema and implements put() with strict update-only semantics and pre-flight existence check. records parameter missing type annotation; schema migration concern for existing tables previously flagged.
nemo_retriever/src/nemo_retriever/vdb/operators.py	Adds PutVdbOperator extending IngestVdbOperator; construction-time guard correctly detects backends that inherit the VDB.put stub; sidecar path correctly wired.
nemo_retriever/src/nemo_retriever/vdb/init.py	Exports PutVdbOperator in both import and all; clean change.
nemo_retriever/tests/test_nv_ingest_vdb_operator.py	Adds thorough PutVdbOperator tests: guard rejection, happy-path delegation with key/table_name forwarding, and sidecar metadata merging; _StubPutVDB correctly implements all abstract methods.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[TabularSchemaExtractOp.process] --> B[extract_tabular_db_data]
    B --> C[store_relational_db_in_neo4j]
    C --> D[populate_tabular_data returns schemas dict]
    D --> E[concat tables_df and columns_df]
    E --> F[TabularFetchEmbeddingsOp.process receives tuple]
    F --> G[_build_rows builds text from DataFrames]
    G --> H[BatchEmbedActor embeds rows]
    H --> I{PutVdbOperator.process}
    I --> J[to_client_vdb_records plus sidecar merge]
    J --> K[LanceDB.put with stable key]
    K --> L{row exists in table?}
    L -->|yes| M[merge_insert update in place]
    L -->|no| N[raise KeyError]
    M --> O[return counts dict]

Prompt To Fix All With AI

Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 3
nemo_retriever/src/nemo_retriever/graph/tabular_fetch_embeddings_operator.py:52-56
**Table embedding text format changed, contradicting the docstring**

The class docstring says "The text templates match the previous Neo4j-derived format" but two concrete differences exist versus `query_neo4j_tables_for_embedding` in the deleted `embeddings.py`: (1) the old table text started with `schema_name:` and had **no** `db_name:` prefix, while the new `_create_table_text` opens with `db_name: {database_name}`; (2) the old Cypher used `apoc.text.join(columns, ' ')` (space separator) while `_create_table_text` uses `','.join(column_pieces)` (comma, no space). Any LanceDB rows written by the old pipeline carry vectors that encode the old text; `PutVdbOperator` will overwrite them with vectors from the new text, causing a silent semantic shift. The docstring should be corrected to describe the new format rather than claiming parity with the deleted `embeddings.py` format.

### Issue 2 of 3
nemo_retriever/src/nemo_retriever/graph/tabular_fetch_embeddings_operator.py:172-173
When `columns` is empty, `','.join(column_pieces)` produces an empty string, resulting in `", columns: "` at the end of the text — a trailing field with no value. The old Cypher used a hard `MATCH (t)-[...]->(c)` which silently excluded tables with no columns. Guard the append conditionally so tables without columns get clean text.

```suggestion
    if column_pieces:
        text += f", columns: {','.join(column_pieces)}"
    return text
```

### Issue 3 of 3
nemo_retriever/src/nemo_retriever/vdb/lancedb.py:540-545
The `records` parameter of `LanceDB.put` is missing a type annotation while the base-class stub declares it as `records: list`. Public methods on public classes must carry complete type annotations per the `type-hints-public-api` rule, and `VDB.put`'s own signature sets the expected type.

```suggestion
    def put(
        self,
        records: list,
        table_name: str | None = None,
        key: str = "id",
    ) -> dict[str, int]:
```

_{Reviews (11): Last reviewed commit: "resolve comment" | Re-trigger Greptile}

- TabularFetchEmbeddingsOp: replace TypeError on non-tuple input with a call to fetch_tabular_embedding_dataframe so the pipeline can still produce embedding rows from Neo4j when upstream did not pass (tables_df, columns_df). - Add nemo_retriever.tabular_data.ingestion.embeddings module that queries Neo4j for Table/Column docs and returns an embedding-ready DataFrame matching the unstructured pipeline format. - lancedb: drop stale comment in _create_lancedb_results.

…mn-by-id

…into feature/update-emmdedding-for-table-and-column-by-id

Rename `VDB.upsert` / `UpsertVdbOperator` to `VDB.put` / `PutVdbOperator` and tighten the contract to an in-place replace: missing keys and rows not already present in the target table now raise `KeyError` instead of being inserted, and `put` no longer creates tables on the fly. Updates the LanceDB implementation and the operator tests accordingly.

Add PutVdbOperator to the public API alongside IngestVdbOperator and RetrieveVdbOperator so callers using the standard package import path (`from nemo_retriever.vdb import PutVdbOperator`) can access it.

Collapse the multi-line NotImplementedError raise into a single line to match what the pre-commit hook produces, unblocking CI.

jioffe502

Per Liav approval

Update embeddings for table and column by id

fb49b1f

DinaLaptii requested review from a team as code owners May 18, 2026 09:50

DinaLaptii requested a review from edknv May 18, 2026 09:50

DinaLaptii marked this pull request as draft May 18, 2026 09:51

greptile-apps Bot reviewed May 18, 2026

View reviewed changes

Comment thread nemo_retriever/src/nemo_retriever/vdb/adt_vdb.py Outdated

Comment thread nemo_retriever/src/nemo_retriever/vdb/adt_vdb.py Outdated

Comment thread nemo_retriever/src/nemo_retriever/vdb/operators.py

yuvalshkolar reviewed May 18, 2026

View reviewed changes

Comment thread nemo_retriever/src/nemo_retriever/graph/tabular_fetch_embeddings_operator.py Outdated

resolve comments

004fecd

DinaLaptii marked this pull request as ready for review May 19, 2026 08:08

greptile-apps Bot reviewed May 19, 2026

View reviewed changes

Comment thread nemo_retriever/tests/test_nv_ingest_vdb_operator.py Outdated

fix test

200e2ad

greptile-apps Bot reviewed May 19, 2026

View reviewed changes

Comment thread nemo_retriever/src/nemo_retriever/vdb/lancedb.py

DinaLaptii added 2 commits May 19, 2026 15:37

fix

2b3ca75

fix

b68d7bd

greptile-apps Bot reviewed May 19, 2026

View reviewed changes

Comment thread nemo_retriever/src/nemo_retriever/tabular_data/ingestion/write_to_graph.py

DinaLaptii added 2 commits May 20, 2026 16:00

Merge branch 'main' into feature/update-emmdedding-for-table-and-colu…

197e1e8

…mn-by-id

Merge branch 'main' of https://github.com/ftatiana-nv/NeMo-Retriever …

d5b5534

…into feature/update-emmdedding-for-table-and-column-by-id

tomer-levin-nv reviewed May 21, 2026

View reviewed changes

DinaLaptii added 3 commits May 21, 2026 13:35

fix(vdb): export PutVdbOperator from package init

c4dc2dc

Add PutVdbOperator to the public API alongside IngestVdbOperator and RetrieveVdbOperator so callers using the standard package import path (`from nemo_retriever.vdb import PutVdbOperator`) can access it.

style(vdb): apply pre-commit formatting to PutVdbOperator

1ebe6e6

Collapse the multi-line NotImplementedError raise into a single line to match what the pre-commit hook produces, unblocking CI.

yuvalshkolar reviewed May 25, 2026

View reviewed changes

Comment thread nemo_retriever/src/nemo_retriever/graph/tabular_fetch_embeddings_operator.py

resolve comment

87aed0b

liavnave approved these changes May 27, 2026

View reviewed changes

jioffe502 approved these changes May 29, 2026

View reviewed changes

liavnave merged commit 5e403b8 into NVIDIA:main May 31, 2026
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update embeddings for table and column by id (with verified)#2049

Update embeddings for table and column by id (with verified)#2049
liavnave merged 12 commits into
NVIDIA:mainfrom
ftatiana-nv:feature/update-emmdedding-for-table-and-column-by-id

DinaLaptii commented May 18, 2026

Uh oh!

greptile-apps Bot commented May 18, 2026 •

edited

Loading

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jioffe502 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

DinaLaptii commented May 18, 2026

Description

Checklist

Uh oh!

greptile-apps Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jioffe502 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

greptile-apps Bot commented May 18, 2026 •

edited

Loading