Skip to content

Update embeddings for table and column by id (with verified)#2049

Merged
liavnave merged 12 commits into
NVIDIA:mainfrom
ftatiana-nv:feature/update-emmdedding-for-table-and-column-by-id
May 31, 2026
Merged

Update embeddings for table and column by id (with verified)#2049
liavnave merged 12 commits into
NVIDIA:mainfrom
ftatiana-nv:feature/update-emmdedding-for-table-and-column-by-id

Conversation

@DinaLaptii
Copy link
Copy Markdown
Contributor

Description

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

@DinaLaptii DinaLaptii requested review from a team as code owners May 18, 2026 09:50
@DinaLaptii DinaLaptii requested a review from edknv May 18, 2026 09:50
@DinaLaptii DinaLaptii marked this pull request as draft May 18, 2026 09:51
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 18, 2026

Greptile Summary

This PR eliminates the Neo4j round-trip in the tabular embedding pipeline: TabularSchemaExtractOp now returns (tables_df, columns_df) carrying post-ingest UUIDs directly, and TabularFetchEmbeddingsOp builds embedding rows from those DataFrames in-memory. It also introduces VDB.put / LanceDB.put and the new PutVdbOperator for stable-key in-place updates of existing VDB rows.

  • Embedding pipeline refactor: TabularSchemaExtractOp returns (tables_df, columns_df) instead of an empty DataFrame; populate_tabular_data now returns the schemas dict; TabularFetchEmbeddingsOp builds text from the in-memory pair without querying Neo4j; the deleted embeddings.py is no longer used.
  • VDB.put / LanceDB.put: Non-abstract stub on the base class with a NotImplementedError fallback; LanceDB implements strict update-only semantics (pre-flight existence check via Lance filter, merge_insert().when_matched_update_all()); PutVdbOperator wraps this with the same sidecar-metadata wiring as IngestVdbOperator.
  • Tests: Three new test cases cover PutVdbOperator construction-time guard, happy-path delegation, and sidecar merge path.

Confidence Score: 5/5

Safe to merge; the core VDB put mechanism and operator wiring are correct, and the three new tests cover the guard, happy-path, and sidecar paths.

The structural changes (returning schemas from populate_tabular_data, threading UUIDs through the operator chain, non-abstract VDB.put, LanceDB.put with pre-flight existence check) are all mechanically sound. The only substantive finding is that the table embedding text format silently changed from the deleted embeddings.py (db_name prefix added, column separator changed space to comma) while the docstring claims parity — a documentation inaccuracy rather than a runtime defect. No data-loss, auth, or crash paths were identified.

tabular_fetch_embeddings_operator.py — inaccurate docstring about format parity with the old Neo4j-query path, and the unconditional column suffix when columns is empty.

Important Files Changed

Filename Overview
nemo_retriever/src/nemo_retriever/graph/tabular_fetch_embeddings_operator.py Major refactor: operator now builds embeddings from in-memory (tables_df, columns_df) instead of querying Neo4j directly. Table embedding text format silently changed (added db_name prefix, comma separator for columns) while docstring claims format is preserved; empty column list produces trailing ", columns: " artifact.
nemo_retriever/src/nemo_retriever/graph/tabular_schema_extract_operator.py Now returns (tables_df, columns_df) tuple with post-ingest UUIDs instead of an empty DataFrame; adds early-return guard for None connector; straightforward and correct.
nemo_retriever/src/nemo_retriever/tabular_data/ingestion/extract_data.py store_relational_db_in_neo4j now returns the schemas dict instead of None/[]; return value threaded through correctly.
nemo_retriever/src/nemo_retriever/tabular_data/ingestion/write_to_graph.py populate_tabular_data now returns all_schemas dict (was returning []); removes the dead intermediate all_schemas={} assignment. Previously flagged ID backfill issue for re-ingested tables remains unaddressed.
nemo_retriever/src/nemo_retriever/vdb/adt_vdb.py Adds non-abstract put() with NotImplementedError stub and detailed docstring explaining why it's not abstract; design is correct and matches the PutVdbOperator guard pattern.
nemo_retriever/src/nemo_retriever/vdb/lancedb.py Adds id column to LanceDB schema and implements put() with strict update-only semantics and pre-flight existence check. records parameter missing type annotation; schema migration concern for existing tables previously flagged.
nemo_retriever/src/nemo_retriever/vdb/operators.py Adds PutVdbOperator extending IngestVdbOperator; construction-time guard correctly detects backends that inherit the VDB.put stub; sidecar path correctly wired.
nemo_retriever/src/nemo_retriever/vdb/init.py Exports PutVdbOperator in both import and all; clean change.
nemo_retriever/tests/test_nv_ingest_vdb_operator.py Adds thorough PutVdbOperator tests: guard rejection, happy-path delegation with key/table_name forwarding, and sidecar metadata merging; _StubPutVDB correctly implements all abstract methods.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[TabularSchemaExtractOp.process] --> B[extract_tabular_db_data]
    B --> C[store_relational_db_in_neo4j]
    C --> D[populate_tabular_data returns schemas dict]
    D --> E[concat tables_df and columns_df]
    E --> F[TabularFetchEmbeddingsOp.process receives tuple]
    F --> G[_build_rows builds text from DataFrames]
    G --> H[BatchEmbedActor embeds rows]
    H --> I{PutVdbOperator.process}
    I --> J[to_client_vdb_records plus sidecar merge]
    J --> K[LanceDB.put with stable key]
    K --> L{row exists in table?}
    L -->|yes| M[merge_insert update in place]
    L -->|no| N[raise KeyError]
    M --> O[return counts dict]
Loading
Prompt To Fix All With AI
Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 3
nemo_retriever/src/nemo_retriever/graph/tabular_fetch_embeddings_operator.py:52-56
**Table embedding text format changed, contradicting the docstring**

The class docstring says "The text templates match the previous Neo4j-derived format" but two concrete differences exist versus `query_neo4j_tables_for_embedding` in the deleted `embeddings.py`: (1) the old table text started with `schema_name:` and had **no** `db_name:` prefix, while the new `_create_table_text` opens with `db_name: {database_name}`; (2) the old Cypher used `apoc.text.join(columns, ' ')` (space separator) while `_create_table_text` uses `','.join(column_pieces)` (comma, no space). Any LanceDB rows written by the old pipeline carry vectors that encode the old text; `PutVdbOperator` will overwrite them with vectors from the new text, causing a silent semantic shift. The docstring should be corrected to describe the new format rather than claiming parity with the deleted `embeddings.py` format.

### Issue 2 of 3
nemo_retriever/src/nemo_retriever/graph/tabular_fetch_embeddings_operator.py:172-173
When `columns` is empty, `','.join(column_pieces)` produces an empty string, resulting in `", columns: "` at the end of the text — a trailing field with no value. The old Cypher used a hard `MATCH (t)-[...]->(c)` which silently excluded tables with no columns. Guard the append conditionally so tables without columns get clean text.

```suggestion
    if column_pieces:
        text += f", columns: {','.join(column_pieces)}"
    return text
```

### Issue 3 of 3
nemo_retriever/src/nemo_retriever/vdb/lancedb.py:540-545
The `records` parameter of `LanceDB.put` is missing a type annotation while the base-class stub declares it as `records: list`. Public methods on public classes must carry complete type annotations per the `type-hints-public-api` rule, and `VDB.put`'s own signature sets the expected type.

```suggestion
    def put(
        self,
        records: list,
        table_name: str | None = None,
        key: str = "id",
    ) -> dict[str, int]:
```

Reviews (11): Last reviewed commit: "resolve comment" | Re-trigger Greptile

Comment thread nemo_retriever/src/nemo_retriever/vdb/adt_vdb.py Outdated
Comment thread nemo_retriever/src/nemo_retriever/vdb/adt_vdb.py Outdated
Comment thread nemo_retriever/src/nemo_retriever/vdb/operators.py
- TabularFetchEmbeddingsOp: replace TypeError on non-tuple input with a
  call to fetch_tabular_embedding_dataframe so the pipeline can still
  produce embedding rows from Neo4j when upstream did not pass
  (tables_df, columns_df).
- Add nemo_retriever.tabular_data.ingestion.embeddings module that
  queries Neo4j for Table/Column docs and returns an embedding-ready
  DataFrame matching the unstructured pipeline format.
- lancedb: drop stale comment in _create_lancedb_results.
Comment thread nemo_retriever/src/nemo_retriever/graph/tabular_fetch_embeddings_operator.py Outdated
@DinaLaptii DinaLaptii marked this pull request as ready for review May 19, 2026 08:08
Comment thread nemo_retriever/tests/test_nv_ingest_vdb_operator.py Outdated
Comment thread nemo_retriever/src/nemo_retriever/vdb/lancedb.py
Comment thread nemo_retriever/src/nemo_retriever/graph/tabular_fetch_embeddings_operator.py Outdated
Comment thread nemo_retriever/src/nemo_retriever/vdb/lancedb.py Outdated
Comment thread nemo_retriever/src/nemo_retriever/vdb/lancedb.py
Comment thread nemo_retriever/src/nemo_retriever/vdb/operators.py Outdated
Rename `VDB.upsert` / `UpsertVdbOperator` to `VDB.put` / `PutVdbOperator`
and tighten the contract to an in-place replace: missing keys and rows
not already present in the target table now raise `KeyError` instead of
being inserted, and `put` no longer creates tables on the fly. Updates
the LanceDB implementation and the operator tests accordingly.
Add PutVdbOperator to the public API alongside IngestVdbOperator and
RetrieveVdbOperator so callers using the standard package import path
(`from nemo_retriever.vdb import PutVdbOperator`) can access it.
Collapse the multi-line NotImplementedError raise into a single line to
match what the pre-commit hook produces, unblocking CI.
Copy link
Copy Markdown
Collaborator

@jioffe502 jioffe502 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per Liav approval

@liavnave liavnave merged commit 5e403b8 into NVIDIA:main May 31, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants