[bug fix] Fail loudly on local embedding failures#2114
Conversation
Greptile SummaryThis PR fixes a contract hole in the local embedding path where backend exceptions were silently converted into empty-embedding rows, making
|
| Filename | Overview |
|---|---|
| nemo_retriever/src/nemo_retriever/text_embed/runtime.py | Adds _is_local_embed helper and uses it in the except Exception block to re-raise for local backend failures while preserving remote error-payload rows; minimal, well-scoped change with no side effects on the success path. |
| nemo_retriever/tests/test_text_embed_runtime.py | New test file with correct SPDX header; covers local failure (re-raise), local zero-dim success (non-fatal), and remote failure (error-payload preservation). Autouse fixture drains the error reporter around each test for isolation. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[embed_text_main_text_embed] --> B{endpoint or model?}
B -- neither --> C[raise ValueError]
B -- provided --> D[_embed_group]
D -- success --> E[compute dim + has_embedding]
E --> F[return out_df]
D -- raises Exception --> G[empty_cache + log + report_error]
G --> H{_is_local_embed?}
H -- Yes --> J[re-raise original exception]
H -- No --> K[build error-payload rows]
K --> L[return error out_df]
Reviews (3): Last reviewed commit: "Merge branch 'main' into codex/local-emb..." | Re-trigger Greptile
jperez999
left a comment
There was a problem hiding this comment.
make sure to fix those greptile comments or resolve them if they dont need to be fixed.
|
Review update: narrowed this PR to the proven SDK contract fix only. Current branch behavior:
This removes the earlier post-success validation scope and resolves the Greptile concerns around private helper imports, validation error wording, GPU cleanup on validation failure, and modality-specific validation test coverage. |
Summary
Fixes the SDK local embedding contract so local backend exceptions cannot be returned as successful-looking ingested rows.
Problem
Before this change, local SDK embedding could fail during model/vLLM initialization or inference, but the runtime would catch the exception and still return rows like:
text_embeddings_1b_v2 = {"embedding": [], "error": "..."}text_embeddings_1b_v2_dim = 0text_embeddings_1b_v2_has_embedding = FalseThat made
.ingest()appear to succeed even though the local embedding backend had failed. Downstream retrieval/vector-store workflows could then receive rows that looked ingested but were not queryable.Change
For local embedding only:
Remote/NIM behavior is preserved because remote paths may intentionally use row-level error payloads.
This PR intentionally does not add post-success per-row validation. If a local embed call completes and returns zero-dimension rows, those rows remain non-fatal and continue to carry
*_dim = 0/*_has_embedding = False. That avoids aborting long runs where only some records fail or have no vectors.Expected Contract
After this change:
*_has_embedding = Falsewithout aborting the full batchNotes
This PR does not claim to fix H100/vLLM/DeepGEMM setup. Pristine upstream/main validation showed the current H100 repro was caused by missing Python 3.12 development headers during vLLM/Triton JIT compilation. Once matching Python headers and sane CUDA env were provided, unmodified upstream/main produced a 2048-d embedding.
This PR fixes the SDK contract hole exposed by local backend exceptions.
Validation
uv run --project nemo_retriever --extra dev pytest nemo_retriever/tests/test_text_embed_runtime.pyuv run --project nemo_retriever --extra dev pytest nemo_retriever/tests/test_store_pipeline_stages.py nemo_retriever/tests/test_actor_operators.py::TestBatchEmbedActor nemo_retriever/tests/test_operator_flags_and_cpu_actors.py::TestBatchEmbedCPUActorgit diff --check HEAD~1..HEADChecklist