docs: sync 26.05 docs/docs with main by kheiss-uwzoo · Pull Request #2179 · NVIDIA/NeMo-Retriever

kheiss-uwzoo · 2026-05-30T13:45:23Z

Summary

Audit result: docs/docs/ on main and 26.05 differ in 13 extraction pages plus docs/mkdocs.yml nav/redirects. main is authoritative — it has the GA 26.05 release notes, updated support matrix (CUDA 13.0 / driver 580, Nemotron Parse extra), caption-scope FAQ, and open_clip troubleshooting.
This PR updates the 26.05 branch so docs/docs/ matches main exactly (git diff upstream/main -- docs/docs/ is empty on this branch).

Notable content restored on 26.05

releasenotes.md: Full GA 26.05 highlights (upgrade notes, pipeline, CLI, service, models, multimodal, RAG, VDB, evaluation, packaging, Helm, documentation) instead of RC1 install boilerplate
prerequisites-support-matrix.md: Current CUDA/driver requirements and Nemotron Parse dependency note
faq.md / troubleshoot.md: Caption scope FAQ and open_clip install guidance
custom-metadata.md: Restructured filtering doc from main
notebooks/index.md: Restored main nav path (with matching mkdocs.yml redirect)

Test plan

git diff upstream/main -- docs/docs/ is empty on this branch
MkDocs build on 26.05 succeeds with updated nav
Release notes page shows GA content, not RC1 install-only text

greptile-apps · 2026-05-30T13:57:21Z

Greptile Summary

This PR syncs docs/docs/ on the 26.05 branch with main, replacing RC1 install boilerplate with GA release notes and carrying over content updates including updated CUDA/driver requirements (12.2/535 → 13.0/580), OCR NIM clarifications, Nemotron Parse dependency docs, a new chart-captioning FAQ, and open_clip troubleshooting. Most of the 14 files are clean, but two files — audio-video.md and custom-metadata.md — have defects introduced during the sync that will break the published documentation.

audio-video.md: The removal of an !!! important admonition left a critical GPU-pinning note 4-space-indented outside a list (renders as a code block), and the code fence restructuring inserted a stray ) that produces a Python SyntaxError in the copyable example; two near-duplicate segment_audio paragraphs also appeared.
custom-metadata.md: The new "On this page" TOC references 6 section anchors that don't exist in the document body; the ## How metadata is stored heading was renamed from "Related Content" without updating its content; and variable definitions (hostname, lancedb_uri, table_name) were removed but are still referenced in the ingestor code example, causing a NameError.

Confidence Score: 3/5

Not safe to merge as-is: two files have doc defects that will ship broken code examples and broken navigation to 26.05 users.

The majority of files in this sync are clean and accurate, but audio-video.md ships a Python SyntaxError in a copyable code block and hides a critical GPU-pinning deployment note as a code block. custom-metadata.md ships an ingestor snippet that throws NameError on first run and an On this page TOC with six dead anchor links. These are visible, immediately reproducible defects in the published documentation that will affect users following the 26.05 setup guides.

docs/docs/extraction/audio-video.md and docs/docs/extraction/custom-metadata.md both need fixes before merge; all other files look correct.

Important Files Changed

Filename	Overview
docs/docs/extraction/audio-video.md	Removal of !!! important admonition leaves GPU-pinning note as a code block; stray ) produces SyntaxError in code sample; two near-duplicate segment_audio paragraphs with conflicting API names.
docs/docs/extraction/custom-metadata.md	New TOC references 6 non-existent section anchors; How metadata is stored heading renamed without updating content; variable definitions removed but still referenced in code example.
docs/docs/extraction/releasenotes.md	RC1 install boilerplate fully replaced with GA 26.05 release notes.
docs/docs/extraction/prerequisites-support-matrix.md	CUDA/driver requirements updated; OCR NIM corrected; Nemotron Parse extra documented; caption-scope note added.
docs/mkdocs.yml	Nav and redirect updated for notebooks/index.md; exclude_docs pattern fixed.

Prompt To Fix All With AI

Fix the following 6 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 6
docs/docs/extraction/audio-video.md:66-68
**GPU pinning note silently rendered as a code block**

After the `!!! important` admonition was removed, the paragraph at line 68 (`Pin the Parakeet workload…`) is now indented by 4 spaces with no enclosing list item. In Markdown (including MkDocs Material), a 4-space-indented paragraph outside a list context is treated as an **indented code block**, so this critical deployment warning will render as `<pre><code>` text rather than readable prose — readers following the setup steps will miss the GPU pinning requirement entirely.

### Issue 2 of 6
docs/docs/extraction/audio-video.md:88-91
**Stray `)` produces a `SyntaxError` in the code sample**

The closing `)` at line 90 is placed outside the code fence, making it part of the rendered code content. The code block therefore ends with two consecutive `)` characters — one closing `extract_audio(...)` and an extra one below `ingestor = (...)`. Anyone copying this snippet will get a `SyntaxError` immediately.

```suggestion
        )
    )
```
```

### Issue 3 of 6
docs/docs/extraction/audio-video.md:93-97
**Duplicate near-identical `segment_audio` paragraphs with conflicting API names**

Line 93 (unindented) says to use `extract_audio_params={"segment_audio": True}` with `.extract(...)`, while line 95 (indented continuation of step 3) says to use `asr_params=ASRParams(segment_audio=True)` with `.extract_audio(...)`. These look like two different API call styles that both appeared after the admonition block was removed. One of them should be removed, or it should be clarified which applies to library mode vs. the service ingestor.

### Issue 4 of 6
docs/docs/extraction/custom-metadata.md:40-42
**Undefined variables make the code example un-runnable**

The diff removes the `hostname`, `table_name`, and `lancedb_uri` variable definitions that previously preceded the `ingestor = (...)` block, but the `create_ingestor(...)` call still references all three. Copying this snippet results in a `NameError` on `hostname`. The variable definitions need to be restored.

```suggestion
hostname = "localhost"
table_name = "nemo_retriever_collection"
lancedb_uri = "./lancedb_data"

ingestor = (
    create_ingestor(run_mode="service", base_url=f"http://{hostname}:7670")
        .files(["data/woods_frost.pdf", "data/multimodal_test.pdf"])
```

### Issue 5 of 6
docs/docs/extraction/custom-metadata.md:5-14
**"On this page" TOC contains 6 broken anchor links**

The table of contents added in this PR references `#filter-results-at-query-time`, `#writing-where-predicates`, `#server-side-vs-client-side-filters`, `#inspect-hit-metadata`, `#limitations`, and `#related-content`. None of these section headings exist in the current file body (128 lines). The body still contains the old 26.05 structure (`## Best Practices`, `## Use Custom Metadata to Filter Results During Retrieval`, etc.) rather than the restructured sections the TOC was written for. Clicking any of these six links in the published docs will silently scroll to the top of the page.

### Issue 6 of 6
docs/docs/extraction/custom-metadata.md:125-128
**Section heading "How metadata is stored" contains only cross-reference bullets**

The heading at line 125 was renamed from `## Related Content` to `## How metadata is stored`, but its body was not updated — it still contains just two reference links. Readers navigating to this section via the TOC will find no explanation of how metadata is persisted (e.g., serialized into the `metadata` column, how `content_metadata` fields are mapped). Either restore the "Related content" heading or replace the bullets with the intended storage explanation.

_{Reviews (1): Last reviewed commit: "docs: sync 26.05 docs/docs with main" | Re-trigger Greptile}

greptile-apps · 2026-05-30T13:57:25Z

+After deploy, call the pipeline from Python:

    Pin the Parakeet workload to the dedicated GPU with your Helm values or the [NIM Operator](https://docs.nvidia.com/nim-operator/latest/index.html) (for example, node selectors, resource limits, or device requests appropriate to your cluster).


GPU pinning note silently rendered as a code block

After the !!! important admonition was removed, the paragraph at line 68 (Pin the Parakeet workload…) is now indented by 4 spaces with no enclosing list item. In Markdown (including MkDocs Material), a 4-space-indented paragraph outside a list context is treated as an indented code block, so this critical deployment warning will render as <pre><code> text rather than readable prose — readers following the setup steps will miss the GPU pinning requirement entirely.

Prompt To Fix With AI

This is a comment left during a code review. Path: docs/docs/extraction/audio-video.md Line: 66-68 Comment: **GPU pinning note silently rendered as a code block** After the `!!! important` admonition was removed, the paragraph at line 68 (`Pin the Parakeet workload…`) is now indented by 4 spaces with no enclosing list item. In Markdown (including MkDocs Material), a 4-space-indented paragraph outside a list context is treated as an **indented code block**, so this critical deployment warning will render as `<pre><code>` text rather than readable prose — readers following the setup steps will miss the GPU pinning requirement entirely. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-05-30T13:57:26Z

        )
    )
-    ```
+)
+```


Stray ) produces a SyntaxError in the code sample

The closing ) at line 90 is placed outside the code fence, making it part of the rendered code content. The code block therefore ends with two consecutive ) characters — one closing extract_audio(...) and an extra one below ingestor = (...). Anyone copying this snippet will get a SyntaxError immediately.

Suggested change

)

)

```

)

```

)

)

<details><summary>Prompt To Fix With AI</summary> `````markdown This is a comment left during a code review. Path: docs/docs/extraction/audio-video.md Line: 88-91 Comment: **Stray `)` produces a `SyntaxError` in the code sample** The closing `)` at line 90 is placed outside the code fence, making it part of the rendered code content. The code block therefore ends with two consecutive `)` characters — one closing `extract_audio(...)` and an extra one below `ingestor = (...)`. Anyone copying this snippet will get a `SyntaxError` immediately. ```suggestion ) )

How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-05-30T13:57:27Z

+To generate one extracted element for each sentence-like ASR segment, include `extract_audio_params={"segment_audio": True}` when calling `.extract(...)`. This option applies when audio extraction runs with a self-hosted Parakeet NIM or using build.nvidia.com hosted inference, but has no effect when using the local Hugging Face Parakeet model.

    To generate one extracted element for each sentence-like ASR segment, pass `asr_params=ASRParams(segment_audio=True)` to `.extract_audio(...)`. This option applies when audio extraction runs with a self-hosted Parakeet NIM or using build.nvidia.com hosted inference, but has no effect when using the local Hugging Face Parakeet model.

-
-    !!! tip
-
-        For more Python examples, refer to [Python Quick Start Guide](https://github.com/NVIDIA/NeMo-Retriever/blob/main/client/client_examples/examples/python_client_usage.ipynb).
+    For more Python examples, refer to [Python Quick Start Guide](https://github.com/NVIDIA/NeMo-Retriever/blob/main/client/client_examples/examples/python_client_usage.ipynb).


Duplicate near-identical segment_audio paragraphs with conflicting API names

Line 93 (unindented) says to use extract_audio_params={"segment_audio": True} with .extract(...), while line 95 (indented continuation of step 3) says to use asr_params=ASRParams(segment_audio=True) with .extract_audio(...). These look like two different API call styles that both appeared after the admonition block was removed. One of them should be removed, or it should be clarified which applies to library mode vs. the service ingestor.

Prompt To Fix With AI

This is a comment left during a code review. Path: docs/docs/extraction/audio-video.md Line: 93-97 Comment: **Duplicate near-identical `segment_audio` paragraphs with conflicting API names** Line 93 (unindented) says to use `extract_audio_params={"segment_audio": True}` with `.extract(...)`, while line 95 (indented continuation of step 3) says to use `asr_params=ASRParams(segment_audio=True)` with `.extract_audio(...)`. These look like two different API call styles that both appeared after the admonition block was removed. One of them should be removed, or it should be clarified which applies to library mode vs. the service ingestor. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-05-30T13:57:28Z

 ingestor = (
    create_ingestor(run_mode="service", base_url=f"http://{hostname}:7670")
        .files(["data/woods_frost.pdf", "data/multimodal_test.pdf"])


Undefined variables make the code example un-runnable

The diff removes the hostname, table_name, and lancedb_uri variable definitions that previously preceded the ingestor = (...) block, but the create_ingestor(...) call still references all three. Copying this snippet results in a NameError on hostname. The variable definitions need to be restored.

Suggested change

ingestor = (

create_ingestor(run_mode="service", base_url=f"http://{hostname}:7670")

.files(["data/woods_frost.pdf", "data/multimodal_test.pdf"])

hostname = "localhost"

table_name = "nemo_retriever_collection"

lancedb_uri = "./lancedb_data"

ingestor = (

create_ingestor(run_mode="service", base_url=f"http://{hostname}:7670")

.files(["data/woods_frost.pdf", "data/multimodal_test.pdf"])

Prompt To Fix With AI

This is a comment left during a code review. Path: docs/docs/extraction/custom-metadata.md Line: 40-42 Comment: **Undefined variables make the code example un-runnable** The diff removes the `hostname`, `table_name`, and `lancedb_uri` variable definitions that previously preceded the `ingestor = (...)` block, but the `create_ingestor(...)` call still references all three. Copying this snippet results in a `NameError` on `hostname`. The variable definitions need to be restored. ```suggestion hostname = "localhost" table_name = "nemo_retriever_collection" lancedb_uri = "./lancedb_data" ingestor = ( create_ingestor(run_mode="service", base_url=f"http://{hostname}:7670") .files(["data/woods_frost.pdf", "data/multimodal_test.pdf"]) ``` How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-05-30T13:57:29Z

+## On this page { #on-this-page }

+- [Attach metadata at ingestion](#attach-metadata-at-ingestion)
+- [How metadata is stored](#how-metadata-is-stored)
+- [Filter results at query time](#filter-results-at-query-time)
+- [Writing `where` predicates](#writing-where-predicates)
+- [Server-side vs client-side filters](#server-side-vs-client-side-filters)
+- [Inspect hit metadata](#inspect-hit-metadata)
+- [Limitations](#limitations)
+- [Related content](#related-content)


"On this page" TOC contains 6 broken anchor links

The table of contents added in this PR references #filter-results-at-query-time, #writing-where-predicates, #server-side-vs-client-side-filters, #inspect-hit-metadata, #limitations, and #related-content. None of these section headings exist in the current file body (128 lines). The body still contains the old 26.05 structure (## Best Practices, ## Use Custom Metadata to Filter Results During Retrieval, etc.) rather than the restructured sections the TOC was written for. Clicking any of these six links in the published docs will silently scroll to the top of the page.

Prompt To Fix With AI

This is a comment left during a code review. Path: docs/docs/extraction/custom-metadata.md Line: 5-14 Comment: **"On this page" TOC contains 6 broken anchor links** The table of contents added in this PR references `#filter-results-at-query-time`, `#writing-where-predicates`, `#server-side-vs-client-side-filters`, `#inspect-hit-metadata`, `#limitations`, and `#related-content`. None of these section headings exist in the current file body (128 lines). The body still contains the old 26.05 structure (`## Best Practices`, `## Use Custom Metadata to Filter Results During Retrieval`, etc.) rather than the restructured sections the TOC was written for. Clicking any of these six links in the published docs will silently scroll to the top of the page. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-05-30T13:57:30Z

+## How metadata is stored { #how-metadata-is-stored }

 - [Vector databases](vdbs.md) — canonical LanceDB upload and retrieval guide
 - [metadata_and_filtered_search.ipynb](https://github.com/NVIDIA/NeMo-Retriever/blob/main/examples/metadata_and_filtered_search.ipynb) — CLI and graph ingest with sidecar metadata


Section heading "How metadata is stored" contains only cross-reference bullets

The heading at line 125 was renamed from ## Related Content to ## How metadata is stored, but its body was not updated — it still contains just two reference links. Readers navigating to this section via the TOC will find no explanation of how metadata is persisted (e.g., serialized into the metadata column, how content_metadata fields are mapped). Either restore the "Related content" heading or replace the bullets with the intended storage explanation.

Prompt To Fix With AI

This is a comment left during a code review. Path: docs/docs/extraction/custom-metadata.md Line: 125-128 Comment: **Section heading "How metadata is stored" contains only cross-reference bullets** The heading at line 125 was renamed from `## Related Content` to `## How metadata is stored`, but its body was not updated — it still contains just two reference links. Readers navigating to this section via the TOC will find no explanation of how metadata is persisted (e.g., serialized into the `metadata` column, how `content_metadata` fields are mapped). Either restore the "Related content" heading or replace the bullets with the intended storage explanation. How can I resolve this? If you propose a fix, please make it concise.

docs: sync 26.05 docs/docs with main

612b49d

kheiss-uwzoo requested review from a team as code owners May 30, 2026 13:45

kheiss-uwzoo requested review from jioffe502 and removed request for a team May 30, 2026 13:45

kheiss-uwzoo added the doc Improvements or additions to documentation label May 30, 2026

kheiss-uwzoo requested review from randerzander and sosahi May 30, 2026 13:56

greptile-apps Bot reviewed May 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: sync 26.05 docs/docs with main#2179

docs: sync 26.05 docs/docs with main#2179
kheiss-uwzoo wants to merge 1 commit into
NVIDIA:26.05from
kheiss-uwzoo:docs/sync-26.05-docs-with-main

kheiss-uwzoo commented May 30, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented May 30, 2026

Confidence Score: 3/5

Important Files Changed

Uh oh!

greptile-apps Bot May 30, 2026

Uh oh!

greptile-apps Bot May 30, 2026

Uh oh!

greptile-apps Bot May 30, 2026

Uh oh!

greptile-apps Bot May 30, 2026

Uh oh!

greptile-apps Bot May 30, 2026

Uh oh!

greptile-apps Bot May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		After deploy, call the pipeline from Python:

		Pin the Parakeet workload to the dedicated GPU with your Helm values or the [NIM Operator](https://docs.nvidia.com/nim-operator/latest/index.html) (for example, node selectors, resource limits, or device requests appropriate to your cluster).

-        )
-    )
-    ```
-)
-```
+        )
+    )

Conversation

kheiss-uwzoo commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Notable content restored on 26.05

Test plan

Uh oh!

greptile-apps Bot commented May 30, 2026

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Uh oh!

greptile-apps Bot May 30, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 30, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 30, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 30, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 30, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kheiss-uwzoo commented May 30, 2026 •

edited

Loading