docs: sync main extraction docs with 26.05#2178
Conversation
Greptile SummaryThis PR syncs 14 documentation files under
|
| Filename | Overview |
|---|---|
| docs/docs/extraction/custom-metadata.md | Page restructured and significantly shortened; the new vdb_upload code example is missing the three metadata sidecar parameters, so it does not actually attach metadata. |
| docs/mkdocs.yml | Nav updated for notebooks.md rename, stale redirect removed; exclude_docs guard weakened from /index.md to index.md with the explanatory comment deleted. |
| docs/docs/extraction/releasenotes.md | 26.05 section condensed to a release-line summary; 26.03 section added inline; the 26.03 versioned docs URL format changed from 26.3.0 to 26.03, inconsistent with all other version links. |
| docs/docs/extraction/prerequisites-support-matrix.md | CUDA/Driver requirements updated to 12.2/535; OCR NIM updated to nemotron-ocr-v2; B200 limitation note and Nemotron Parse install prerequisites removed in line with 26.05. |
| docs/docs/extraction/notebooks.md | Renamed from notebooks/index.md; relative link to overview.md corrected; Related Topics section added. |
| docs/docs/extraction/audio-video.md | GPU pinning note converted to a proper !!! important admonition block; code example formatting corrected. |
| docs/docs/extraction/troubleshoot.md | Removed the open_clip / nemotron-parse troubleshooting entry, consistent with removing the [nemotron-parse] extra from prerequisites. |
| docs/docs/extraction/multimodal-extraction.md | Removed the Omni caption-scope callout for chart regions and simplified the OCR engine description to default to nemotron-ocr-v2. |
| docs/docs/extraction/faq.md | Removed the FAQ entry about PDF chart captioning with Omni, consistent with the multimodal-extraction.md cleanup. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["notebooks/index.md (old)"] -->|renamed to| B["notebooks.md"]
B --> C["mkdocs.yml nav updated"]
A -->|reverse redirect removed| D["redirect deleted"]
C --> E["exclude_docs: /index.md changed to index.md"]
F["prerequisites-support-matrix.md"] -->|CUDA 12.2 / Driver 535| G["Updated requirements"]
F -->|nemotron-ocr-v1 to v2| H["OCR NIM updated"]
F -->|Nemotron Parse removed| I["Affects troubleshoot.md, faq.md, multimodal-extraction.md"]
J["releasenotes.md"] -->|26.05 condensed| K["26.03 section added inline"]
K -->|URL 26.3.0 changed to 26.03| L["Possible broken link"]
M["custom-metadata.md"] -->|Rewritten| N["vdb_upload example"]
N -->|Missing meta_dataframe, meta_source_field, meta_fields| O["Metadata not attached"]
Comments Outside Diff (1)
-
docs/docs/extraction/custom-metadata.md, line 62-92 (link)Metadata parameters missing from the
vdb_uploadcallThe example creates
meta_dfand writes it tometa_file.csv, but the.vdb_upload(...)call on line 83–88 does not pass any of the three required metadata parameters (meta_dataframe,meta_source_field,meta_fields). As written, a user who copies this example will ingest documents without any custom metadata attached — the opposite of what the page advertises. The prose on line 92 says "merge values frommeta_df(orfile_path) into each document'scontent_metadatabeforevdb_upload" but provides no code showing how to do so.Prompt To Fix With AI
This is a comment left during a code review. Path: docs/docs/extraction/custom-metadata.md Line: 62-92 Comment: **Metadata parameters missing from the `vdb_upload` call** The example creates `meta_df` and writes it to `meta_file.csv`, but the `.vdb_upload(...)` call on line 83–88 does not pass any of the three required metadata parameters (`meta_dataframe`, `meta_source_field`, `meta_fields`). As written, a user who copies this example will ingest documents **without** any custom metadata attached — the opposite of what the page advertises. The prose on line 92 says "merge values from `meta_df` (or `file_path`) into each document's `content_metadata` before `vdb_upload`" but provides no code showing how to do so. How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 3
docs/docs/extraction/custom-metadata.md:62-92
**Metadata parameters missing from the `vdb_upload` call**
The example creates `meta_df` and writes it to `meta_file.csv`, but the `.vdb_upload(...)` call on line 83–88 does not pass any of the three required metadata parameters (`meta_dataframe`, `meta_source_field`, `meta_fields`). As written, a user who copies this example will ingest documents **without** any custom metadata attached — the opposite of what the page advertises. The prose on line 92 says "merge values from `meta_df` (or `file_path`) into each document's `content_metadata` before `vdb_upload`" but provides no code showing how to do so.
### Issue 2 of 3
docs/mkdocs.yml:209-212
The `exclude_docs` pattern was changed from `/index.md` to `index.md`, and the comment explaining the distinction was removed. In gitignore syntax, `index.md` (no leading slash) matches every `index.md` at any directory depth, while `/index.md` restricts the match to the docs root only. With `notebooks/index.md` now renamed there is currently only one `index.md` in the tree so the behavior is equivalent — but silently dropping the guard means any future `index.md` added in a subdirectory will be excluded without warning.
```suggestion
# MkDocs 1.6+: exclude suite landing and legacy duplicate pages (still in repo for parity).
# extraction/chunking.md — removed from nav; content is under concepts.md (redirect_maps keeps old URLs).
# Use /index.md (docs root only); bare index.md would exclude every index.md in any subdirectory.
exclude_docs: |
/index.md
```
### Issue 3 of 3
docs/docs/extraction/releasenotes.md:57
The 26.03 versioned docs URL was changed to `26.03` format, but every other link in this table uses the `YY.M.P` semver-style path (e.g., `25.9.0`, `26.1.2`, `26.1.1`). The old value `26.3.0` was consistent with that pattern; `26.03` is not and may resolve to a 404 on the NVIDIA docs CDN.
```suggestion
| [26.03](https://docs.nvidia.com/nemo/retriever/26.3.0/extraction/releasenotes/)
```
Reviews (1): Last reviewed commit: "Update faq.md" | Re-trigger Greptile
| # MkDocs 1.6+: exclude suite landing and legacy duplicate pages (still in repo for parity). | ||
| # extraction/chunking.md — removed from nav; content is under concepts.md (redirect_maps keeps old URLs). | ||
| # Use /index.md (docs root only); bare index.md would exclude every index.md (e.g. extraction/notebooks/index.md). | ||
| exclude_docs: | | ||
| /index.md | ||
| index.md |
There was a problem hiding this comment.
The
exclude_docs pattern was changed from /index.md to index.md, and the comment explaining the distinction was removed. In gitignore syntax, index.md (no leading slash) matches every index.md at any directory depth, while /index.md restricts the match to the docs root only. With notebooks/index.md now renamed there is currently only one index.md in the tree so the behavior is equivalent — but silently dropping the guard means any future index.md added in a subdirectory will be excluded without warning.
| # MkDocs 1.6+: exclude suite landing and legacy duplicate pages (still in repo for parity). | |
| # extraction/chunking.md — removed from nav; content is under concepts.md (redirect_maps keeps old URLs). | |
| # Use /index.md (docs root only); bare index.md would exclude every index.md (e.g. extraction/notebooks/index.md). | |
| exclude_docs: | | |
| /index.md | |
| index.md | |
| # MkDocs 1.6+: exclude suite landing and legacy duplicate pages (still in repo for parity). | |
| # extraction/chunking.md — removed from nav; content is under concepts.md (redirect_maps keeps old URLs). | |
| # Use /index.md (docs root only); bare index.md would exclude every index.md in any subdirectory. | |
| exclude_docs: | | |
| /index.md |
Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/mkdocs.yml
Line: 209-212
Comment:
The `exclude_docs` pattern was changed from `/index.md` to `index.md`, and the comment explaining the distinction was removed. In gitignore syntax, `index.md` (no leading slash) matches every `index.md` at any directory depth, while `/index.md` restricts the match to the docs root only. With `notebooks/index.md` now renamed there is currently only one `index.md` in the tree so the behavior is equivalent — but silently dropping the guard means any future `index.md` added in a subdirectory will be excluded without warning.
```suggestion
# MkDocs 1.6+: exclude suite landing and legacy duplicate pages (still in repo for parity).
# extraction/chunking.md — removed from nav; content is under concepts.md (redirect_maps keeps old URLs).
# Use /index.md (docs root only); bare index.md would exclude every index.md in any subdirectory.
exclude_docs: |
/index.md
```
How can I resolve this? If you propose a fix, please make it concise.| ## Release Notes for Previous Versions | ||
|
|
||
| | [26.03](https://docs.nvidia.com/nemo/retriever/26.3.0/extraction/releasenotes/) | ||
| | [26.03](https://docs.nvidia.com/nemo/retriever/26.03/extraction/releasenotes/) |
There was a problem hiding this comment.
The 26.03 versioned docs URL was changed to
26.03 format, but every other link in this table uses the YY.M.P semver-style path (e.g., 25.9.0, 26.1.2, 26.1.1). The old value 26.3.0 was consistent with that pattern; 26.03 is not and may resolve to a 404 on the NVIDIA docs CDN.
| | [26.03](https://docs.nvidia.com/nemo/retriever/26.03/extraction/releasenotes/) | |
| | [26.03](https://docs.nvidia.com/nemo/retriever/26.3.0/extraction/releasenotes/) |
Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/docs/extraction/releasenotes.md
Line: 57
Comment:
The 26.03 versioned docs URL was changed to `26.03` format, but every other link in this table uses the `YY.M.P` semver-style path (e.g., `25.9.0`, `26.1.2`, `26.1.1`). The old value `26.3.0` was consistent with that pattern; `26.03` is not and may resolve to a 404 on the NVIDIA docs CDN.
```suggestion
| [26.03](https://docs.nvidia.com/nemo/retriever/26.3.0/extraction/releasenotes/)
```
How can I resolve this? If you propose a fix, please make it concise.|
Closing: sync direction was incorrect. main/docs/docs already has the authoritative GA content (especially releasenotes.md). Copying from 26.05 regressed main. Follow-up PR will sync 26.05 to match main instead. |
Summary
docs/differed betweenmainand26.05.mainto match the26.05documentation content exactly (verified withgit diff upstream/26.05 -- docs/showing no remaining differences).notebooks/index.md→notebooks.md), andmkdocs.ymlnav updates.Files changed
docs/docs/extraction/audio-video.mddocs/docs/extraction/concepts.mddocs/docs/extraction/custom-metadata.mddocs/docs/extraction/deployment-options.mddocs/docs/extraction/faq.mddocs/docs/extraction/getting-started-about.mddocs/docs/extraction/integrations-langchain-llamaindex-haystack.mddocs/docs/extraction/multimodal-extraction.mddocs/docs/extraction/notebooks.md(renamed fromnotebooks/index.md)docs/docs/extraction/overview.mddocs/docs/extraction/prerequisites-support-matrix.mddocs/docs/extraction/releasenotes.mddocs/docs/extraction/troubleshoot.mddocs/mkdocs.ymlTest plan
git diff upstream/26.05 -- docs/is empty on this branch