Skip to content

cherry-picked skills from main#2177

Open
sosahi wants to merge 2 commits into
26.05from
sohail/pull-skill-in-release
Open

cherry-picked skills from main#2177
sosahi wants to merge 2 commits into
26.05from
sohail/pull-skill-in-release

Conversation

@sosahi
Copy link
Copy Markdown
Collaborator

@sosahi sosahi commented May 29, 2026

Description

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

@sosahi sosahi requested review from a team as code owners May 29, 2026 23:59
@sosahi sosahi requested review from jioffe502 and removed request for a team May 29, 2026 23:59
@sosahi
Copy link
Copy Markdown
Collaborator Author

sosahi commented May 29, 2026

/nvskills-ci

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 30, 2026

Greptile Summary

This PR cherry-picks the nemo-retriever skill from main into the 26.05 branch, replacing the old single-file SKILL.md with a fully structured skill package including install, setup, query, and troubleshooting references, two new utility scripts (filename_fast_path.py, grep_corpus.py), evaluation tasks, and a benchmark report.

  • Skill restructure: the monolithic .claude/skills/nemo-retriever/SKILL.md is replaced by a modular layout under skills/nemo-retriever/ with dedicated references/ docs and a symlink from .claude/skills — all workflow references consistently adopt nv-ingest as the LanceDB table name.
  • New utility scripts: filename_fast_path.py implements a query-turn PDF fast-path via pdfium; grep_corpus.py adds regex search over the already-built LanceDB index; several previously flagged issues (SPDX headers, default table name in grep_corpus.py, file-handle lifecycle, test coverage) on these scripts carry over from the source branch and remain open.
  • Documentation inconsistency: references/cli/ingest.md's Key Flags table still documents --table-name default as nemo-retriever while every other part of the skill uses nv-ingest, which will mislead any agent reading the flags reference.

Confidence Score: 4/5

Safe to merge with one documentation fix: the Key Flags table in ingest.md still documents the wrong default table name, contradicting the rest of the skill.

The structural changes and new reference docs are clean and internally consistent except for one stale line in ingest.md's Key Flags table where --table-name still reads nemo-retriever while the narrative in the same file and every other skill reference uses nv-ingest. Any agent that reads the flag description to decide what table a plain retriever ingest writes into will use the wrong name for all subsequent queries and corpus searches.

skills/nemo-retriever/references/cli/ingest.md — the --table-name default in the Key Flags table needs updating to nv-ingest.

Important Files Changed

Filename Overview
skills/nemo-retriever/references/cli/ingest.md Renamed and updated CLI reference; the Key Flags --table-name default (nemo-retriever) was not updated to match the new nv-ingest default stated in the narrative and all other skill docs.
skills/nemo-retriever/scripts/grep_corpus.py New corpus-grep utility; previously flagged issues (SPDX header, wrong default table name nemo-retriever, full to_pandas() memory load) remain open.
skills/nemo-retriever/scripts/filename_fast_path.py New query-turn fast-path script; previously flagged issues (SPDX header, unclosed file handle, no unit tests) remain open.
skills/nemo-retriever/SKILL.md New top-level skill entrypoint; workflow table, hard limits, and reference pointers look consistent and well-structured.
skills/nemo-retriever/references/query.md New query workflow reference; canonical pipeline, grep_corpus, chart/image caution, and non-semantic operations all consistently reference nv-ingest.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A([Agent receives query]) --> B{nv-ingest.lance exists?}
    B -- No --> C[Read references/setup.md]
    C --> D{TOTAL_PAGES le 800?}
    D -- Yes --> E[retriever ingest ./pdfs/]
    D -- No --> F["retriever pipeline run --quiet"]
    E & F --> G([STOP - index built])
    B -- Yes --> H[Read references/query.md]
    H --> I["retriever query --top-k 10 --rerank"]
    I --> J{Hits sufficient?}
    J -- Yes --> K([Synthesize final_answer - STOP])
    J -- chart/image hit --> L["retriever pdf stage page-elements"]
    J -- exact text needed --> M["grep_corpus.py regex"]
    L & M --> K
    I -- Empty --> N[Read references/troubleshooting.md]
    N --> O[Pick best PDF from ./pdfs/]
    O --> P["retriever pdf stage page-elements single PDF"]
    P --> K
Loading
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
skills/nemo-retriever/references/cli/ingest.md:95
**`--table-name` default contradicts the rest of the skill docs**

The Key Flags table still documents the default as `nemo-retriever`, but this same file's canonical invocation header was updated to `lancedb/nv-ingest.lance`, and every other reference in the skill (`SKILL.md`, `references/query.md` page-filter and aggregate one-liners, and `setup.md`) consistently uses `nv-ingest`. An agent reading the Key Flags section to understand what table `retriever ingest` writes into will use the wrong table name in any subsequent `retriever query`, `grep_corpus.py`, or direct LanceDB call — causing all corpus searches to silently return nothing.

Reviews (2): Last reviewed commit: "fixed syntax" | Re-trigger Greptile

@@ -0,0 +1,161 @@
"""Query-turn filename fast path for the nemo-retriever skill.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Missing SPDX license header

Both new Python files (filename_fast_path.py and grep_corpus.py) are missing the required SPDX license header. Per the repository rule, every Python file added in a PR must begin with:

# SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES.
# All rights reserved.
# SPDX-License-Identifier: Apache-2.0

This applies to both scripts/filename_fast_path.py and scripts/grep_corpus.py.

Rule Used: Python files added in this PR must include the SPD... (source)

Prompt To Fix With AI
This is a comment left during a code review.
Path: skills/nemo-retriever/scripts/filename_fast_path.py
Line: 1

Comment:
**Missing SPDX license header**

Both new Python files (`filename_fast_path.py` and `grep_corpus.py`) are missing the required SPDX license header. Per the repository rule, every Python file added in a PR must begin with:

```
# SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES.
# All rights reserved.
# SPDX-License-Identifier: Apache-2.0
```

This applies to both `scripts/filename_fast_path.py` and `scripts/grep_corpus.py`.

**Rule Used:** Python files added in this PR must include the SPD... ([source](https://app.greptile.com/review/custom-context?memory=spdx-license-header))

How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

ap.add_argument("--max-hits", type=int, default=50)
ap.add_argument("--snippet-pad", type=int, default=60)
ap.add_argument("--lancedb-uri", default="./lancedb")
ap.add_argument("--table-name", default="nemo-retriever")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Change the default table name to match the one retriever ingest actually writes into.

Suggested change
ap.add_argument("--table-name", default="nemo-retriever")
ap.add_argument("--table-name", default="nv-ingest")
Prompt To Fix With AI
This is a comment left during a code review.
Path: skills/nemo-retriever/scripts/grep_corpus.py
Line: 37

Comment:
Change the default table name to match the one `retriever ingest` actually writes into.

```suggestion
    ap.add_argument("--table-name", default="nv-ingest")
```

How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines +94 to +95
def page_records(sidecar: str) -> list[dict]:
data = json.load(open(sidecar))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The file opened by json.load(open(sidecar)) is never explicitly closed. Use a context manager instead.

Suggested change
def page_records(sidecar: str) -> list[dict]:
data = json.load(open(sidecar))
def page_records(sidecar: str) -> list[dict]:
with open(sidecar) as fh:
data = json.load(fh)
Prompt To Fix With AI
This is a comment left during a code review.
Path: skills/nemo-retriever/scripts/filename_fast_path.py
Line: 94-95

Comment:
The file opened by `json.load(open(sidecar))` is never explicitly closed. Use a context manager instead.

```suggestion
def page_records(sidecar: str) -> list[dict]:
    with open(sidecar) as fh:
        data = json.load(fh)
```

How can I resolve this? If you propose a fix, please make it concise.

@sosahi
Copy link
Copy Markdown
Collaborator Author

sosahi commented May 30, 2026

/nvskills-ci

1 similar comment
@sosahi
Copy link
Copy Markdown
Collaborator Author

sosahi commented May 30, 2026

/nvskills-ci

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant