(skill) update filename match fast path by edknv · Pull Request #2162 · NVIDIA/NeMo-Retriever

edknv · 2026-05-28T23:50:38Z

Description

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

…est into edwardk/skill-single-source

edknv · 2026-05-28T23:52:13Z

/nvskills-ci

greptile-apps · 2026-05-28T23:54:09Z

Greptile Summary

This PR tightens the filename fast path in the nemo-retriever skill so that PDF basename matching now requires the .pdf extension (stem-only matching is removed to avoid false positives on common English words), and propagates that constraint into documentation. It also ships several robustness fixes that address previous review feedback.

Matching logic (filename_fast_path.py): find_matches now requires the full name.pdf literal in the query; the duplicate sidecar candidate path ({stem}.pdf.pdf_extraction.json → {stem}.pdf_extraction.json) is corrected; extract_pages redirects subprocess stdout to DEVNULL and wraps each file in a try/except CalledProcessError so a single bad PDF no longer aborts the whole run; page_records gains a json.JSONDecodeError handler; and os.listdir is wrapped to emit a clean error on missing ./pdfs/.
Docs / metadata: references/query.md, SKILL.md, and references/pitfalls.md are updated to match the new fast-path trigger condition and remove the output.json write step; BENCHMARK.md and skill-card.md reflect the updated NVSkills-Eval PASS result.

Confidence Score: 5/5

Safe to merge; all previously flagged issues are addressed and the logic changes are consistent with the updated documentation.

The matching change is narrow and well-motivated, the error-handling additions are defensive improvements with no regression risk, and the documentation is kept in sync. The only open gap is the absence of unit tests for the changed matching logic.

skills/nemo-retriever/scripts/filename_fast_path.py — the find_matches behaviour change has no automated coverage.

Important Files Changed

Filename	Overview
skills/nemo-retriever/scripts/filename_fast_path.py	Tightens filename matching to require the `.pdf` extension, fixes the duplicate sidecar candidate path, adds per-file error handling for extraction failures, and redirects subprocess stdout to DEVNULL to fix stdout protocol pollution — all previously-reviewed issues are addressed.
skills/nemo-retriever/references/query.md	Updated fast-path trigger condition to document that the `.pdf` extension is now required, aligning documentation with the code change in filename_fast_path.py.
skills/nemo-retriever/SKILL.md	Removes the instruction to write output.json and updates query-turn guidance to synthesize answers in-place; cosmetic improvements to hard-limit descriptions.
skills/nemo-retriever/references/pitfalls.md	Trimmed failure-mode list by moving cold-start and low-relevance entries to the CLI reference docs, reducing duplication that Tier 2 had flagged.
skills/nemo-retriever/BENCHMARK.md	Reflects updated NVSkills-Eval results: overall verdict flipped to PASS, Tier 2 duplicate finding resolved (0 findings), finding count updated to 21.
skills/nemo-retriever/skill-card.md	Updated skill metadata: version bumped, output format description extended, evaluation summary updated to match new BENCHMARK.md, reference list reordered.
skills/nemo-retriever/skill.oms.sig	Signature bundle refreshed to cover the updated file set; no code change.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A([filename_fast_path.py called with query]) --> B{list ./pdfs/}
    B -- FileNotFoundError/PermissionError --> C[print ERROR to stderr\nexit 1]
    B -- success --> D[find_matches: scan basenames\nrequire name.pdf in query_lower]
    D -- no matches --> E[print NO_MATCH\nexit 0]
    D -- matches found --> F[extract_pages: run retriever\nstdout=DEVNULL per file]
    F -- CalledProcessError --> G[WARN to stderr, skip file\ncontinue loop]
    F -- success --> H[write sidecar JSON\nto /tmp/pdf_text/]
    G & H --> I[rank_pages: load sidecars\nscore by token freq]
    I -- json.JSONDecodeError --> J[ERROR to stderr\nreturn empty list]
    I -- no sidecar found --> K[skip match]
    J & K & I --> L{any scored pages?}
    L -- no --> M[print NO_TEXT\nexit 0]
    L -- yes --> N[print JSON ranking\n+ top page text\nexit 0]

Prompt To Fix All With AI

Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
skills/nemo-retriever/scripts/filename_fast_path.py:51-63
**Missing unit tests for changed matching logic**

The `find_matches` function has a material behaviour change — stems are no longer matched without the `.pdf` extension — but there are no unit tests covering the happy path, the extension-required path, or the `MIN_STEM_LEN` guard. The `test-coverage-new-code` rule requires tests for changed business logic. A query like `"Tell me about report_q3"` would have matched before and silently returns `NO_MATCH` now, with no automated safety net to catch regressions or future accidental reversions.

_{Reviews (6): Last reviewed commit: "lint" | Re-trigger Greptile}

edknv · 2026-05-29T00:01:22Z

/nvskills-ci

edknv · 2026-05-29T00:56:00Z

/nvskills-ci

Signed-off-by: nvskills-svc-account <svc-nvskills-signing@nvidia.com>

copy-pr-bot · 2026-05-29T01:19:42Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

edknv and others added 9 commits May 27, 2026 13:54

(skill-eval) add filename match fast path

6f2aa10

add lancedb search script

ab1489e

drop the text field from summary echo

700d52d

Merge branch 'main' into edwardk/skill-single-source

391f77b

move regex search section

4082e04

Merge branch 'main' into edwardk/skill-single-source

31caf8e

greptile fixes

29c6a58

Merge branch 'edwardk/skill-single-source' of github.com:edknv/nv-ing…

db4c702

…est into edwardk/skill-single-source

Wrapping the per-file call in a try/except

4c9deb9

edknv requested review from a team as code owners May 28, 2026 23:50

edknv requested a review from jdye64 May 28, 2026 23:50

edknv mentioned this pull request May 28, 2026

(skill-eval) add filename match fast path #2140

Closed

4 tasks

edknv requested review from jperez999 and sosahi May 28, 2026 23:51

greptile-apps Bot reviewed May 28, 2026

View reviewed changes

Comment thread skills/nemo-retriever/scripts/filename_fast_path.py

Comment thread skills/nemo-retriever/scripts/filename_fast_path.py

Comment thread skills/nemo-retriever/scripts/filename_fast_path.py

Comment thread skills/nemo-retriever/scripts/grep_corpus.py

edknv added 2 commits May 28, 2026 16:58

catch json errors

6349418

send subprocess stdout to devnull

5df2190

Merge branch 'main' into edwardk/skill-single-source

f67a60d

edknv changed the title ~~(skill-eval) add filename match fast path~~ (skill) update filename match fast path May 29, 2026

remove json reference

b7fdbb4

Attach NVSkills validation signatures

63ffc2f

Signed-off-by: nvskills-svc-account <svc-nvskills-signing@nvidia.com>

lint

069c79a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(skill) update filename match fast path#2162

(skill) update filename match fast path#2162
edknv wants to merge 15 commits into
mainfrom
edwardk/skill-single-source

edknv commented May 28, 2026

Uh oh!

edknv commented May 28, 2026

Uh oh!

greptile-apps Bot commented May 28, 2026 •

edited

Loading

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

edknv commented May 29, 2026

Uh oh!

edknv commented May 29, 2026

Uh oh!

copy-pr-bot Bot commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

edknv commented May 28, 2026

Description

Checklist

Uh oh!

edknv commented May 28, 2026

Uh oh!

greptile-apps Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

edknv commented May 29, 2026

Uh oh!

edknv commented May 29, 2026

Uh oh!

copy-pr-bot Bot commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

greptile-apps Bot commented May 28, 2026 •

edited

Loading