(skill-eval) add filename match fast path by edknv · Pull Request #2140 · NVIDIA/NeMo-Retriever

edknv · 2026-05-27T20:55:36Z

Description

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

greptile-apps · 2026-05-28T21:41:05Z

Greptile Summary

This PR adds a filename-based fast path to the nemo-retriever skill. When the user's query literally contains a PDF basename (including .pdf extension, stem ≥6 chars), the skill bypasses semantic search, runs pdfium extraction directly on the named file, ranks pages by query-token overlap, and returns the top-10 ranking plus the top page's raw text — all in two tool calls instead of the full retrieval pipeline.

filename_fast_path.py: Implements match detection, per-file pdfium extraction via the retriever CLI subprocess, token-frequency page ranking, and a three-outcome stdout protocol (NO_MATCH / NO_TEXT / JSON+text). Previous review concerns (sidecar naming, file-handle leak, FileNotFoundError, per-file failure isolation, word-boundary false positives) have been addressed in a follow-up commit.
grep_corpus.py: New helper that scans the already-built LanceDB table for regex matches without re-reading PDFs, outputting pdf:page:type:snippet lines with a configurable hit cap.
query.md: Updated skill reference documentation to describe both new workflows, their stdout protocols, and mutual-exclusivity rules relative to retriever query.

Confidence Score: 4/5

Safe to merge with one fix: the pdfium subprocess in extract_pages has no timeout, so a malformed PDF can stall the fast path indefinitely.

The fast-path logic and output protocol are well-structured, and prior review concerns have been addressed. The one remaining concern is that subprocess.run in extract_pages carries no timeout, meaning a hung pdfium process blocks the entire skill call with no recovery path — the exception handler only catches a non-zero exit, never a process that doesn't exit at all.

skills/nemo-retriever/scripts/filename_fast_path.py — specifically the extract_pages function's subprocess.run call.

Important Files Changed

Filename	Overview
skills/nemo-retriever/scripts/filename_fast_path.py	New fast-path script: matches PDF basenames in queries, runs pdfium extraction, ranks pages by token frequency. Missing subprocess timeout means malformed PDFs can stall the fast path indefinitely.
skills/nemo-retriever/scripts/grep_corpus.py	New corpus grep script: scans LanceDB table for regex matches. Loads entire table into memory via tbl.to_pandas() — acceptable for small corpora but may exhaust RAM on large ones.
skills/nemo-retriever/references/query.md	Updated skill reference documentation: adds filename fast-path and grep-corpus workflows with clear invocation instructions, stdout protocol, and mutual-exclusivity guidance vs semantic search.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[User Query] --> B{filename_fast_path.py:\nfilename in query?}
    B -- NO_MATCH --> C[Standard Path:\nretriever query]
    B -- Match found --> D[extract_pages:\nsubprocess retriever pdfium]
    D -- CalledProcessError --> E[Log WARN, continue\nto next matched PDF]
    D -- No timeout guard --> F[⚠️ Hangs indefinitely\nif pdfium stalls]
    D -- Success --> G[rank_pages:\ntoken-frequency scoring]
    G -- No scored pages --> H[NO_TEXT → fall through\nto standard path]
    G -- Scored pages --> I[Emit JSON ranking +\nTOP_PAGE_TEXT]
    I --> J[LLM writes output.json\nSTOP — no retriever query]
    C --> K[tee /tmp/hits.json\nparse summary]
    K --> J

    L[grep_corpus.py] --> M[lancedb.connect\ntbl.to_pandas — full table in RAM]
    M --> N[Regex scan rows]
    N --> O[Print pdf:page:type:snippet\nor NO_MATCH]

Prompt To Fix All With AI

Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
skills/nemo-retriever/scripts/filename_fast_path.py:64-86
**Subprocess hangs forever on malformed PDFs**

`subprocess.run` is called without a `timeout` parameter. If pdfium enters an infinite loop on a corrupted or adversarially crafted PDF (e.g., deeply-nested form fields, circular object references, or broken cross-reference tables), the subprocess never exits and the fast-path call blocks indefinitely — no `CalledProcessError` is ever raised. From the AI agent's perspective the tool call simply never returns, eventually hitting whatever global session timeout the harness enforces rather than recovering gracefully. Add a `timeout=` value (e.g., `timeout=120`) and catch `subprocess.TimeoutExpired` alongside `CalledProcessError`.

_{Reviews (4): Last reviewed commit: "Wrapping the per-file call in a try/exce..." | Re-trigger Greptile}

sosahi · 2026-05-28T22:01:25Z

/nvskills-ci

sosahi · 2026-05-28T22:26:26Z

/nvskills-ci

sosahi · 2026-05-28T22:43:19Z

/nvskills-ci

…est into edwardk/skill-single-source

edknv · 2026-05-28T23:51:15Z

superseded by #2162

edknv added 5 commits May 27, 2026 13:54

(skill-eval) add filename match fast path

6f2aa10

add lancedb search script

ab1489e

drop the text field from summary echo

700d52d

Merge branch 'main' into edwardk/skill-single-source

391f77b

move regex search section

4082e04

edknv marked this pull request as ready for review May 28, 2026 21:37

edknv requested review from a team as code owners May 28, 2026 21:37

edknv requested a review from jperez999 May 28, 2026 21:37

greptile-apps Bot reviewed May 28, 2026

View reviewed changes

Merge branch 'main' into edwardk/skill-single-source

31caf8e

greptile-apps Bot reviewed May 28, 2026

View reviewed changes

Comment thread skills/nemo-retriever/scripts/filename_fast_path.py Outdated

edknv added 2 commits May 28, 2026 16:05

greptile fixes

29c6a58

Merge branch 'edwardk/skill-single-source' of github.com:edknv/nv-ing…

db4c702

…est into edwardk/skill-single-source

greptile-apps Bot reviewed May 28, 2026

View reviewed changes

Comment thread skills/nemo-retriever/scripts/filename_fast_path.py Outdated

Wrapping the per-file call in a try/except

4c9deb9

edknv closed this May 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(skill-eval) add filename match fast path#2140

(skill-eval) add filename match fast path#2140
edknv wants to merge 9 commits into
NVIDIA:mainfrom
edknv:edwardk/skill-single-source

edknv commented May 27, 2026

Uh oh!

greptile-apps Bot commented May 28, 2026 •

edited

Loading

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sosahi commented May 28, 2026

Uh oh!

Uh oh!

sosahi commented May 28, 2026

Uh oh!

sosahi commented May 28, 2026

Uh oh!

Uh oh!

edknv commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

edknv commented May 27, 2026

Description

Checklist

Uh oh!

greptile-apps Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sosahi commented May 28, 2026

Uh oh!

Uh oh!

sosahi commented May 28, 2026

Uh oh!

sosahi commented May 28, 2026

Uh oh!

Uh oh!

edknv commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps Bot commented May 28, 2026 •

edited

Loading