Skip to content

(skill-eval) add filename match fast path#2140

Closed
edknv wants to merge 9 commits into
NVIDIA:mainfrom
edknv:edwardk/skill-single-source
Closed

(skill-eval) add filename match fast path#2140
edknv wants to merge 9 commits into
NVIDIA:mainfrom
edknv:edwardk/skill-single-source

Conversation

@edknv
Copy link
Copy Markdown
Collaborator

@edknv edknv commented May 27, 2026

Description

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

@edknv edknv marked this pull request as ready for review May 28, 2026 21:37
@edknv edknv requested review from a team as code owners May 28, 2026 21:37
@edknv edknv requested a review from jperez999 May 28, 2026 21:37
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 28, 2026

Greptile Summary

This PR adds a filename-based fast path to the nemo-retriever skill. When the user's query literally contains a PDF basename (including .pdf extension, stem ≥6 chars), the skill bypasses semantic search, runs pdfium extraction directly on the named file, ranks pages by query-token overlap, and returns the top-10 ranking plus the top page's raw text — all in two tool calls instead of the full retrieval pipeline.

  • filename_fast_path.py: Implements match detection, per-file pdfium extraction via the retriever CLI subprocess, token-frequency page ranking, and a three-outcome stdout protocol (NO_MATCH / NO_TEXT / JSON+text). Previous review concerns (sidecar naming, file-handle leak, FileNotFoundError, per-file failure isolation, word-boundary false positives) have been addressed in a follow-up commit.
  • grep_corpus.py: New helper that scans the already-built LanceDB table for regex matches without re-reading PDFs, outputting pdf:page:type:snippet lines with a configurable hit cap.
  • query.md: Updated skill reference documentation to describe both new workflows, their stdout protocols, and mutual-exclusivity rules relative to retriever query.

Confidence Score: 4/5

Safe to merge with one fix: the pdfium subprocess in extract_pages has no timeout, so a malformed PDF can stall the fast path indefinitely.

The fast-path logic and output protocol are well-structured, and prior review concerns have been addressed. The one remaining concern is that subprocess.run in extract_pages carries no timeout, meaning a hung pdfium process blocks the entire skill call with no recovery path — the exception handler only catches a non-zero exit, never a process that doesn't exit at all.

skills/nemo-retriever/scripts/filename_fast_path.py — specifically the extract_pages function's subprocess.run call.

Important Files Changed

Filename Overview
skills/nemo-retriever/scripts/filename_fast_path.py New fast-path script: matches PDF basenames in queries, runs pdfium extraction, ranks pages by token frequency. Missing subprocess timeout means malformed PDFs can stall the fast path indefinitely.
skills/nemo-retriever/scripts/grep_corpus.py New corpus grep script: scans LanceDB table for regex matches. Loads entire table into memory via tbl.to_pandas() — acceptable for small corpora but may exhaust RAM on large ones.
skills/nemo-retriever/references/query.md Updated skill reference documentation: adds filename fast-path and grep-corpus workflows with clear invocation instructions, stdout protocol, and mutual-exclusivity guidance vs semantic search.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[User Query] --> B{filename_fast_path.py:\nfilename in query?}
    B -- NO_MATCH --> C[Standard Path:\nretriever query]
    B -- Match found --> D[extract_pages:\nsubprocess retriever pdfium]
    D -- CalledProcessError --> E[Log WARN, continue\nto next matched PDF]
    D -- No timeout guard --> F[⚠️ Hangs indefinitely\nif pdfium stalls]
    D -- Success --> G[rank_pages:\ntoken-frequency scoring]
    G -- No scored pages --> H[NO_TEXT → fall through\nto standard path]
    G -- Scored pages --> I[Emit JSON ranking +\nTOP_PAGE_TEXT]
    I --> J[LLM writes output.json\nSTOP — no retriever query]
    C --> K[tee /tmp/hits.json\nparse summary]
    K --> J

    L[grep_corpus.py] --> M[lancedb.connect\ntbl.to_pandas — full table in RAM]
    M --> N[Regex scan rows]
    N --> O[Print pdf:page:type:snippet\nor NO_MATCH]
Loading
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
skills/nemo-retriever/scripts/filename_fast_path.py:64-86
**Subprocess hangs forever on malformed PDFs**

`subprocess.run` is called without a `timeout` parameter. If pdfium enters an infinite loop on a corrupted or adversarially crafted PDF (e.g., deeply-nested form fields, circular object references, or broken cross-reference tables), the subprocess never exits and the fast-path call blocks indefinitely — no `CalledProcessError` is ever raised. From the AI agent's perspective the tool call simply never returns, eventually hitting whatever global session timeout the harness enforces rather than recovering gracefully. Add a `timeout=` value (e.g., `timeout=120`) and catch `subprocess.TimeoutExpired` alongside `CalledProcessError`.

Reviews (4): Last reviewed commit: "Wrapping the per-file call in a try/exce..." | Re-trigger Greptile

Comment thread skills/nemo-retriever/scripts/filename_fast_path.py
Comment thread skills/nemo-retriever/scripts/filename_fast_path.py
Comment thread skills/nemo-retriever/scripts/filename_fast_path.py Outdated
Comment thread skills/nemo-retriever/scripts/grep_corpus.py
Comment thread skills/nemo-retriever/scripts/filename_fast_path.py Outdated
Comment thread skills/nemo-retriever/scripts/filename_fast_path.py
@sosahi
Copy link
Copy Markdown
Collaborator

sosahi commented May 28, 2026

/nvskills-ci

Comment thread skills/nemo-retriever/scripts/filename_fast_path.py Outdated
@sosahi
Copy link
Copy Markdown
Collaborator

sosahi commented May 28, 2026

/nvskills-ci

1 similar comment
@sosahi
Copy link
Copy Markdown
Collaborator

sosahi commented May 28, 2026

/nvskills-ci

Comment thread skills/nemo-retriever/scripts/filename_fast_path.py Outdated
@edknv
Copy link
Copy Markdown
Collaborator Author

edknv commented May 28, 2026

superseded by #2162

@edknv edknv closed this May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants