(skill-eval) add filename match fast path#2140
Conversation
Greptile SummaryThis PR adds a filename-based fast path to the nemo-retriever skill. When the user's query literally contains a PDF basename (including
|
| Filename | Overview |
|---|---|
| skills/nemo-retriever/scripts/filename_fast_path.py | New fast-path script: matches PDF basenames in queries, runs pdfium extraction, ranks pages by token frequency. Missing subprocess timeout means malformed PDFs can stall the fast path indefinitely. |
| skills/nemo-retriever/scripts/grep_corpus.py | New corpus grep script: scans LanceDB table for regex matches. Loads entire table into memory via tbl.to_pandas() — acceptable for small corpora but may exhaust RAM on large ones. |
| skills/nemo-retriever/references/query.md | Updated skill reference documentation: adds filename fast-path and grep-corpus workflows with clear invocation instructions, stdout protocol, and mutual-exclusivity guidance vs semantic search. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[User Query] --> B{filename_fast_path.py:\nfilename in query?}
B -- NO_MATCH --> C[Standard Path:\nretriever query]
B -- Match found --> D[extract_pages:\nsubprocess retriever pdfium]
D -- CalledProcessError --> E[Log WARN, continue\nto next matched PDF]
D -- No timeout guard --> F[⚠️ Hangs indefinitely\nif pdfium stalls]
D -- Success --> G[rank_pages:\ntoken-frequency scoring]
G -- No scored pages --> H[NO_TEXT → fall through\nto standard path]
G -- Scored pages --> I[Emit JSON ranking +\nTOP_PAGE_TEXT]
I --> J[LLM writes output.json\nSTOP — no retriever query]
C --> K[tee /tmp/hits.json\nparse summary]
K --> J
L[grep_corpus.py] --> M[lancedb.connect\ntbl.to_pandas — full table in RAM]
M --> N[Regex scan rows]
N --> O[Print pdf:page:type:snippet\nor NO_MATCH]
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 1
skills/nemo-retriever/scripts/filename_fast_path.py:64-86
**Subprocess hangs forever on malformed PDFs**
`subprocess.run` is called without a `timeout` parameter. If pdfium enters an infinite loop on a corrupted or adversarially crafted PDF (e.g., deeply-nested form fields, circular object references, or broken cross-reference tables), the subprocess never exits and the fast-path call blocks indefinitely — no `CalledProcessError` is ever raised. From the AI agent's perspective the tool call simply never returns, eventually hitting whatever global session timeout the harness enforces rather than recovering gracefully. Add a `timeout=` value (e.g., `timeout=120`) and catch `subprocess.TimeoutExpired` alongside `CalledProcessError`.
Reviews (4): Last reviewed commit: "Wrapping the per-file call in a try/exce..." | Re-trigger Greptile
|
/nvskills-ci |
|
/nvskills-ci |
1 similar comment
|
/nvskills-ci |
…est into edwardk/skill-single-source
|
superseded by #2162 |
Description
Checklist