Add page-precise URL exclusion to Tavily search tools#1496
Conversation
Lets a blocklist target a single page/paper (e.g. one arxiv id) without banning the whole domain. url_substring patterns load from the same exclude config as domains and apply to both search results and page access. Signed-off-by: mnovikov <mnovikov@nvidia.com>
📝 WalkthroughWalkthroughTavily exclude-config handling now supports ChangesURL substring exclusion
Sequence Diagram(s)sequenceDiagram
participant ConfigFile
participant DirectTavilyBase
participant DirectTavilyGymTool
participant DirectTavilyBrowserTool
ConfigFile->>DirectTavilyBase: load exclude config JSON
DirectTavilyBase->>DirectTavilyBase: parse domain and url_substring blocks
DirectTavilyBase->>DirectTavilyBase: store _exclude_domains and _exclude_url_substrings
DirectTavilyGymTool->>DirectTavilyBase: _is_url_blocked(url)
DirectTavilyBase-->>DirectTavilyGymTool: blocked or allowed
DirectTavilyGymTool->>DirectTavilyGymTool: filter search results / reject page lookup / reject scroll
DirectTavilyBrowserTool->>DirectTavilyBase: _is_url_blocked(url)
DirectTavilyBase-->>DirectTavilyBrowserTool: blocked or allowed
DirectTavilyBrowserTool->>DirectTavilyBrowserTool: skip search-memory entry / stop page extraction
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
nemo_skills/mcp/servers/tavily_search_tool.py (1)
202-219: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winConsider a single parse/load pass for domains and substrings.
_parse_exclude_url_substrings/_load_exclude_url_substrings_from_fileduplicate the domain variants, andconfigurereads the sameexclude_domains_configfile twice (once for each type). A single loader that walksnotices/propertiesonce and returns both lists keeps the code DRY and avoids the redundant parse.As per coding guidelines: "Keep code simple and elegant; reuse/extend existing functionality when possible."♻️ Sketch
def _parse_exclude_config(exclude_config: dict[str, Any]) -> tuple[list[str], list[str]]: domains, substrings = [], [] for notice in exclude_config["notices"]: for prop in notice["properties"]: if prop["type"] == "domain": domains.append(prop["value"]) elif prop["type"] == "url_substring": substrings.append(prop["value"]) return domains, substringsThen
configureopens the file once and unpacks both lists.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@nemo_skills/mcp/servers/tavily_search_tool.py` around lines 202 - 219, The exclude config loading is duplicated between the domain and URL-substring paths, and configure currently parses the same file twice. Refactor the existing _parse_exclude_url_substrings and the corresponding domain parser into a single shared parser/loader that walks exclude_config["notices"] and their properties once, collecting both "domain" and "url_substring" values. Then update configure and the existing _load_exclude_*_from_file helpers to reuse that shared logic and return both lists from one file read.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@nemo_skills/mcp/servers/tavily_search_tool.py`:
- Around line 202-219: The exclude config loading is duplicated between the
domain and URL-substring paths, and configure currently parses the same file
twice. Refactor the existing _parse_exclude_url_substrings and the corresponding
domain parser into a single shared parser/loader that walks
exclude_config["notices"] and their properties once, collecting both "domain"
and "url_substring" values. Then update configure and the existing
_load_exclude_*_from_file helpers to reuse that shared logic and return both
lists from one file read.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 76c30bf1-0fec-4d2c-aeeb-c04bde1ce226
📒 Files selected for processing (2)
nemo_skills/mcp/servers/tavily_search_tool.pytests/test_tavily_search_tool.py
Ships exclude_domains_hle_opus.json (9 whole-domain + 31 page-precise url_substring bans) next to the Tavily tools as a ready-made exclude_domains_config, cited to the Claude Opus 4.8 System Card and referenced from the docstring. Whitelisted in .gitignore since the repo blanket-ignores *.json. Signed-off-by: mnovikov <mnovikov@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@nemo_skills/mcp/servers/exclude_domains_hle_opus.json`:
- Line 42: The Medium blocklist entry in exclude_domains_hle_opus.json has a
slug typo, so the intended URL will not be matched by the url_substring filter.
Update the value in the relevant {"type": "url_substring"} entry to use the
correct Medium slug, and verify the exclusion still targets the intended page
based on the existing substring-matching behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 3c9be89a-6b96-4ca6-9dfd-d425e94e1fe3
📒 Files selected for processing (3)
.gitignorenemo_skills/mcp/servers/exclude_domains_hle_opus.jsonnemo_skills/mcp/servers/tavily_search_tool.py
✅ Files skipped from review due to trivial changes (1)
- .gitignore
🚧 Files skipped from review as they are similar to previous changes (1)
- nemo_skills/mcp/servers/tavily_search_tool.py
| {"type": "url_substring", "value": "github.com/hanjanghoon/DEER"}, | ||
| {"type": "url_substring", "value": "github.com/repos/hanjanghoon/DEER"}, | ||
| {"type": "url_substring", "value": "mveteanu/HLE_PDF"}, | ||
| {"type": "url_substring", "value": "medium.com/@82deutschmark/o3-quiet-breakthrough-1bf9f0fobafc84"}, |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Fix the Medium slug typo so the intended page is actually excluded.
Line 42 has 1bf9f0fobafc84, but indexed source excerpts list this blocklist URL as medium.com/@82deutschmark/o3-quiet-breakthrough-1bf9f0bafc84; because substring matching only lowercases and removes /, the current typo will not match the intended URL. (studylib.net)
Proposed fix
- {"type": "url_substring", "value": "medium.com/@82deutschmark/o3-quiet-breakthrough-1bf9f0fobafc84"},
+ {"type": "url_substring", "value": "medium.com/@82deutschmark/o3-quiet-breakthrough-1bf9f0bafc84"},📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| {"type": "url_substring", "value": "medium.com/@82deutschmark/o3-quiet-breakthrough-1bf9f0fobafc84"}, | |
| {"type": "url_substring", "value": "medium.com/@82deutschmark/o3-quiet-breakthrough-1bf9f0bafc84"}, |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@nemo_skills/mcp/servers/exclude_domains_hle_opus.json` at line 42, The Medium
blocklist entry in exclude_domains_hle_opus.json has a slug typo, so the
intended URL will not be matched by the url_substring filter. Update the value
in the relevant {"type": "url_substring"} entry to use the correct Medium slug,
and verify the exclusion still targets the intended page based on the existing
substring-matching behavior.
|
lgtm |
Lets a blocklist target a single page/paper (e.g. one arxiv id) without banning the whole domain. url_substring patterns load from the same exclude config as domains and apply to both search results and page access.
Summary by CodeRabbit