Add page-precise URL exclusion to Tavily search tools by jubick1337 · Pull Request #1496 · NVIDIA-NeMo/Skills

jubick1337 · 2026-06-25T16:42:57Z

Lets a blocklist target a single page/paper (e.g. one arxiv id) without banning the whole domain. url_substring patterns load from the same exclude config as domains and apply to both search results and page access.

Summary by CodeRabbit

Bug Fixes
- Improved URL blocking by supporting both excluded domains and configurable URL substrings.
- Made URL matching more consistent by normalizing case and slash formatting.
- Applied the same blocking behavior across search results and page extraction/opening actions, with reduced risk of over-matching similar-looking sites.
Tests
- Added new test coverage for URL substring normalization, matching precision, and blocking enforcement.
Documentation / Chores
- Updated matching documentation and added the HLE exclude/blocklist config (plus updated ignore rules).

Lets a blocklist target a single page/paper (e.g. one arxiv id) without banning the whole domain. url_substring patterns load from the same exclude config as domains and apply to both search results and page access. Signed-off-by: mnovikov <mnovikov@nvidia.com>

coderabbitai · 2026-06-25T16:52:04Z

📝 Walkthrough

Walkthrough

Tavily exclude-config handling now supports url_substring entries, normalizes URLs and patterns, and blocks matching URLs across Gym search, page lookup, scrolling, browser search memory, and page extraction. Tests cover normalization, config loading, and blocked URL filtering.

Changes

URL substring exclusion

Layer / File(s)	Summary
Parse substring rules `nemo_skills/mcp/servers/tavily_search_tool.py`, `tests/test_tavily_search_tool.py`, `nemo_skills/mcp/servers/exclude_domains_hle_opus.json`, `.gitignore`	`url_substring` parsing, normalization, config content, and matcher tests are added to the Tavily exclude-config path.
Load exclusion state `nemo_skills/mcp/servers/tavily_search_tool.py`, `tests/test_tavily_search_tool.py`	`_DirectTavilyBase` stores substring exclusions, loads them from config, and tests cover combined domain and substring blocking.
Gym filtering `nemo_skills/mcp/servers/tavily_search_tool.py`, `tests/test_tavily_search_tool.py`	`DirectTavilyGymTool` filters blocked search results and rejects blocked page and scroll requests.
Browser filtering `nemo_skills/mcp/servers/tavily_search_tool.py`, `tests/test_tavily_search_tool.py`	`DirectTavilyBrowserTool` skips blocked search-memory entries and refuses blocked page extraction, with tests for browser paths.

Sequence Diagram(s)

sequenceDiagram
  participant ConfigFile
  participant DirectTavilyBase
  participant DirectTavilyGymTool
  participant DirectTavilyBrowserTool

  ConfigFile->>DirectTavilyBase: load exclude config JSON
  DirectTavilyBase->>DirectTavilyBase: parse domain and url_substring blocks
  DirectTavilyBase->>DirectTavilyBase: store _exclude_domains and _exclude_url_substrings

  DirectTavilyGymTool->>DirectTavilyBase: _is_url_blocked(url)
  DirectTavilyBase-->>DirectTavilyGymTool: blocked or allowed
  DirectTavilyGymTool->>DirectTavilyGymTool: filter search results / reject page lookup / reject scroll

  DirectTavilyBrowserTool->>DirectTavilyBase: _is_url_blocked(url)
  DirectTavilyBase-->>DirectTavilyBrowserTool: blocked or allowed
  DirectTavilyBrowserTool->>DirectTavilyBrowserTool: skip search-memory entry / stop page extraction

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly summarizes the main change: adding page-precise URL exclusion support to Tavily search tools.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch mnovikov/tavily_precise_exclude

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

🧹 Nitpick comments (1)

nemo_skills/mcp/servers/tavily_search_tool.py (1)
202-219: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Consider a single parse/load pass for domains and substrings.

_parse_exclude_url_substrings/_load_exclude_url_substrings_from_file duplicate the domain variants, and configure reads the same exclude_domains_config file twice (once for each type). A single loader that walks notices/properties once and returns both lists keeps the code DRY and avoids the redundant parse.
♻️ Sketch
def _parse_exclude_config(exclude_config: dict[str, Any]) -> tuple[list[str], list[str]]:
    domains, substrings = [], []
    for notice in exclude_config["notices"]:
        for prop in notice["properties"]:
            if prop["type"] == "domain":
                domains.append(prop["value"])
            elif prop["type"] == "url_substring":
                substrings.append(prop["value"])
    return domains, substrings
Then configure opens the file once and unpacks both lists.
As per coding guidelines: "Keep code simple and elegant; reuse/extend existing functionality when possible."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nemo_skills/mcp/servers/tavily_search_tool.py` around lines 202 - 219, The
exclude config loading is duplicated between the domain and URL-substring paths,
and configure currently parses the same file twice. Refactor the existing
_parse_exclude_url_substrings and the corresponding domain parser into a single
shared parser/loader that walks exclude_config["notices"] and their properties
once, collecting both "domain" and "url_substring" values. Then update configure
and the existing _load_exclude_*_from_file helpers to reuse that shared logic
and return both lists from one file read.
Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@nemo_skills/mcp/servers/tavily_search_tool.py`:
- Around line 202-219: The exclude config loading is duplicated between the
domain and URL-substring paths, and configure currently parses the same file
twice. Refactor the existing _parse_exclude_url_substrings and the corresponding
domain parser into a single shared parser/loader that walks
exclude_config["notices"] and their properties once, collecting both "domain"
and "url_substring" values. Then update configure and the existing
_load_exclude_*_from_file helpers to reuse that shared logic and return both
lists from one file read.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 76c30bf1-0fec-4d2c-aeeb-c04bde1ce226

📥 Commits

Reviewing files that changed from the base of the PR and between da85a88 and 3a4b7b8.

📒 Files selected for processing (2)

nemo_skills/mcp/servers/tavily_search_tool.py
tests/test_tavily_search_tool.py

Ships exclude_domains_hle_opus.json (9 whole-domain + 31 page-precise url_substring bans) next to the Tavily tools as a ready-made exclude_domains_config, cited to the Claude Opus 4.8 System Card and referenced from the docstring. Whitelisted in .gitignore since the repo blanket-ignores *.json. Signed-off-by: mnovikov <mnovikov@nvidia.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@nemo_skills/mcp/servers/exclude_domains_hle_opus.json`:
- Line 42: The Medium blocklist entry in exclude_domains_hle_opus.json has a
slug typo, so the intended URL will not be matched by the url_substring filter.
Update the value in the relevant {"type": "url_substring"} entry to use the
correct Medium slug, and verify the exclusion still targets the intended page
based on the existing substring-matching behavior.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 3c9be89a-6b96-4ca6-9dfd-d425e94e1fe3

📥 Commits

Reviewing files that changed from the base of the PR and between 3a4b7b8 and a18a1a4.

📒 Files selected for processing (3)

.gitignore
nemo_skills/mcp/servers/exclude_domains_hle_opus.json
nemo_skills/mcp/servers/tavily_search_tool.py

✅ Files skipped from review due to trivial changes (1)

.gitignore

🚧 Files skipped from review as they are similar to previous changes (1)

nemo_skills/mcp/servers/tavily_search_tool.py

coderabbitai · 2026-06-26T23:10:00Z

+        {"type": "url_substring", "value": "github.com/hanjanghoon/DEER"},
+        {"type": "url_substring", "value": "github.com/repos/hanjanghoon/DEER"},
+        {"type": "url_substring", "value": "mveteanu/HLE_PDF"},
+        {"type": "url_substring", "value": "medium.com/@82deutschmark/o3-quiet-breakthrough-1bf9f0fobafc84"},


🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Fix the Medium slug typo so the intended page is actually excluded.

Line 42 has 1bf9f0fobafc84, but indexed source excerpts list this blocklist URL as medium.com/@82deutschmark/o3-quiet-breakthrough-1bf9f0bafc84; because substring matching only lowercases and removes /, the current typo will not match the intended URL. (studylib.net)

Proposed fix

- {"type": "url_substring", "value": "medium.com/@82deutschmark/o3-quiet-breakthrough-1bf9f0fobafc84"}, + {"type": "url_substring", "value": "medium.com/@82deutschmark/o3-quiet-breakthrough-1bf9f0bafc84"},

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

{"type": "url_substring", "value": "medium.com/@82deutschmark/o3-quiet-breakthrough-1bf9f0fobafc84"},

{"type": "url_substring", "value": "medium.com/@82deutschmark/o3-quiet-breakthrough-1bf9f0bafc84"},

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@nemo_skills/mcp/servers/exclude_domains_hle_opus.json` at line 42, The Medium blocklist entry in exclude_domains_hle_opus.json has a slug typo, so the intended URL will not be matched by the url_substring filter. Update the value in the relevant {"type": "url_substring"} entry to use the correct Medium slug, and verify the exclusion still targets the intended page based on the existing substring-matching behavior.

jiacheng-xu · 2026-06-27T03:56:46Z

lgtm
please fix the lint format checks.

jubick1337 marked this pull request as ready for review June 25, 2026 16:47

jubick1337 requested a review from jiacheng-xu June 25, 2026 16:48

coderabbitai Bot reviewed Jun 25, 2026

View reviewed changes

jubick1337 added 2 commits June 25, 2026 22:24

Merge branch 'main' into mnovikov/tavily_precise_exclude

6741d63

coderabbitai Bot reviewed Jun 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add page-precise URL exclusion to Tavily search tools#1496

Add page-precise URL exclusion to Tavily search tools#1496
jubick1337 wants to merge 3 commits into
mainfrom
mnovikov/tavily_precise_exclude

jubick1337 commented Jun 25, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 25, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 26, 2026

Uh oh!

jiacheng-xu commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	{"type": "url_substring", "value": "medium.com/@82deutschmark/o3-quiet-breakthrough-1bf9f0fobafc84"},
	{"type": "url_substring", "value": "medium.com/@82deutschmark/o3-quiet-breakthrough-1bf9f0bafc84"},

Uh oh!

Conversation

jubick1337 commented Jun 25, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

jiacheng-xu commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jubick1337 commented Jun 25, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 25, 2026 •

edited

Loading