feat: deduplication layer for scrape_batch + field filters#140
Open
SuarezPM wants to merge 1 commit into
Open
Conversation
…arch_engine_batch SECURITY FIXES: - Add prototype pollution protection in filterFields() - Block __proto__, constructor, prototype properties - Sanitize error messages to prevent information disclosure - No hardcoded API keys in any file FUNCTIONALITY: - Add deduplication layer to scrape_batch tool - Add field filtering to search_engine_batch tool - Remove duplicate content blocks across URLs - Include metrics option for dedup stats TEST FILES: - test_context_cache.js: 9 tests - test_dedup_edge_cases.js: 8 tests - test_filter_fields.js: 20 tests Total: 37 tests passing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Deduplication layer for batch scraping that removes duplicate content blocks across URLs, reducing token usage in LLM pipelines.
Problem
When scraping multiple URLs from the same domain, pages share nav/header/footer HTML. Without dedup, identical content is returned N times, wasting tokens and increasing costs.
Solution
scrape_batch new parameters
deduplicatetrueinclude_metricsfalse{results: [...], metrics: {...}}responsefieldsformatmarkdownmarkdown(default) orrawsearch_engine_batch
fieldsresult.organicarray to requested keys (link, title, description, relevance_score, cursor)Hash Algorithm
This captures shared headers/footers while correctly distinguishing pages with same structure but different body content.
Test Suite
Backward Compatibility
Default behavior (no params or
include_metrics: false) returns flat array — no breaking changes to existing consumers.Prior Art
Deduplication strategy inspired by ContextForge (https://github.com/SuarezPM/Apohara_Context_Forge, DOI: 10.5281/zenodo.20277875).