Skip to content

feat: deduplication layer for scrape_batch + field filters#140

Open
SuarezPM wants to merge 1 commit into
brightdata:mainfrom
SuarezPM:feat/dedup-layer
Open

feat: deduplication layer for scrape_batch + field filters#140
SuarezPM wants to merge 1 commit into
brightdata:mainfrom
SuarezPM:feat/dedup-layer

Conversation

@SuarezPM
Copy link
Copy Markdown

@SuarezPM SuarezPM commented May 24, 2026

Summary

Deduplication layer for batch scraping that removes duplicate content blocks across URLs, reducing token usage in LLM pipelines.

Problem

When scraping multiple URLs from the same domain, pages share nav/header/footer HTML. Without dedup, identical content is returned N times, wasting tokens and increasing costs.

Solution

scrape_batch new parameters

Parameter Default Description
deduplicate true Remove duplicate content blocks via SHA-256 fingerprinting
include_metrics false Opt-in for {results: [...], metrics: {...}} response
fields undefined Filter response to specific top-level fields
format markdown Output format: markdown (default) or raw

search_engine_batch

Parameter Description
fields Filter result.organic array to requested keys (link, title, description, relevance_score, cursor)

Hash Algorithm

Content length Hash computation
≤ 2048 chars Full content SHA-256
> 2048 chars sha256(prefix[2048] + middle[256] + suffix[256])

This captures shared headers/footers while correctly distinguishing pages with same structure but different body content.

Test Suite

File Tests Coverage
test_context_cache.js 9 Core dedup logic, hash correctness
test_dedup_edge_cases.js 8 Edge cases: empty, boundary, null handling
test_filter_fields.js 20 Field filtering edge cases
TOTAL 37 All passing

Backward Compatibility

Default behavior (no params or include_metrics: false) returns flat array — no breaking changes to existing consumers.

Prior Art

Deduplication strategy inspired by ContextForge (https://github.com/SuarezPM/Apohara_Context_Forge, DOI: 10.5281/zenodo.20277875).

…arch_engine_batch

SECURITY FIXES:
- Add prototype pollution protection in filterFields()
- Block __proto__, constructor, prototype properties
- Sanitize error messages to prevent information disclosure
- No hardcoded API keys in any file

FUNCTIONALITY:
- Add deduplication layer to scrape_batch tool
- Add field filtering to search_engine_batch tool
- Remove duplicate content blocks across URLs
- Include metrics option for dedup stats

TEST FILES:
- test_context_cache.js: 9 tests
- test_dedup_edge_cases.js: 8 tests
- test_filter_fields.js: 20 tests

Total: 37 tests passing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants