feat: deduplication layer for scrape_batch + field filters by SuarezPM · Pull Request #140 · brightdata/brightdata-mcp

SuarezPM · 2026-05-24T20:35:57Z

Summary

Deduplication layer for batch scraping that removes duplicate content blocks across URLs, reducing token usage in LLM pipelines.

Problem

When scraping multiple URLs from the same domain, pages share nav/header/footer HTML. Without dedup, identical content is returned N times, wasting tokens and increasing costs.

Solution

scrape_batch new parameters

Parameter	Default	Description
`deduplicate`	`true`	Remove duplicate content blocks via SHA-256 fingerprinting
`include_metrics`	`false`	Opt-in for `{results: [...], metrics: {...}}` response
`fields`	undefined	Filter response to specific top-level fields
`format`	`markdown`	Output format: `markdown` (default) or `raw`

search_engine_batch

Parameter	Description
`fields`	Filter `result.organic` array to requested keys (link, title, description, relevance_score, cursor)

Hash Algorithm

Content length	Hash computation
≤ 2048 chars	Full content SHA-256
> 2048 chars	sha256(prefix[2048] + middle[256] + suffix[256])

This captures shared headers/footers while correctly distinguishing pages with same structure but different body content.

Test Suite

File	Tests	Coverage
test_context_cache.js	9	Core dedup logic, hash correctness
test_dedup_edge_cases.js	8	Edge cases: empty, boundary, null handling
test_filter_fields.js	20	Field filtering edge cases
TOTAL	37	All passing

Backward Compatibility

Default behavior (no params or include_metrics: false) returns flat array — no breaking changes to existing consumers.

Prior Art

Deduplication strategy inspired by ContextForge (https://github.com/SuarezPM/Apohara_Context_Forge, DOI: 10.5281/zenodo.20277875).

…arch_engine_batch SECURITY FIXES: - Add prototype pollution protection in filterFields() - Block __proto__, constructor, prototype properties - Sanitize error messages to prevent information disclosure - No hardcoded API keys in any file FUNCTIONALITY: - Add deduplication layer to scrape_batch tool - Add field filtering to search_engine_batch tool - Remove duplicate content blocks across URLs - Include metrics option for dedup stats TEST FILES: - test_context_cache.js: 9 tests - test_dedup_edge_cases.js: 8 tests - test_filter_fields.js: 20 tests Total: 37 tests passing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: deduplication layer for scrape_batch + field filters#140

feat: deduplication layer for scrape_batch + field filters#140
SuarezPM wants to merge 1 commit into
brightdata:mainfrom
SuarezPM:feat/dedup-layer

SuarezPM commented May 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

SuarezPM commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

scrape_batch new parameters

search_engine_batch

Hash Algorithm

Test Suite

Backward Compatibility

Prior Art

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SuarezPM commented May 24, 2026 •

edited

Loading