Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions relay-filter/src/web_crawlers.rs
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ static WEB_CRAWLERS: LazyLock<Regex> = LazyLock::new(|| {
PerplexityBot| # Perplexity - see https://docs.perplexity.ai/guides/bots
Applebot| # Apple - see https://support.apple.com/en-us/119829
DuckDuckBot # DuckDuckGo - see https://duckduckgo.com/duckduckgo-help-pages/results/duckduckbot
Lightpanda # Lightpand - see https://lightpanda.io/

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing regex alternation pipe

High Severity

The WEB_CRAWLERS pattern uses (?x), so DuckDuckBot and the new Lightpanda token are concatenated into one literal DuckDuckBotLightpanda because there is no | between them. Real DuckDuckBot and Lightpanda user agents no longer match their intended alternatives.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 2e32395. Configure here.

Comment on lines 48 to +49

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The missing pipe | separator after DuckDuckBot causes the regex to incorrectly look for DuckDuckBotLightpanda instead of DuckDuckBot or Lightpanda, breaking filtering for both.
Severity: MEDIUM

Suggested Fix

Add a pipe separator (|) to the end of the DuckDuckBot line in the regex pattern. This will correctly separate the two patterns, ensuring the regex engine treats them as alternatives (DuckDuckBot OR Lightpanda).

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.

Location: relay-filter/src/web_crawlers.rs#L48-L49

Potential issue: Due to a missing pipe (`|`) separator on the `DuckDuckBot` line, the
verbose regex (`(?ix)`) will concatenate `DuckDuckBot` and the newly added `Lightpanda`.
This changes the pattern to match the literal string `DuckDuckBotLightpanda` instead of
matching either `DuckDuckBot` or `Lightpanda`. As a result, user agents for both
DuckDuckGo's and Lightpanda's crawlers will no longer be filtered out, allowing their
events to be processed when they should be dropped. This breaks the intended
functionality of the web crawler filter for these two bots.

Did we get this right? 👍 / 👎 to inform future reviews.

"
)
.expect("Invalid web crawlers filter Regex")
Expand Down Expand Up @@ -164,6 +165,7 @@ mod tests {
"Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; PerplexityBot/1.0; +https://perplexity.ai/perplexitybot)",
"Mozilla/5.0 (Device; OS_version) AppleWebKit/WebKit_version (KHTML, like Gecko)Version/Safari_version [Mobile/Mobile_version] Safari/WebKit_version (Applebot/Applebot_version; +http://www.apple.com/go/applebot)",
"DuckDuckBot/1.1; (+http://duckduckgo.com/duckduckbot.html)",
"Lightpanda/1.0",
];

for banned_user_agent in &user_agents {
Expand Down
Loading