An implementation of the Model Context Protocol (MCP) that gives AI Agents like Claude the ability to scrape and interpret live web data with high precision.
- FastMCP Framework: Built using the latest 2026 standards for AI-interop.
- Headless Automation: Uses Playwright to handle JavaScript-heavy sites.
- Context Optimization: Automatically cleans HTML to save LLM tokens.
- Smart Caching: Caches scraped content for 10 minutes to avoid redundant requests.
- Configurable: Adjustable character limits and metadata options.
Before you begin, make sure you have:
- Python 3.8+ installed (Download here)
- Claude Desktop App installed (Download here)
- Basic familiarity with the terminal/command line
git clone https://github.com/RootedDreamsBlog/ScrapeFlow-MCP
cd ScrapeFlow-MCPOr download and extract the ZIP file, then navigate to the folder in your terminal.
pip install -r requirements.txtplaywright install chromiumThis downloads the Chromium browser that Playwright uses for scraping.
You need the full path to your server.py file.
On Mac/Linux:
cd /path/to/ScrapeFlow-MCP
echo "$(pwd)/server.py"On Windows (PowerShell):
cd C:\path\to\ScrapeFlow-MCP
Write-Host "$(Get-Location)\server.py"Copy the output! Example:
- Mac:
/Users/yourname/Documents/ScrapeFlow-MCP/server.py - Windows:
C:\Users\yourname\Documents\ScrapeFlow-MCP\server.py
On Mac:
open ~/Library/Application\ Support/Claude/On Windows (Command Prompt):
explorer %APPDATA%\ClaudeOn Windows (PowerShell):
explorer $env:APPDATA\ClaudeThis opens the Claude configuration folder. Look for claude_desktop_config.json.
If the file doesn't exist, create it. Then add this configuration:
Mac Example:
{
"mcpServers": {
"scrapeflow": {
"command": "python3",
"args": ["/Users/yourname/Documents/ScrapeFlow-MCP/server.py"]
}
}
}Windows Example:
{
"mcpServers": {
"scrapeflow": {
"command": "python",
"args": ["C:/Users/yourname/Documents/ScrapeFlow-MCP/server.py"]
}
}
}Important Notes:
- Replace the path with YOUR actual path from Step 1
- On Windows, use forward slashes (
/) or double backslashes (\\) - If you use a virtual environment, use the full path to that Python executable
Completely quit Claude Desktop (don't just close the window), then reopen it.
On Mac: Cmd + Q to quit
On Windows: Right-click the taskbar icon and select "Quit"
Once Claude Desktop is restarted, open a new chat and try these test prompts:
Can you scrape https://example.com and tell me what it says?
Use the search_and_summarize tool to get content from https://news.ycombinator.com
Scrape https://lite.cnn.com and summarize the top headlines
Get the content from https://www.bbc.com/news with max_chars set to 3000
Scrape https://github.com/trending and tell me what's trending
Some websites have strong anti-bot protection and may timeout:
- Forbes, Medium (paywalls)
- Amazon, eBay (heavy JavaScript)
- Sites with CAPTCHAs
For these sites, the tool will return an error message with suggestions.
If you want to test without Claude Desktop, use the MCP Inspector:
Download from nodejs.org
cd /path/to/ScrapeFlow-MCP
npx @modelcontextprotocol/inspector python server.py- The terminal will show a URL like
http://localhost:6274 - Click it or paste it into your browser
- Click "Connect" → Go to "Tools" tab
- You'll see
search_and_summarize- try entering a URL there
When you ask Claude to scrape a website:
- Claude recognizes it should use the
search_and_summarizetool - Your MCP server receives the URL
- Playwright launches a headless browser and visits the page
- The HTML is cleaned (removing scripts, ads, navigation)
- Text content is extracted and cached
- Claude receives the cleaned content and can answer your questions
Solution:
- Check the config file path is correct
- Make sure you completely quit and restarted Claude Desktop
- Check Claude Desktop logs:
- Mac:
~/Library/Logs/Claude/mcp*.log - Windows:
%APPDATA%\Claude\logs\mcp*.log
- Mac:
Solution:
Try changing "command": "python" to "command": "python3" in your config file.
Solution: This is normal for sites with heavy JavaScript or anti-bot protection. Try:
- Simpler news sites (BBC, Reuters)
- Technical sites (GitHub, Stack Overflow)
- Documentation sites
Solution: Test manually in terminal:
cd /path/to/ScrapeFlow-MCP
python server.pyIf you see errors, check that all dependencies are installed.
The search_and_summarize tool accepts these parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
url |
string | required | The webpage URL to scrape |
max_chars |
integer | 5000 | Maximum characters to return |
include_metadata |
boolean | true | Include URL and character count info |
Example usage in Claude:
Scrape https://example.com with max_chars set to 10000
ScrapeFlow-MCP/
├── server.py # FastMCP server configuration
├── scraper.py # WebScout scraping class
├── requirements.txt # Python dependencies
└── README.md # This file
Found a bug or want to improve the scraper? Contributions are welcome!
- Fork this repository
- Create a feature branch (
git checkout -b feature/improvement) - Commit your changes (
git commit -am 'Add new feature') - Push to the branch (
git push origin feature/improvement) - Open a Pull Request
MIT License - feel free to use this project for learning and development!
- Built with FastMCP
- Web scraping powered by Playwright
- HTML parsing by Beautiful Soup
If you run into problems:
- Check the Troubleshooting section above
- Review the Claude Desktop logs
- Open an issue on GitHub with:
- Your operating system
- Error messages from logs
- Steps you've already tried
Happy scraping!
This project is part of a deep-dive tutorial on my blog: [https://www.rooteddreams.net/web-scraping-mcp-guide/]