No setup. No install. Just paste a URL and scrape.
Open notebook in Google Colab, run all cells, and capture the full rendered HTML from any website — even JavaScript-heavy SPAs.
Tags: web-scraping selenium chromium spa-scraper colab-notebook python html-extraction
Quick Navigation
| Section | Description |
|---|---|
| 📖 Overview | What is Z-Web-Scraper? |
| 📂 Project Structure | Repository layout |
| 🧩 Architecture | Pipeline flow diagram |
| ⚙️ Pipeline Components | Tools and engines used |
| 🚀 Quick Start | Get running in 3 steps |
| 🎛️ Scrape Parameters | All configurable options |
| 📐 Timeout Guide | When to use which settings |
| 🧠 Scraper Details | Technical specs of each component |
| 🔋 Resource Requirements | RAM, disk specs |
| 🐍 Python Modules | Modular source code reference |
| 🧪 Tips & Tricks | Get the best results |
| ❓ FAQ | Common questions answered |
| 🐛 Troubleshooting | Fix common issues |
| 🙏 Acknowledgements | Credits and references |
| 🤝 Contributing | How to contribute |
| 📜 License | MIT license details |
Z-Web-Scraper is a full-page web scraper for Google Colab that uses Selenium with headless Chromium to render any webpage — including JavaScript-heavy single-page applications — and capture the complete HTML after all dynamic content has loaded.
Note
Why Selenium + Chromium? Unlike requests or urllib, Selenium runs a real browser. This means JavaScript frameworks (React, Vue, Angular, Next.js, Nuxt) execute fully before the HTML is captured. You get what the user sees, not what the server initially sends.
| Feature | Description |
|---|---|
| 🕷️ Full Render | Captures HTML after complete JS execution |
| 📜 Auto-Scroll | Scrolls to trigger lazy-loaded content |
| ⚡ Custom JS | Execute your own JavaScript after load |
| 🔗 Link Extraction | All links with text and resolved URLs |
| 🖼️ Image Extraction | All images with alt text and dimensions |
| 🏷️ Meta Tags | Open Graph, Twitter Cards, description, keywords |
| 📄 HTML Preview | Syntax-highlighted preview in notebook |
| 💾 Save & Download | HTML + metadata JSON, ZIP export |
| Component | File | Purpose |
|---|---|---|
| Notebook | notebook/Z-Web-Scraper.ipynb |
3-cell Colab notebook — main entry point |
| Config | src/config.py |
Constants, Chrome options, UI theme tokens |
| Scraper | src/scraper.py |
Selenium engine with auto-scroll, metadata extraction |
| UI | src/ui.py |
Theme-safe Colab components (dark/light mode) |
| Guide | GUIDE.md |
Beginner-friendly user guide |
Z-Web-Scraper/
├── CHANGELOG.md # Version history (newest first)
├── CONTRIBUTING.md # How to contribute
├── GUIDE.md # Beginner-friendly user guide
├── LICENSE # MIT
├── README.md # This file
├── SECURITY.md # Vulnerability reporting policy
├── .gitignore # Python, Jupyter, output files, OS artifacts
├── requirements.txt # Python dependencies
│
├── .github/
│ ├── ISSUE_TEMPLATE/
│ │ ├── bug_report.md
│ │ └── feature_request.md
│ └── PULL_REQUEST_TEMPLATE.md
│
├── notebook/
│ └── Z-Web-Scraper.ipynb # Main Colab notebook (3 cells)
│
└── src/
├── __init__.py # Package marker + shared exports
├── config.py # Constants and defaults
├── scraper.py # Core scraping engine
└── ui.py # Colab UI components
flowchart TD
A["📋 Paste URL"] --> B["🔧 Launch Chromium"]
B --> C["🌐 Load Page"]
C --> D{"JS Settled?"}
D -->|No| C
D -->|Yes| E{"Auto-Scroll?"}
E -->|Yes| F["📜 Scroll to Bottom"]
E -->|No| G["📸 Capture HTML"]
F --> G
G --> H{"Custom JS?"}
H -->|Yes| I["⚡ Execute JS"]
H -->|No| J["🔍 Parse Metadata"]
I --> J
J --> K["📊 Display Stats"]
K --> L["💾 Save & Download"]
style A fill:#0d1117,stroke:#58a6ff,color:#e6edf3,stroke-width:2px
style L fill:#0d1117,stroke:#3fb950,color:#e6edf3,stroke-width:2px
style B fill:#0d1117,stroke:#a371f7,color:#e6edf3,stroke-width:2px
style G fill:#0d1117,stroke:#f97316,color:#e6edf3,stroke-width:2px
| Component | Technology | Purpose |
|---|---|---|
| Browser Engine | Chromium (headless) | Renders JavaScript, executes SPAs |
| Automation | Selenium WebDriver | Controls Chromium programmatically |
| HTML Parser | BeautifulSoup + lxml | Extracts links, images, meta tags |
| HTTP Fallback | requests | Fetches response headers |
| Step | Cell | What Happens | Duration |
|---|---|---|---|
| 🔧 | 1. Setup | Install Chromium & Selenium | ~60s (first) / ~10s (cached) |
| 🕷️ | 2. Scrape | Paste URL → render → capture HTML | ~5–60s per page |
| 💾 | 3. Export | Zip and download results | ~5 sec |
| Parameter | Type | Default | Options | Description |
|---|---|---|---|---|
url |
String | — | Any URL | Website to scrape |
timeout |
Integer | 30 |
10–120 |
Max seconds to wait for page load |
wait_after_load |
Integer | 3 |
0–30 |
Extra seconds for JS to settle |
auto_scroll |
Bool | True |
True/False |
Scroll to trigger lazy content |
custom_js |
String | "" |
Any JS | JavaScript to execute after load |
save_html_file |
Bool | True |
True/False |
Save HTML to file |
show_preview |
Bool | True |
True/False |
Display syntax-highlighted preview |
show_links |
Bool | True |
True/False |
Extract and display all links |
| Site Type | Timeout | Wait After | Auto-Scroll | Speed |
|---|---|---|---|---|
| Static HTML | 15s | 1s | Optional | ⚡⚡⚡ |
| React / Vue SPA | 30s | 3–5s | Yes | ⚡⚡ |
| Next.js / Nuxt | 60s | 5–8s | Yes | ⚡ |
| Heavy Media | 90–120s | 5–10s | Yes | 🐢 |
| Property | Value |
|---|---|
| Engine | Chromium (headless, --headless=new) |
| Driver | Selenium WebDriver 4.x |
| Window | 1920×1080 |
| User Agent | Chrome 125 (Windows) |
| Timeout | Configurable 10–120 seconds |
| Source | selenium.dev |
| Property | Value |
|---|---|
| Purpose | HTML parsing and data extraction |
| Extracts | Links, images, meta tags, text |
| Parser | lxml (fast C-based) |
| Source | crummy.com/BeautifulSoup |
| Resource | Minimum | Recommended | Notes |
|---|---|---|---|
| Runtime | Colab free | Colab free | No GPU needed |
| System RAM | 2 GB | 4 GB+ | Chromium memory |
| Disk Space | 500 MB | 1 GB+ | Chromium + outputs |
| Python | 3.10+ | Colab default | Required for Selenium |
from src.config import DEFAULT_TIMEOUT, CHROME_OPTIONS, BG_CARD
print(f"Default timeout: {DEFAULT_TIMEOUT}s")from src.scraper import scrape_url, save_html, extract_links
result = scrape_url("https://example.com", timeout=30, scroll=True)
save_html(result["html"], "output.html")
links = extract_links(result["html"], result["final_url"])from src.ui import show_header, show_ok, show_stats
show_header("🕷️", "Scraping", "Loading page...")
show_ok("Page captured!")
show_stats([("📄", "Chars", "125,000"), ("🔗", "Links", "42")])
|
|
|
|
Does it work on Next.js / React / Vue sites?
Yes! Selenium runs a real Chromium browser, so all JavaScript frameworks execute fully before the HTML is captured.
Can I scrape pages behind login?
Not directly. You could inject cookies via custom_js, but there's no built-in auth flow.
Why not just use `requests`?
requests only gets the initial server response — no JavaScript execution. SPAs return an empty shell that gets filled by JS. Selenium renders the full page.
Is GPU required?
No. Z-Web-Scraper runs entirely on CPU. Chromium doesn't need GPU acceleration for scraping.
Can I scrape multiple URLs?
Yes! Run Step 2 multiple times with different URLs. All files accumulate in the output directory and can be exported together in Step 3.
How big can the HTML be?
There's no hard limit. Very large pages (10+ MB HTML) may slow down the preview. The full HTML is always saved to file regardless of size.
| Problem | Cause | Solution |
|---|---|---|
Chromium not found |
Runtime restarted | Re-run Cell 1 |
TimeoutException |
Page too slow | Increase timeout to 60–120s |
| Empty HTML | JS not settled | Increase wait_after_load to 5–10s |
| Missing lazy content | Scroll not triggered | Enable auto_scroll |
WebDriverException |
Driver crash | Restart runtime, re-run Cell 1 |
ConnectionRefused |
Site blocks bots | Try different URL or add delay |
|
|
Contributions are welcome!
This project is licensed under the MIT License.
Free to use, modify, and distribute — see the LICENSE file for details.