GitHub - festverse/Z-Web-Scraper: 🕸️ Extract fully rendered HTML from any URL. A powerful web scraper built to bypass empty payloads and capture dynamic content from JavaScript-heavy SPAs.

No setup. No install. Just paste a URL and scrape.

Open notebook in Google Colab, run all cells, and capture the full rendered HTML from any website — even JavaScript-heavy SPAs.

Tags: web-scraping selenium chromium spa-scraper colab-notebook python html-extraction

📑 Table of Contents

Quick Navigation

Section	Description
📖 Overview	What is Z-Web-Scraper?
📂 Project Structure	Repository layout
🧩 Architecture	Pipeline flow diagram
⚙️ Pipeline Components	Tools and engines used
🚀 Quick Start	Get running in 3 steps
🎛️ Scrape Parameters	All configurable options
📐 Timeout Guide	When to use which settings
🧠 Scraper Details	Technical specs of each component
🔋 Resource Requirements	RAM, disk specs
🐍 Python Modules	Modular source code reference
🧪 Tips & Tricks	Get the best results
❓ FAQ	Common questions answered
🐛 Troubleshooting	Fix common issues
🙏 Acknowledgements	Credits and references
🤝 Contributing	How to contribute
📜 License	MIT license details

📖 Overview

Z-Web-Scraper is a full-page web scraper for Google Colab that uses Selenium with headless Chromium to render any webpage — including JavaScript-heavy single-page applications — and capture the complete HTML after all dynamic content has loaded.

Note

Why Selenium + Chromium? Unlike requests or urllib, Selenium runs a real browser. This means JavaScript frameworks (React, Vue, Angular, Next.js, Nuxt) execute fully before the HTML is captured. You get what the user sees, not what the server initially sends.

✨ Key Features

Feature	Description
🕷️ Full Render	Captures HTML after complete JS execution
📜 Auto-Scroll	Scrolls to trigger lazy-loaded content
⚡ Custom JS	Execute your own JavaScript after load
🔗 Link Extraction	All links with text and resolved URLs
🖼️ Image Extraction	All images with alt text and dimensions
🏷️ Meta Tags	Open Graph, Twitter Cards, description, keywords
📄 HTML Preview	Syntax-highlighted preview in notebook
💾 Save & Download	HTML + metadata JSON, ZIP export

📦 What's Included

Component	File	Purpose
Notebook	`notebook/Z-Web-Scraper.ipynb`	3-cell Colab notebook — main entry point
Config	`src/config.py`	Constants, Chrome options, UI theme tokens
Scraper	`src/scraper.py`	Selenium engine with auto-scroll, metadata extraction
UI	`src/ui.py`	Theme-safe Colab components (dark/light mode)
Guide	`GUIDE.md`	Beginner-friendly user guide

📂 Project Structure

Z-Web-Scraper/
├── CHANGELOG.md                # Version history (newest first)
├── CONTRIBUTING.md             # How to contribute
├── GUIDE.md                    # Beginner-friendly user guide
├── LICENSE                     # MIT
├── README.md                   # This file
├── SECURITY.md                 # Vulnerability reporting policy
├── .gitignore                  # Python, Jupyter, output files, OS artifacts
├── requirements.txt            # Python dependencies
│
├── .github/
│   ├── ISSUE_TEMPLATE/
│   │   ├── bug_report.md
│   │   └── feature_request.md
│   └── PULL_REQUEST_TEMPLATE.md
│
├── notebook/
│   └── Z-Web-Scraper.ipynb          # Main Colab notebook (3 cells)
│
└── src/
    ├── __init__.py             # Package marker + shared exports
    ├── config.py               # Constants and defaults
    ├── scraper.py              # Core scraping engine
    └── ui.py                   # Colab UI components

🧩 Architecture

flowchart TD
    A["📋 Paste URL"] --> B["🔧 Launch Chromium"]
    B --> C["🌐 Load Page"]
    C --> D{"JS Settled?"}
    D -->|No| C
    D -->|Yes| E{"Auto-Scroll?"}
    E -->|Yes| F["📜 Scroll to Bottom"]
    E -->|No| G["📸 Capture HTML"]
    F --> G
    G --> H{"Custom JS?"}
    H -->|Yes| I["⚡ Execute JS"]
    H -->|No| J["🔍 Parse Metadata"]
    I --> J
    J --> K["📊 Display Stats"]
    K --> L["💾 Save & Download"]

    style A fill:#0d1117,stroke:#58a6ff,color:#e6edf3,stroke-width:2px
    style L fill:#0d1117,stroke:#3fb950,color:#e6edf3,stroke-width:2px
    style B fill:#0d1117,stroke:#a371f7,color:#e6edf3,stroke-width:2px
    style G fill:#0d1117,stroke:#f97316,color:#e6edf3,stroke-width:2px

⚙️ Pipeline Components

Component	Technology	Purpose
Browser Engine	Chromium (headless)	Renders JavaScript, executes SPAs
Automation	Selenium WebDriver	Controls Chromium programmatically
HTML Parser	BeautifulSoup + lxml	Extracts links, images, meta tags
HTTP Fallback	requests	Fetches response headers

🚀 Quick Start

Step	Cell	What Happens	Duration
🔧	1. Setup	Install Chromium & Selenium	~60s (first) / ~10s (cached)
🕷️	2. Scrape	Paste URL → render → capture HTML	~5–60s per page
💾	3. Export	Zip and download results	~5 sec

🎛️ Scrape Parameters

Parameter	Type	Default	Options	Description
`url`	String	—	Any URL	Website to scrape
`timeout`	Integer	`30`	`10`–`120`	Max seconds to wait for page load
`wait_after_load`	Integer	`3`	`0`–`30`	Extra seconds for JS to settle
`auto_scroll`	Bool	`True`	`True`/`False`	Scroll to trigger lazy content
`custom_js`	String	`""`	Any JS	JavaScript to execute after load
`save_html_file`	Bool	`True`	`True`/`False`	Save HTML to file
`show_preview`	Bool	`True`	`True`/`False`	Display syntax-highlighted preview
`show_links`	Bool	`True`	`True`/`False`	Extract and display all links

📐 Timeout Guide

Site Type	Timeout	Wait After	Auto-Scroll	Speed
Static HTML	15s	1s	Optional	⚡⚡⚡
React / Vue SPA	30s	3–5s	Yes	⚡⚡
Next.js / Nuxt	60s	5–8s	Yes	⚡
Heavy Media	90–120s	5–10s	Yes	🐢

🧠 Scraper Details

Selenium + Chromium

Property	Value
Engine	Chromium (headless, `--headless=new`)
Driver	Selenium WebDriver 4.x
Window	1920×1080
User Agent	Chrome 125 (Windows)
Timeout	Configurable 10–120 seconds
Source	selenium.dev

BeautifulSoup + lxml

Property	Value
Purpose	HTML parsing and data extraction
Extracts	Links, images, meta tags, text
Parser	lxml (fast C-based)
Source	crummy.com/BeautifulSoup

🔋 Resource Requirements

Resource	Minimum	Recommended	Notes
Runtime	Colab free	Colab free	No GPU needed
System RAM	2 GB	4 GB+	Chromium memory
Disk Space	500 MB	1 GB+	Chromium + outputs
Python	3.10+	Colab default	Required for Selenium

🐍 Python Modules

`src/config.py`

from src.config import DEFAULT_TIMEOUT, CHROME_OPTIONS, BG_CARD
print(f"Default timeout: {DEFAULT_TIMEOUT}s")

`src/scraper.py`

from src.scraper import scrape_url, save_html, extract_links
result = scrape_url("https://example.com", timeout=30, scroll=True)
save_html(result["html"], "output.html")
links = extract_links(result["html"], result["final_url"])

`src/ui.py`

from src.ui import show_header, show_ok, show_stats
show_header("🕷️", "Scraping", "Loading page...")
show_ok("Page captured!")
show_stats([("📄", "Chars", "125,000"), ("🔗", "Links", "42")])

🧪 Tips & Tricks

🌐 Input Quality Use full URLs — include `https://` for best results Increase timeout for slow servers (60–120s) Custom JS can click expand buttons or trigger modals	⚡ Performance Disable auto-scroll if you only need above-the-fold content Lower wait_after_load for static sites (1s is enough) Batch scraping — run Step 2 multiple times, export once
🎯 SPA-Specific Next.js / Nuxt — increase wait to 5–8s for SSG/SSR Infinite scroll — auto-scroll handles it automatically Client-side routing — paste the final URL, not the shell	📤 Output HTML file — complete rendered DOM, ready for parsing Metadata JSON — title, tags, stats, hash for dedup ZIP export — download everything at once

❓ FAQ

Does it work on Next.js / React / Vue sites?

Yes! Selenium runs a real Chromium browser, so all JavaScript frameworks execute fully before the HTML is captured.

Can I scrape pages behind login?

Not directly. You could inject cookies via custom_js, but there's no built-in auth flow.

Why not just use `requests`?

requests only gets the initial server response — no JavaScript execution. SPAs return an empty shell that gets filled by JS. Selenium renders the full page.

Is GPU required?

No. Z-Web-Scraper runs entirely on CPU. Chromium doesn't need GPU acceleration for scraping.

Can I scrape multiple URLs?

Yes! Run Step 2 multiple times with different URLs. All files accumulate in the output directory and can be exported together in Step 3.

How big can the HTML be?

There's no hard limit. Very large pages (10+ MB HTML) may slow down the preview. The full HTML is always saved to file regardless of size.

🐛 Troubleshooting

Problem	Cause	Solution
`Chromium not found`	Runtime restarted	Re-run Cell 1
`TimeoutException`	Page too slow	Increase `timeout` to 60–120s
Empty HTML	JS not settled	Increase `wait_after_load` to 5–10s
Missing lazy content	Scroll not triggered	Enable `auto_scroll`
`WebDriverException`	Driver crash	Restart runtime, re-run Cell 1
`ConnectionRefused`	Site blocks bots	Try different URL or add delay

🙏 Acknowledgements

🛠️ Tools

Selenium — Browser automation
BeautifulSoup — HTML parsing
Chromium — Browser engine
Google Colab — Free cloud runtime

📚 Libraries

lxml — Fast XML/HTML parser
requests — HTTP library
PyCapsule Render — README header

🤝 Contributing

Contributions are welcome!

📜 License

This project is licensed under the MIT License.

Free to use, modify, and distribute — see the LICENSE file for details.

💕 Loved My Work?

🚨 Follow me on GitHub

⭐ Give a star to this project

~ For inquiries or collaborations

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github		.github
notebook		notebook
src		src
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
GUIDE.md		GUIDE.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

📑 Table of Contents

📖 Overview

✨ Key Features

📦 What's Included

📂 Project Structure

🧩 Architecture

⚙️ Pipeline Components

🚀 Quick Start

🎛️ Scrape Parameters

📐 Timeout Guide

🧠 Scraper Details

Selenium + Chromium

BeautifulSoup + lxml

🔋 Resource Requirements

🐍 Python Modules

src/config.py

src/scraper.py

src/ui.py

🧪 Tips & Tricks

🌐 Input Quality

⚡ Performance

🎯 SPA-Specific

📤 Output

❓ FAQ

🐛 Troubleshooting

🙏 Acknowledgements

🛠️ Tools

📚 Libraries

🤝 Contributing

🐛 Report Bugs

💡 Suggest Features

🔀 Submit PRs

📜 License

💕 Loved My Work?

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`src/config.py`

`src/scraper.py`

`src/ui.py`

Packages