A powerful asynchronous website crawler and link checker that helps you identify broken links, orphaned pages, and analyze your website's link structure.
Created by Farhan Ansari
- 🔄 Asynchronous crawling for faster performance
- 🌐 Cross-platform support (Windows, macOS, Linux)
- 🎨 Beautiful terminal output with color coding
- 📊 Link analysis and reporting
- 🔍 Smart caching system for efficient crawling
- 🛡️ Rate limiting and robots.txt compliance
- 📝 CSV reports for broken and all links
- 🔒 SSL/TLS support
- 🎯 Configurable crawl depth and page limits
- Python 3.8 or higher
- pip (Python package installer)
- Clone the repository:
git clone https://github.com/fxrhan/LinkGuardian.git
cd LinkGuardian- Install dependencies:
pip install -r requirements.txtpython linkcheck.py --url https://example.compython linkcheck.py --url https://example.com --workers 20 --rate 0.5 --max-pages 200 --max-depth 4 --timeout 15 --check-external --ignore-robots| Argument | Default | Description |
|---|---|---|
--url |
(required) | Base URL to crawl (must start with http:// or https://) |
--workers |
10 |
Number of concurrent workers |
--rate |
0.5 |
Seconds to wait between requests per worker |
--max-pages |
100 |
Maximum number of pages to crawl |
--max-depth |
3 |
Maximum crawl depth |
--timeout |
30 |
Request timeout in seconds |
--cache-dir |
Custom directory for cache files | |
--output-dir |
Custom directory for CSV reports | |
--ignore-robots |
Ignore robots.txt rules | |
--no-verify-ssl |
Disable SSL certificate verification (for self-signed certs) | |
--check-external |
Verify external links via HEAD requests (not crawled further) | |
--version |
Print version and exit |
The tool creates a .linkguardian directory in your home folder with the following structure:
~/.linkguardian/
├── cache/ # Cache files for each domain
├── logs/ # Log files
└── output/ # Crawl results
└── {domain}_{timestamp}/
├── broken_links.csv
└── all_links.csv
The tool implements a smart caching system that:
- Stores visited pages and checked links
- Handles JSON serialization of complex data types
- Automatically manages cache files per domain
- Preserves crawl progress between sessions
The tool includes comprehensive error handling for:
- Network connectivity issues
- SSL/TLS certificate problems
- Timeout errors
- HTTP errors
- JSON serialization errors
- Platform-specific path issues
- Keyboard interrupts
Contains information about broken links:
- Broken Link URL
- Source Page URL
- Status Code
- Error Category
- Timestamp
Contains information about all discovered links:
- Link URL
- Source Page URL
- Status Code
- Link Type (Internal/External)
- Depth
- Is Orphaned
- Timestamp
The tool categorizes errors into the following types:
- Connection errors
- Timeout errors
- SSL/TLS errors
- HTTP errors
- Parsing errors
- Validation errors
- Unknown errors
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
If you encounter any issues or have questions, please open an issue on the GitHub repository.