A simple Node.js web crawler that visits pages on one website, finds links, and counts how many times each page is discovered.
- Starts from one URL, then follows links found on the same hostname only.
- Skips pages on other domains.
- Skips resources that are not HTML pages.
- Keeps a tally of each normalized page URL it finds.
This is not a full search engine crawler — it is a basic tool for exploring a single website and collecting the set of discovered URLs.
main.jsreads a website URL from the command line.crawl.jsvisits the starting page.- It parses the page HTML with
jsdomand extracts all<a href="...">links. - It converts relative links into absolute URLs using the base site URL.
- It recursively visits each same-site link and counts each page.
- It avoids revisiting the same normalized URL repeatedly.
npm installnode main.js https://example.comExpected behavior:
- The crawler begins at
https://example.com - It visits only pages on
example.com - It prints each discovered page URL and how many times it was counted
- It saves the crawl results into
pages.json
After running the crawler, use the Python script ReadResult_csv.py to read pages.json and write the data into pages.csv.
Example steps:
node main.js https://example.com
python ReadResult_csv.pyThis allows you to preserve the crawl output in CSV format for later analysis.
main.js— starts the crawl and outputs resultscrawl.js— crawler logic, HTML link extraction, and URL normalizationcrawl.test.js— tests for URL normalization and link extraction
crawlpage(baseURL, currentURL, pages)— recursively crawls pages and updatespagescountsgitURLfromHTML(htmlBody, baseURL)— extracts absolute and relative links from HTMLNormalizeURL(URLstring)— normalizes URLs by removing protocol, trailing slash, and case differences in hostname
Use this project to test basic crawling logic, learn how to parse links in HTML, or inspect how many unique pages a website exposes on one domain.