Skip to content

skullllllly/http-project-webcrawl

Repository files navigation

HTTP Project Web Crawler

A simple Node.js web crawler that visits pages on one website, finds links, and counts how many times each page is discovered.

What this project does

  • Starts from one URL, then follows links found on the same hostname only.
  • Skips pages on other domains.
  • Skips resources that are not HTML pages.
  • Keeps a tally of each normalized page URL it finds.

This is not a full search engine crawler — it is a basic tool for exploring a single website and collecting the set of discovered URLs.

How it works

  1. main.js reads a website URL from the command line.
  2. crawl.js visits the starting page.
  3. It parses the page HTML with jsdom and extracts all <a href="..."> links.
  4. It converts relative links into absolute URLs using the base site URL.
  5. It recursively visits each same-site link and counts each page.
  6. It avoids revisiting the same normalized URL repeatedly.

Install

npm install

Usage

node main.js https://example.com

Expected behavior:

  • The crawler begins at https://example.com
  • It visits only pages on example.com
  • It prints each discovered page URL and how many times it was counted
  • It saves the crawl results into pages.json

Python result reader

After running the crawler, use the Python script ReadResult_csv.py to read pages.json and write the data into pages.csv.

Example steps:

node main.js https://example.com
python ReadResult_csv.py

This allows you to preserve the crawl output in CSV format for later analysis.

Files

  • main.js — starts the crawl and outputs results
  • crawl.js — crawler logic, HTML link extraction, and URL normalization
  • crawl.test.js — tests for URL normalization and link extraction

Key functions

  • crawlpage(baseURL, currentURL, pages) — recursively crawls pages and updates pages counts
  • gitURLfromHTML(htmlBody, baseURL) — extracts absolute and relative links from HTML
  • NormalizeURL(URLstring) — normalizes URLs by removing protocol, trailing slash, and case differences in hostname

Why this is useful

Use this project to test basic crawling logic, learn how to parse links in HTML, or inspect how many unique pages a website exposes on one domain.

About

A Node.js crawler that starts from one URL, follows same-site HTML links, and counts unique pages on that domain.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors