HTTP Project Web Crawler

A simple Node.js web crawler that visits pages on one website, finds links, and counts how many times each page is discovered.

What this project does

Starts from one URL, then follows links found on the same hostname only.
Skips pages on other domains.
Skips resources that are not HTML pages.
Keeps a tally of each normalized page URL it finds.

This is not a full search engine crawler — it is a basic tool for exploring a single website and collecting the set of discovered URLs.

How it works

main.js reads a website URL from the command line.
crawl.js visits the starting page.
It parses the page HTML with jsdom and extracts all <a href="..."> links.
It converts relative links into absolute URLs using the base site URL.
It recursively visits each same-site link and counts each page.
It avoids revisiting the same normalized URL repeatedly.

Install

npm install

Usage

node main.js https://example.com

Expected behavior:

The crawler begins at https://example.com
It visits only pages on example.com
It prints each discovered page URL and how many times it was counted
It saves the crawl results into pages.json

Python result reader

After running the crawler, use the Python script ReadResult_csv.py to read pages.json and write the data into pages.csv.

Example steps:

node main.js https://example.com
python ReadResult_csv.py

This allows you to preserve the crawl output in CSV format for later analysis.

Files

main.js — starts the crawl and outputs results
crawl.js — crawler logic, HTML link extraction, and URL normalization
crawl.test.js — tests for URL normalization and link extraction

Key functions

crawlpage(baseURL, currentURL, pages) — recursively crawls pages and updates pages counts
gitURLfromHTML(htmlBody, baseURL) — extracts absolute and relative links from HTML
NormalizeURL(URLstring) — normalizes URLs by removing protocol, trailing slash, and case differences in hostname

Why this is useful

Use this project to test basic crawling logic, learn how to parse links in HTML, or inspect how many unique pages a website exposes on one domain.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.gitignore		.gitignore
README.md		README.md
ReadResult_csv.py		ReadResult_csv.py
babel.config.cjs		babel.config.cjs
crawl.js		crawl.js
crawl.test.js		crawl.test.js
main.js		main.js
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HTTP Project Web Crawler

What this project does

How it works

Install

Usage

Python result reader

Files

Key functions

Why this is useful

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

HTTP Project Web Crawler

What this project does

How it works

Install

Usage

Python result reader

Files

Key functions

Why this is useful

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages