This project implements a web crawler that collects and processes URLs, saving them into a Neo4j graph database. The crawler extracts and processes URLs from the web, follows links, and stores data on each visited page. The project uses the Jsoup library for HTML parsing and Neo4j for graph database management.
Purpose: A new folder was created to organize the web crawling functionality, containing the core classes for crawling and benchmarking.
This class is the main component of the crawler and is responsible for initiating the crawling process, handling URL extraction, and storing crawled data in the Neo4j database.
This class was added to benchmark the performance of the web crawler, comparing different crawling strategies (e.g., different queue sizes, depth settings) to evaluate the efficiency of the crawler.
The test class used to mock interactions with Neo4j and verify that URLs are processed correctly. It ensures that methods are being called properly.
The WebCrawler class is the core of the project. It manages the crawling process, starting from a given URL and recursively crawling up to a specified depth. It uses a priority queue to manage URLs based on their priority, which is calculated based on the URL content.
4. Links are extracted using the Jsoup library and filtered to ensure they are not already visited or invalid.
5. URL data is saved in a Neo4j graph database, including URL, depth, and in-degree (number of incoming links).
When a URL is processed and links are found, the method saveLinkToGraph() is invoked to create a relationship between the "from" URL and the "to" URL.
Each time a relationship is added, the in-degree for the target page (the "to" URL) is incremented by 1 in the database.
In order to prioritize certain URLs over others during the crawling process, the WebCrawler class uses a heuristic approach.
Priority Calculation: URLs are assigned a priority score based on keywords present in the URL or its content.
Priority 1 (Highest Priority): URLs containing .edu and including the keyword “graduate”.
Priority 2 (Mid Priority): URLs related to admissions for Boston universities.
Priority 10 (Low Priority): URLs that are less important (e.g., help or policy pages).
Default Priority 5: For all other URLs.