The project is designed to crawl the web starting from three seed URLs:
The crawler efficiently handles multiple requests in parallel using asynchronous operations and a thread-safe priority queue. By leveraging advanced concurrency techniques and integrating with a Neo4j graph database, it indexes and analyzes web content, providing valuable insights into the structure and connections of web pages.
- Asynchronous Operations: Utilizes Java's
CompletableFutureto perform multiple web requests in parallel, ensuring efficient resource usage and faster crawling. - Priority-Based URL Processing: Manages URLs using a
PriorityBlockingQueue, processing more relevant or important URLs first based on a custom heuristic. - Thread-Safe Operations: Utilizes concurrent data structures like
PriorityBlockingQueueandConcurrentHashMapfor safe operations in a multi-threaded environment. - Robust Error Handling: Logs any errors encountered during the crawling process using
log4j, ensuring the application does not terminate unexpectedly. - Neo4j Integration: Stores URLs and their relationships in a Neo4j graph database for efficient querying and analysis of the web structure.
- Scalability and Extensibility: Designed to handle a growing number of URLs and web pages, with a modular architecture that allows for easy feature extensions and heuristic updates.
The web crawler integrates with Neo4j to store the URLs and their relationships as a graph. Each URL is a node, and each link between pages is represented as a relationship between those nodes. This integration allows us to easily visualize and analyze the interconnections between crawled pages. Using Neo4j, we can efficiently perform graph-based queries to explore the relationships between pages and gain insights into the structure of the web.
To interact with the Web Crawler API, you can use the Swagger UI, which provides a user-friendly interface to explore and test all available endpoints. Swagger generates API documentation and allows you to send HTTP requests directly from the browser.
You can access the Swagger UI by navigating to the following link:
http://localhost:8080/swagger-ui/index.html#/
- API Documentation: It automatically generates documentation for the Web Crawler API, listing all available endpoints, their descriptions, and the HTTP methods they support (GET, POST, etc.).
- Interactive Interface: Swagger allows you to interact with the API by providing input data for each endpoint and seeing the response directly within the UI.
- OpenAPI Specification: The API follows the OpenAPI specification, providing structured and standardized API documentation that is both human-readable and machine-readable.
- The Swagger UI dynamically reads the OpenAPI specification, which is a description of the API in a JSON or YAML format. This specification defines the structure of the API, the parameters for each endpoint, and the expected responses.
- You can use Swagger to:
- Test the
/crawl/startendpoint by sending a POST request to initiate the crawling process. - View other API endpoints (if implemented) to manage and monitor the web crawling process.
- Test the
Clone this repository to your local machine:
git clone https://github.com/yourusername/web-crawler.git
cd web-crawlerInstall Neo4j and start the database. Ensure Neo4j is running on localhost:7687 with the default username neo4j and password password.
Build the project using Maven:
./mvnw clean installRun the application with Spring Boot:
./mvnw spring-boot:runTo start the crawling process, send a POST request to the following endpoint using Postman or any other HTTP client:
POST http://localhost:8080/crawl/startYou can configure the seed URLs and other settings in the application.properties file located at src/main/resources/application.properties.
This project is licensed under the Northeastern University License. See the LICENSE file for details.
Contributions are welcome! Feel free to open an issue or submit a pull request for any improvements or bug fixes.
For any questions or inquiries, please contact: