A concurrent web crawler service that searches for keywords in web pages and their linked pages.
- Concurrent web crawling using multiple threads
- Case-insensitive keyword search
- Supports both relative and absolute URLs
- RESTful API endpoints
- Real-time status tracking
- Handles special characters in search terms
- POST
/crawl - Body:
{"keyword": "your-search-term"} - Constraints:
- Keyword must be 4-32 characters long
- Returns: Search object with ID and initial status
- GET
/crawl/:id - Returns: Search object with current status and found URLs
{
"id": "unique-id",
"urls": ["array-of-found-urls"],
"status": "active|done"
}-
Make sure you have Java and Maven installed
-
Build the project:
mvn clean package
-
Run the application:
Manually:
export BASE_URL=[base-url] java -jar target/backend-test-1.0-SNAPSHOT.jarOr without needing to build previously:
./run [base-url]
You can also run the application using Docker:
docker build . -t blur/backend
docker run -e BASE_URL=[base-url] -p 4567:4567- Built with Java and Spark Framework
- Uses concurrent data structures for thread safety
- Implements smart URL normalization
- Filters out non-HTML resources (images, videos, etc.)
- Proper error handling and logging