A production-ready solution that crawls books.toscrape.com, stores data in MongoDB, detects daily changes, and exposes a secure FastAPI for querying books and change logs. Includes a dashboard to trigger/resume crawls, run scheduled jobs on-demand, and view live logs.
- Install Docker & Docker Compose
Copy and edit:
cp .env.example .envdocker compose up -d --build- API/Dashboard → http://localhost:8000
- Swagger UI → http://localhost:8000/docs
- Dashboard → http://localhost:8000/dashboard (Basic Auth)
- Mongo-Express (optional) → http://localhost:8081
Run a crawl first (Dashboard → Start Crawl) to populate data.
- Python: 3.13 (tested)
- MongoDB: 6.x on x86_64 (official
mongo:6.0) - OS (tested): Linux Mint (x86_64), Mac Silicon (ARM64)
- Dependencies: pinned in
requirements.txt
The stack is fully containerized with Docker Compose. It includes four services:
-
mongo— MongoDB database with authentication enabled (root credentials from.env).- Persists data in
mongo_datavolume. - Health-checked with
db.adminCommand('ping').
- Persists data in
-
mongo-express— Optional web UI for MongoDB.- Runs on http://localhost:8081.
- Requires Basic Auth (
ME_CONFIG_BASICAUTH_USERNAME/ME_CONFIG_BASICAUTH_PASSWORD). - Connects to MongoDB using the root credentials.
-
app— Main FastAPI application + Scrapy crawler.- Runs on http://localhost:8000.
- Mounts local source code and reports directory.
- Command:
uvicorn app.api.main:app.
-
scheduler— APScheduler daily job runner.- Runs
python -m scheduler.schedule_daily. - Same image as
app, but executes scheduler instead of API server.
- Runs
Volumes:
mongo_data— stores MongoDB data files persistently../reports— mounted inside theappcontainer to persist daily reports../jobdata— mounted for Scrapy’s JOBDIR state (resume crawls).
Stop containers:
docker compose downRemove everything (containers + volumes):
docker compose down -v- Start Crawl (Fresh) — full crawl (for daily runs).
- Start Crawl (Resume if possible) — resume interrupted crawl (Scrapy JOBDIR).
- Stop Crawl — terminate current crawl.
- View Logs — live crawler output.
# Fresh crawl
docker compose exec app bash -lc "python scheduler/run_crawl.py"
# Resume crawl
docker compose exec app bash -lc "QTS_SCRAPY_RESUME=true python scheduler/run_crawl.py"Fresh vs Resume (important):
- Fresh = revisits all pages → required for accurate change detection.
- Resume = only for interrupted runs. Never use for scheduled daily jobs.
- Implemented in
scheduler/schedule_daily.pywith APScheduler. - Runs daily at 09:00 (based on
QTS_TIMEZONE). - Workflow: fresh crawl → compute change summary → save reports → send email.
- Dashboard → Run Scheduled Job Now = same flow, on demand.
Manual run:
docker compose exec app bash -lc "python scheduler/schedule_daily.py"- Crawls all categories & paginated listings with robust selectors, retries, HTTP cache, and polite throttling.
- Normalizes and stores each book in MongoDB (
books), including:- Numeric price fields
- Gzipped HTML snapshot (
raw_html_gz)
- Idempotent upserts by URL.
- Per-page
content_hash(stable fingerprint). - Detailed entry in
changesfor new and update events. - Field-level diffs (
fields_changed),price_delta, and asignificantflag. - Daily JSON/CSV reports + email alerts.
- APScheduler daily run (timezone from
.env). - Dashboard button: “Run Scheduled Job Now”.
GET /books— filter, sort, paginate books.GET /books/{id}— full book details.GET /changes— filter by type, significance, URL, time windows.- API-key auth and per-key, per-path rate limiting.
- Interactive Swagger UI with API key security scheme.
- Start fresh crawl.
- Start resume-if-possible crawl (Scrapy JOBDIR).
- Stop crawl, view live logs.
- Quick links: Swagger, Docs, Mongo-Express.
- App + MongoDB (+ optional Mongo-Express).
- Persistent volumes for data and reports.
- Pytest suite for endpoints, rate limiting, and reports.
- Coverage reporting.
- Every request requires:
X-API-Key: <QTS_API_KEY>- Rate limit: 100 req/hour per (API key, path)
- Exceeding →
429 Too Many Requests - Replace placeholders before running:
HOST="http://example.com"API_KEY="your-api-key"
HOST="http://example.com"
API_KEY="your-api-key"
for i in $(seq 1 105); do
code=$(curl -s -o /dev/null -w "%{http_code}" \
"$HOST/books?page=1&page_size=1" \
-H "X-API-Key: $API_KEY")
printf "%03d -> %s\n" "$i" "$code"
done- Expect 200s first, then 429 once the limit is hit.
GET /books— query by category, rating, price range, search term.GET /books/{id}— book details.GET /changes— filter by kind, significance, time window.GET /reports/list— list available daily reports.GET /reports/today— fetch today’s report (json|csv).
The application uses two primary collections: books and changes.
Stores the latest snapshot of each crawled book.
Sample document:
{
"_id": { "$oid": "6512bd43d9caa6e02c990b0a" },
"url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
"name": "A Light in the Attic",
"description": "A collection of humorous poems and drawings.",
"category": "Poetry",
"image_url": "https://books.toscrape.com/media/cache/fe/9a/fe9a...jpg",
"rating": 4,
"availability": "In stock (22 available)",
"price_incl_tax": "£51.77",
"price_excl_tax": "£48.77",
"price_incl_tax_num": 51.77,
"price_excl_tax_num": 48.77,
"tax": "£3.00",
"num_reviews": 0,
"crawled_at": { "$date": "2025-09-27T14:23:16.725Z" },
"source": "books.toscrape.com",
"content_hash": "sha256:6d6b8b5f9c8f...b7",
"raw_html_gz": { "$binary": "H4sIAAAAAAAA/4xYXW...", "$type": "00" }
}Field reference:
url(string, unique) — canonical product URLname(string)description(string)category(string)image_url(string)rating(int, 0–5)availability(string)price_incl_tax,price_excl_tax,tax(string as scraped, e.g. "£51.77")price_incl_tax_num,price_excl_tax_num(number, for sorting/filtering)num_reviews(int)crawled_at(datetime, UTC)source(string, e.g. "books.toscrape.com")content_hash(string, sha256 fingerprint of salient content)raw_html_gz(binary, gzipped HTML snapshot; optional)
Stores change history between crawls. Each entry records either a new book or an update.
Sample documents:
{
"_id": { "$oid": "6512bd43d9caa6e02c990b0b" },
"url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
"changed_at": { "$date": "2025-09-28T09:05:11.112Z" },
"change_kind": "new",
"significant": true,
"fields_changed": {
"price_incl_tax": { "prev": null, "new": "£51.77" },
"availability": { "prev": null, "new": "In stock (22 available)" }
},
"price_delta": 51.77,
"prev_hash": null,
"new_hash": "sha256:6d6b8b5f9c8f...b7"
}{
"_id": { "$oid": "6512bd43d9caa6e02c990b0c" },
"url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
"changed_at": { "$date": "2025-10-01T10:14:03.004Z" },
"change_kind": "update",
"significant": true,
"fields_changed": {
"price_incl_tax": { "prev": "£51.77", "new": "£49.99" },
"availability": { "prev": "In stock (22 available)", "new": "In stock (5 available)" }
},
"price_delta": -1.78,
"prev_hash": "sha256:6d6b8b5f9c8f...b7",
"new_hash": "sha256:9f3e2a1c0d4e...21"
}Field reference:
url(string) — FK tobooks.urlchanged_at(datetime, UTC)change_kind(enum:"new"|"update")significant(bool)fields_changed(object:{ field: { prev, new } })price_delta(number; 0 for non-price updates)prev_hash,new_hash(strings; content fingerprints)
- Daily reports in
./reports/:changes_YYYY-MM-DD.jsonchanges_YYYY-MM-DD.csv
- Email alerts (optional) → when new items or significant changes are detected.
Run tests + coverage:
docker compose exec app bash -lc "coverage run -m pytest -q && coverage report -m"Generate HTML report:
docker compose exec app bash -lc "coverage html && ls -l htmlcov/index.html"- API key: always send
X-API-Key. - Dashboard auth:
QTS_ADMIN_USER/QTS_ADMIN_PASS. - Scheduler: timezone controlled by
QTS_TIMEZONE. - Resume: only for interrupted runs (not for daily jobs).
- Mongo-Express: optional UI.
-
401/403 despite Authorized in Swagger
→ EnsureX-API-Keymatches.envand container is restarted. -
Resume not working
→ CheckQTS_SCRAPY_RESUME=trueand./jobdata/books/exists.