Downloads all resources from a TextAPI 2.0 instance and saves them locally, rewriting URLs so the output can be served from a static web server. Used to create mock files for TIDO e2e testing.
# 1. Clone the repository
git clone <repo-url> text-api-crawler
cd text-api-crawler
# 2. Create and activate a virtual environment
python3 -m venv .venv
source .venv/bin/activate
# 3. Install dependencies
pip install -r requirements.txt
# 4. Configure the target API
# Open main.py and set the config variables at the top:
# - server_base_url (the API server, e.g. https://example.com/api)
# - output_base_url (where the output will be served from)
# - entrypoint_url (where to start crawling)
# - example_mode (True = first/last 3 items only)
# 5. Run the crawler
python3 main.py# Default config (variables at top of main.py)
python3 main.py
# Override entrypoint and enable example mode
python3 main.py https://server/api/collections/ -e
# Download only specific manifests
python3 main.py -fm man-1,man-2,man-3
# Skip images, CSS, XML, and font files
python3 main.py -x img,css,xml,font
# Download only specific collections and items
python3 main.py -fc col-a,col-b -fi item-x,item-y
# Full CLI control
python3 main.py https://server/api/collections/ \
-s https://server/api \
-o http://localhost:8181 \
-d output \
-m mocks \
-e \
-a \
-r objects \
-fc col-1,col-2 \
-fm man-1,man-2 \
-fi item-1,item-2Edit variables at the top of main.py, or override via CLI:
| Variable / Flag | Default | Description |
|---|---|---|
entrypoint_url / [entrypoint] |
https://.../collections/transcriptions |
URL to start crawling from |
-s / --server-base-url |
https://.../api |
API server base URL |
-o / --output-base-url |
http://localhost:8181 |
Base URL written into saved files |
-d / --output-dir |
output |
Local directory for downloaded files |
-m / --mocks-dir |
mocks |
Directory with placeholder files (e.g. text-api.png) |
-e / --example-mode |
True |
Only download first 3 + last 3 items from arrays |
-a / --no-annotations |
False |
Disable following annotationCollection links |
-r / --follow-references |
strings |
strings = always download referenced URLs; objects = process embedded objects inline |
collection_filters / -fc |
[] |
Comma-separated list of collection IDs to download |
manifest_filters / -fm |
[] |
Comma-separated list of manifest IDs to download |
item_filters / -fi |
[] |
Comma-separated list of item IDs to download |
exclude_assets / -x |
[] |
Asset types to skip: font, img, css, xml (comma-separated) |
advanced_replace |
[] |
[search, replace] — an additional search/replace pair applied to URLs in downloaded content |
When a filter list is non-empty, only resources whose URL or slug matches an entry in the list are crawled. Matching checks both the full URL and the last path segment (slug). For example, -fm man-1 matches both https://server/api/manifests/man-1/ and https://server/api/manifests/man-1.
- Starts from the
entrypoint_urland fetches the JSON - Detects resource type via the
textapiTypefield (TextApiCollection,TextApiManifest,TextApiItem) or URL path - Saves each resource as
<last-path-segment>.jsonmirroring the URL path - Follows references (
collections,manifests,items) recursively, applying any active filters - Downloads linked files (HTML content, CSS assets) — images are replaced with
mocks/text-api.png - Rewrites all occurrences of
server_base_url→output_base_urlin saved files
The URL path structure is preserved exactly. The last path segment becomes the filename with .json extension:
| URL | Saved to |
|---|---|
.../api/collections/transcriptions |
output/collections/transcriptions.json |
.../api/manifests/man-1 |
output/manifests/man-1.json |
.../api/items/page-001 |
output/items/page-001.json |
.../api/api/files/html/page.html |
output/api/files/html/page.html |
JSON API resources get .json appended to their last path segment. Content files with extensions (.html, .css, etc.) keep their original filename.
output/
collections/
transcriptions.json
manifests/
manuscript-a.json
items/
page-001.json
api/files/html/
page-001.html
- Python 3.8+
pip install requests