Skip to content

paulpestov/text-api-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TextAPI Crawler

Downloads all resources from a TextAPI 2.0 instance and saves them locally, rewriting URLs so the output can be served from a static web server. Used to create mock files for TIDO e2e testing.

Setup

# 1. Clone the repository
git clone <repo-url> text-api-crawler
cd text-api-crawler

# 2. Create and activate a virtual environment
python3 -m venv .venv
source .venv/bin/activate

# 3. Install dependencies
pip install -r requirements.txt

# 4. Configure the target API
#    Open main.py and set the config variables at the top:
#    - server_base_url  (the API server, e.g. https://example.com/api)
#    - output_base_url  (where the output will be served from)
#    - entrypoint_url   (where to start crawling)
#    - example_mode     (True = first/last 3 items only)

# 5. Run the crawler
python3 main.py

Usage

# Default config (variables at top of main.py)
python3 main.py

# Override entrypoint and enable example mode
python3 main.py https://server/api/collections/ -e

# Download only specific manifests
python3 main.py -fm man-1,man-2,man-3

# Skip images, CSS, XML, and font files
python3 main.py -x img,css,xml,font

# Download only specific collections and items
python3 main.py -fc col-a,col-b -fi item-x,item-y

# Full CLI control
python3 main.py https://server/api/collections/ \
  -s https://server/api \
  -o http://localhost:8181 \
  -d output \
  -m mocks \
  -e \
  -a \
  -r objects \
  -fc col-1,col-2 \
  -fm man-1,man-2 \
  -fi item-1,item-2

Configuration

Edit variables at the top of main.py, or override via CLI:

Variable / Flag Default Description
entrypoint_url / [entrypoint] https://.../collections/transcriptions URL to start crawling from
-s / --server-base-url https://.../api API server base URL
-o / --output-base-url http://localhost:8181 Base URL written into saved files
-d / --output-dir output Local directory for downloaded files
-m / --mocks-dir mocks Directory with placeholder files (e.g. text-api.png)
-e / --example-mode True Only download first 3 + last 3 items from arrays
-a / --no-annotations False Disable following annotationCollection links
-r / --follow-references strings strings = always download referenced URLs; objects = process embedded objects inline
collection_filters / -fc [] Comma-separated list of collection IDs to download
manifest_filters / -fm [] Comma-separated list of manifest IDs to download
item_filters / -fi [] Comma-separated list of item IDs to download
exclude_assets / -x [] Asset types to skip: font, img, css, xml (comma-separated)
advanced_replace [] [search, replace] — an additional search/replace pair applied to URLs in downloaded content

When a filter list is non-empty, only resources whose URL or slug matches an entry in the list are crawled. Matching checks both the full URL and the last path segment (slug). For example, -fm man-1 matches both https://server/api/manifests/man-1/ and https://server/api/manifests/man-1.

How It Works

  1. Starts from the entrypoint_url and fetches the JSON
  2. Detects resource type via the textapiType field (TextApiCollection, TextApiManifest, TextApiItem) or URL path
  3. Saves each resource as <last-path-segment>.json mirroring the URL path
  4. Follows references (collections, manifests, items) recursively, applying any active filters
  5. Downloads linked files (HTML content, CSS assets) — images are replaced with mocks/text-api.png
  6. Rewrites all occurrences of server_base_urloutput_base_url in saved files

The URL path structure is preserved exactly. The last path segment becomes the filename with .json extension:

URL Saved to
.../api/collections/transcriptions output/collections/transcriptions.json
.../api/manifests/man-1 output/manifests/man-1.json
.../api/items/page-001 output/items/page-001.json
.../api/api/files/html/page.html output/api/files/html/page.html

JSON API resources get .json appended to their last path segment. Content files with extensions (.html, .css, etc.) keep their original filename.

Output Structure

output/
  collections/
    transcriptions.json
  manifests/
    manuscript-a.json
  items/
    page-001.json
  api/files/html/
    page-001.html

Requirements

  • Python 3.8+
  • pip install requests

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages