TextAPI Crawler

Downloads all resources from a TextAPI 2.0 instance and saves them locally, rewriting URLs so the output can be served from a static web server. Used to create mock files for TIDO e2e testing.

Setup

# 1. Clone the repository
git clone <repo-url> text-api-crawler
cd text-api-crawler

# 2. Create and activate a virtual environment
python3 -m venv .venv
source .venv/bin/activate

# 3. Install dependencies
pip install -r requirements.txt

# 4. Configure the target API
#    Open main.py and set the config variables at the top:
#    - server_base_url  (the API server, e.g. https://example.com/api)
#    - output_base_url  (where the output will be served from)
#    - entrypoint_url   (where to start crawling)
#    - example_mode     (True = first/last 3 items only)

# 5. Run the crawler
python3 main.py

Usage

# Default config (variables at top of main.py)
python3 main.py

# Override entrypoint and enable example mode
python3 main.py https://server/api/collections/ -e

# Download only specific manifests
python3 main.py -fm man-1,man-2,man-3

# Skip images, CSS, XML, and font files
python3 main.py -x img,css,xml,font

# Download only specific collections and items
python3 main.py -fc col-a,col-b -fi item-x,item-y

# Full CLI control
python3 main.py https://server/api/collections/ \
  -s https://server/api \
  -o http://localhost:8181 \
  -d output \
  -m mocks \
  -e \
  -a \
  -r objects \
  -fc col-1,col-2 \
  -fm man-1,man-2 \
  -fi item-1,item-2

Configuration

Edit variables at the top of main.py, or override via CLI:

Variable / Flag	Default	Description
`entrypoint_url` / `[entrypoint]`	`https://.../collections/transcriptions`	URL to start crawling from
`-s` / `--server-base-url`	`https://.../api`	API server base URL
`-o` / `--output-base-url`	`http://localhost:8181`	Base URL written into saved files
`-d` / `--output-dir`	`output`	Local directory for downloaded files
`-m` / `--mocks-dir`	`mocks`	Directory with placeholder files (e.g. `text-api.png`)
`-e` / `--example-mode`	`True`	Only download first 3 + last 3 items from arrays
`-a` / `--no-annotations`	`False`	Disable following `annotationCollection` links
`-r` / `--follow-references`	`strings`	`strings` = always download referenced URLs; `objects` = process embedded objects inline
`collection_filters` / `-fc`	`[]`	Comma-separated list of collection IDs to download
`manifest_filters` / `-fm`	`[]`	Comma-separated list of manifest IDs to download
`item_filters` / `-fi`	`[]`	Comma-separated list of item IDs to download
`exclude_assets` / `-x`	`[]`	Asset types to skip: `font`, `img`, `css`, `xml` (comma-separated)
`advanced_replace`	`[]`	`[search, replace]` — an additional search/replace pair applied to URLs in downloaded content

When a filter list is non-empty, only resources whose URL or slug matches an entry in the list are crawled. Matching checks both the full URL and the last path segment (slug). For example, -fm man-1 matches both https://server/api/manifests/man-1/ and https://server/api/manifests/man-1.

How It Works

Starts from the entrypoint_url and fetches the JSON
Detects resource type via the textapiType field (TextApiCollection, TextApiManifest, TextApiItem) or URL path
Saves each resource as <last-path-segment>.json mirroring the URL path
Follows references (collections, manifests, items) recursively, applying any active filters
Downloads linked files (HTML content, CSS assets) — images are replaced with mocks/text-api.png
Rewrites all occurrences of server_base_url → output_base_url in saved files

The URL path structure is preserved exactly. The last path segment becomes the filename with .json extension:

URL	Saved to
`.../api/collections/transcriptions`	`output/collections/transcriptions.json`
`.../api/manifests/man-1`	`output/manifests/man-1.json`
`.../api/items/page-001`	`output/items/page-001.json`
`.../api/api/files/html/page.html`	`output/api/files/html/page.html`

JSON API resources get .json appended to their last path segment. Content files with extensions (.html, .css, etc.) keep their original filename.

Output Structure

output/
  collections/
    transcriptions.json
  manifests/
    manuscript-a.json
  items/
    page-001.json
  api/files/html/
    page-001.html

Requirements

Python 3.8+
pip install requests

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
mocks		mocks
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TextAPI Crawler

Setup

Usage

Configuration

How It Works

Output Structure

Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TextAPI Crawler

Setup

Usage

Configuration

How It Works

Output Structure

Requirements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages