Skip to content

Epic: Improve the Testing Deployment Process #12950

@RayBB

Description

@RayBB

Epic: Improve the Testing Deployment Process

Problem Statement

The Open Library testing deployment page is painfully slow to load. When pushing multiple PRs, removing some, or updating others, the friction is high and the iteration speed sucks. We want to make the whole process fast enough that deploying to testing stops being a bottleneck.

Since this is not user-facing, API stability is a non-concern. We can move fast and choose whatever shape makes the most internal sense.


Current System

The entire testing deployment system lives in a single file: openlibrary/plugins/openlibrary/status.py (557 lines), with its template at openlibrary/templates/status.html (253 lines). It is served entirely via legacy Web.py (Infogami). No FastAPI endpoints or Lit components exist for this system yet.

Routes

Method Path Class Purpose
GET /status status Main page — renders full HTML with testing state, drift, build results, flags
POST /status/add status_add Queue PR(s) by number or URL
POST /status/remove status_remove Remove PR(s) from queue
POST /status/enable status_enable Stage PR(s) for enabling (applied on next deploy)
POST /status/disable status_disable Stage PR(s) for disabling (applied on next deploy)
POST /status/pull-latest status_pull_latest Stage fetching latest SHA for PR(s) (applied on next deploy)
POST /status/deploy status_deploy Apply all pending changes + trigger Jenkins rebuild
POST /status/refresh status_refresh Evict GitHub drift cache, forcing refresh on next load

Data Models

All defined in status.py as dataclasses:

TestingState
  last_deploy_at: str          # ISO timestamp
  prs: list[TestingPR]

TestingPR
  pr: int                      # PR number
  commit: str                  # pinned commit SHA
  active: bool
  title: str
  added_at: str                # ISO timestamp
  added_by: str                # OL username
  pull_latest_sha: str         # pending SHA update
  pending_active: bool | None  # pending enable/disable
  author: str                  # GitHub login
  author_avatar: str
  assignee: str
  assignee_avatar: str

DevMergedStatus                # CI build result (from _dev-merged_status.txt)
  git_status: str
  pr_statuses: list[PRStatus]
  footer: str

PRStatus
  pull_line: str
  status: str
  body: str

Persistence

Store Path / Key Contents
JSON file ./_testing-prs.json Full TestingState (PRs + last_deploy_at)
JSON file ./_dev-merged_status.txt CI build result, parsed into DevMergedStatus
Memcache status.github_pr_drift (5 min TTL) Per-PR drift info (head_sha, drift count, merged status)

Auth

_is_maintainer() checks the current user against /usergroup/maintainers or /usergroup/admin. All mutating POST endpoints require this check. The GET /status page hides the testing section from non-maintainers.

External Dependencies

  • GitHub API (api.github.com/repos/internetarchive/openlibrary) — fetches PR info (title, author, assignee, head SHA) and drift info (commits behind). Uses config.github_api_token for auth. Fallback fields on error.
  • Jenkins (jenkins.openlibrary.org/job/testing-deploy/buildWithParameters) — triggered on deploy with the full active PR list as GH_REPO_AND_BRANCH parameter. Uses config.jenkins_token for auth.

Current Performance Issues

The page loads synchronously on every GET:

  1. Reads _testing-prs.json from disk
  2. Calls _get_drift_info() which hits memcache (fast) or GitHub API (slow — one request per PR)
  3. Reads _dev-merged_status.txt from disk
  4. Renders full HTML server-side with inline <script> and <style>

The main bottleneck is the GitHub API calls on cache miss — each PR needs at minimum one API call (GET /pulls/{num}), and drifted PRs need a second (GET /compare/{base}...{head}). These calls are currently made synchronously and sequentially (one blocking HTTP request per PR). Moving to async httpx in FastAPI would let the server concurrently fetch drift data from GitHub — turning an O(n) sequential wait into a single round-trip.


Proposed Phases

Phase 1 — Read-only FastAPI Endpoints

Expose structured JSON APIs for the current testing state. Build alongside the existing Web.py handlers — no need to migrate the whole file at once.

Design notes:

  • Expose structured JSON endpoints that cover what the page currently shows: the PR list with metadata and drift, the build result, server info, feature flags
  • Reuse the existing data models (TestingState, TestingPR, etc.) — they already have to_dict() methods
  • Share the same persistence layer (_load_testing_state(), memcache) — FastAPI and Web.py run in the same container and can both read _testing-prs.json
  • Auth: Use the same _is_maintainer() check for now (it calls get_current_user() which works in both contexts)
  • Cache drift info in the response with Cache-Control headers (respect the existing 5-minute drift TTL)
  • Use async httpx to fetch GitHub data concurrently (the current Web.py code calls _github_get sequentially per PR)

Tech: FastAPI route in openlibrary/fastapi/ (new file, e.g. testing.py)

Success criteria:

  • A client can fetch everything the current testing page displays in a few lightweight JSON requests
  • FastAPI endpoints are registered and return correct data
  • Existing Web.py page continues to work unchanged

Phase 2 — Lit Front End (Read-only)

Rebuild the testing page UI using Lit components that consume the Phase 1 APIs.

Key considerations from the current template (status.html):

  • The page has 9 distinct sections: deploy banner, add-PR form, PR table with drift indicators, action buttons, deploy button, build results, system info, feature flags, refresh button
  • The PR table is the most complex piece — per-row state (merged/inactive/new/pending-enable/pending-disable), drift levels, and checkbox selection
  • Currently all interaction is via synchronous form POSTs with full page reloads

Goals:

  • Much faster initial load (fetch JSON, render client-side)
  • No full-page reloads on actions
  • Cleaner, maintainable components
  • Same visual layout, slightly friendlier UX

Out of scope: Writes still hit the old Web.py POST handlers; this phase is view-only.

Phase 3 — Write APIs

Add FastAPI endpoints for mutating operations: queue PRs, remove PRs, stage enable/disable, pull latest SHA, trigger deploy, evict drift cache.

Design notes:

  • Batch operations should be first-class — accept multiple PRs in add/remove/update operations
  • Only one deploy at a time (Jenkins queue handles serialization)
  • Auth: Require maintainer group for all write endpoints
  • File-based persistence: both Web.py and FastAPI read/write _testing-prs.json — use a file lock or atomic write pattern (write + rename)

Phase 4 — Lit UI + Write APIs

Wire the Lit front end to the Phase 3 write APIs. At the end of this phase, the old testing page is fully replaced.

UX wins: Instant feedback, no full-page reloads, batch operations feel snappy.


Implementation Notes

Decision Rationale
FastAPI over Web.py Modern async support, auto-generated docs, easier to maintain
Lit over Vue We're broadly moving away from Vue; Lit is lighter and keeps us framework-agnostic
API shape Internal-only, so optimize for speed and clarity over backwards compatibility
Share persistence layer FastAPI and Web.py run in the same container; reuse _load_testing_state() / _save_testing_state() directly
Dev-merged status file CI writes _dev-merged_status.txt; FastAPI APIs should read it the same way Web.py does
Memcache for drift FastAPI can share the same memcache pool — reuse _DRIFT_CACHE_KEY to avoid duplicate GitHub API calls
GitHub API token Shared via config.github_api_token — already works for both Web.py and FastAPI contexts

Rollout Strategy

  1. Phase 1 + 2 ship at /status/v2 (read-only, Lit front-end) without touching the legacy page
  2. Phase 3 + 4 add writes to /status/v2; once parity is proven, retire /status and move /status/v2 to /status
  3. During migration, the existing /status page remains the source of truth for writes

Success Criteria

  • Testing page loads in < 1s (vs. current multi-second wait)
  • Queuing / removing / updating PRs feels instantaneous (no full reload)
  • Read-only FastAPI endpoints return the same data as the current page
  • Old testing page can be safely deleted

Settled Questions

  • [x] Do we want to expose any of these APIs for automation (e.g., CLI or bot-driven deploys), or keep them internal and undocumented? — Internal and documented by FastAPI's auto-generated OpenAPI schema. A CLI tool would be nice to have one day but is out of scope for now.
  • [x] Should batch operations (queue multiple PRs at once) be a first-class API concept? — Yes. Accept multiple PRs in add/remove/update operations. However there will only ever be one deploy active at a time (Jenkins serializes the queue).

Metadata

Metadata

Assignees

No one assigned

    Labels

    Affects: Admin/MaintenanceIssues relating to support scripts, bots, cron jobs and admin web pages. [managed]Type: Feature RequestIssue describes a feature or enhancement we'd like to implement. [managed]

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions