permitio · dshoen619 · Jun 23, 2026 · Jun 23, 2026 · Jun 23, 2026 · Jun 23, 2026
diff --git a/.gitignore b/.gitignore
@@ -137,3 +137,6 @@ dmypy.json
 *.iml
 
 .DS_Store
+
+# Private Claude Code working artifacts (plans/specs) — never commit
+.claude/
diff --git a/app-tests/git-leak/README.md b/app-tests/git-leak/README.md
@@ -0,0 +1,82 @@
+# OPAL git-leak / resilience test bed
+
+Reproduces (as failing tests) the four issues fixed by PR2–PR5: memory leak,
+offline-repo hang, slow serial boot, broadcaster no-reconnect.
+
+Every assertion is driven through `GET /internal/git-fetcher-cache-stats`, which
+**this PR (PR1) adds** — it does not exist on `master`. So the suite runs against
+*this branch*: the leak/offline tests fail here *until PR2/PR3 land*, then go
+green. Run against true `master` they would all error at setup on the missing
+endpoint, not "fail for the targeted bug."
+
+## Stack
+- `opal_server` (single worker, scopes on, Postgres broadcaster, built from `docker/Dockerfile`)
+- `redis`, `postgres`, `gitea` (+ one-shot `gitea-admin` and `seed` sidecars)
+- `blackhole` (alpine/socat: accepts TCP then never answers — the offline repo)
+
+Only `opal_server` (`:7002`) and `gitea` (`:13000` on the host) are published;
+Postgres and `blackhole` are internal to the compose network.
+
+## Helpers (`helpers.py`)
+- `OpalServerClient` — drive opal over HTTP (`stats`, `put_scope`, `delete_scope`,
+  `refresh_all`, `get_scope_policy`, `list_scope_ids`, `delete_all_scopes`).
+- `GiteaAdmin` — host-side Gitea admin client (`list_repos`, `repo_exists`,
+  `create_repo`, `delete_repo`); also exposed as the `gitea_admin` pytest fixture.
+- `make_repo_unreachable(name)` — git URL on the `blackhole` sidecar (completes
+  the TCP handshake, never answers) so the clone hangs for the offline-repo test.
+- `bounce_postgres(down_seconds)` — stop Postgres, then `up -d --wait` it back to
+  simulate a broadcaster outage and await readiness before the recovery poll.
+
+## Run
+```bash
+cd app-tests/git-leak
+python -m pytest -v --boot-scopes=50              # full set
+python -m pytest test_leak.py -v --boot-scopes=20 # just the leak gates
+```
+Useful flags: `--boot-scopes=N` (any N), `--keep-stack` (skip teardown),
+env `BOOT_TARGET_SECONDS=120` (tighten the boot gate).
+
+## Expected behavior
+
+Gate-coverage matrix (what each flagship test actually does):
+
+| Test | Role | Behaviour here |
+|---|---|---|
+| `test_churn_releases_caches` | **gate (PR2)** | FAILS without the PR2 leak fix — delete leaves the caches populated; flips green when PR2 lands |
+| `test_offline_repo_does_not_block_healthy_scopes` | **gate (PR3)** | FAILS without the PR3 fetch timeout — 40 hung clones starve the executor so a healthy scope never serves; flips green when PR3 lands |
+| `test_boot_loads_all_scopes` | **baseline → gate (PR4)** | PASSES with the loose default target; set `BOOT_TARGET_SECONDS` low (plan: 120 @ 50) on PR4 to gate the parallel-boot fix |
+| `test_repeat_sync_rss_stays_bounded` | **RSS guard** | PASSES; an RSS-budget guard against per-sync allocation leaks (the cache *count* can't grow for any impl, so there is no count assertion — see below) |
+| `test_server_recovers_after_postgres_bounce` | **guard (PER-15065)** | PASSES on this branch (which has #915); guards the in-place broadcaster reconnect |
+
+Notes on the two guards:
+- `test_repeat_sync_rss_stays_bounded` — clone paths are keyed by the repo URL,
+  so re-syncing identical scopes reuses cache entries and the cache *counts*
+  can't grow for any implementation; the load-bearing assertion is therefore on
+  RSS only (a `len(repos)` check would be tautological and is intentionally
+  omitted), guarding against a regression that leaks per-sync allocations.
+- `test_server_recovers_after_postgres_bounce` — runs **2 workers** so the
+  Postgres backbone is actually exercised (cross-worker fan-out needs >=2
+  workers; a single worker fans out in-process and never touches the backbone).
+  Across a transient bounce it asserts the gunicorn **worker PIDs are unchanged**
+  — proving #915's reconnecting broadcaster recovered the reader *in place*
+  rather than gunicorn respawning a graceful-shutdown worker (the pre-fix
+  behaviour) — and that a scope PUT after the bounce becomes servable, proving
+  the broadcast/sync path recovered (not just HTTP).
+
+## Requires
+Docker + docker compose v2, plus host Python with `pytest pytest-timeout requests GitPython`.
+
+## Notes
+- Auth is disabled in the stack: `OPAL_AUTH_PUBLIC_KEY` is left unset so the JWT
+  verifier is disabled and the harness can call scope routes without minting JWTs.
+  Local test bed only; never a production setting. (The `/internal` endpoint is
+  registered with the same `JWTAuthenticator` dependency as the other routes, so
+  it is protected when JWT verification is enabled and open only here.)
+- The server runs a **single** uvicorn worker. The `GitPolicyFetcher` caches read
+  by `/internal/git-fetcher-cache-stats` are per-process, so a multi-worker stack
+  would make a round-robin read miss the worker that fetched and let a `== 0`
+  drain assertion pass falsely. One worker makes every cache read deterministic;
+  the leak/boot/offline bugs all reproduce single-worker.
+- First-sync of a fresh scope takes the clone path, which fills only `repo_locks`;
+  `repos` / `repos_last_fetched` are filled by the discover/fetch path on a second
+  sync, so the load helpers issue a `refresh_all()` before asserting on `repos`.
diff --git a/app-tests/git-leak/conftest.py b/app-tests/git-leak/conftest.py
@@ -0,0 +1,133 @@
+import os
+import shutil
+
+import pytest
+from helpers import (
+    HEALTHY_PROBE_REPO,
+    GiteaAdmin,
+    OpalServerClient,
+    compose,
+    list_seeded_repos,
+    worker_pids,
+)
+
+
+def pytest_addoption(parser):
+    parser.addoption(
+        "--boot-scopes",
+        action="store",
+        default="50",
+        help="number of repos to seed/boot (default 50)",
+    )
+    parser.addoption(
+        "--keep-stack",
+        action="store_true",
+        default=False,
+        help="do not tear the compose stack down after the run",
+    )
+
+
+@pytest.fixture(scope="session")
+def repo_count(request) -> int:
+    return int(request.config.getoption("--boot-scopes"))
+
+
+@pytest.fixture(scope="session")
+def stack(request, repo_count):
+    # Defense-in-depth: this docker-compose suite is already excluded from the
+    # repo's default `pytest` run via `testpaths = packages` in pytest.ini, so
+    # the unit-test CI matrix never collects it. If it is ever collected in an
+    # environment without docker, skip cleanly instead of erroring.
+    if shutil.which("docker") is None:
+        pytest.skip("docker (compose) is required for the git-leak test bed")
+    os.environ["REPO_COUNT"] = str(repo_count)
+    # build + start infra; seed runs to completion then exits
+    compose("up", "-d", "--build")
+    # block until seeding sidecar has finished creating repos. compose() raises
+    # (with output) if the seed container exited non-zero, so a hard seed
+    # failure surfaces here rather than as a confusing later test failure.
+    compose("wait", "seed")
+    # Verify the seed actually produced all N repos before any test runs: a
+    # partial seed would otherwise look like a server bug when the load gate
+    # can't reach N. Fail loudly with the gap.
+    # include the reserved probe repo the resilience test relies on, so a
+    # partial seed of it is caught here too rather than as a later test failure
+    expected = set(list_seeded_repos(repo_count)) | {HEALTHY_PROBE_REPO}
+    present = set(GiteaAdmin().list_repos())
+    missing = expected - present
+    assert not missing, (
+        f"seed incomplete: {len(missing)}/{repo_count} repos missing "
+        f"(e.g. {sorted(missing)[:5]})"
+    )
+    client = OpalServerClient()
+    client.wait_healthy()
+    yield client
+    if not request.config.getoption("--keep-stack"):
+        compose("down", "-v")
+
+
+@pytest.fixture()
+def opal(stack) -> OpalServerClient:
+    # The compose stack is session-scoped (one server for the whole run), but
+    # scopes must not leak between tests: clone paths are keyed by repo URL, so
+    # a scope left behind by one test shares a cache entry with any later test
+    # using the same seeded repo and would pollute its drain assertions.
+    #
+    # Delete every scope the *server* currently knows (not just this client's
+    # tracked set) at setup, so a scope orphaned by a prior failed test can't
+    # contaminate this one; then again on teardown.
+    stack.delete_all_scopes()
+    # Guard the single-worker invariant the cache gates depend on: if a prior
+    # opal_multiworker teardown failed to restore 1 worker, the per-process cache
+    # reads would be nondeterministic here (a `== 0` drain could false-pass).
+    # Fail loudly and ordering-independently rather than silently mis-measure.
+    assert (
+        len(worker_pids()) == 1
+    ), f"expected a single-worker stack, found workers {sorted(worker_pids())}"
+    yield stack
+    stack.delete_all_scopes()
+
+
+@pytest.fixture()
+def opal_multiworker(stack) -> OpalServerClient:
+    """opal_server reconfigured to 2 gunicorn workers, for the broadcaster
+    test.
+
+    The session stack is single-worker (the right call for the per-
+    process cache drain assertions), but the Postgres broadcaster's
+    cross-worker fan-out — the reason it is in this compose file at all
+    — is only exercised with >=2 workers (references/debug-pubsub.md
+    §3-4). This force-recreates opal_server with 2 workers for one test,
+    then restores the single-worker stack on teardown so the cache tests
+    keep their determinism. Each side starts from a clean slate: the
+    recreate wipes the container's on-disk clones, and clearing scopes
+    stops a leftover scope (whose clone is URL-keyed) from being re-
+    cloned on boot.
+    """
+    os.environ["OPAL_TEST_WORKERS"] = "2"
+    try:
+        # --no-deps: don't bounce redis/postgres/gitea; --force-recreate: apply
+        # the new worker count. No --wait (opal_server has no compose
+        # healthcheck) — wait_healthy() polls the HTTP surface instead.
+        compose("up", "-d", "--no-deps", "--force-recreate", "opal_server")
+        stack.wait_healthy()
+        stack.delete_all_scopes()
+        yield stack
+    finally:
+        os.environ["OPAL_TEST_WORKERS"] = "1"
+        compose("up", "-d", "--no-deps", "--force-recreate", "opal_server")
+        stack.wait_healthy()
+        stack.delete_all_scopes()
+        # Verify the restore actually reduced the stack back to one worker. If it
+        # did not (a botched recreate), fail loudly here rather than leave a
+        # 2-worker stack that would silently break later single-worker cache
+        # gates' determinism.
+        assert (
+            len(worker_pids()) == 1
+        ), f"opal_multiworker teardown left workers {sorted(worker_pids())}, expected 1"
+
+
+@pytest.fixture(scope="session")
+def gitea_admin(stack) -> GiteaAdmin:
+    """Host-side Gitea admin client (depends on `stack` so Gitea is up)."""
+    return GiteaAdmin()
diff --git a/app-tests/git-leak/docker-compose.yml b/app-tests/git-leak/docker-compose.yml
@@ -0,0 +1,141 @@
+name: opal-git-leak-test
+
+services:
+  redis:
+    image: redis:7-alpine
+    healthcheck:
+      test: ["CMD", "redis-cli", "ping"]
+      interval: 2s
+      timeout: 3s
+      retries: 30
+
+  postgres:
+    image: postgres:16-alpine
+    environment:
+      POSTGRES_USER: opal
+      POSTGRES_PASSWORD: opal
+      POSTGRES_DB: opal
+    # not published to the host: only opal_server reaches it over the compose
+    # network, and bounce_postgres() uses `docker compose stop/start`. Publishing
+    # 5432 would collide with any Postgres already running on the host.
+    healthcheck:
+      test: ["CMD-SHELL", "pg_isready -U opal"]
+      interval: 2s
+      timeout: 3s
+      retries: 30
+
+  gitea:
+    image: gitea/gitea:1.21
+    environment:
+      GITEA__security__INSTALL_LOCK: "true"
+      GITEA__server__ROOT_URL: "http://gitea:3000/"
+      GITEA__database__DB_TYPE: "sqlite3"
+    # published on 13000 (not 3000) for the host-side GiteaAdmin helper; the
+    # uncommon port avoids the usual :3000 clash. opal_server and the seed
+    # sidecar still reach it over the compose network via http://gitea:3000.
+    ports:
+      - "13000:3000"
+    volumes:
+      - gitea-data:/data
+    healthcheck:
+      test: ["CMD-SHELL", "wget -qO- http://localhost:3000/api/v1/version || exit 1"]
+      interval: 3s
+      timeout: 5s
+      retries: 40
+
+  gitea-admin:
+    # creates the admin user once gitea is healthy
+    image: gitea/gitea:1.21
+    depends_on:
+      gitea:
+        condition: service_healthy
+    user: git
+    entrypoint: ["/bin/sh", "-c"]
+    # Tolerate the idempotent "already exists" case, and RETRY on "database is
+    # locked": this CLI mutates the same SQLite file the live gitea server holds
+    # open, so it can lose a lock race and fail transiently. Any other failure
+    # aborts so `seed` (which depends on this completing) doesn't run against a
+    # Gitea with no admin user and fail with a confusing 401.
+    # `|` (literal) block, not `>` (folded): the `gitea admin user create` call is
+    # kept on ONE line so YAML can't fold a newline into the middle of its args
+    # (that would run `--email ...` as its own command -> exit 127).
+    command:
+      - |
+        for attempt in 1 2 3 4 5 6; do
+          out=$$(gitea admin user create --username opaladmin --password opaladmin --email admin@example.com --admin --must-change-password=false --config /data/gitea/conf/app.ini 2>&1)
+          rc=$$?
+          echo "$$out"
+          if [ $$rc -eq 0 ] || echo "$$out" | grep -qi "already exist"; then exit 0; fi
+          if echo "$$out" | grep -qi "database is locked"; then echo "gitea db locked; retry $$attempt"; sleep 2; continue; fi
+          exit $$rc
+        done
+        echo "gitea admin create failed after retries"
+        exit 1
+    volumes:
+      - gitea-data:/data
+    restart: "no"
+
+  blackhole:
+    # Accepts the TCP handshake then never answers — a clone connects and
+    # blocks reading the git smart-HTTP response, holding the fetch executor.
+    # Deterministic, unlike a TEST-NET-1 address which many networks reject
+    # fast with ICMP-unreachable (so the clone would fail fast, not hang).
+    image: alpine/socat:1.8.0.3
+    command: ["TCP-LISTEN:80,fork,reuseaddr", "SYSTEM:sleep 3600"]
+
+  seed:
+    build: ./seed
+    depends_on:
+      gitea:
+        condition: service_healthy
+      gitea-admin:
+        condition: service_completed_successfully
+    environment:
+      GITEA_URL: "http://gitea:3000"
+      GITEA_ADMIN_USER: "opaladmin"
+      GITEA_ADMIN_PASSWORD: "opaladmin"
+      REPO_COUNT: "${REPO_COUNT:-50}"
+    restart: "no"
+
+  opal_server:
+    build:
+      context: ../..
+      dockerfile: docker/Dockerfile
+      target: server
+    environment:
+      # Default single worker: the GitPolicyFetcher caches read by
+      # /internal/git-fetcher-cache-stats are per-process, so with >1 worker a
+      # round-robin read can miss the worker that fetched and make a `== 0`
+      # drain assertion pass falsely. One worker makes every cache read
+      # deterministic. The leak/boot/offline bugs all reproduce single-worker.
+      # The postgres-bounce test (test_resilience.py) overrides this to 2 via
+      # OPAL_TEST_WORKERS for its own container, because cross-worker fan-out
+      # over the Postgres backbone only happens with >=2 workers
+      # (references/debug-pubsub.md §3-4) — a single worker can't tell #915's
+      # in-place broadcaster reconnect from a plain worker respawn.
+      UVICORN_NUM_WORKERS: "${OPAL_TEST_WORKERS:-1}"
+      OPAL_SCOPES: "1"
+      OPAL_REDIS_URL: "redis://redis:6379"
+      OPAL_BROADCAST_URI: "postgres://opal:opal@postgres:5432/opal"
+      # Make the broadcaster reconnect fast + deterministic for the postgres-
+      # bounce test (defaults are 30s max backoff / 2s settle). Harmless to the
+      # single-worker tests, which don't exercise cross-worker fan-out.
+      OPAL_BROADCAST_RECONNECT_BACKOFF_MAX_SECONDS: "2"
+      OPAL_BROADCAST_RESYNC_SETTLE_SECONDS: "2"
+      OPAL_BASE_DIR: "/opal"
+      OPAL_POLICY_REFRESH_INTERVAL: "0"
+      OPAL_DEBUG_INTERNAL_STATS: "1"
+      # OPAL_AUTH_PUBLIC_KEY is intentionally left unset: with no public key the
+      # JWT verifier is disabled, so the harness can call scope routes without
+      # minting JWTs. Local test bed only; never a production setting.
+      OPAL_LOG_FORMAT_INCLUDE_PID: "true"
+    ports:
+      - "7002:7002"
+    depends_on:
+      redis:
+        condition: service_healthy
+      postgres:
+        condition: service_healthy
+
+volumes:
+  gitea-data: