Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
18 commits
Select commit Hold shift + click to select a range
1f9c816
feat(opal-server): gated /internal git-fetcher cache stats endpoint
dshoen619 Jun 23, 2026
c176ce9
test(git-leak): add OPAL git leak/resilience test bed
dshoen619 Jun 23, 2026
bd49676
test(git-leak): add GiteaAdmin and make_repo_unreachable helpers
dshoen619 Jun 23, 2026
afeb969
test(git-leak): correct postgres-bounce framing (passes on master)
dshoen619 Jun 23, 2026
92353f6
style(git-leak): apply black/isort/docformatter (pre-commit)
dshoen619 Jun 23, 2026
db54ae6
test: scope root pytest collection to packages/ (exclude git-leak bed)
dshoen619 Jun 23, 2026
6f9a089
test(git-leak): address Copilot review feedback
dshoen619 Jun 23, 2026
1b23ac0
test(git-leak): isolate scopes per test and fix false repeat-sync gate
dshoen619 Jun 23, 2026
6046a10
test(git-leak): make the regression gates trustworthy (address PR rev…
dshoen619 Jun 24, 2026
f810db8
style(git-leak): apply black/isort/docformatter (pre-commit)
dshoen619 Jun 24, 2026
82ff33a
test(git-leak): tighten stat polling and pin test-bed images (PR review)
dshoen619 Jun 24, 2026
d719c34
Merge branch 'master' into david/per-15155-pr1-git-leakresilience-tes…
dshoen619 Jun 28, 2026
75ad43a
test(git-leak): isolate offline-hang healthy probe to a never-cloned …
dshoen619 Jun 28, 2026
15f3cfe
test(git-leak): harden harness teardown and tighten assertions (PR re…
dshoen619 Jun 28, 2026
a502f2e
Merge branch 'master' into david/per-15155-pr1-git-leakresilience-tes…
dshoen619 Jun 30, 2026
8e24cb0
test(git-leak): make PR4/PR5 tests genuine gates (address PR review)
dshoen619 Jul 1, 2026
6f002fd
test(git-leak): address PR review round 3 (fixture robustness + cleanup)
dshoen619 Jul 1, 2026
5aae85c
test(git-leak): harden gates per multi-agent review (H1 + M1-M6)
dshoen619 Jul 1, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -137,3 +137,6 @@ dmypy.json
*.iml

.DS_Store

# Private Claude Code working artifacts (plans/specs) — never commit
.claude/
82 changes: 82 additions & 0 deletions app-tests/git-leak/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
# OPAL git-leak / resilience test bed

Reproduces (as failing tests) the four issues fixed by PR2–PR5: memory leak,
offline-repo hang, slow serial boot, broadcaster no-reconnect.

Every assertion is driven through `GET /internal/git-fetcher-cache-stats`, which
**this PR (PR1) adds** — it does not exist on `master`. So the suite runs against
*this branch*: the leak/offline tests fail here *until PR2/PR3 land*, then go
green. Run against true `master` they would all error at setup on the missing
endpoint, not "fail for the targeted bug."

## Stack
- `opal_server` (single worker, scopes on, Postgres broadcaster, built from `docker/Dockerfile`)
- `redis`, `postgres`, `gitea` (+ one-shot `gitea-admin` and `seed` sidecars)
- `blackhole` (alpine/socat: accepts TCP then never answers — the offline repo)

Only `opal_server` (`:7002`) and `gitea` (`:13000` on the host) are published;
Postgres and `blackhole` are internal to the compose network.

## Helpers (`helpers.py`)
- `OpalServerClient` — drive opal over HTTP (`stats`, `put_scope`, `delete_scope`,
`refresh_all`, `get_scope_policy`, `list_scope_ids`, `delete_all_scopes`).
- `GiteaAdmin` — host-side Gitea admin client (`list_repos`, `repo_exists`,
`create_repo`, `delete_repo`); also exposed as the `gitea_admin` pytest fixture.
- `make_repo_unreachable(name)` — git URL on the `blackhole` sidecar (completes
the TCP handshake, never answers) so the clone hangs for the offline-repo test.
- `bounce_postgres(down_seconds)` — stop Postgres, then `up -d --wait` it back to
simulate a broadcaster outage and await readiness before the recovery poll.

## Run
```bash
cd app-tests/git-leak
python -m pytest -v --boot-scopes=50 # full set
python -m pytest test_leak.py -v --boot-scopes=20 # just the leak gates
```
Useful flags: `--boot-scopes=N` (any N), `--keep-stack` (skip teardown),
env `BOOT_TARGET_SECONDS=120` (tighten the boot gate).

## Expected behavior

Gate-coverage matrix (what each flagship test actually does):

| Test | Role | Behaviour here |
|---|---|---|
| `test_churn_releases_caches` | **gate (PR2)** | FAILS without the PR2 leak fix — delete leaves the caches populated; flips green when PR2 lands |
| `test_offline_repo_does_not_block_healthy_scopes` | **gate (PR3)** | FAILS without the PR3 fetch timeout — 40 hung clones starve the executor so a healthy scope never serves; flips green when PR3 lands |
| `test_boot_loads_all_scopes` | **baseline → gate (PR4)** | PASSES with the loose default target; set `BOOT_TARGET_SECONDS` low (plan: 120 @ 50) on PR4 to gate the parallel-boot fix |
| `test_repeat_sync_rss_stays_bounded` | **RSS guard** | PASSES; an RSS-budget guard against per-sync allocation leaks (the cache *count* can't grow for any impl, so there is no count assertion — see below) |
| `test_server_recovers_after_postgres_bounce` | **guard (PER-15065)** | PASSES on this branch (which has #915); guards the in-place broadcaster reconnect |

Notes on the two guards:
- `test_repeat_sync_rss_stays_bounded` — clone paths are keyed by the repo URL,
so re-syncing identical scopes reuses cache entries and the cache *counts*
can't grow for any implementation; the load-bearing assertion is therefore on
RSS only (a `len(repos)` check would be tautological and is intentionally
omitted), guarding against a regression that leaks per-sync allocations.
- `test_server_recovers_after_postgres_bounce` — runs **2 workers** so the
Postgres backbone is actually exercised (cross-worker fan-out needs >=2
workers; a single worker fans out in-process and never touches the backbone).
Across a transient bounce it asserts the gunicorn **worker PIDs are unchanged**
— proving #915's reconnecting broadcaster recovered the reader *in place*
rather than gunicorn respawning a graceful-shutdown worker (the pre-fix
behaviour) — and that a scope PUT after the bounce becomes servable, proving
the broadcast/sync path recovered (not just HTTP).

## Requires
Docker + docker compose v2, plus host Python with `pytest pytest-timeout requests GitPython`.

## Notes
- Auth is disabled in the stack: `OPAL_AUTH_PUBLIC_KEY` is left unset so the JWT
verifier is disabled and the harness can call scope routes without minting JWTs.
Local test bed only; never a production setting. (The `/internal` endpoint is
registered with the same `JWTAuthenticator` dependency as the other routes, so
it is protected when JWT verification is enabled and open only here.)
- The server runs a **single** uvicorn worker. The `GitPolicyFetcher` caches read
by `/internal/git-fetcher-cache-stats` are per-process, so a multi-worker stack
would make a round-robin read miss the worker that fetched and let a `== 0`
drain assertion pass falsely. One worker makes every cache read deterministic;
the leak/boot/offline bugs all reproduce single-worker.
- First-sync of a fresh scope takes the clone path, which fills only `repo_locks`;
`repos` / `repos_last_fetched` are filled by the discover/fetch path on a second
sync, so the load helpers issue a `refresh_all()` before asserting on `repos`.
133 changes: 133 additions & 0 deletions app-tests/git-leak/conftest.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
import os
import shutil

import pytest
from helpers import (
HEALTHY_PROBE_REPO,
GiteaAdmin,
OpalServerClient,
compose,
list_seeded_repos,
worker_pids,
)


def pytest_addoption(parser):
parser.addoption(
"--boot-scopes",
action="store",
default="50",
help="number of repos to seed/boot (default 50)",
)
parser.addoption(
"--keep-stack",
action="store_true",
default=False,
help="do not tear the compose stack down after the run",
)


@pytest.fixture(scope="session")
def repo_count(request) -> int:
return int(request.config.getoption("--boot-scopes"))


@pytest.fixture(scope="session")
def stack(request, repo_count):
# Defense-in-depth: this docker-compose suite is already excluded from the
# repo's default `pytest` run via `testpaths = packages` in pytest.ini, so
# the unit-test CI matrix never collects it. If it is ever collected in an
# environment without docker, skip cleanly instead of erroring.
if shutil.which("docker") is None:
pytest.skip("docker (compose) is required for the git-leak test bed")
os.environ["REPO_COUNT"] = str(repo_count)
# build + start infra; seed runs to completion then exits
compose("up", "-d", "--build")
# block until seeding sidecar has finished creating repos. compose() raises
# (with output) if the seed container exited non-zero, so a hard seed
# failure surfaces here rather than as a confusing later test failure.
compose("wait", "seed")
Comment thread
dshoen619 marked this conversation as resolved.
Comment thread
dshoen619 marked this conversation as resolved.
# Verify the seed actually produced all N repos before any test runs: a
# partial seed would otherwise look like a server bug when the load gate
# can't reach N. Fail loudly with the gap.
# include the reserved probe repo the resilience test relies on, so a
# partial seed of it is caught here too rather than as a later test failure
expected = set(list_seeded_repos(repo_count)) | {HEALTHY_PROBE_REPO}
present = set(GiteaAdmin().list_repos())
missing = expected - present
assert not missing, (
f"seed incomplete: {len(missing)}/{repo_count} repos missing "
f"(e.g. {sorted(missing)[:5]})"
)
client = OpalServerClient()
client.wait_healthy()
yield client
if not request.config.getoption("--keep-stack"):
compose("down", "-v")


@pytest.fixture()
def opal(stack) -> OpalServerClient:
# The compose stack is session-scoped (one server for the whole run), but
# scopes must not leak between tests: clone paths are keyed by repo URL, so
# a scope left behind by one test shares a cache entry with any later test
# using the same seeded repo and would pollute its drain assertions.
#
# Delete every scope the *server* currently knows (not just this client's
# tracked set) at setup, so a scope orphaned by a prior failed test can't
# contaminate this one; then again on teardown.
stack.delete_all_scopes()
# Guard the single-worker invariant the cache gates depend on: if a prior
# opal_multiworker teardown failed to restore 1 worker, the per-process cache
# reads would be nondeterministic here (a `== 0` drain could false-pass).
# Fail loudly and ordering-independently rather than silently mis-measure.
assert (
len(worker_pids()) == 1
), f"expected a single-worker stack, found workers {sorted(worker_pids())}"
yield stack
stack.delete_all_scopes()


@pytest.fixture()
def opal_multiworker(stack) -> OpalServerClient:
"""opal_server reconfigured to 2 gunicorn workers, for the broadcaster
test.

The session stack is single-worker (the right call for the per-
process cache drain assertions), but the Postgres broadcaster's
cross-worker fan-out — the reason it is in this compose file at all
— is only exercised with >=2 workers (references/debug-pubsub.md
§3-4). This force-recreates opal_server with 2 workers for one test,
then restores the single-worker stack on teardown so the cache tests
keep their determinism. Each side starts from a clean slate: the
recreate wipes the container's on-disk clones, and clearing scopes
stops a leftover scope (whose clone is URL-keyed) from being re-
cloned on boot.
"""
os.environ["OPAL_TEST_WORKERS"] = "2"
try:
# --no-deps: don't bounce redis/postgres/gitea; --force-recreate: apply
# the new worker count. No --wait (opal_server has no compose
# healthcheck) — wait_healthy() polls the HTTP surface instead.
compose("up", "-d", "--no-deps", "--force-recreate", "opal_server")
stack.wait_healthy()
stack.delete_all_scopes()
yield stack
finally:
os.environ["OPAL_TEST_WORKERS"] = "1"
compose("up", "-d", "--no-deps", "--force-recreate", "opal_server")
stack.wait_healthy()
stack.delete_all_scopes()
# Verify the restore actually reduced the stack back to one worker. If it
# did not (a botched recreate), fail loudly here rather than leave a
# 2-worker stack that would silently break later single-worker cache
# gates' determinism.
assert (
len(worker_pids()) == 1
), f"opal_multiworker teardown left workers {sorted(worker_pids())}, expected 1"


@pytest.fixture(scope="session")
def gitea_admin(stack) -> GiteaAdmin:
"""Host-side Gitea admin client (depends on `stack` so Gitea is up)."""
return GiteaAdmin()
141 changes: 141 additions & 0 deletions app-tests/git-leak/docker-compose.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
name: opal-git-leak-test

services:
redis:
image: redis:7-alpine
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 2s
timeout: 3s
retries: 30

postgres:
image: postgres:16-alpine
environment:
POSTGRES_USER: opal
POSTGRES_PASSWORD: opal
POSTGRES_DB: opal
# not published to the host: only opal_server reaches it over the compose
# network, and bounce_postgres() uses `docker compose stop/start`. Publishing
# 5432 would collide with any Postgres already running on the host.
healthcheck:
test: ["CMD-SHELL", "pg_isready -U opal"]
interval: 2s
timeout: 3s
retries: 30

gitea:
image: gitea/gitea:1.21
environment:
GITEA__security__INSTALL_LOCK: "true"
GITEA__server__ROOT_URL: "http://gitea:3000/"
GITEA__database__DB_TYPE: "sqlite3"
# published on 13000 (not 3000) for the host-side GiteaAdmin helper; the
# uncommon port avoids the usual :3000 clash. opal_server and the seed
# sidecar still reach it over the compose network via http://gitea:3000.
ports:
- "13000:3000"
volumes:
- gitea-data:/data
healthcheck:
test: ["CMD-SHELL", "wget -qO- http://localhost:3000/api/v1/version || exit 1"]
interval: 3s
timeout: 5s
retries: 40

gitea-admin:
# creates the admin user once gitea is healthy
image: gitea/gitea:1.21
depends_on:
gitea:
condition: service_healthy
user: git
entrypoint: ["/bin/sh", "-c"]
# Tolerate the idempotent "already exists" case, and RETRY on "database is
# locked": this CLI mutates the same SQLite file the live gitea server holds
# open, so it can lose a lock race and fail transiently. Any other failure
# aborts so `seed` (which depends on this completing) doesn't run against a
# Gitea with no admin user and fail with a confusing 401.
# `|` (literal) block, not `>` (folded): the `gitea admin user create` call is
# kept on ONE line so YAML can't fold a newline into the middle of its args
# (that would run `--email ...` as its own command -> exit 127).
command:
- |
for attempt in 1 2 3 4 5 6; do
out=$$(gitea admin user create --username opaladmin --password opaladmin --email admin@example.com --admin --must-change-password=false --config /data/gitea/conf/app.ini 2>&1)
rc=$$?
echo "$$out"
if [ $$rc -eq 0 ] || echo "$$out" | grep -qi "already exist"; then exit 0; fi
if echo "$$out" | grep -qi "database is locked"; then echo "gitea db locked; retry $$attempt"; sleep 2; continue; fi
exit $$rc
done
echo "gitea admin create failed after retries"
exit 1
volumes:
- gitea-data:/data
restart: "no"

blackhole:
# Accepts the TCP handshake then never answers — a clone connects and
# blocks reading the git smart-HTTP response, holding the fetch executor.
# Deterministic, unlike a TEST-NET-1 address which many networks reject
# fast with ICMP-unreachable (so the clone would fail fast, not hang).
image: alpine/socat:1.8.0.3
command: ["TCP-LISTEN:80,fork,reuseaddr", "SYSTEM:sleep 3600"]

seed:
build: ./seed
depends_on:
gitea:
condition: service_healthy
gitea-admin:
condition: service_completed_successfully
environment:
GITEA_URL: "http://gitea:3000"
GITEA_ADMIN_USER: "opaladmin"
GITEA_ADMIN_PASSWORD: "opaladmin"
REPO_COUNT: "${REPO_COUNT:-50}"
restart: "no"

opal_server:
build:
context: ../..
dockerfile: docker/Dockerfile
target: server
environment:
# Default single worker: the GitPolicyFetcher caches read by
# /internal/git-fetcher-cache-stats are per-process, so with >1 worker a
# round-robin read can miss the worker that fetched and make a `== 0`
# drain assertion pass falsely. One worker makes every cache read
# deterministic. The leak/boot/offline bugs all reproduce single-worker.
# The postgres-bounce test (test_resilience.py) overrides this to 2 via
# OPAL_TEST_WORKERS for its own container, because cross-worker fan-out
# over the Postgres backbone only happens with >=2 workers
# (references/debug-pubsub.md §3-4) — a single worker can't tell #915's
# in-place broadcaster reconnect from a plain worker respawn.
UVICORN_NUM_WORKERS: "${OPAL_TEST_WORKERS:-1}"
OPAL_SCOPES: "1"
OPAL_REDIS_URL: "redis://redis:6379"
OPAL_BROADCAST_URI: "postgres://opal:opal@postgres:5432/opal"
# Make the broadcaster reconnect fast + deterministic for the postgres-
# bounce test (defaults are 30s max backoff / 2s settle). Harmless to the
# single-worker tests, which don't exercise cross-worker fan-out.
OPAL_BROADCAST_RECONNECT_BACKOFF_MAX_SECONDS: "2"
OPAL_BROADCAST_RESYNC_SETTLE_SECONDS: "2"
OPAL_BASE_DIR: "/opal"
OPAL_POLICY_REFRESH_INTERVAL: "0"
OPAL_DEBUG_INTERNAL_STATS: "1"
# OPAL_AUTH_PUBLIC_KEY is intentionally left unset: with no public key the
# JWT verifier is disabled, so the harness can call scope routes without
# minting JWTs. Local test bed only; never a production setting.
OPAL_LOG_FORMAT_INCLUDE_PID: "true"
ports:
- "7002:7002"
depends_on:
redis:
condition: service_healthy
postgres:
condition: service_healthy

volumes:
gitea-data:
Loading
Loading