test(opal-server): git leak/resilience test environment (PR1)#922
test(opal-server): git leak/resilience test environment (PR1)#922dshoen619 wants to merge 18 commits into
Conversation
Add an off-by-default diagnostics endpoint so tests can observe the in-memory GitPolicyFetcher cache sizes (repo_locks/repos/repos_last_fetched) and process RSS that the upcoming memory-leak fix eliminates. - debug_stats.py: read-only git_fetcher_cache_stats() helper + a register_internal_stats_route() registrar that mounts GET /internal/git-fetcher-cache-stats only when enabled. - config.py: new OPAL_DEBUG_INTERNAL_STATS flag, default False. - server.py: register the route, gated by the flag, beside /healthcheck. No production behavior change when the flag is off (the default). Also ignore .claude/ so private planning artifacts are never committed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A self-contained docker-compose stack (opal-server x2 workers + Redis + Postgres broadcaster + Gitea) plus a pytest harness that reproduces, as tests that fail on master, the git-fetcher memory leak, the offline-repo hang, the slow serial boot, and the broadcaster-disconnect gap. These become the regression gates for the follow-up fixes. - seed/: idempotent Gitea seeding sidecar (N policy repos) + Dockerfile. - docker-compose.yml: 4-service stack, opal-server built from the repo's own docker/Dockerfile (server target), scopes on, Postgres broadcaster. - helpers.py / conftest.py: HTTP + infra helpers and stack fixtures. - test_leak.py / test_resilience.py / test_boot.py: the flagship tests. - README.md: how to run and expected fail-on-master behavior. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
✅ Deploy Preview for opal-docs canceled.
|
Complete the helpers.py surface promised in the plan's file-structure table. Both are now functional and used, not dead code: - make_repo_unreachable(name): returns a git URL on a routable-but-dead TEST-NET-1 host (RFC 5737). test_offline_repo now uses it instead of an inlined literal. - GiteaAdmin: host-side Gitea admin client (list_repos / repo_exists / create_repo / delete_repo), exposed as the `gitea_admin` pytest fixture for tests that need to inspect or stage repos beyond the seed sidecar. Gitea is published on host port 13000 (uncommon, to avoid the usual :3000 clash) so GiteaAdmin can reach it; opal_server and the seed sidecar still use the internal http://gitea:3000. README updated with the helper and port notes. Verified live: GiteaAdmin lists the seeded repos and round-trips create/exists/delete against Gitea over the published port. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Verified test_server_recovers_after_postgres_bounce against the stack: it PASSES on master (~14-19s). On a broadcaster drop the affected worker triggers a graceful shutdown, gunicorn respawns it, and the sibling worker keeps serving HTTP, so the surface recovers within the window — recovery happens via gunicorn's in-container worker supervision, not an external supervisor and not an in-process reconnect. Reframe #5 as a recovery guard (not a known-broken case) in the docstring and README; the prior "FAILS on master / needs external supervisor" wording was wrong. PER-15065's in-process reconnect would avoid the worker churn but recovery already holds. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Run the repo's pinned pre-commit formatters (black 23.1.0, isort 5.12.0, docformatter 1.7.5) over the PR1 files to satisfy the pre-commit CI check. Formatting only — no behavior change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds an off-by-default OPAL server internal diagnostics endpoint for git-fetcher cache/RSS stats, and introduces a docker-compose-based “git leak/resilience” integration testbed intended to reproduce/regress several production issues (leak, offline repo hang, boot slowness, broadcaster disconnect recovery).
Changes:
- Add
/internal/git-fetcher-cache-statsendpoint gated byOPAL_DEBUG_INTERNAL_STATS(default off), plus unit tests. - Add
app-tests/git-leak/docker-compose stack (opal-server + redis/postgres/gitea) and pytest harness (boot/leak/resilience tests). - Add
.claude/to.gitignore.
Reviewed changes
Copilot reviewed 14 out of 15 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/opal-server/opal_server/debug_stats.py | New helper to read git-fetcher cache sizes + RSS and register a gated internal route. |
| packages/opal-server/opal_server/config.py | Adds DEBUG_INTERNAL_STATS config flag (default False). |
| packages/opal-server/opal_server/server.py | Mounts the internal stats route when enabled. |
| packages/opal-server/opal_server/tests/debug_stats_test.py | Unit tests for stats dict sizing + flag default. |
| packages/opal-server/opal_server/tests/debug_stats_endpoint_test.py | Unit tests for endpoint presence/absence when gated. |
| app-tests/git-leak/docker-compose.yml | Compose stack for OPAL + dependencies + seeding sidecars. |
| app-tests/git-leak/helpers.py | Host-side HTTP/compose helpers for the test harness. |
| app-tests/git-leak/conftest.py | Pytest fixtures to boot/teardown the stack and provide clients. |
| app-tests/git-leak/test_leak.py | Leak regression tests (churn + repeat sync). |
| app-tests/git-leak/test_resilience.py | Offline-repo hang test + Postgres bounce recovery guard. |
| app-tests/git-leak/test_boot.py | Boot timing/baseline test. |
| app-tests/git-leak/seed/seed_gitea.py | Idempotent Gitea seeding script for N repos. |
| app-tests/git-leak/seed/Dockerfile | Container image for the seeding sidecar. |
| app-tests/git-leak/README.md | Documentation for running the new testbed locally. |
| .gitignore | Ignores .claude/ working artifacts. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The CI `build` job runs `pytest` from the repo root with no path, which recursed into app-tests/git-leak/ and ran the flagship tests — these are designed to FAIL on master (they are the regression gates for PR2-PR5), so they broke the build job. Set `testpaths = packages` so the rootdir run collects only the unit tests under packages/ (matching master's effective behavior, since app-tests/ had no pytest files before). testpaths only applies when pytest is invoked from the rootdir with no args, so `cd app-tests/git-leak && pytest` still collects and runs the test bed. Verified both: root run -> packages only; subdir run -> all 5 flagship tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- stats(): sample the /internal endpoint several times and merge per key with max(). The caches are per-process on a multi-worker server, so a single read can hit an empty (non-leader) worker — max-merge avoids both false negatives (missing a populated leader) and false positives (an `== 0` drain assertion passing only because it hit an empty worker). - test_leak: assert the initial-load `_wait_until` succeeded before deleting / before taking baseline, so the tests can't pass vacuously when load never completed. - refresh_all(): correct the misleading comment — a 404 is a no-op, there is no client-side fallback. - conftest: skip the suite cleanly if docker is unavailable (defense in depth; it's already excluded from the default pytest run via testpaths). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two review findings on the regression gates: - Cross-test contamination: clone paths are keyed by repo URL (source_id = sha256(url)+branch-shard), not scope_id, so scopes from different tests that point at the same seeded repo share one GitPolicyFetcher cache entry. With a session-scoped stack and no teardown, a leftover boot-*/stable-* scope kept those entries alive and would make test_churn's `repos == 0` drain assertion fail on fixed code. OpalServerClient now tracks created scopes and the opal fixture deletes them on teardown (best-effort drain wait, swallows errors so master — where delete never purges — doesn't fail the passing test). - False gate: test_repeat_sync_does_not_grow re-syncs identical scopes, which a path-keyed cache can't grow even on master, so it could never be the leak gate it claimed. Reframed as an honest idempotency guard (passes on master) that points at test_churn_releases_caches as the real leak gate; README's "Expected on master" reclassifies it alongside the postgres-bounce guard. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
zeevmoney
left a comment
There was a problem hiding this comment.
Requesting changes — the regression gates do not function as gates yet
Scope check first, because it's the reassuring part: the production surface is safe. The only always-on change is a benign import plus a default-off OPAL_DEBUG_INTERNAL_STATS config key; with the flag off (the default) there is zero behavior change, and when on the handler can't crash (the class dicts always exist, len() is GIL-atomic against the event-loop mutations, and _read_rss_kb swallows errors). testpaths = packages drops no existing tests. Git hygiene is clean (nothing private leaked into the diff, no rebase needed). This PR will not crash production.
The problem is the actual deliverable — the five flagship tests meant to gate PR2–PR5.
What's blocking (must fix before this can be trusted as the gate)
-
The load/drain gates assert on
GitPolicyFetcher.repos, which the first-sync clone path never populates (see the CRITICAL inline comment ontest_leak.py:26). On a fresh scope,fetch_and_notify_on_changestakes the_clone()branch, which only setsrepo_locks—reposis filled solely by_get_repo()on a second sync or the bundle path, and periodic re-sync is disabled in the stack (OPAL_POLICY_REFRESH_INTERVAL: "0"). As a resulttest_churn_releases_caches,test_repeat_sync_does_not_grow, andtest_offline_repo_does_not_block_healthy_scopeshang at their_wait_until(repos >= n)load gate for the full timeout and then fail at the load stage — they never reach the leak/offline logic they exist to test, and they would not flip green after PR2/PR3 land. Onlytest_boot_loads_all_scopesworks, and only incidentally (compose restart→preload_scopesre-discovers the on-disk clones). This is the same root cause the PR description flags as "Known caveat #2" — it needs to be fixed, not just noted. -
Even once load is fixed, the cache assertions are not reliable on a 2-worker stack.
stats()'s max-merge cannot prevent a== 0drain assertion from falsely passing when the samples miss the leader worker (HIGH onhelpers.py:50). The cache tests should run single-worker, or the endpoint should aggregate across workers. -
Failure modes are invisible.
compose()swallows stdout/stderr on any compose failure (HIGH onhelpers.py:218), and partial seeding is never detected before the suite runs (HIGH onconftest.py:40) — so a broken stack or a half-seeded Gitea looks like a test failure for the wrong reason.
Secondary correctness (should fix)
test_repeat_sync_does_not_growis tautological —len(repos)can't grow on repeat URL-keyed sync (test_leak.py:64).- The offline test's TEST-NET-1 address likely fails fast instead of hanging, so it may not exercise starvation (
helpers.py:205). - The
opalfixture clears_created_scopesat setup, orphaning scopes from a failed prior test and contaminating the next (conftest.py:55). - The postgres-bounce test asserts only HTTP liveness, not broadcaster recovery, and doesn't
--waitfor Postgres (test_resilience.py:51). - README/docstrings say "fails on master," but the suite requires this PR's
/internalendpoint (README.md:30). - Dead/misleading 404 branch in
refresh_all()(helpers.py:128).
Minor
- Enabling the flag exposes an unauthenticated
/internalroute (non-sensitive payload, off by default) (debug_stats.py:39). || truemasks genuinegitea-adminfailures (docker-compose.yml:58).- Boot-timing clock starts after
wait_healthy, undercounting boot-sync time (test_boot.py:24).
Bottom line
All 11 plan tasks are implemented and the production change is safe, but the suite needs to (1) re-key the load/drain assertions to a cache the sync path actually populates (or force a policy fetch), (2) make the cache reads worker-deterministic, (3) surface compose/seed failures, and then (4) be validated end-to-end against a live stack to confirm each test fails/passes for the right reason. Until then it can't be relied on as the regression gate for PR2–PR5.
…iew) Addresses the CHANGES_REQUESTED review on PR #922. Root cause behind most findings: a fresh scope's first sync takes the _clone() branch, which only fills GitPolicyFetcher.repo_locks; repos/repos_last_fetched are filled on a *second* sync. So the load gates on `repos` hung, and the 2-worker per-process caches made `== 0` drain assertions unsafe. Blocking: - Single worker (UVICORN_NUM_WORKERS=1): deterministic per-process cache reads; removes the false `== 0` drain class. - Load gate (CRITICAL + Zivxx HIGH): _load_scopes gates on repo_locks then refresh_all() to force the second sync, so repos/repos_last_fetched are actually populated before any drain/purge assertion. - compose() surfaces captured stdout/stderr on failure. - Seed completeness asserted in conftest; seed script isolates per-repo failures and exits non-zero with a count. Secondary: - test_repeat_sync asserts an RSS bound (count can't grow for any impl); churn asserts all three caches drain + a loose RSS backstop. - blackhole socat sidecar replaces TEST-NET-1 (deterministic hang); offline test saturates the fetch executor with 40 hung clones and recovers via OpalServerClient.hard_reset() (stop -> redis FLUSHALL -> start). - Per-test clean slate deletes all server scopes (fixes orphan-scope leak). - Postgres-bounce proves broadcast recovery (PUT post-bounce, assert sync) and uses `up -d --wait`. - Remove dead 404 branch in refresh_all; boot clock starts at restart; gitea-admin `|| true` -> "already exists"-only guard; README reworded. opal-server: - /internal stats route now takes the JWTAuthenticator dependency (protected when JWT on, no-op in the test bed); unit test asserts enforcement. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Review addressed —
|
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- snapshot a single /internal stats read per poll in the churn-drain and boot gates (consistent multi-key observation; fewer HTTP round-trips) - document why a 200 from the healthy scope can't be a masked default bundle, and why the stats route is intentionally a sync def - pin alpine/socat and the seed image's pip deps for reproducibility Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…t-environment-one-big-pr
…repo Ziv (PR review, round 2) caught that the offline-resilience gate can false-PASS in the full-suite run. The "healthy" scope pointed at policy-repo-0000 (list_seeded_repos(1)[0]), but on-disk clones are keyed by URL-hash and survive compose restart/stop/start (opal_server mounts no volume at /opal; only `down -v` wipes them). test_boot/test_leak run first (alphabetical) and already clone every seeded repo, so the healthy scope hit the existing clone via _discover_repository, skipped _clone(), and served 200 without ever touching the saturated fetch executor — the gate that must FAIL on this branch (no PR3 timeout) passed. Fix: seed a reserved repo (policy-repo-healthy-probe) outside the numeric policy-repo-NNNN range that no boot/leak test enumerates, and point the healthy probe at it. A never-cloned repo forces a genuine fresh clone through the starved pool, so the gate fails correctly. The seed- completeness check in conftest now covers the reserved repo too. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…view) Follow-up to PR review of the git-leak/resilience test bed: - hard_reset: always restart opal_server + wait_healthy via a finally, so a failed redis FLUSHALL can't leave the server stopped (which would fail every later session-scoped test and, running in a test finally, mask the result). - delete_all_scopes drain: a transient /internal read error no longer counts as a successful drain (was `except: return`); keep polling to the deadline so a not-yet-drained cache can't leak into the next test once PR2 lands. - use stats(samples=1) for the zero-waiting drain/empty polls (the peak-merge only matters for load assertions; this also drops 3x HTTP per poll). - resilience: narrow broad `except Exception` to requests.RequestException (and RuntimeError for wait_healthy timeout) so harness bugs surface instead of masquerading as "never served"/"never recovered". - resilience: collapse `assert opal.stats()` + redundant re-read into one read. - debug_stats_test: assert rss_kb > 0 on Linux (was `>= 0`, which passed for the wrong reason where /proc is absent and RSS reads fall back to 0). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…t-environment-one-big-pr
|
@/private/tmp/claude-501/-Users-zeevmanilovich-NewDev-opal/deddbdb3-9e66-4bdc-86ca-0a2e991c021b/scratchpad/corrected.md |
zeevmoney
left a comment
There was a problem hiding this comment.
Approving — the test bed is sound and this PR has already been through several review rounds (Copilot / @Zivxx / @zeevmoney) with the substantive issues raised and resolved. A fresh python-pro pass (planned against the opal-development references) surfaced only non-blocking items; none are CRITICAL/HIGH, so they don't gate landing the test bed.
Issues to fix (tracked under PER-15155) — inline comments above:
Worth fixing before/with merge
- MEDIUM
pytest.ini:10—testpaths = packagesmakes the README's documentedcd app-tests/git-leak && pytest --boot-scopes=50error withunrecognized arguments: --boot-scopes. Self-root the suite + use a targeted ignore. - MEDIUM
test_resilience.py:123— the bounce test'srepo_locks > baselinesignal is order-dependent across files (sharedpolicy-repo-0000+ non-purging caches on master); assertGET /scopes/post-bounce/policy == 200instead.
Follow-ups (LOW)
helpers.py:274— add atimeout=tocompose()so a wedged setup fails fast instead of hitting the CI job limit.helpers.py:148—delete_all_scopesburns ~40s/test waiting on a drain that can't happen on master; short-circuit it.seed_gitea.py:159— remove the unused/seed-output/tokenartifact + volume.seed_gitea.py:108— makepush_urlcredential injection scheme-agnostic (urllib.parse).
Doc nit (not posted inline): the bounce-test docstring (test_resilience.py:85-93) still describes pre-PER-15065 "graceful shutdown + gunicorn respawn over the recovered broadcaster"; reconnect is now in-place — worth a 3-line update.
Recommend merging PR1 first so PR2 (#923) / PR3 (#924) can build on the test bed + /internal endpoint, then addressing the two MEDIUMs.
Zeev's gate-coverage review found only 2 of 5 flagship tests (churn #1, offline #4) were genuine fail-now / pass-after gates. Lift the other three: - #5 broadcaster: run 2 workers (OPAL_TEST_WORKERS) so the Postgres backbone is actually fanned out cross-worker (references/debug-pubsub.md §3-4), and assert the gunicorn worker PIDs are unchanged across a transient bounce -- the in-place-reconnect signal that distinguishes #915 (PER-15065) from a plain worker respawn. Prove recovery via a servable post-bounce scope, not /internal cache counts (per-process, non-deterministic on 2 workers). - #3 boot: key completion on "all scopes served" (GET /scopes/{id}/policy == 200) instead of repo_locks (set at fetch start, so it undercounts the final clone); document the PR4 tight-BOOT_TARGET_SECONDS carry-forward. - #2 repeat-sync: rename to test_repeat_sync_rss_stays_bounded and drop the tautological len(repos) assertion; RSS is the sole (load-bearing) gate. Adds worker_pids() (/proc-based, matched host-side so the scan can't count its own sh -c wrapper) and the opal_multiworker fixture (recreate to 2 workers, restore to 1 on teardown). Validated live (--boot-scopes=50): #2/#3 pass, #5 passes (worker PIDs held across the bounce, post-bounce scope served), #1/#4 fail for the right reason (PR2/PR3 not landed). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Gate coverage lifted —
|
| Test | Result | Reason |
|---|---|---|
| #3 boot | ✅ PASS | served 50/50 in 7.4s (all-served probe) |
| #2 rss | ✅ PASS | RSS guard |
| #1 churn | ❌ FAIL (right reason) | caches stay at {repo_locks:50, repos:50, repos_last_fetched:50} — PR2 not landed |
| #4 offline | ❌ FAIL (right reason) | healthy probe ReadTimeout while 40 clones hang — PR3 not landed |
| #5 broadcaster | ✅ PASS | 2 workers, worker PIDs unchanged across bounce, post-bounce scope served |
Net: gates PR2 + PR3 today, PR5 is now guarded (in-place reconnect), and PR4 is ready to gate the moment its branch sets a tight BOOT_TARGET_SECONDS.
Zeev's latest inline batch: - pytest.ini: self-root the suite (app-tests/git-leak/pytest.ini) so `cd app-tests/git-leak && pytest --boot-scopes=N` is deterministic across pytest versions/cwd. (The documented command already works -- testpaths only applies when pytest runs from the rootdir -- but this makes it explicit.) - compose(): add a subprocess timeout (default 1200s) so a wedged up/wait/build fails fast instead of hanging session-scoped fixture setup to the CI job limit (pytest-timeout does not cover fixture setup). - delete_all_scopes(): cut the drain wait 20s -> 3s; on master the caches can't purge (the leak this gates), so the old wait burned ~40s of dead time per test across setup+teardown. - seed_gitea.py: inject push creds scheme-agnostically (urllib.parse) instead of string-replacing "http://"; drop the unused /seed-output token artifact and the seed-output volume (host uses basic auth, never the token). The order-dependent bounce signal (test_resilience.py) was already fixed in 8e24cb0 (asserts GET /scopes/post-bounce/policy == 200, not a delta on a shared process-global counter). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Addresses the HIGH and all MEDIUM findings from the opal-development / python-pro / backend-architect microreview: - H1 (#5 vacuous pass): the broadcaster gate now positively verifies a disconnect+reconnect actually happened, not just that no respawn occurred. New helpers.broadcaster_connect_count() counts the reconnecting broadcaster's "listener connected to channel" log line; the test asserts it increased across the bounce (paired with worker PIDs unchanged = in-place, not respawn). - M1 (#4): move the 40 executor-saturating PUTs inside the try/finally so a PUT failure still runs hard_reset() instead of leaking hung clone threads into the session stack. - M2 (#1/#2): _wait_until now treats a transient requests error from opal.stats() as "not yet" and retries, instead of ERRORing the test. - M3 (#3): measure a deterministic pure-cold boot (--force-recreate wipes the ephemeral FS -> preload cold-clones all N from Redis) instead of a nondeterministic warm/cold mix, so PR4's tight BOOT_TARGET_SECONDS can gate. - M4: verify the single-worker invariant -- opal_multiworker teardown asserts the stack is back to 1 worker, and the opal fixture asserts single-worker at setup, so a botched restore fails loudly instead of silently breaking cache gates. - M5 (#4): correct the reserved-probe comment (serving shares the fetch executor, so a shared repo would be starved on serve too; the probe additionally exercises the clone). - M6: gitea-admin retries on "database is locked" (CLI mutating live SQLite); rewritten as a `|` literal block so the create call stays on one line. Validated live (--boot-scopes=20): #2/#3 pass, #5 passes (reconnect count increased across the bounce + PIDs unchanged + scope served), #1/#4 fail for the right reason. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
PR1 — Git Leak/Resilience Test Environment
Builds the test bed and the single diagnostics hook needed to exercise the git-fetcher memory leak, the offline-repo hang, the slow serial boot, and the broadcaster-disconnect path. The leak/offline tests fail on
masterand become the regression gates for the follow-up fix PRs; the boot and bounce tests are tunable/recovery guards (see below).This is intentionally one large PR — it's the foundation every later PR's gate depends on. The actual fixes land in the PRs this one blocks:
test_churn_releases_caches); Data fetcher #2 is an RSS guard, not a leak gateLinear: PER-15155
opal-server: an off-by-default stats endpoint
opal_server/debug_stats.py— read-onlygit_fetcher_cache_stats()(sizes of the three process-globalGitPolicyFetchercaches + process RSS) and aregister_internal_stats_route()registrar.opal_server/config.py— newOPAL_DEBUG_INTERNAL_STATSflag, defaultFalse.opal_server/server.py— mountsGET /internal/git-fetcher-cache-statsbeside/healthcheck, gated by the flag.No production behavior change when the flag is off (the default).
app-tests/git-leak/: a self-contained stack + pytest harnessdocker-compose.yml— opal-server (single worker,OPAL_SCOPES=1, Postgres broadcaster) built from the repo's owndocker/Dockerfileservertarget, plus Redis, Postgres, Gitea, and one-shotgitea-admin+seedsidecars.seed/— idempotent Gitea seeding sidecar (N policy repos) + Dockerfile.helpers.py—OpalServerClient,GiteaAdmin(host-side admin client),make_repo_unreachable,bounce_postgres,compose.conftest.py— session-scopedstack+opal+gitea_adminfixtures;--boot-scopes=N,--keep-stack.test_leak.py(WIP: Opal refactor (policy fetcher, tree structure) #1 churn, Data fetcher #2 repeat-sync),test_resilience.py(Update and fetch #4 offline repo, remove singletons from server #5 postgres bounce),test_boot.py(Refactor: Opal Server, Repo Watcher, Cloner, Tracker, LeaderElection, BundleMaker, and bundle diffs #3 boot timing, tunableBOOT_TARGET_SECONDS).README.md.The five flagship tests, and their baseline behavior on master
test_churn_releases_cachestest_repeat_sync_rss_stays_boundedtest_boot_loads_all_scopesBOOT_TARGET_SECONDSlowtest_offline_repo_does_not_block_healthy_scopestest_server_recovers_after_postgres_bounceTest #5 — verified (2-worker, PID guard). Runs 2 workers (
OPAL_TEST_WORKERS) so the Postgres backbone is actually fanned out cross-worker (debug-pubsub.md§3-4); across a transient bounce it asserts the gunicorn worker PIDs are unchanged — the in-place-reconnect signal that distinguishes #915 (PER-15065) from a plain gunicorn respawn — plus a servable post-bounce scope (GET /scopes/{id}/policy == 200, not/internalcounts, which are per-process on 2 workers). Re-validated live on this branch: passes (~54s). Updated per the gate-coverage review — see the latest reply for the full fail/pass-for-the-right-reason matrix.How to run
Requires Docker + compose v2 and host Python with
pytest pytest-timeout requests GitPython.Verification done
serverimage;import opal_server.serverclean; flag defaultsFalsewith a description.docker compose config -qvalid; all 5 flagship tests collect./internal/git-fetcher-cache-statslive,PUT /scopesreturns 201 with auth disabled.GiteaAdminexercised live (list/exists/create/delete over the published port);test_server_recovers_after_postgres_bouncerun to a green pass.Notes for reviewers
OPAL_AUTH_PUBLIC_KEYleft unset → JWT verifier disabled so the harness can drive scope routes without minting JWTs (require_peer_typebecomes a no-op). Test bed only.auth: {auth_type: "none"}(required pydantic-v1 discriminated union).opal_server:7002andgitea:13000(uncommon, to avoid the:3000clash; used byGiteaAdmin); Postgres is internal-only.Cache-read determinism (how the gates stay trustworthy)
Two properties of
masterwould make naive cache reads non-deterministic. Both are handled in this harness — they are not open issues:GitPolicyFetchercaches are per-process, so with >1 worker a round-robin/internalread can miss the worker that fetched and report0. Resolved: the stack runs a single uvicorn worker (UVICORN_NUM_WORKERS=1), so every cache read is deterministic.repospopulates on the second sync, not the first. The first-sync clone path fills onlyrepo_locks;repos/repos_last_fetchedare filled by the discover/fetch path on a subsequent sync. Resolved: the load helpers issuerefresh_all()and wait before asserting onrepos, so the gates assert against a cache the sync path has actually filled.These were called out as caveats in the original design doc; the implementation above closes both, so the leak/drain assertions are deterministic single-worker.
🤖 Generated with Claude Code