test(opal-server): git leak/resilience test environment (PR1) by dshoen619 · Pull Request #922 · permitio/opal

dshoen619 · 2026-06-23T09:26:22Z

PR1 — Git Leak/Resilience Test Environment

Builds the test bed and the single diagnostics hook needed to exercise the git-fetcher memory leak, the offline-repo hang, the slow serial boot, and the broadcaster-disconnect path. The leak/offline tests fail on master and become the regression gates for the follow-up fix PRs; the boot and bounce tests are tunable/recovery guards (see below).

This is intentionally one large PR — it's the foundation every later PR's gate depends on. The actual fixes land in the PRs this one blocks:

PER-15156 (PR2) — memory-leak fix → gated by test WIP: Opal refactor (policy fetcher, tree structure) #1 (test_churn_releases_caches); Data fetcher #2 is an RSS guard, not a leak gate
PER-15157 (PR3) — offline-repo resilience → gated by test Update and fetch #4
PER-15065 — broadcaster in-process reconnect → relates to test remove singletons from server #5
PR4 (boot parallelism) → gated by test Refactor: Opal Server, Repo Watcher, Cloner, Tracker, LeaderElection, BundleMaker, and bundle diffs #3

Linear: PER-15155

opal-server: an off-by-default stats endpoint

opal_server/debug_stats.py — read-only git_fetcher_cache_stats() (sizes of the three process-global GitPolicyFetcher caches + process RSS) and a register_internal_stats_route() registrar.
opal_server/config.py — new OPAL_DEBUG_INTERNAL_STATS flag, default False.
opal_server/server.py — mounts GET /internal/git-fetcher-cache-stats beside /healthcheck, gated by the flag.
Unit tests for the helper, the flag default, and the gating.

No production behavior change when the flag is off (the default).

`app-tests/git-leak/`: a self-contained stack + pytest harness

docker-compose.yml — opal-server (single worker, OPAL_SCOPES=1, Postgres broadcaster) built from the repo's own docker/Dockerfile server target, plus Redis, Postgres, Gitea, and one-shot gitea-admin + seed sidecars.
seed/ — idempotent Gitea seeding sidecar (N policy repos) + Dockerfile.
helpers.py — OpalServerClient, GiteaAdmin (host-side admin client), make_repo_unreachable, bounce_postgres, compose.
conftest.py — session-scoped stack + opal + gitea_admin fixtures; --boot-scopes=N, --keep-stack.
test_leak.py (WIP: Opal refactor (policy fetcher, tree structure) #1 churn, Data fetcher #2 repeat-sync), test_resilience.py (Update and fetch #4 offline repo, remove singletons from server #5 postgres bounce), test_boot.py (Refactor: Opal Server, Repo Watcher, Cloner, Tracker, LeaderElection, BundleMaker, and bundle diffs #3 boot timing, tunable BOOT_TARGET_SECONDS).
README.md.

The five flagship tests, and their baseline behavior on master

#	Test	On master	Gate
1	`test_churn_releases_caches`	FAILS (caches never drain)	PR2
2	`test_repeat_sync_rss_stays_bounded`	PASSES (RSS guard; the URL-keyed cache count can't grow for any impl)	RSS guard
3	`test_boot_loads_all_scopes`	passes; FAILS with `BOOT_TARGET_SECONDS` low	PR4
4	`test_offline_repo_does_not_block_healthy_scopes`	FAILS (no fetch timeout)	PR3
5	`test_server_recovers_after_postgres_bounce`	PASSES (2-worker; worker-PID-unchanged across bounce = in-place reconnect)	PER-15065

Test #5 — verified (2-worker, PID guard). Runs 2 workers (OPAL_TEST_WORKERS) so the Postgres backbone is actually fanned out cross-worker (debug-pubsub.md §3-4); across a transient bounce it asserts the gunicorn worker PIDs are unchanged — the in-place-reconnect signal that distinguishes #915 (PER-15065) from a plain gunicorn respawn — plus a servable post-bounce scope (GET /scopes/{id}/policy == 200, not /internal counts, which are per-process on 2 workers). Re-validated live on this branch: passes (~54s). Updated per the gate-coverage review — see the latest reply for the full fail/pass-for-the-right-reason matrix.

How to run

cd app-tests/git-leak
python -m pytest -v --boot-scopes=50              # full set
python -m pytest test_leak.py -v --boot-scopes=20 # just the leak gates

Requires Docker + compose v2 and host Python with pytest pytest-timeout requests GitPython.

Verification done

All 4 opal-server unit tests pass inside the repo's Docker server image; import opal_server.server clean; flag defaults False with a description.
docker compose config -q valid; all 5 flagship tests collect.
Full-stack smoke: stack boots, Gitea seeds repos, opal healthy, /internal/git-fetcher-cache-stats live, PUT /scopes returns 201 with auth disabled.
GiteaAdmin exercised live (list/exists/create/delete over the published port); test_server_recovers_after_postgres_bounce run to a green pass.

Notes for reviewers

Auth disabled deliberately: OPAL_AUTH_PUBLIC_KEY left unset → JWT verifier disabled so the harness can drive scope routes without minting JWTs (require_peer_type becomes a no-op). Test bed only.
Scope create body sets auth: {auth_type: "none"} (required pydantic-v1 discriminated union).
Host ports: opal_server:7002 and gitea:13000 (uncommon, to avoid the :3000 clash; used by GiteaAdmin); Postgres is internal-only.

Cache-read determinism (how the gates stay trustworthy)

Two properties of master would make naive cache reads non-deterministic. Both are handled in this harness — they are not open issues:

Per-worker caches + round-robin reads. The GitPolicyFetcher caches are per-process, so with >1 worker a round-robin /internal read can miss the worker that fetched and report 0. Resolved: the stack runs a single uvicorn worker (UVICORN_NUM_WORKERS=1), so every cache read is deterministic.
repos populates on the second sync, not the first. The first-sync clone path fills only repo_locks; repos / repos_last_fetched are filled by the discover/fetch path on a subsequent sync. Resolved: the load helpers issue refresh_all() and wait before asserting on repos, so the gates assert against a cache the sync path has actually filled.

These were called out as caveats in the original design doc; the implementation above closes both, so the leak/drain assertions are deterministic single-worker.

🤖 Generated with Claude Code

Add an off-by-default diagnostics endpoint so tests can observe the in-memory GitPolicyFetcher cache sizes (repo_locks/repos/repos_last_fetched) and process RSS that the upcoming memory-leak fix eliminates. - debug_stats.py: read-only git_fetcher_cache_stats() helper + a register_internal_stats_route() registrar that mounts GET /internal/git-fetcher-cache-stats only when enabled. - config.py: new OPAL_DEBUG_INTERNAL_STATS flag, default False. - server.py: register the route, gated by the flag, beside /healthcheck. No production behavior change when the flag is off (the default). Also ignore .claude/ so private planning artifacts are never committed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

A self-contained docker-compose stack (opal-server x2 workers + Redis + Postgres broadcaster + Gitea) plus a pytest harness that reproduces, as tests that fail on master, the git-fetcher memory leak, the offline-repo hang, the slow serial boot, and the broadcaster-disconnect gap. These become the regression gates for the follow-up fixes. - seed/: idempotent Gitea seeding sidecar (N policy repos) + Dockerfile. - docker-compose.yml: 4-service stack, opal-server built from the repo's own docker/Dockerfile (server target), scopes on, Postgres broadcaster. - helpers.py / conftest.py: HTTP + infra helpers and stack fixtures. - test_leak.py / test_resilience.py / test_boot.py: the flagship tests. - README.md: how to run and expected fail-on-master behavior. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

linear-code · 2026-06-23T09:26:25Z

PER-15155

netlify · 2026-06-23T09:26:26Z

✅ Deploy Preview for opal-docs canceled.

Name	Link
🔨 Latest commit	`5aae85c`
🔍 Latest deploy log	https://app.netlify.com/projects/opal-docs/deploys/6a4504d7357006000884bef4

Complete the helpers.py surface promised in the plan's file-structure table. Both are now functional and used, not dead code: - make_repo_unreachable(name): returns a git URL on a routable-but-dead TEST-NET-1 host (RFC 5737). test_offline_repo now uses it instead of an inlined literal. - GiteaAdmin: host-side Gitea admin client (list_repos / repo_exists / create_repo / delete_repo), exposed as the `gitea_admin` pytest fixture for tests that need to inspect or stage repos beyond the seed sidecar. Gitea is published on host port 13000 (uncommon, to avoid the usual :3000 clash) so GiteaAdmin can reach it; opal_server and the seed sidecar still use the internal http://gitea:3000. README updated with the helper and port notes. Verified live: GiteaAdmin lists the seeded repos and round-trips create/exists/delete against Gitea over the published port. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Verified test_server_recovers_after_postgres_bounce against the stack: it PASSES on master (~14-19s). On a broadcaster drop the affected worker triggers a graceful shutdown, gunicorn respawns it, and the sibling worker keeps serving HTTP, so the surface recovers within the window — recovery happens via gunicorn's in-container worker supervision, not an external supervisor and not an in-process reconnect. Reframe #5 as a recovery guard (not a known-broken case) in the docstring and README; the prior "FAILS on master / needs external supervisor" wording was wrong. PER-15065's in-process reconnect would avoid the worker churn but recovery already holds. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Run the repo's pinned pre-commit formatters (black 23.1.0, isort 5.12.0, docformatter 1.7.5) over the PR1 files to satisfy the pre-commit CI check. Formatting only — no behavior change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Adds an off-by-default OPAL server internal diagnostics endpoint for git-fetcher cache/RSS stats, and introduces a docker-compose-based “git leak/resilience” integration testbed intended to reproduce/regress several production issues (leak, offline repo hang, boot slowness, broadcaster disconnect recovery).

Changes:

Add /internal/git-fetcher-cache-stats endpoint gated by OPAL_DEBUG_INTERNAL_STATS (default off), plus unit tests.
Add app-tests/git-leak/ docker-compose stack (opal-server + redis/postgres/gitea) and pytest harness (boot/leak/resilience tests).
Add .claude/ to .gitignore.

Reviewed changes

Copilot reviewed 14 out of 15 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
packages/opal-server/opal_server/debug_stats.py	New helper to read git-fetcher cache sizes + RSS and register a gated internal route.
packages/opal-server/opal_server/config.py	Adds `DEBUG_INTERNAL_STATS` config flag (default `False`).
packages/opal-server/opal_server/server.py	Mounts the internal stats route when enabled.
packages/opal-server/opal_server/tests/debug_stats_test.py	Unit tests for stats dict sizing + flag default.
packages/opal-server/opal_server/tests/debug_stats_endpoint_test.py	Unit tests for endpoint presence/absence when gated.
app-tests/git-leak/docker-compose.yml	Compose stack for OPAL + dependencies + seeding sidecars.
app-tests/git-leak/helpers.py	Host-side HTTP/compose helpers for the test harness.
app-tests/git-leak/conftest.py	Pytest fixtures to boot/teardown the stack and provide clients.
app-tests/git-leak/test_leak.py	Leak regression tests (churn + repeat sync).
app-tests/git-leak/test_resilience.py	Offline-repo hang test + Postgres bounce recovery guard.
app-tests/git-leak/test_boot.py	Boot timing/baseline test.
app-tests/git-leak/seed/seed_gitea.py	Idempotent Gitea seeding script for N repos.
app-tests/git-leak/seed/Dockerfile	Container image for the seeding sidecar.
app-tests/git-leak/README.md	Documentation for running the new testbed locally.
.gitignore	Ignores `.claude/` working artifacts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

The CI `build` job runs `pytest` from the repo root with no path, which recursed into app-tests/git-leak/ and ran the flagship tests — these are designed to FAIL on master (they are the regression gates for PR2-PR5), so they broke the build job. Set `testpaths = packages` so the rootdir run collects only the unit tests under packages/ (matching master's effective behavior, since app-tests/ had no pytest files before). testpaths only applies when pytest is invoked from the rootdir with no args, so `cd app-tests/git-leak && pytest` still collects and runs the test bed. Verified both: root run -> packages only; subdir run -> all 5 flagship tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- stats(): sample the /internal endpoint several times and merge per key with max(). The caches are per-process on a multi-worker server, so a single read can hit an empty (non-leader) worker — max-merge avoids both false negatives (missing a populated leader) and false positives (an `== 0` drain assertion passing only because it hit an empty worker). - test_leak: assert the initial-load `_wait_until` succeeded before deleting / before taking baseline, so the tests can't pass vacuously when load never completed. - refresh_all(): correct the misleading comment — a 404 is a no-op, there is no client-side fallback. - conftest: skip the suite cleanly if docker is unavailable (defense in depth; it's already excluded from the default pytest run via testpaths). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Two review findings on the regression gates: - Cross-test contamination: clone paths are keyed by repo URL (source_id = sha256(url)+branch-shard), not scope_id, so scopes from different tests that point at the same seeded repo share one GitPolicyFetcher cache entry. With a session-scoped stack and no teardown, a leftover boot-*/stable-* scope kept those entries alive and would make test_churn's `repos == 0` drain assertion fail on fixed code. OpalServerClient now tracks created scopes and the opal fixture deletes them on teardown (best-effort drain wait, swallows errors so master — where delete never purges — doesn't fail the passing test). - False gate: test_repeat_sync_does_not_grow re-syncs identical scopes, which a path-keyed cache can't grow even on master, so it could never be the leak gate it claimed. Reframed as an honest idempotency guard (passes on master) that points at test_churn_releases_caches as the real leak gate; README's "Expected on master" reclassifies it alongside the postgres-bounce guard. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

zeevmoney

Requesting changes — the regression gates do not function as gates yet

Scope check first, because it's the reassuring part: the production surface is safe. The only always-on change is a benign import plus a default-off OPAL_DEBUG_INTERNAL_STATS config key; with the flag off (the default) there is zero behavior change, and when on the handler can't crash (the class dicts always exist, len() is GIL-atomic against the event-loop mutations, and _read_rss_kb swallows errors). testpaths = packages drops no existing tests. Git hygiene is clean (nothing private leaked into the diff, no rebase needed). This PR will not crash production.

The problem is the actual deliverable — the five flagship tests meant to gate PR2–PR5.

What's blocking (must fix before this can be trusted as the gate)

The load/drain gates assert on GitPolicyFetcher.repos, which the first-sync clone path never populates (see the CRITICAL inline comment on test_leak.py:26). On a fresh scope, fetch_and_notify_on_changes takes the _clone() branch, which only sets repo_locks — repos is filled solely by _get_repo() on a second sync or the bundle path, and periodic re-sync is disabled in the stack (OPAL_POLICY_REFRESH_INTERVAL: "0"). As a result test_churn_releases_caches, test_repeat_sync_does_not_grow, and test_offline_repo_does_not_block_healthy_scopes hang at their _wait_until(repos >= n) load gate for the full timeout and then fail at the load stage — they never reach the leak/offline logic they exist to test, and they would not flip green after PR2/PR3 land. Only test_boot_loads_all_scopes works, and only incidentally (compose restart → preload_scopes re-discovers the on-disk clones). This is the same root cause the PR description flags as "Known caveat #2" — it needs to be fixed, not just noted.
Even once load is fixed, the cache assertions are not reliable on a 2-worker stack. stats()'s max-merge cannot prevent a == 0 drain assertion from falsely passing when the samples miss the leader worker (HIGH on helpers.py:50). The cache tests should run single-worker, or the endpoint should aggregate across workers.
Failure modes are invisible. compose() swallows stdout/stderr on any compose failure (HIGH on helpers.py:218), and partial seeding is never detected before the suite runs (HIGH on conftest.py:40) — so a broken stack or a half-seeded Gitea looks like a test failure for the wrong reason.

Secondary correctness (should fix)

test_repeat_sync_does_not_grow is tautological — len(repos) can't grow on repeat URL-keyed sync (test_leak.py:64).
The offline test's TEST-NET-1 address likely fails fast instead of hanging, so it may not exercise starvation (helpers.py:205).
The opal fixture clears _created_scopes at setup, orphaning scopes from a failed prior test and contaminating the next (conftest.py:55).
The postgres-bounce test asserts only HTTP liveness, not broadcaster recovery, and doesn't --wait for Postgres (test_resilience.py:51).
README/docstrings say "fails on master," but the suite requires this PR's /internal endpoint (README.md:30).
Dead/misleading 404 branch in refresh_all() (helpers.py:128).

Minor

Enabling the flag exposes an unauthenticated /internal route (non-sensitive payload, off by default) (debug_stats.py:39).
|| true masks genuine gitea-admin failures (docker-compose.yml:58).
Boot-timing clock starts after wait_healthy, undercounting boot-sync time (test_boot.py:24).

Bottom line

All 11 plan tasks are implemented and the production change is safe, but the suite needs to (1) re-key the load/drain assertions to a cache the sync path actually populates (or force a policy fetch), (2) make the cache reads worker-deterministic, (3) surface compose/seed failures, and then (4) be validated end-to-end against a live stack to confirm each test fails/passes for the right reason. Until then it can't be relied on as the regression gate for PR2–PR5.

Copilot

Pull request overview

Copilot reviewed 15 out of 16 changed files in this pull request and generated 3 comments.

…iew) Addresses the CHANGES_REQUESTED review on PR #922. Root cause behind most findings: a fresh scope's first sync takes the _clone() branch, which only fills GitPolicyFetcher.repo_locks; repos/repos_last_fetched are filled on a *second* sync. So the load gates on `repos` hung, and the 2-worker per-process caches made `== 0` drain assertions unsafe. Blocking: - Single worker (UVICORN_NUM_WORKERS=1): deterministic per-process cache reads; removes the false `== 0` drain class. - Load gate (CRITICAL + Zivxx HIGH): _load_scopes gates on repo_locks then refresh_all() to force the second sync, so repos/repos_last_fetched are actually populated before any drain/purge assertion. - compose() surfaces captured stdout/stderr on failure. - Seed completeness asserted in conftest; seed script isolates per-repo failures and exits non-zero with a count. Secondary: - test_repeat_sync asserts an RSS bound (count can't grow for any impl); churn asserts all three caches drain + a loose RSS backstop. - blackhole socat sidecar replaces TEST-NET-1 (deterministic hang); offline test saturates the fetch executor with 40 hung clones and recovers via OpalServerClient.hard_reset() (stop -> redis FLUSHALL -> start). - Per-test clean slate deletes all server scopes (fixes orphan-scope leak). - Postgres-bounce proves broadcast recovery (PUT post-bounce, assert sync) and uses `up -d --wait`. - Remove dead 404 branch in refresh_all; boot clock starts at restart; gitea-admin `|| true` -> "already exists"-only guard; README reworded. opal-server: - /internal stats route now takes the JWTAuthenticator dependency (protected when JWT on, no-op in the test bed); unit test asserts enforcement. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

dshoen619 · 2026-06-24T13:08:11Z

Review addressed — `6046a10`

Thanks for the thorough pass. All blocking items, both HIGHs from @Zivxx / @zeevmoney, and the secondary/minor items are addressed in 6046a10. Per-comment replies are inline; summary below.

Root cause behind most findings (and why the gates didn't gate): a fresh scope's first sync takes the _clone() branch, which only fills repo_locks; repos/repos_last_fetched are filled on a second sync. The load gates on repos therefore hung, and the 2-worker per-process caches made == 0 drains unsafe.

Blocking

Load gate (CRITICAL + @Zivxx HIGH) — _load_scopes gates on repo_locks (first-sync signal), then refresh_all() forces the second sync so repos/repos_last_fetched are populated before any drain/purge assertion. Churn now asserts all three caches drain to 0 (so a broken repos_last_fetched purge fails instead of passing vacuously).
Worker determinism — stack is now UVICORN_NUM_WORKERS=1; removes the false-== 0-drain class outright (no more reliance on max-merge to paper over it).
Failure visibility — compose() re-raises with captured stdout/stderr; seed completeness is asserted in conftest; seed_gitea.py isolates per-repo failures and exits non-zero with a count.

Secondary

test_repeat_sync now asserts an rss_kb bound (count can't grow for a URL-keyed set).
Offline test uses a blackhole socat sidecar (deterministic hang, verified) and saturates the fetch executor with 40 hung clones; recovers the session stack via hard_reset() (stop → redis FLUSHALL → start).
Per-test clean slate deletes all server scopes (fixes the orphan-scope leak).
Postgres-bounce proves broadcast recovery (PUT post-bounce → assert sync), uses up -d --wait.
Dead 404 branch removed; boot clock starts at restart; gitea-admin || true → "already exists"-only guard; README/docstrings reworded to "fails on this branch without PR2/PR3", noting the suite needs this PR's /internal endpoint.

Minor

/internal route now carries the JWTAuthenticator dependency (protected when JWT on, no-op in the test bed); added a unit test asserting a rejecting dependency → 401.

Validation done: 5/5 opal-server unit tests pass in the Docker server image (incl. the new auth test); docker compose config valid; blackhole hang and the gitea-admin guard verified directly.

Still outstanding (flagging honestly): I have not yet re-run the full live integration suite end-to-end to confirm each flagship test fails/passes for the right reason against a live stack — that's the "validate end-to-end" item from your bottom line. I can kick that off next; it's a ~30 min run and several tests are expected-fail gates on this branch until PR2/PR3.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- snapshot a single /internal stats read per poll in the churn-drain and boot gates (consistent multi-key observation; fewer HTTP round-trips) - document why a 200 from the healthy scope can't be a masked default bundle, and why the stats route is intentionally a sync def - pin alpine/socat and the seed image's pip deps for reproducibility Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…t-environment-one-big-pr

…repo Ziv (PR review, round 2) caught that the offline-resilience gate can false-PASS in the full-suite run. The "healthy" scope pointed at policy-repo-0000 (list_seeded_repos(1)[0]), but on-disk clones are keyed by URL-hash and survive compose restart/stop/start (opal_server mounts no volume at /opal; only `down -v` wipes them). test_boot/test_leak run first (alphabetical) and already clone every seeded repo, so the healthy scope hit the existing clone via _discover_repository, skipped _clone(), and served 200 without ever touching the saturated fetch executor — the gate that must FAIL on this branch (no PR3 timeout) passed. Fix: seed a reserved repo (policy-repo-healthy-probe) outside the numeric policy-repo-NNNN range that no boot/leak test enumerates, and point the healthy probe at it. A never-cloned repo forces a genuine fresh clone through the starved pool, so the gate fails correctly. The seed- completeness check in conftest now covers the reserved repo too. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…view) Follow-up to PR review of the git-leak/resilience test bed: - hard_reset: always restart opal_server + wait_healthy via a finally, so a failed redis FLUSHALL can't leave the server stopped (which would fail every later session-scoped test and, running in a test finally, mask the result). - delete_all_scopes drain: a transient /internal read error no longer counts as a successful drain (was `except: return`); keep polling to the deadline so a not-yet-drained cache can't leak into the next test once PR2 lands. - use stats(samples=1) for the zero-waiting drain/empty polls (the peak-merge only matters for load assertions; this also drops 3x HTTP per poll). - resilience: narrow broad `except Exception` to requests.RequestException (and RuntimeError for wait_healthy timeout) so harness bugs surface instead of masquerading as "never served"/"never recovered". - resilience: collapse `assert opal.stats()` + redundant re-read into one read. - debug_stats_test: assert rss_kb > 0 on Linux (was `>= 0`, which passed for the wrong reason where /proc is absent and RSS reads fall back to 0). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…t-environment-one-big-pr

approvded

zeevmoney · 2026-06-30T14:43:49Z

@/private/tmp/claude-501/-Users-zeevmanilovich-NewDev-opal/deddbdb3-9e66-4bdc-86ca-0a2e991c021b/scratchpad/corrected.md

zeevmoney

Approving — the test bed is sound and this PR has already been through several review rounds (Copilot / @Zivxx / @zeevmoney) with the substantive issues raised and resolved. A fresh python-pro pass (planned against the opal-development references) surfaced only non-blocking items; none are CRITICAL/HIGH, so they don't gate landing the test bed.

Issues to fix (tracked under PER-15155) — inline comments above:

Worth fixing before/with merge

MEDIUM pytest.ini:10 — testpaths = packages makes the README's documented cd app-tests/git-leak && pytest --boot-scopes=50 error with unrecognized arguments: --boot-scopes. Self-root the suite + use a targeted ignore.
MEDIUM test_resilience.py:123 — the bounce test's repo_locks > baseline signal is order-dependent across files (shared policy-repo-0000 + non-purging caches on master); assert GET /scopes/post-bounce/policy == 200 instead.

Follow-ups (LOW)

helpers.py:274 — add a timeout= to compose() so a wedged setup fails fast instead of hitting the CI job limit.
helpers.py:148 — delete_all_scopes burns ~40s/test waiting on a drain that can't happen on master; short-circuit it.
seed_gitea.py:159 — remove the unused /seed-output/token artifact + volume.
seed_gitea.py:108 — make push_url credential injection scheme-agnostic (urllib.parse).

Doc nit (not posted inline): the bounce-test docstring (test_resilience.py:85-93) still describes pre-PER-15065 "graceful shutdown + gunicorn respawn over the recovered broadcaster"; reconnect is now in-place — worth a 3-line update.

Recommend merging PR1 first so PR2 (#923) / PR3 (#924) can build on the test bed + /internal endpoint, then addressing the two MEDIUMs.

Zeev's gate-coverage review found only 2 of 5 flagship tests (churn #1, offline #4) were genuine fail-now / pass-after gates. Lift the other three: - #5 broadcaster: run 2 workers (OPAL_TEST_WORKERS) so the Postgres backbone is actually fanned out cross-worker (references/debug-pubsub.md §3-4), and assert the gunicorn worker PIDs are unchanged across a transient bounce -- the in-place-reconnect signal that distinguishes #915 (PER-15065) from a plain worker respawn. Prove recovery via a servable post-bounce scope, not /internal cache counts (per-process, non-deterministic on 2 workers). - #3 boot: key completion on "all scopes served" (GET /scopes/{id}/policy == 200) instead of repo_locks (set at fetch start, so it undercounts the final clone); document the PR4 tight-BOOT_TARGET_SECONDS carry-forward. - #2 repeat-sync: rename to test_repeat_sync_rss_stays_bounded and drop the tautological len(repos) assertion; RSS is the sole (load-bearing) gate. Adds worker_pids() (/proc-based, matched host-side so the scan can't count its own sh -c wrapper) and the opal_multiworker fixture (recreate to 2 workers, restore to 1 on teardown). Validated live (--boot-scopes=50): #2/#3 pass, #5 passes (worker PIDs held across the bounce, post-bounce scope served), #1/#4 fail for the right reason (PR2/PR3 not landed). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

dshoen619 · 2026-07-01T09:35:07Z

Gate coverage lifted — `8e24cb0b`

Thanks, the gate-coverage matrix was exactly the right lens. You're right that as committed only #1/#4 actually gated. Addressed all three suggestions; left #1 churn and #4 offline (the load-bearing gates) untouched.

#5 broadcaster — now a real PER-15065 guard. A new opal_multiworker fixture recreates opal_server with 2 workers (UVICORN_NUM_WORKERS: ${OPAL_TEST_WORKERS:-1}) so the Postgres backbone is actually fanned out cross-worker (debug-pubsub.md §3-4), then restores single-worker on teardown so the cache gates keep their determinism. The test now asserts the gunicorn worker PIDs are unchanged across the bounce — the in-place-reconnect signal that tells #915 apart from a plain respawn. With the retry-forever default a transient blip never gives up, so the worker keeps its PID; a revert of #915 would escalate to a graceful shutdown → gunicorn respawn → new PID. Recovery is proven via a servable post-bounce scope (GET /scopes/{id}/policy == 200), not /internal counts (per-process, non-deterministic on 2 workers). worker_pids() reads /proc and matches host-side so the scan can't count its own sh -c wrapper (whose command line literally contains "gunicorn").

#3 boot — measures the right thing. Completion now keys on all scopes served (GET /scopes/{id}/policy == 200) instead of repo_locks (set at fetch start, so it undercut the final clone). Default BOOT_TARGET_SECONDS stays loose (baseline recorder); the docstring + README carry forward that PR4 must run it with a tight target (120s @ 50).

#2 repeat-sync — relabeled. Renamed test_repeat_sync_does_not_grow → test_repeat_sync_rss_stays_bounded; dropped the tautological len(repos) assertion (with a comment so it isn't re-added); RSS is the sole gate.

Live validation (--boot-scopes=50, this branch):

Test	Result	Reason
#3 boot	✅ PASS	served 50/50 in 7.4s (all-served probe)
#2 rss	✅ PASS	RSS guard
#1 churn	❌ FAIL (right reason)	caches stay at `{repo_locks:50, repos:50, repos_last_fetched:50}` — PR2 not landed
#4 offline	❌ FAIL (right reason)	healthy probe `ReadTimeout` while 40 clones hang — PR3 not landed
#5 broadcaster	✅ PASS	2 workers, worker PIDs unchanged across bounce, post-bounce scope served

Net: gates PR2 + PR3 today, PR5 is now guarded (in-place reconnect), and PR4 is ready to gate the moment its branch sets a tight BOOT_TARGET_SECONDS.

Zeev's latest inline batch: - pytest.ini: self-root the suite (app-tests/git-leak/pytest.ini) so `cd app-tests/git-leak && pytest --boot-scopes=N` is deterministic across pytest versions/cwd. (The documented command already works -- testpaths only applies when pytest runs from the rootdir -- but this makes it explicit.) - compose(): add a subprocess timeout (default 1200s) so a wedged up/wait/build fails fast instead of hanging session-scoped fixture setup to the CI job limit (pytest-timeout does not cover fixture setup). - delete_all_scopes(): cut the drain wait 20s -> 3s; on master the caches can't purge (the leak this gates), so the old wait burned ~40s of dead time per test across setup+teardown. - seed_gitea.py: inject push creds scheme-agnostically (urllib.parse) instead of string-replacing "http://"; drop the unused /seed-output token artifact and the seed-output volume (host uses basic auth, never the token). The order-dependent bounce signal (test_resilience.py) was already fixed in 8e24cb0 (asserts GET /scopes/post-bounce/policy == 200, not a delta on a shared process-global counter). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Addresses the HIGH and all MEDIUM findings from the opal-development / python-pro / backend-architect microreview: - H1 (#5 vacuous pass): the broadcaster gate now positively verifies a disconnect+reconnect actually happened, not just that no respawn occurred. New helpers.broadcaster_connect_count() counts the reconnecting broadcaster's "listener connected to channel" log line; the test asserts it increased across the bounce (paired with worker PIDs unchanged = in-place, not respawn). - M1 (#4): move the 40 executor-saturating PUTs inside the try/finally so a PUT failure still runs hard_reset() instead of leaking hung clone threads into the session stack. - M2 (#1/#2): _wait_until now treats a transient requests error from opal.stats() as "not yet" and retries, instead of ERRORing the test. - M3 (#3): measure a deterministic pure-cold boot (--force-recreate wipes the ephemeral FS -> preload cold-clones all N from Redis) instead of a nondeterministic warm/cold mix, so PR4's tight BOOT_TARGET_SECONDS can gate. - M4: verify the single-worker invariant -- opal_multiworker teardown asserts the stack is back to 1 worker, and the opal fixture asserts single-worker at setup, so a botched restore fails loudly instead of silently breaking cache gates. - M5 (#4): correct the reserved-probe comment (serving shares the fetch executor, so a shared repo would be starved on serve too; the probe additionally exercises the clone). - M6: gitea-admin retries on "database is locked" (CLI mutating live SQLite); rewritten as a `|` literal block so the create call stays on one line. Validated live (--boot-scopes=20): #2/#3 pass, #5 passes (reconnect count increased across the bounce + PIDs unchanged + scope served), #1/#4 fail for the right reason. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

dshoen619 and others added 2 commits June 23, 2026 12:24

dshoen619 and others added 3 commits June 23, 2026 12:37

dshoen619 marked this pull request as ready for review June 23, 2026 10:05

dshoen619 requested a review from Copilot June 23, 2026 10:05

dshoen619 self-assigned this Jun 23, 2026

Copilot started reviewing on behalf of dshoen619 June 23, 2026 10:05 View session

Copilot AI reviewed Jun 23, 2026

View reviewed changes

Comment thread app-tests/git-leak/conftest.py

Comment thread app-tests/git-leak/helpers.py Outdated

Comment thread app-tests/git-leak/helpers.py Outdated

Comment thread app-tests/git-leak/test_leak.py Outdated

Comment thread app-tests/git-leak/test_leak.py Outdated

dshoen619 and others added 3 commits June 23, 2026 13:13

dshoen619 requested review from Zivxx and zeevmoney June 23, 2026 10:56

dshoen619 marked this pull request as draft June 23, 2026 11:37

dshoen619 marked this pull request as ready for review June 23, 2026 11:49

zeevmoney previously requested changes Jun 23, 2026

View reviewed changes

zeevmoney requested a review from Copilot June 23, 2026 19:08

Copilot started reviewing on behalf of zeevmoney June 23, 2026 19:09 View session

Copilot AI reviewed Jun 23, 2026

View reviewed changes

Comment thread app-tests/git-leak/helpers.py

Comment thread packages/opal-server/opal_server/debug_stats.py Outdated

Comment thread packages/opal-server/opal_server/server.py

Zivxx reviewed Jun 24, 2026

View reviewed changes

Comment thread app-tests/git-leak/test_leak.py Outdated

dshoen619 and others added 2 commits June 24, 2026 16:15

style(git-leak): apply black/isort/docformatter (pre-commit)

f810db8

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

dshoen619 requested review from Zivxx and zeevmoney June 24, 2026 17:06

Zivxx reviewed Jun 28, 2026

View reviewed changes

Comment thread app-tests/git-leak/test_resilience.py Outdated

dshoen619 and others added 3 commits June 28, 2026 18:20

Merge branch 'master' into david/per-15155-pr1-git-leakresilience-tes…

d719c34

…t-environment-one-big-pr

Zivxx approved these changes Jun 29, 2026

View reviewed changes

Merge branch 'master' into david/per-15155-pr1-git-leakresilience-tes…

a502f2e

…t-environment-one-big-pr

dshoen619 removed the request for review from zeevmoney June 30, 2026 08:30

dshoen619 requested a review from zeevmoney June 30, 2026 09:11

This was referenced Jun 30, 2026

fix(opal-server): git resilience — never stuck on an offline repo (PR3) #924

Open

fix(opal-server): memory leak — purge GitPolicyFetcher caches on scope delete + webhook task cleanup (PR2) #923

Open

zeevmoney changed the title ~~PR1: Git leak/resilience test environment~~ test(opal-server): git leak/resilience test environment (PR1) Jun 30, 2026

zeevmoney approved these changes Jun 30, 2026

View reviewed changes

Zivxx mentioned this pull request Jul 1, 2026

feat(opal-server): parallel scope loading — bounded-concurrency boot (PR4) #932

Open

dshoen619 and others added 2 commits July 1, 2026 13:02

Uh oh!

Conversation

dshoen619 commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR1 — Git Leak/Resilience Test Environment

opal-server: an off-by-default stats endpoint

app-tests/git-leak/: a self-contained stack + pytest harness

The five flagship tests, and their baseline behavior on master

How to run

Verification done

Notes for reviewers

Cache-read determinism (how the gates stay trustworthy)

Uh oh!

linear-code Bot commented Jun 23, 2026

Uh oh!

netlify Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for opal-docs canceled.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zeevmoney left a comment

Choose a reason for hiding this comment

Requesting changes — the regression gates do not function as gates yet

What's blocking (must fix before this can be trusted as the gate)

Secondary correctness (should fix)

Minor

Bottom line

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dshoen619 commented Jun 24, 2026

Review addressed — 6046a10

Uh oh!

Uh oh!

zeevmoney commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zeevmoney left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dshoen619 commented Jul 1, 2026

Gate coverage lifted — 8e24cb0b

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

dshoen619 commented Jun 23, 2026 •

edited

Loading

`app-tests/git-leak/`: a self-contained stack + pytest harness

netlify Bot commented Jun 23, 2026 •

edited

Loading

Review addressed — `6046a10`

zeevmoney commented Jun 30, 2026 •

edited

Loading

Gate coverage lifted — `8e24cb0b`