Skip to content

fix(opal-server): git resilience — never stuck on an offline repo (PR3)#924

Open
dshoen619 wants to merge 10 commits into
masterfrom
david/per-15157-pr3-git-resilience-never-stuck-on-an-offline-repo
Open

fix(opal-server): git resilience — never stuck on an offline repo (PR3)#924
dshoen619 wants to merge 10 commits into
masterfrom
david/per-15157-pr3-git-resilience-never-stuck-on-an-offline-repo

Conversation

@dshoen619

Copy link
Copy Markdown

PR3 — Git Resilience: never stuck on an offline repo

Closes PER-15157.

Problem

Scope git clone/fetch use pygit2 with no timeout, dispatched through run_sync(...) onto the shared default executor with no time limit on the await:

  • clone_repository(...) in GitPolicyFetcher._clone
  • repo.remotes[self._remote].fetch(...) in fetch_and_notify_on_changes

POLICY_REPO_CLONE_TIMEOUT is wired only to the legacy non-scopes path, so it does not apply in scopes mode. The consequences:

  • Boot (gunicorn_confpreload_scopes()asyncio.run(sync_scopes(...))) blocks indefinitely on a single unreachable repo.
  • Normal operation: a hung op holds the per-source lock and a shared-pool thread. Enough hung ops starve bundle serving and other server work that shares the default executor.

Fix

A hard per-operation timeout plus an isolated, bounded thread pool so a hung repo fails fast, is skipped, and never starves anything else.

  • New module-level dedicated ThreadPoolExecutor (opal-git threads, lazily built) for scope git work, isolated from the default executor.
  • New helper run_in_git_executor(func, *args, timeout=..., **kwargs) that runs the blocking call on that pool and wraps the await in asyncio.wait_for. timeout <= 0 means no limit.
  • Clone + fetch now go through the helper with SCOPES_GIT_FETCH_TIMEOUT.
  • _clone error handling broadened to (pygit2.GitError, asyncio.TimeoutError) — a hung clone is logged and the scope skipped instead of crashing the caller. Per-scope isolation in ScopesService.sync_scope (try/except Exception) already handles a timed-out fetch, making boot best-effort once operations can no longer hang forever.
  • Dropped the now-unused run_sync import from git_fetcher.py.

New config keys (opal-server, server-only)

Env var Type Default Purpose
OPAL_SCOPES_GIT_FETCH_TIMEOUT float (s) 120.0 Hard timeout for a single scope git clone/fetch. On timeout the op is abandoned and the scope marked failed, so one unreachable repo can never block boot or other scopes. 0 = no timeout.
OPAL_SCOPES_GIT_MAX_WORKERS int 10 Size of the dedicated thread pool for scope git operations; isolates git work from the default executor.

⚠️ Caveat — the timeout is soft, not a hard kill

asyncio.wait_for cancels the await — unblocking the event loop and the awaiting coroutine — but the underlying pygit2 call keeps running on its pool thread until the OS network timeout. The dedicated bounded pool (OPAL_SCOPES_GIT_MAX_WORKERS) isolates those lingering threads so they cannot affect bundle serving or other scopes. Hard-kill via subprocess is explicitly out of scope (spec §6).

Tests

  • tests/git_executor_test.py — config defaults; helper returns value; helper times out; timeout=0 means no limit.
  • tests/fetch_timeout_test.py — a hanging op surfaces TimeoutError and wait_for unblocks promptly (< 2s with a 0.2s timeout).

Local run (Python 3.12 venv, editable installs of all three packages):

  • New tests: 5 passed
  • Full opal-server suite: 14 passed
  • Full opal-common suite: 64 passed (warnings are pre-existing deprecations)

Not runnable in this PR's CI yet

Task 4 of the plan includes the PR1 regression gate app-tests/git-leak/test_resilience.py::test_offline_repo_does_not_block_healthy_scopes and a live /healthcheck smoke test. Both depend on PR1 (test environment) being merged — the stated prerequisite — and a running stack, so they should be exercised once PR1 lands. This PR ships after PR2 and is independent of it.

Review checklist (opal-development)

  • Config keys: server-only consumption (correct component); bare env names (no OPAL_ double-prefix); mandatory description= present (covered by test_config.py); exactly one declaration per env name (no collision).
  • PDP impact: server-only keys; no opal_client/opal_common symbol touched; no PDP override of these names.
  • Pub/sub: no topic, broadcaster, or fan-out change.
  • No new dependency — stdlib concurrent.futures + asyncio only.
  • Interlock with PR4: PR4's parallel sync_scopes relies on this PR's guarantee that each fetch terminates, so its bounded gather can never wait on an infinite member.

🤖 Generated with Claude Code

dshoen619 and others added 3 commits June 23, 2026 14:19
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Wire scope clone/fetch through run_in_git_executor with
SCOPES_GIT_FETCH_TIMEOUT, and broaden the _clone except to catch
asyncio.TimeoutError so a hung clone is logged and the scope skipped
instead of crashing the caller. Drop the now-unused run_sync import.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@linear-code

linear-code Bot commented Jun 23, 2026

Copy link
Copy Markdown

PER-15157

@netlify

netlify Bot commented Jun 23, 2026

Copy link
Copy Markdown

Deploy Preview for opal-docs canceled.

Name Link
🔨 Latest commit d31a0b6
🔍 Latest deploy log https://app.netlify.com/projects/opal-docs/deploys/6a45066e09817a0008666ab2

@dshoen619 dshoen619 marked this pull request as draft June 23, 2026 11:35
@dshoen619 dshoen619 requested a review from Copilot June 24, 2026 16:59

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves opal-server resilience when syncing scope policy repos by ensuring git clone/fetch operations can’t block indefinitely or starve the server’s shared executor.

Changes:

  • Add a dedicated, bounded ThreadPoolExecutor and run_in_git_executor(...) helper to run blocking pygit2 operations with an asyncio.wait_for timeout.
  • Apply the helper + new timeout config to scope repo clone and fetch paths.
  • Add server config keys for timeout and executor sizing, plus focused unit tests for timeout behavior.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
packages/opal-server/opal_server/git_fetcher.py Introduces dedicated git executor + timeout helper; routes scope clone/fetch through it.
packages/opal-server/opal_server/config.py Adds SCOPES_GIT_FETCH_TIMEOUT and SCOPES_GIT_MAX_WORKERS configuration.
packages/opal-server/opal_server/tests/git_executor_test.py Tests config defaults and run_in_git_executor basic behavior.
packages/opal-server/opal_server/tests/fetch_timeout_test.py Tests that a hanging git op times out quickly (doesn’t block).
.claude/plans/docs/05-config-reference.md Internal config reference entry for the new env vars and caveat.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +65 to +66
loop = asyncio.get_event_loop()
fut = loop.run_in_executor(_get_git_executor(), partial(func, *args, **kwargs))

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — switched to asyncio.get_running_loop() in 5dd53a2.

Comment on lines 225 to 229
GitPolicyFetcher.repos_last_fetched[
self._source_id
] = datetime.datetime.now()
await run_sync(
await run_in_git_executor(
repo.remotes[self._remote].fetch,

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 5dd53a2repos_last_fetched is now updated only after the fetch completes successfully, so a timeout/error leaves it stale and won't suppress a later force_fetch via _was_fetched_after.

Comment thread packages/opal-server/opal_server/tests/fetch_timeout_test.py Outdated
dshoen619 and others added 2 commits June 24, 2026 20:09
On Python < 3.11 asyncio.TimeoutError is a distinct class from the
builtin TimeoutError, so run_in_git_executor's wait_for timeout was not
caught by `pytest.raises(TimeoutError)` — failing build (3.9)/(3.10).
Normalize to the builtin TimeoutError so the documented contract holds
on every supported Python, and update the _clone catch site to match.

Also apply black/isort/docformatter formatting to satisfy pre-commit.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- run_in_git_executor: use asyncio.get_running_loop() instead of the
  deprecated get_event_loop() inside an async function
- fetch_and_notify_on_changes: set repos_last_fetched only after a
  successful fetch so a timeout/error does not wrongly suppress a later
  force_fetch via _was_fetched_after
- fetch_timeout_test: measure elapsed time with time.monotonic()

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@dshoen619 dshoen619 marked this pull request as ready for review June 24, 2026 17:15
@dshoen619 dshoen619 self-assigned this Jun 24, 2026
The fetch path let TimeoutError propagate to sync_scope's catch-all,
which logged a full traceback at ERROR level for the expected
unreachable-repo case — inconsistent with the clone path's quiet
logger.error. Catch TimeoutError at the fetch site and log without a
traceback, then skip (repos_last_fetched stays stale so the next cycle
retries). Also shorten the hanging-thread sleeps in the timeout tests so
the lingering pool thread doesn't delay process teardown.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@dshoen619 dshoen619 requested review from Zivxx and zeevmoney June 24, 2026 17:23
@zeevmoney

Copy link
Copy Markdown
Contributor

Review notes — overlap + gate location (planned with the opal-development skill: references/add-config-key.md, references/debug-pubsub.md)

Faithful to the PR3 plan, with two improvements over it: normalizing asyncio.TimeoutError → builtin TimeoutError (so callers catch the builtin on 3.9/3.10 where they're distinct), and moving repos_last_fetched to after a successful fetch — a timed-out fetch no longer falsely marks the source "fresh," and it's lock-safe because _should_fetch runs inside the per-source repo_lock. The two config keys follow references/add-config-key.md (server-only OpalServerConfig, bare names, mandatory descriptions, no double-prefix).

Three things to resolve:

  1. Direct overlap with Fix git clone/fetch hanging indefinitely on unreachable repos #875 (PER-13817). Same root cause — pygit2 clone_repository / remotes.fetch going through run_sync with no timeout — and the same edited blocks in git_fetcher.py (_clone and fetch_and_notify_on_changes). They cannot both merge; whichever lands second will conflict. This PR is the stronger of the two:

    Recommend closing Fix git clone/fetch hanging indefinitely on unreachable repos #875 in favor of this PR.

  2. Its regression gate lives in PR1 (test(opal-server): git leak/resilience test environment (PR1) #922). test_offline_repo_does_not_block_healthy_scopes is the fail-now/pass-after gate for this fix and only exists on test(opal-server): git leak/resilience test environment (PR1) #922. So this can't be validated end-to-end until test(opal-server): git leak/resilience test environment (PR1) #922 merges (or is merged into this branch). Suggested order: test(opal-server): git leak/resilience test environment (PR1) #922 → this.

  3. This PR is currently check-blocked by branch protection (required checks / review), not by a merge conflict — needs CI green + an approval.

Minor:

  • The dedicated pool reads SCOPES_GIT_MAX_WORKERS once, lazily, and caches the executor for the process lifetime — it isn't runtime-reconfigurable and is never shut down. Matches the plan's design; worth a one-line note in the docstring.
  • The line numbers cited in .claude/plans/docs/05-config-reference.md for the two keys are stale vs where they actually land on this branch (cosmetic).

- git_fetcher: document that the dedicated scope-git ThreadPoolExecutor
  reads SCOPES_GIT_MAX_WORKERS once on first use, caches for the process
  lifetime (not runtime-reconfigurable), and is never explicitly shut
  down — matches the PR3 design.
- 05-config-reference: fix stale config.py line refs after the master
  merge shifted the keys — SCOPES_GIT_FETCH_TIMEOUT 150-156 -> 196-202,
  SCOPES_GIT_MAX_WORKERS 157-163 -> 203-209.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@dshoen619

Copy link
Copy Markdown
Author

Thanks @zeevmoney — addressed in bad21c1.

Minor items

  • Executor lifetime docstring — added a note to _get_git_executor: SCOPES_GIT_MAX_WORKERS is read once on first use, the executor is cached for the process lifetime (not runtime-reconfigurable) and is never explicitly shut down, matching the PR3 design.
  • Stale line numbers — good catch. They'd drifted after the master merge shifted config.py. Fixed the 05-config-reference.md refs: SCOPES_GIT_FETCH_TIMEOUT 150-156196-202, SCOPES_GIT_MAX_WORKERS 157-163203-209.

Things to resolve

  1. Overlap with Fix git clone/fetch hanging indefinitely on unreachable repos #875 (PER-13817) — agreed this is the stronger fix (default-on OPAL_SCOPES_GIT_FETCH_TIMEOUT=120, dedicated bounded pool, best-effort boot). Plan is to close Fix git clone/fetch hanging indefinitely on unreachable repos #875 in favor of this.
  2. Regression gate in test(opal-server): git leak/resilience test environment (PR1) #922 — agreed; the fail-now/pass-after gate (test_offline_repo_does_not_block_healthy_scopes) lives on PR1, so the plan is to land test(opal-server): git leak/resilience test environment (PR1) #922 → this and validate end-to-end there.
  3. Check-blocked — required checks (E2E, builds 3.9–3.12, pre-commit) were green on the prior head and are re-running on bad21c1e; the remaining blocker is the required approving review. A review once it's green would unblock it.

Also confirmed the two improvements you flagged are in place: asyncio.TimeoutError → builtin normalization, and repos_last_fetched moved to after a successful fetch.

… concurrent sync

Addresses review findings on PR3 (never stuck on an offline repo):

- CRITICAL: reset the dedicated git ThreadPoolExecutor after fork
  (os.register_at_fork) and shut it down at the end of preload_scopes. A
  pool built in the pre-fork gunicorn master was inherited with dead worker
  threads by every worker, so the leader's scope sync stalled forever
  (silent policy staleness). Verified with a fork repro on 3.12.

- HIGH: never use the non-thread-safe pygit2 Repository from two threads.
  A timed-out clone/fetch keeps running on its pool thread while the
  per-source_id lock is released; a per-source_id in-flight guard now skips
  a cycle while a prior op is still lingering. run_in_git_executor switches
  asyncio.wait_for -> asyncio.wait so a timeout never cancels the future
  (the thread runs to completion and clears the in-flight marker).

- HIGH: sync scopes concurrently, bounded by SCOPES_GIT_MAX_WORKERS, so one
  unreachable repo no longer serially blocks boot and other scopes.

- MEDIUM: daemon-thread pool so a lingering git op can't block interpreter
  shutdown.

- MEDIUM: stamp repos_last_fetched with the fetch start time on success
  (was completion time, which could wrongly suppress a force_fetch whose
  req_time falls within an in-flight fetch).

- rmtree(ignore_errors) for the abandoned-clone race; harden the
  env-sensitive config-defaults test; correct config/doc wording
  ("logged and skipped" instead of "marked failed").

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants