feat(client): bound concurrent in-flight data updates to cap peak memory#917
Open
Kyzgor wants to merge 1 commit into
Open
feat(client): bound concurrent in-flight data updates to cap peak memory#917Kyzgor wants to merge 1 commit into
Kyzgor wants to merge 1 commit into
Conversation
`trigger_data_update` fires each update fire-and-forget into an unbounded `TasksPool`, and `_update_policy_data` fetches inside a per-path lock whose waiters queue without bound. Under a burst — a reconnect storm across a fleet, frequent periodic updates, or a slow policy store — the number of in-flight updates (each holding its fetched dataset) is bounded only by the arrival rate, inflating the peak working set. Add an `asyncio.Semaphore` (`DATA_UPDATER_MAX_CONCURRENT_UPDATES`, default 10) acquired around the fetch+write of each update entry, so peak concurrency — and hence peak memory — is bounded. The semaphore blocks rather than drops, so no update is lost, and per-path write ordering (the hierarchical lock) is preserved. Adds a regression test asserting peak concurrent writes stays at or below the cap under a burst. Related to permitio#844, permitio#770
✅ Deploy Preview for opal-docs canceled.
|
1 similar comment
✅ Deploy Preview for opal-docs canceled.
|
7 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes Issue
Related to #844 and #770. This is hardening; the primary leak in those issues is fixed separately
in the fetch-worker PR (#916). There is no "Closes" here, this is a defense-in-depth bound rather
than the reported-bug fix.
Changes proposed
Bounds how many data-update entries fetch and write to the policy store concurrently, so a burst
of updates cannot stack an unbounded number of in-flight datasets in memory.
opal_client/config.py: newDATA_UPDATER_MAX_CONCURRENT_UPDATES(default10).opal_client/data/updater.py: anasyncio.Semaphoreacquired as the outermost context manageraround each entry's fetch and write in
_update_policy_data, plus a constructor overridemax_concurrent_data_updates.opal_client/tests/data_updater_backpressure_test.py(new): asserts peak concurrent writesstays at or below the cap under a burst, and that every update is still applied (no drops).
Why
trigger_data_updatefires each update fire-and-forget into an unboundedTasksPool, and_update_policy_datafetches inside a per-dst_pathhierarchical lock whose waiters queue withoutbound. Under load (a reconnect storm across a fleet, frequent periodic updates, or a slow policy
store) the count of in-flight updates, each holding its fetched dataset, is bounded only by the
arrival rate, so the peak working set can grow large before the backlog drains.
This is a different failure mode from the per-worker frame retention fixed in #916 (that one is a
steady-state high-water mark; this one is a transient burst spike). They are complementary.
Check List (Check all the applicable boxes)
Screenshots
N/A. The behaviour is memory and concurrency, quantified below.
Note to reviewers
Design choices, and what is preserved:
context manager inside the existing
async with. It does not wrap_send_reports(which usesthe same
TasksPool), so report delivery is never throttled behind data writes.dst_pathwrite ordering) is untouched.Empirical results from a local harness driving the real
DataUpdater-> fetch path:Peak concurrent writes pin to the cap regardless of N, and peak RSS stops scaling with load. A
larger A/B (45 updates x 4 MB with a slow store): 1,202 MB drops to ~340 MB at cap=5. The
regression test fails if the semaphore is removed (observed peak concurrency jumps to about the
burst size).
On the default value:
10is a balance, high enough not to throttle typical multi-entry configsand low enough to bound memory under pathological bursts. It is fully configurable, and I am happy
to adjust the default if you prefer a different point, or to make it opt-in (an unbounded default).
CI note: the
security/snykcheck tends to error on fork PRs (org token/quota) rather thanindicating a code problem.