device/gpu: proactive eviction with adaptive percentage threshold by devreal · Pull Request #773 · ICLDisco/parsec

devreal · 2026-05-11T14:33:01Z

Introduce a two-tier proactive GPU memory eviction mechanism to hide eviction latency and avoid task stalls on allocation failure.

Tier-1 (pre-flight in parsec_device_data_reserve_space): before walking each task's data flows, free clean LRU entries while zone utilisation exceeds mem_evict_threshold percent of total capacity. Uses the new parsec_device_try_evict_lru_one() helper which handles the readers / ref-count / trylock / CAS-readers race conditions that the reactive path already handles, and detects full-cycle scans via a cycling_sentinel.

Tier-2 (parsec_device_kernel_scheduler): when the clean LRU is empty and zone pressure is above the threshold, proactively enqueue a D2H writeback task on exec_stream[1] so dirty-page eviction latency overlaps with the upcoming H2D and kernel stages rather than blocking the critical path.

Adaptive threshold: each device carries mem_evict_threshold, initialised to parsec_gpu_mem_evict_upper (default 95%). When parsec_device_progress_stream returns PARSEC_HOOK_RETURN_NEXT for the kernel-push stream -- meaning every queued task failed to acquire memory, a true stall -- the threshold is stepped down by 5 percentage points toward parsec_gpu_mem_evict_lower (default 80%). This is intentionally coarser than per-task adjustment: a single task failing does not signal a stall; only a full pass of the pending queue failing does.

New MCA parameters registered by the CUDA and Level Zero components:
device_{cuda,level_zero}mem_evict_upper (default 95)
device{cuda,level_zero}_mem_evict_lower (default 80)

All zone-pressure checks are guarded with #if !defined(PARSEC_GPU_ALLOC_PER_TILE) since that mode has no zone allocator; tier-2 falls back to the simple clean-LRU-empty condition in that mode.

Supersedes #763

Introduce a two-tier proactive GPU memory eviction mechanism to hide eviction latency and avoid task stalls on allocation failure. Tier-1 (pre-flight in parsec_device_data_reserve_space): before walking each task's data flows, free clean LRU entries while zone utilisation exceeds mem_evict_threshold percent of total capacity. Uses the new parsec_device_try_evict_lru_one() helper which handles the readers / ref-count / trylock / CAS-readers race conditions that the reactive path already handles, and detects full-cycle scans via a cycling_sentinel. Tier-2 (parsec_device_kernel_scheduler): when the clean LRU is empty and zone pressure is above the threshold, proactively enqueue a D2H writeback task on exec_stream[1] so dirty-page eviction latency overlaps with the upcoming H2D and kernel stages rather than blocking the critical path. Adaptive threshold: each device carries mem_evict_threshold, initialised to parsec_gpu_mem_evict_upper (default 95%). When parsec_device_progress_stream returns PARSEC_HOOK_RETURN_NEXT for the kernel-push stream -- meaning every queued task failed to acquire memory, a true stall -- the threshold is stepped down by 5 percentage points toward parsec_gpu_mem_evict_lower (default 80%). This is intentionally coarser than per-task adjustment: a single task failing does not signal a stall; only a full pass of the pending queue failing does. New MCA parameters registered by the CUDA and Level Zero components: device_{cuda,level_zero}_mem_evict_upper (default 95) device_{cuda,level_zero}_mem_evict_lower (default 80) All zone-pressure checks are guarded with #if !defined(PARSEC_GPU_ALLOC_PER_TILE) since that mode has no zone allocator; tier-2 falls back to the simple clean-LRU-empty condition in that mode. Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

parsec_gpu_create_w2r_task() now accepts a required_size (bytes) and an out-parameter selected_size so callers know exactly how much dirty data was selected for the D2H transfer. The selection loop stops as soon as the accumulated nb_elts bytes reach required_size (or the max-flows cap is hit, whichever comes first). A new per-device field mem_evict_in_flight tracks the total bytes of dirty GPU data currently queued or executing on exec_stream[1]: - incremented in parsec_gpu_create_w2r_task() by the bytes selected - decremented in parsec_gpu_complete_w2r_task() as each copy finishes Tier-2 in parsec_device_kernel_scheduler() now: 1. Computes needed = zone_in_use - threshold_bytes 2. Derives still_needed = needed - mem_evict_in_flight (avoiding redundant D2H tasks when enough data is already being evicted) 3. Loops calling parsec_gpu_create_w2r_task(still_needed) and pushing each resulting task to exec_stream[1] until still_needed is satisfied or the owned LRU is exhausted The reactive fallback (all push tasks stalled) passes SIZE_MAX so it drains as many dirty pages as the max-flows cap allows, unchanged from previous behavior. PARSEC_GPU_ALLOC_PER_TILE mode (no zone allocator) also uses SIZE_MAX and issues one batch, keeping the prior behavior. Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The reactive fallback that issues a D2H writeback when all kernel-push tasks stall on memory now checks mem_evict_in_flight first. If evictions are already active, queuing another batch would create a storm of D2H tasks that pile up faster than they complete. Instead, let the in-flight transfers finish and free zone memory before issuing more. Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Three issues raised in code review: 1. MCA parameter scope mem_evict_upper and mem_evict_lower were registered once per GPU backend component, which means the second registration overwrites the first when both CUDA and Level Zero are enabled, and the params end up under backend-specific namespaces. Move the registrations to parsec_mca_device_init() in device.c under the "device" namespace (device_mem_evict_upper / device_mem_evict_lower), guarded by PARSEC_HAVE_CUDA || PARSEC_HAVE_HIP || PARSEC_HAVE_LEVEL_ZERO so the extern references are only present when transfer_gpu.c is compiled. 2. data_avail_epoch style Tier-1 used `data_avail_epoch = 1` while the rest of the function uses `data_avail_epoch++`. Changed to `++` for consistency. 3. Threshold step-down condition The previous placement fired on every PARSEC_HOOK_RETURN_NEXT from parsec_device_progress_stream, which tries one task per call — not all tasks. This could drive mem_evict_threshold to its minimum rapidly. The step-down is now integrated into the mem_evict_in_flight == 0 guard: we only lower the threshold when there are no active evictions AND parsec_gpu_create_w2r_task also returns NULL (no dirty pages available to queue). That is the true "stuck" condition. Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

bosilca · 2026-05-11T21:31:36Z

+        if( 0 == gpu_device->mem_evict_in_flight ) {
+            size_t _sel = 0;
+            gpu_task = parsec_gpu_create_w2r_task(gpu_device, es, SIZE_MAX, &_sel);
+            if(gpu_device->mem_evict_threshold - 5 >= parsec_gpu_mem_evict_lower) {


The threshold manipulation happens irrespective if the parsec_gpu_create_w2r_task was able to create some tasks or not. This is very aggressive, because you will lower the threshold to the minimum very quickly, way before there is any real pressure on the memory. Why lowering it so aggressively ?

Three issues from code review: 1. Proactive D2H tasks not counted in device->mutex Tier-2 proactive w2r tasks were pushed to exec_stream[1]->fifo_pending without incrementing device->mutex, so the GPU manager thread could exit (when the last regular task decremented mutex to zero) while proactive D2H transfers were still queued or in flight. Fix: introduce PARSEC_GPU_TASK_TYPE_PROACTIVE_D2HTRANSFER. When tier-2 creates and pushes such a task it increments device->mutex. At complete_task: the proactive path calls parsec_gpu_complete_w2r_task (which already frees the task) and then decrements mutex via the same exit logic used by regular tasks, including the "last one out" exit path that returns PARSEC_HOOK_RETURN_ASYNC. 2. Parameter validation for mem_evict_upper / mem_evict_lower After registering the MCA parameters, look up the actual values (which may have been overridden via env/config), clamp each to [0,100] with a warning, and swap them if lower > upper. 3. Variables moved from transfer_gpu.c to device.c parsec_gpu_mem_evict_upper and parsec_gpu_mem_evict_lower are now defined in device.c so they always exist regardless of which GPU backends are compiled in. The MCA registrations are now unconditional (no PARSEC_HAVE_CUDA/HIP/LEVEL_ZERO guard needed). Extern declarations remain in device_gpu.h for use by GPU-side code. Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Pushing the inputs of a new task does not mean that we have to immediately evict data. If there are still pending tasks in the fifo we can wait for memory to become available as a result of them completing. This catches cases where we are pushing taks faster than we can execute and building up long fifos. If the fifos run empty we need to react and thus evict data. Once we evict, we adjust the proactive threshold to avoid that in the future. Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>

bosilca · 2026-05-19T01:00:57Z

+ */
+static bool gpu_device_exec_streams_fifo_empty( parsec_device_gpu_module_t *gpu_device )
+{
+    for (int i = 2; i < gpu_device->nb_exec_streams; i++) {


Suggested change

for (int i = 2; i < gpu_device->nb_exec_streams; i++) {

for (int i = 2; i < gpu_device->num_exec_streams; i++) {

bosilca · 2026-05-19T01:01:14Z

+static bool gpu_device_exec_streams_fifo_empty( parsec_device_gpu_module_t *gpu_device )
+{
+    for (int i = 2; i < gpu_device->nb_exec_streams; i++) {
+        if( !parsec_list_nolock_is_empty(gpu_device->exec_streams[i].fifo_pending) ) {


Suggested change

if( !parsec_list_nolock_is_empty(gpu_device->exec_streams[i].fifo_pending) ) {

if( !parsec_list_nolock_is_empty(gpu_device->exec_stream[i].fifo_pending) ) {

bosilca · 2026-05-19T01:04:54Z

+            _selected += gpu_copy->original->nb_elts;
            nb_cleaned++;
-            if (MAX_PARAM_COUNT == nb_cleaned)
+            if( MAX_PARAM_COUNT == nb_cleaned || _selected >= required_size )


required_size is in bytes, but if I recall correctly nb_elts is in number of datatypes. There is another similar check in the completion path. Please check both.

…/parsec-1 into dynamic-eviction-threshold

Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>

devreal · 2026-06-08T23:47:25Z

Merged in master. Will test it later this week.

bosilca · 2026-06-09T00:18:56Z

You scared me here for a moment. "Merged in master" for a PR that does not build properly ... I guess you meant "rebased on master"

devreal requested a review from a team as a code owner May 11, 2026 14:33

devreal and others added 2 commits May 11, 2026 11:07

bosilca reviewed May 11, 2026

View reviewed changes

Comment thread parsec/mca/device/cuda/device_cuda_component.c Outdated

Comment thread parsec/mca/device/device_gpu.c Outdated

Comment thread parsec/mca/device/device_gpu.c Outdated

devreal force-pushed the dynamic-eviction-threshold branch from d59d40d to dcf8468 Compare May 11, 2026 18:37

devreal mentioned this pull request May 11, 2026

Throttle evictions to one w2r task at a time #763

Closed

bosilca reviewed May 11, 2026

View reviewed changes

devreal and others added 3 commits May 11, 2026 17:44

Merge branch 'master' into dynamic-eviction-threshold

ba65e9d

bosilca reviewed May 19, 2026

View reviewed changes

devreal added 3 commits June 8, 2026 19:43

Merge branch 'master' into dynamic-eviction-threshold

e4985ca

Merge branch 'dynamic-eviction-threshold' of ssh://github.com/devreal…

6f8224f

…/parsec-1 into dynamic-eviction-threshold

Fix typos

a22c93c

Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

device/gpu: proactive eviction with adaptive percentage threshold#773

device/gpu: proactive eviction with adaptive percentage threshold#773
devreal wants to merge 10 commits into
ICLDisco:masterfrom
devreal:dynamic-eviction-threshold

devreal commented May 11, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bosilca May 11, 2026

Uh oh!

Uh oh!

bosilca May 19, 2026

Uh oh!

bosilca May 19, 2026

Uh oh!

bosilca May 19, 2026 •

edited

Loading

Uh oh!

devreal commented Jun 8, 2026

Uh oh!

bosilca commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	for (int i = 2; i < gpu_device->nb_exec_streams; i++) {
	for (int i = 2; i < gpu_device->num_exec_streams; i++) {

	if( !parsec_list_nolock_is_empty(gpu_device->exec_streams[i].fifo_pending) ) {
	if( !parsec_list_nolock_is_empty(gpu_device->exec_stream[i].fifo_pending) ) {

Uh oh!

Conversation

devreal commented May 11, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bosilca May 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bosilca May 19, 2026

Choose a reason for hiding this comment

Uh oh!

bosilca May 19, 2026

Choose a reason for hiding this comment

Uh oh!

bosilca May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

devreal commented Jun 8, 2026

Uh oh!

bosilca commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bosilca May 19, 2026 •

edited

Loading