Skip to content

device/gpu: proactive eviction with adaptive percentage threshold#773

Open
devreal wants to merge 10 commits into
ICLDisco:masterfrom
devreal:dynamic-eviction-threshold
Open

device/gpu: proactive eviction with adaptive percentage threshold#773
devreal wants to merge 10 commits into
ICLDisco:masterfrom
devreal:dynamic-eviction-threshold

Conversation

@devreal

@devreal devreal commented May 11, 2026

Copy link
Copy Markdown
Contributor

Introduce a two-tier proactive GPU memory eviction mechanism to hide eviction latency and avoid task stalls on allocation failure.

Tier-1 (pre-flight in parsec_device_data_reserve_space): before walking each task's data flows, free clean LRU entries while zone utilisation exceeds mem_evict_threshold percent of total capacity. Uses the new parsec_device_try_evict_lru_one() helper which handles the readers / ref-count / trylock / CAS-readers race conditions that the reactive path already handles, and detects full-cycle scans via a cycling_sentinel.

Tier-2 (parsec_device_kernel_scheduler): when the clean LRU is empty and zone pressure is above the threshold, proactively enqueue a D2H writeback task on exec_stream[1] so dirty-page eviction latency overlaps with the upcoming H2D and kernel stages rather than blocking the critical path.

Adaptive threshold: each device carries mem_evict_threshold, initialised to parsec_gpu_mem_evict_upper (default 95%). When parsec_device_progress_stream returns PARSEC_HOOK_RETURN_NEXT for the kernel-push stream -- meaning every queued task failed to acquire memory, a true stall -- the threshold is stepped down by 5 percentage points toward parsec_gpu_mem_evict_lower (default 80%). This is intentionally coarser than per-task adjustment: a single task failing does not signal a stall; only a full pass of the pending queue failing does.

New MCA parameters registered by the CUDA and Level Zero components:
device_{cuda,level_zero}mem_evict_upper (default 95)
device
{cuda,level_zero}_mem_evict_lower (default 80)

All zone-pressure checks are guarded with #if !defined(PARSEC_GPU_ALLOC_PER_TILE) since that mode has no zone allocator; tier-2 falls back to the simple clean-LRU-empty condition in that mode.

Supersedes #763

Introduce a two-tier proactive GPU memory eviction mechanism to hide
eviction latency and avoid task stalls on allocation failure.

Tier-1 (pre-flight in parsec_device_data_reserve_space): before walking
each task's data flows, free clean LRU entries while zone utilisation
exceeds mem_evict_threshold percent of total capacity.  Uses the new
parsec_device_try_evict_lru_one() helper which handles the readers /
ref-count / trylock / CAS-readers race conditions that the reactive path
already handles, and detects full-cycle scans via a cycling_sentinel.

Tier-2 (parsec_device_kernel_scheduler): when the clean LRU is empty and
zone pressure is above the threshold, proactively enqueue a D2H writeback
task on exec_stream[1] so dirty-page eviction latency overlaps with the
upcoming H2D and kernel stages rather than blocking the critical path.

Adaptive threshold: each device carries mem_evict_threshold, initialised
to parsec_gpu_mem_evict_upper (default 95%).  When
parsec_device_progress_stream returns PARSEC_HOOK_RETURN_NEXT for the
kernel-push stream -- meaning every queued task failed to acquire memory,
a true stall -- the threshold is stepped down by 5 percentage points
toward parsec_gpu_mem_evict_lower (default 80%).  This is intentionally
coarser than per-task adjustment: a single task failing does not signal a
stall; only a full pass of the pending queue failing does.

New MCA parameters registered by the CUDA and Level Zero components:
  device_{cuda,level_zero}_mem_evict_upper  (default 95)
  device_{cuda,level_zero}_mem_evict_lower  (default 80)

All zone-pressure checks are guarded with #if !defined(PARSEC_GPU_ALLOC_PER_TILE)
since that mode has no zone allocator; tier-2 falls back to the simple
clean-LRU-empty condition in that mode.

Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@devreal devreal requested a review from a team as a code owner May 11, 2026 14:33
devreal and others added 2 commits May 11, 2026 11:07
parsec_gpu_create_w2r_task() now accepts a required_size (bytes) and an
out-parameter selected_size so callers know exactly how much dirty data
was selected for the D2H transfer.  The selection loop stops as soon as
the accumulated nb_elts bytes reach required_size (or the max-flows cap
is hit, whichever comes first).

A new per-device field mem_evict_in_flight tracks the total bytes of
dirty GPU data currently queued or executing on exec_stream[1]:
- incremented in parsec_gpu_create_w2r_task() by the bytes selected
- decremented in parsec_gpu_complete_w2r_task() as each copy finishes

Tier-2 in parsec_device_kernel_scheduler() now:
1. Computes needed = zone_in_use - threshold_bytes
2. Derives still_needed = needed - mem_evict_in_flight (avoiding
   redundant D2H tasks when enough data is already being evicted)
3. Loops calling parsec_gpu_create_w2r_task(still_needed) and pushing
   each resulting task to exec_stream[1] until still_needed is satisfied
   or the owned LRU is exhausted

The reactive fallback (all push tasks stalled) passes SIZE_MAX so it
drains as many dirty pages as the max-flows cap allows, unchanged from
previous behavior.  PARSEC_GPU_ALLOC_PER_TILE mode (no zone allocator)
also uses SIZE_MAX and issues one batch, keeping the prior behavior.

Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The reactive fallback that issues a D2H writeback when all kernel-push
tasks stall on memory now checks mem_evict_in_flight first.  If evictions
are already active, queuing another batch would create a storm of D2H
tasks that pile up faster than they complete.  Instead, let the in-flight
transfers finish and free zone memory before issuing more.

Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread parsec/mca/device/cuda/device_cuda_component.c Outdated
Comment thread parsec/mca/device/device_gpu.c Outdated
Comment thread parsec/mca/device/device_gpu.c Outdated
Three issues raised in code review:

1. MCA parameter scope
   mem_evict_upper and mem_evict_lower were registered once per GPU
   backend component, which means the second registration overwrites the
   first when both CUDA and Level Zero are enabled, and the params end up
   under backend-specific namespaces.  Move the registrations to
   parsec_mca_device_init() in device.c under the "device" namespace
   (device_mem_evict_upper / device_mem_evict_lower), guarded by
   PARSEC_HAVE_CUDA || PARSEC_HAVE_HIP || PARSEC_HAVE_LEVEL_ZERO so the
   extern references are only present when transfer_gpu.c is compiled.

2. data_avail_epoch style
   Tier-1 used `data_avail_epoch = 1` while the rest of the function uses
   `data_avail_epoch++`.  Changed to `++` for consistency.

3. Threshold step-down condition
   The previous placement fired on every PARSEC_HOOK_RETURN_NEXT from
   parsec_device_progress_stream, which tries one task per call — not all
   tasks.  This could drive mem_evict_threshold to its minimum rapidly.
   The step-down is now integrated into the mem_evict_in_flight == 0
   guard: we only lower the threshold when there are no active evictions
   AND parsec_gpu_create_w2r_task also returns NULL (no dirty pages
   available to queue).  That is the true "stuck" condition.

Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread parsec/mca/device/device_gpu.c
if( 0 == gpu_device->mem_evict_in_flight ) {
size_t _sel = 0;
gpu_task = parsec_gpu_create_w2r_task(gpu_device, es, SIZE_MAX, &_sel);
if(gpu_device->mem_evict_threshold - 5 >= parsec_gpu_mem_evict_lower) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The threshold manipulation happens irrespective if the parsec_gpu_create_w2r_task was able to create some tasks or not. This is very aggressive, because you will lower the threshold to the minimum very quickly, way before there is any real pressure on the memory. Why lowering it so aggressively ?

Comment thread parsec/mca/device/device.c Outdated
devreal and others added 3 commits May 11, 2026 17:44
Three issues from code review:

1. Proactive D2H tasks not counted in device->mutex
   Tier-2 proactive w2r tasks were pushed to exec_stream[1]->fifo_pending
   without incrementing device->mutex, so the GPU manager thread could
   exit (when the last regular task decremented mutex to zero) while
   proactive D2H transfers were still queued or in flight.

   Fix: introduce PARSEC_GPU_TASK_TYPE_PROACTIVE_D2HTRANSFER. When tier-2
   creates and pushes such a task it increments device->mutex. At
   complete_task: the proactive path calls parsec_gpu_complete_w2r_task
   (which already frees the task) and then decrements mutex via the same
   exit logic used by regular tasks, including the "last one out" exit
   path that returns PARSEC_HOOK_RETURN_ASYNC.

2. Parameter validation for mem_evict_upper / mem_evict_lower
   After registering the MCA parameters, look up the actual values (which
   may have been overridden via env/config), clamp each to [0,100] with a
   warning, and swap them if lower > upper.

3. Variables moved from transfer_gpu.c to device.c
   parsec_gpu_mem_evict_upper and parsec_gpu_mem_evict_lower are now
   defined in device.c so they always exist regardless of which GPU
   backends are compiled in. The MCA registrations are now unconditional
   (no PARSEC_HAVE_CUDA/HIP/LEVEL_ZERO guard needed). Extern declarations
   remain in device_gpu.h for use by GPU-side code.

Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Pushing the inputs of a new task does not mean that we have to immediately
evict data. If there are still pending tasks in the fifo we can
wait for memory to become available as a result of them completing.
This catches cases where we are pushing taks faster than we can execute
and building up long fifos.
If the fifos run empty we need to react and thus evict data.
Once we evict, we adjust the proactive threshold to avoid that in the future.

Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
Comment thread parsec/mca/device/device_gpu.c Outdated
*/
static bool gpu_device_exec_streams_fifo_empty( parsec_device_gpu_module_t *gpu_device )
{
for (int i = 2; i < gpu_device->nb_exec_streams; i++) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for (int i = 2; i < gpu_device->nb_exec_streams; i++) {
for (int i = 2; i < gpu_device->num_exec_streams; i++) {

Comment thread parsec/mca/device/device_gpu.c Outdated
static bool gpu_device_exec_streams_fifo_empty( parsec_device_gpu_module_t *gpu_device )
{
for (int i = 2; i < gpu_device->nb_exec_streams; i++) {
if( !parsec_list_nolock_is_empty(gpu_device->exec_streams[i].fifo_pending) ) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if( !parsec_list_nolock_is_empty(gpu_device->exec_streams[i].fifo_pending) ) {
if( !parsec_list_nolock_is_empty(gpu_device->exec_stream[i].fifo_pending) ) {

_selected += gpu_copy->original->nb_elts;
nb_cleaned++;
if (MAX_PARAM_COUNT == nb_cleaned)
if( MAX_PARAM_COUNT == nb_cleaned || _selected >= required_size )

@bosilca bosilca May 19, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

required_size is in bytes, but if I recall correctly nb_elts is in number of datatypes. There is another similar check in the completion path. Please check both.

devreal added 3 commits June 8, 2026 19:43
Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
@devreal

devreal commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

Merged in master. Will test it later this week.

@bosilca

bosilca commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

You scared me here for a moment. "Merged in master" for a PR that does not build properly ... I guess you meant "rebased on master"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants