device/gpu: proactive eviction with adaptive percentage threshold#773
device/gpu: proactive eviction with adaptive percentage threshold#773devreal wants to merge 10 commits into
Conversation
Introduce a two-tier proactive GPU memory eviction mechanism to hide
eviction latency and avoid task stalls on allocation failure.
Tier-1 (pre-flight in parsec_device_data_reserve_space): before walking
each task's data flows, free clean LRU entries while zone utilisation
exceeds mem_evict_threshold percent of total capacity. Uses the new
parsec_device_try_evict_lru_one() helper which handles the readers /
ref-count / trylock / CAS-readers race conditions that the reactive path
already handles, and detects full-cycle scans via a cycling_sentinel.
Tier-2 (parsec_device_kernel_scheduler): when the clean LRU is empty and
zone pressure is above the threshold, proactively enqueue a D2H writeback
task on exec_stream[1] so dirty-page eviction latency overlaps with the
upcoming H2D and kernel stages rather than blocking the critical path.
Adaptive threshold: each device carries mem_evict_threshold, initialised
to parsec_gpu_mem_evict_upper (default 95%). When
parsec_device_progress_stream returns PARSEC_HOOK_RETURN_NEXT for the
kernel-push stream -- meaning every queued task failed to acquire memory,
a true stall -- the threshold is stepped down by 5 percentage points
toward parsec_gpu_mem_evict_lower (default 80%). This is intentionally
coarser than per-task adjustment: a single task failing does not signal a
stall; only a full pass of the pending queue failing does.
New MCA parameters registered by the CUDA and Level Zero components:
device_{cuda,level_zero}_mem_evict_upper (default 95)
device_{cuda,level_zero}_mem_evict_lower (default 80)
All zone-pressure checks are guarded with #if !defined(PARSEC_GPU_ALLOC_PER_TILE)
since that mode has no zone allocator; tier-2 falls back to the simple
clean-LRU-empty condition in that mode.
Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
parsec_gpu_create_w2r_task() now accepts a required_size (bytes) and an out-parameter selected_size so callers know exactly how much dirty data was selected for the D2H transfer. The selection loop stops as soon as the accumulated nb_elts bytes reach required_size (or the max-flows cap is hit, whichever comes first). A new per-device field mem_evict_in_flight tracks the total bytes of dirty GPU data currently queued or executing on exec_stream[1]: - incremented in parsec_gpu_create_w2r_task() by the bytes selected - decremented in parsec_gpu_complete_w2r_task() as each copy finishes Tier-2 in parsec_device_kernel_scheduler() now: 1. Computes needed = zone_in_use - threshold_bytes 2. Derives still_needed = needed - mem_evict_in_flight (avoiding redundant D2H tasks when enough data is already being evicted) 3. Loops calling parsec_gpu_create_w2r_task(still_needed) and pushing each resulting task to exec_stream[1] until still_needed is satisfied or the owned LRU is exhausted The reactive fallback (all push tasks stalled) passes SIZE_MAX so it drains as many dirty pages as the max-flows cap allows, unchanged from previous behavior. PARSEC_GPU_ALLOC_PER_TILE mode (no zone allocator) also uses SIZE_MAX and issues one batch, keeping the prior behavior. Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The reactive fallback that issues a D2H writeback when all kernel-push tasks stall on memory now checks mem_evict_in_flight first. If evictions are already active, queuing another batch would create a storm of D2H tasks that pile up faster than they complete. Instead, let the in-flight transfers finish and free zone memory before issuing more. Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three issues raised in code review: 1. MCA parameter scope mem_evict_upper and mem_evict_lower were registered once per GPU backend component, which means the second registration overwrites the first when both CUDA and Level Zero are enabled, and the params end up under backend-specific namespaces. Move the registrations to parsec_mca_device_init() in device.c under the "device" namespace (device_mem_evict_upper / device_mem_evict_lower), guarded by PARSEC_HAVE_CUDA || PARSEC_HAVE_HIP || PARSEC_HAVE_LEVEL_ZERO so the extern references are only present when transfer_gpu.c is compiled. 2. data_avail_epoch style Tier-1 used `data_avail_epoch = 1` while the rest of the function uses `data_avail_epoch++`. Changed to `++` for consistency. 3. Threshold step-down condition The previous placement fired on every PARSEC_HOOK_RETURN_NEXT from parsec_device_progress_stream, which tries one task per call — not all tasks. This could drive mem_evict_threshold to its minimum rapidly. The step-down is now integrated into the mem_evict_in_flight == 0 guard: we only lower the threshold when there are no active evictions AND parsec_gpu_create_w2r_task also returns NULL (no dirty pages available to queue). That is the true "stuck" condition. Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
d59d40d to
dcf8468
Compare
| if( 0 == gpu_device->mem_evict_in_flight ) { | ||
| size_t _sel = 0; | ||
| gpu_task = parsec_gpu_create_w2r_task(gpu_device, es, SIZE_MAX, &_sel); | ||
| if(gpu_device->mem_evict_threshold - 5 >= parsec_gpu_mem_evict_lower) { |
There was a problem hiding this comment.
The threshold manipulation happens irrespective if the parsec_gpu_create_w2r_task was able to create some tasks or not. This is very aggressive, because you will lower the threshold to the minimum very quickly, way before there is any real pressure on the memory. Why lowering it so aggressively ?
Three issues from code review: 1. Proactive D2H tasks not counted in device->mutex Tier-2 proactive w2r tasks were pushed to exec_stream[1]->fifo_pending without incrementing device->mutex, so the GPU manager thread could exit (when the last regular task decremented mutex to zero) while proactive D2H transfers were still queued or in flight. Fix: introduce PARSEC_GPU_TASK_TYPE_PROACTIVE_D2HTRANSFER. When tier-2 creates and pushes such a task it increments device->mutex. At complete_task: the proactive path calls parsec_gpu_complete_w2r_task (which already frees the task) and then decrements mutex via the same exit logic used by regular tasks, including the "last one out" exit path that returns PARSEC_HOOK_RETURN_ASYNC. 2. Parameter validation for mem_evict_upper / mem_evict_lower After registering the MCA parameters, look up the actual values (which may have been overridden via env/config), clamp each to [0,100] with a warning, and swap them if lower > upper. 3. Variables moved from transfer_gpu.c to device.c parsec_gpu_mem_evict_upper and parsec_gpu_mem_evict_lower are now defined in device.c so they always exist regardless of which GPU backends are compiled in. The MCA registrations are now unconditional (no PARSEC_HAVE_CUDA/HIP/LEVEL_ZERO guard needed). Extern declarations remain in device_gpu.h for use by GPU-side code. Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Pushing the inputs of a new task does not mean that we have to immediately evict data. If there are still pending tasks in the fifo we can wait for memory to become available as a result of them completing. This catches cases where we are pushing taks faster than we can execute and building up long fifos. If the fifos run empty we need to react and thus evict data. Once we evict, we adjust the proactive threshold to avoid that in the future. Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
| */ | ||
| static bool gpu_device_exec_streams_fifo_empty( parsec_device_gpu_module_t *gpu_device ) | ||
| { | ||
| for (int i = 2; i < gpu_device->nb_exec_streams; i++) { |
There was a problem hiding this comment.
| for (int i = 2; i < gpu_device->nb_exec_streams; i++) { | |
| for (int i = 2; i < gpu_device->num_exec_streams; i++) { |
| static bool gpu_device_exec_streams_fifo_empty( parsec_device_gpu_module_t *gpu_device ) | ||
| { | ||
| for (int i = 2; i < gpu_device->nb_exec_streams; i++) { | ||
| if( !parsec_list_nolock_is_empty(gpu_device->exec_streams[i].fifo_pending) ) { |
There was a problem hiding this comment.
| if( !parsec_list_nolock_is_empty(gpu_device->exec_streams[i].fifo_pending) ) { | |
| if( !parsec_list_nolock_is_empty(gpu_device->exec_stream[i].fifo_pending) ) { |
| _selected += gpu_copy->original->nb_elts; | ||
| nb_cleaned++; | ||
| if (MAX_PARAM_COUNT == nb_cleaned) | ||
| if( MAX_PARAM_COUNT == nb_cleaned || _selected >= required_size ) |
There was a problem hiding this comment.
required_size is in bytes, but if I recall correctly nb_elts is in number of datatypes. There is another similar check in the completion path. Please check both.
…/parsec-1 into dynamic-eviction-threshold
|
Merged in master. Will test it later this week. |
|
You scared me here for a moment. "Merged in master" for a PR that does not build properly ... I guess you meant "rebased on master" |
Introduce a two-tier proactive GPU memory eviction mechanism to hide eviction latency and avoid task stalls on allocation failure.
Tier-1 (pre-flight in parsec_device_data_reserve_space): before walking each task's data flows, free clean LRU entries while zone utilisation exceeds mem_evict_threshold percent of total capacity. Uses the new parsec_device_try_evict_lru_one() helper which handles the readers / ref-count / trylock / CAS-readers race conditions that the reactive path already handles, and detects full-cycle scans via a cycling_sentinel.
Tier-2 (parsec_device_kernel_scheduler): when the clean LRU is empty and zone pressure is above the threshold, proactively enqueue a D2H writeback task on exec_stream[1] so dirty-page eviction latency overlaps with the upcoming H2D and kernel stages rather than blocking the critical path.
Adaptive threshold: each device carries mem_evict_threshold, initialised to parsec_gpu_mem_evict_upper (default 95%). When parsec_device_progress_stream returns PARSEC_HOOK_RETURN_NEXT for the kernel-push stream -- meaning every queued task failed to acquire memory, a true stall -- the threshold is stepped down by 5 percentage points toward parsec_gpu_mem_evict_lower (default 80%). This is intentionally coarser than per-task adjustment: a single task failing does not signal a stall; only a full pass of the pending queue failing does.
New MCA parameters registered by the CUDA and Level Zero components:
device_{cuda,level_zero}mem_evict_upper (default 95)
device{cuda,level_zero}_mem_evict_lower (default 80)
All zone-pressure checks are guarded with #if !defined(PARSEC_GPU_ALLOC_PER_TILE) since that mode has no zone allocator; tier-2 falls back to the simple clean-LRU-empty condition in that mode.
Supersedes #763