Skip to content

Use-after-free under memory pressure: buffer-cache trim frees an MTLBuffer still used by an in-flight command buffer (kIOGPUCommandBufferCallbackErrorInvalidResource) #3689

@spokvulcan

Description

@spokvulcan

Summary

Under sustained allocation + buffer-cache churn, the Metal allocator can free an MTLBuffer that an in-flight command buffer still references, crashing the process with kIOGPUCommandBufferCallbackErrorInvalidResource. Command buffers are created with commandBufferWithUnretainedReferences(), so Metal does not keep referenced buffers alive — and the buffer cache's trim releases them anyway.

Fix proposed in #3688.

Environment

  • MLX: consumed via mlx-swift 0.31.4 → mlx core v0.31.1 (ce45c525). Also reproduced/inspected against current main (which has refactored to a CommandEncoder with two unretained command-buffer creation sites).
  • Device: MacBook Pro (Mac15,9), Apple M3 Max, 48 GB unified memory
  • OS: macOS 26.5.1 (25F80)
  • Workload: an on-device, OpenAI-compatible inference server (Tesseract) running a 27B 4-bit vision model behind a tiered RAM+SSD prefix cache that restores KV state and re-runs prefill — i.e. steady allocation and cache eviction/trim while prior command buffers are still executing.

Symptom

Exception Type:  EXC_CRASH (SIGABRT)
Triggered by Thread: com.Metal.CompletionQueueDispatch
  mlx::core::gpu::check_error(...)
Termination: abort() called

With the Metal validation layer (MTL_DEBUG_LAYER=1 MTL_DEBUG_LAYER_ERROR_MODE=assert):

-[MTLDebugCommandBuffer preCommit]: ... failed assertion
'command buffer references deallocated object which previously existed at address 0x...'

The IOGPU error code is kIOGPUCommandBufferCallbackErrorInvalidResource.

Root cause

  1. MetalAllocator::free() recycles a buffer into the reuse cache (recycle_to_cache) as soon as its refcount hits 0 — even if a command buffer that used it is still in flight (e.g. after async_eval).
  2. When MetalAllocator::malloc() finds the cache over max_pool_size_ (or memory pressure past gc_limit_), it calls buffer_cache_.release_cached_buffers(...), whose free callback runs the real buf->release() (and residency_set_.erase(buf)).
  3. Buffers use MTL::ResourceHazardTrackingModeUntracked, and command buffers are created with commandBufferWithUnretainedReferences(), so nothing keeps the MTLBuffer alive — release() deallocates the allocation out from under the GPU.
  4. The in-flight command buffer then fails completion with kIOGPUCommandBufferCallbackErrorInvalidResource.

(All in mlx/backend/metal/allocator.cpp and mlx/backend/metal/device.cpp.)

Reproduction

The bug is a timing-sensitive race: it needs malloc-driven cache trimming to release a buffer in the window between a command buffer being committed and completing. Anything that slows that window down hides it — running under a debugger, or raising the cache limit so the trim never fires (both make it disappear, which is itself diagnostic).

Reliable trigger in our app:

  1. Send a large multi-image request to a 27B vision model through the prefix-cache server (cold request — populates + persists a KV snapshot).
  2. Re-send the same request so the warm path restores the snapshot from SSD and re-runs prefill (sustained allocation + cache churn).
  3. Crashes within ~16 s on the warm request. Concurrency makes it crash on the first round.

Repro harness (replays recorded requests against the live server and watches for the abort): scripts/repro-image-cache-crash.sh in the linked repo.

I'm happy to put together a minimal standalone MLX repro (allocate under memory pressure + async_eval + cache trim while command buffers are in flight) if that would help — just let me know.

Fix

#3688 — use retained references (commandBuffer()) at the command-buffer creation site(s) so Metal holds referenced buffers until completion; the cache trim's release() then only drops MLX's own reference. With the fix the previously-crashing workload runs to completion, and warm prefix-cache restores are faster than cold runs (confirming this is a correctness fix, not a slowdown that masks the race).

If a lower-overhead approach is preferred over global retained references (e.g. deferring release of cached buffers still referenced by in-flight command buffers), I'm glad to rework the PR along those lines.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions