Summary
Under sustained allocation + buffer-cache churn, the Metal allocator can free an MTLBuffer that an in-flight command buffer still references, crashing the process with kIOGPUCommandBufferCallbackErrorInvalidResource. Command buffers are created with commandBufferWithUnretainedReferences(), so Metal does not keep referenced buffers alive — and the buffer cache's trim releases them anyway.
Fix proposed in #3688.
Environment
- MLX: consumed via
mlx-swift 0.31.4 → mlx core v0.31.1 (ce45c525). Also reproduced/inspected against current main (which has refactored to a CommandEncoder with two unretained command-buffer creation sites).
- Device: MacBook Pro (
Mac15,9), Apple M3 Max, 48 GB unified memory
- OS: macOS 26.5.1 (25F80)
- Workload: an on-device, OpenAI-compatible inference server (Tesseract) running a 27B 4-bit vision model behind a tiered RAM+SSD prefix cache that restores KV state and re-runs prefill — i.e. steady allocation and cache eviction/trim while prior command buffers are still executing.
Symptom
Exception Type: EXC_CRASH (SIGABRT)
Triggered by Thread: com.Metal.CompletionQueueDispatch
mlx::core::gpu::check_error(...)
Termination: abort() called
With the Metal validation layer (MTL_DEBUG_LAYER=1 MTL_DEBUG_LAYER_ERROR_MODE=assert):
-[MTLDebugCommandBuffer preCommit]: ... failed assertion
'command buffer references deallocated object which previously existed at address 0x...'
The IOGPU error code is kIOGPUCommandBufferCallbackErrorInvalidResource.
Root cause
MetalAllocator::free() recycles a buffer into the reuse cache (recycle_to_cache) as soon as its refcount hits 0 — even if a command buffer that used it is still in flight (e.g. after async_eval).
- When
MetalAllocator::malloc() finds the cache over max_pool_size_ (or memory pressure past gc_limit_), it calls buffer_cache_.release_cached_buffers(...), whose free callback runs the real buf->release() (and residency_set_.erase(buf)).
- Buffers use
MTL::ResourceHazardTrackingModeUntracked, and command buffers are created with commandBufferWithUnretainedReferences(), so nothing keeps the MTLBuffer alive — release() deallocates the allocation out from under the GPU.
- The in-flight command buffer then fails completion with
kIOGPUCommandBufferCallbackErrorInvalidResource.
(All in mlx/backend/metal/allocator.cpp and mlx/backend/metal/device.cpp.)
Reproduction
The bug is a timing-sensitive race: it needs malloc-driven cache trimming to release a buffer in the window between a command buffer being committed and completing. Anything that slows that window down hides it — running under a debugger, or raising the cache limit so the trim never fires (both make it disappear, which is itself diagnostic).
Reliable trigger in our app:
- Send a large multi-image request to a 27B vision model through the prefix-cache server (cold request — populates + persists a KV snapshot).
- Re-send the same request so the warm path restores the snapshot from SSD and re-runs prefill (sustained allocation + cache churn).
- Crashes within ~16 s on the warm request. Concurrency makes it crash on the first round.
Repro harness (replays recorded requests against the live server and watches for the abort): scripts/repro-image-cache-crash.sh in the linked repo.
I'm happy to put together a minimal standalone MLX repro (allocate under memory pressure + async_eval + cache trim while command buffers are in flight) if that would help — just let me know.
Fix
#3688 — use retained references (commandBuffer()) at the command-buffer creation site(s) so Metal holds referenced buffers until completion; the cache trim's release() then only drops MLX's own reference. With the fix the previously-crashing workload runs to completion, and warm prefix-cache restores are faster than cold runs (confirming this is a correctness fix, not a slowdown that masks the race).
If a lower-overhead approach is preferred over global retained references (e.g. deferring release of cached buffers still referenced by in-flight command buffers), I'm glad to rework the PR along those lines.
Summary
Under sustained allocation + buffer-cache churn, the Metal allocator can free an
MTLBufferthat an in-flight command buffer still references, crashing the process withkIOGPUCommandBufferCallbackErrorInvalidResource. Command buffers are created withcommandBufferWithUnretainedReferences(), so Metal does not keep referenced buffers alive — and the buffer cache's trim releases them anyway.Fix proposed in #3688.
Environment
mlx-swift0.31.4 → mlx corev0.31.1(ce45c525). Also reproduced/inspected against currentmain(which has refactored to aCommandEncoderwith two unretained command-buffer creation sites).Mac15,9), Apple M3 Max, 48 GB unified memorySymptom
With the Metal validation layer (
MTL_DEBUG_LAYER=1 MTL_DEBUG_LAYER_ERROR_MODE=assert):The IOGPU error code is
kIOGPUCommandBufferCallbackErrorInvalidResource.Root cause
MetalAllocator::free()recycles a buffer into the reuse cache (recycle_to_cache) as soon as its refcount hits 0 — even if a command buffer that used it is still in flight (e.g. afterasync_eval).MetalAllocator::malloc()finds the cache overmax_pool_size_(or memory pressure pastgc_limit_), it callsbuffer_cache_.release_cached_buffers(...), whose free callback runs the realbuf->release()(andresidency_set_.erase(buf)).MTL::ResourceHazardTrackingModeUntracked, and command buffers are created withcommandBufferWithUnretainedReferences(), so nothing keeps theMTLBufferalive —release()deallocates the allocation out from under the GPU.kIOGPUCommandBufferCallbackErrorInvalidResource.(All in
mlx/backend/metal/allocator.cppandmlx/backend/metal/device.cpp.)Reproduction
The bug is a timing-sensitive race: it needs malloc-driven cache trimming to release a buffer in the window between a command buffer being committed and completing. Anything that slows that window down hides it — running under a debugger, or raising the cache limit so the trim never fires (both make it disappear, which is itself diagnostic).
Reliable trigger in our app:
Repro harness (replays recorded requests against the live server and watches for the abort):
scripts/repro-image-cache-crash.shin the linked repo.I'm happy to put together a minimal standalone MLX repro (allocate under memory pressure +
async_eval+ cache trim while command buffers are in flight) if that would help — just let me know.Fix
#3688 — use retained references (
commandBuffer()) at the command-buffer creation site(s) so Metal holds referenced buffers until completion; the cache trim'srelease()then only drops MLX's own reference. With the fix the previously-crashing workload runs to completion, and warm prefix-cache restores are faster than cold runs (confirming this is a correctness fix, not a slowdown that masks the race).If a lower-overhead approach is preferred over global retained references (e.g. deferring release of cached buffers still referenced by in-flight command buffers), I'm glad to rework the PR along those lines.