Use-after-free under memory pressure: buffer-cache trim frees an MTLBuffer still used by an in-flight command buffer (kIOGPUCommandBufferCallbackErrorInvalidResource)

### Summary

Under sustained allocation + buffer-cache churn, the Metal allocator can free an `MTLBuffer` that an **in-flight command buffer still references**, crashing the process with `kIOGPUCommandBufferCallbackErrorInvalidResource`. Command buffers are created with `commandBufferWithUnretainedReferences()`, so Metal does not keep referenced buffers alive — and the buffer cache's trim releases them anyway.

Fix proposed in #3688.

### Environment

- **MLX**: consumed via `mlx-swift` 0.31.4 → mlx core `v0.31.1` (`ce45c525`). Also reproduced/inspected against current `main` (which has refactored to a `CommandEncoder` with two unretained command-buffer creation sites).
- **Device**: MacBook Pro (`Mac15,9`), **Apple M3 Max, 48 GB** unified memory
- **OS**: macOS 26.5.1 (25F80)
- **Workload**: an on-device, OpenAI-compatible inference server ([Tesseract](https://github.com/spokvulcan/tesseract)) running a 27B 4-bit **vision** model behind a tiered RAM+SSD prefix cache that restores KV state and re-runs prefill — i.e. steady allocation and cache eviction/trim while prior command buffers are still executing.

### Symptom

```
Exception Type:  EXC_CRASH (SIGABRT)
Triggered by Thread: com.Metal.CompletionQueueDispatch
  mlx::core::gpu::check_error(...)
Termination: abort() called
```

With the Metal validation layer (`MTL_DEBUG_LAYER=1 MTL_DEBUG_LAYER_ERROR_MODE=assert`):

```
-[MTLDebugCommandBuffer preCommit]: ... failed assertion
'command buffer references deallocated object which previously existed at address 0x...'
```

The IOGPU error code is `kIOGPUCommandBufferCallbackErrorInvalidResource`.

### Root cause

1. `MetalAllocator::free()` recycles a buffer into the reuse cache (`recycle_to_cache`) as soon as its refcount hits 0 — even if a command buffer that used it is still in flight (e.g. after `async_eval`).
2. When `MetalAllocator::malloc()` finds the cache over `max_pool_size_` (or memory pressure past `gc_limit_`), it calls `buffer_cache_.release_cached_buffers(...)`, whose free callback runs the real `buf->release()` (and `residency_set_.erase(buf)`).
3. Buffers use `MTL::ResourceHazardTrackingModeUntracked`, and command buffers are created with `commandBufferWithUnretainedReferences()`, so nothing keeps the `MTLBuffer` alive — `release()` deallocates the allocation out from under the GPU.
4. The in-flight command buffer then fails completion with `kIOGPUCommandBufferCallbackErrorInvalidResource`.

(All in `mlx/backend/metal/allocator.cpp` and `mlx/backend/metal/device.cpp`.)

### Reproduction

The bug is a **timing-sensitive race**: it needs malloc-driven cache trimming to release a buffer in the window between a command buffer being committed and completing. Anything that slows that window down hides it — running under a debugger, or raising the cache limit so the trim never fires (both make it disappear, which is itself diagnostic).

Reliable trigger in our app:

1. Send a large multi-image request to a 27B vision model through the prefix-cache server (cold request — populates + persists a KV snapshot).
2. Re-send the same request so the **warm** path restores the snapshot from SSD and re-runs prefill (sustained allocation + cache churn).
3. Crashes within ~16 s on the warm request. Concurrency makes it crash on the first round.

Repro harness (replays recorded requests against the live server and watches for the abort): [`scripts/repro-image-cache-crash.sh`](https://github.com/spokvulcan/tesseract/blob/main/scripts/repro-image-cache-crash.sh) in the linked repo.

I'm happy to put together a **minimal standalone MLX repro** (allocate under memory pressure + `async_eval` + cache trim while command buffers are in flight) if that would help — just let me know.

### Fix

#3688 — use retained references (`commandBuffer()`) at the command-buffer creation site(s) so Metal holds referenced buffers until completion; the cache trim's `release()` then only drops MLX's own reference. With the fix the previously-crashing workload runs to completion, and warm prefix-cache restores are *faster* than cold runs (confirming this is a correctness fix, not a slowdown that masks the race).

If a lower-overhead approach is preferred over global retained references (e.g. deferring release of cached buffers still referenced by in-flight command buffers), I'm glad to rework the PR along those lines.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use-after-free under memory pressure: buffer-cache trim frees an MTLBuffer still used by an in-flight command buffer (kIOGPUCommandBufferCallbackErrorInvalidResource) #3689

Summary

Environment

Symptom

Root cause

Reproduction

Fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Use-after-free under memory pressure: buffer-cache trim frees an MTLBuffer still used by an in-flight command buffer (kIOGPUCommandBufferCallbackErrorInvalidResource) #3689

Description

Summary

Environment

Symptom

Root cause

Reproduction

Fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions