Skip to content

Health check reports tiers 'healthy' while the vLLM engine loop crashes on every request #52

Description

@weklund

Summary

mlx-stack up/status report a tier as healthy even when its engine loop is erroring on every inference request. The health probe appears to check only /v1/models (which responds even when the generation engine is dead), so a fully-broken server shows green.

How I hit it

With continuous_batching enabled (see #51), the draft tier logged Engine loop error: ArraysCache... on every request and all completions hung — yet:

draft   qwen3.5-9b                  8000  healthy   3m 46s
judge   qwen3.5-27b-opus-distilled  8001  healthy   3m 38s
litellm proxy                       4000  healthy   3m 31s

A user sees all-green and has no signal that inference is 100% broken.

Suggested fix

Health check should perform a minimal generation probe (e.g. max_tokens: 1) and require a valid completion before marking a tier healthy — not just a /v1/models (or port-open) check. This would also let up fail fast on issues like #51.

Severity

High — it silently masks complete inference failure, which is the hardest class of problem for a user to diagnose.

Environment

  • mlx-stack 0.3.8
  • vllm-mlx v0.2.6
  • mlx 0.31.1
  • macOS 26.2 (arm64), Apple M4 Pro, 64 GB

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions