Summary
mlx-stack up/status report a tier as healthy even when its engine loop is erroring on every inference request. The health probe appears to check only /v1/models (which responds even when the generation engine is dead), so a fully-broken server shows green.
How I hit it
With continuous_batching enabled (see #51), the draft tier logged Engine loop error: ArraysCache... on every request and all completions hung — yet:
draft qwen3.5-9b 8000 healthy 3m 46s
judge qwen3.5-27b-opus-distilled 8001 healthy 3m 38s
litellm proxy 4000 healthy 3m 31s
A user sees all-green and has no signal that inference is 100% broken.
Suggested fix
Health check should perform a minimal generation probe (e.g. max_tokens: 1) and require a valid completion before marking a tier healthy — not just a /v1/models (or port-open) check. This would also let up fail fast on issues like #51.
Severity
High — it silently masks complete inference failure, which is the hardest class of problem for a user to diagnose.
Environment
- mlx-stack 0.3.8
- vllm-mlx v0.2.6
- mlx 0.31.1
- macOS 26.2 (arm64), Apple M4 Pro, 64 GB
Summary
mlx-stack up/statusreport a tier ashealthyeven when its engine loop is erroring on every inference request. The health probe appears to check only/v1/models(which responds even when the generation engine is dead), so a fully-broken server shows green.How I hit it
With
continuous_batchingenabled (see #51), the draft tier loggedEngine loop error: ArraysCache...on every request and all completions hung — yet:A user sees all-green and has no signal that inference is 100% broken.
Suggested fix
Health check should perform a minimal generation probe (e.g.
max_tokens: 1) and require a valid completion before marking a tier healthy — not just a/v1/models(or port-open) check. This would also letupfail fast on issues like #51.Severity
High — it silently masks complete inference failure, which is the hardest class of problem for a user to diagnose.
Environment