Skip to content

coreai-build compile SIGSEGVs in MPSGraph anePreCompileBinary on a static-shape LLM with linear INT4 weights (palettized compiles fine) #55

Description

@john-rocky

Summary

Compiling a static-shape (iOS) LLM program whose weights are linear blockwise-INT4
(blockwise_shift_scale) for the Neural Engine crashes coreai-build with a
SIGSEGV inside MPSGraph's anePreCompileBinary. The byte-for-byte-identical
program structure
with palettized weights (lut_to_dense) compiles cleanly to the
ANE. So the ANE pre-compiler cannot legalize a linear-INT4 static program and segfaults
instead of failing gracefully.

This blocks a matched-quantization ANE-vs-GPU comparison on Qwen3-0.6B: the dynamic
(--platform macOS) export ships linear INT4, the static (--platform iOS) export
ships palettized weights, and there is currently no way to put the GPU export's
exact INT4 scheme on the ANE — the attempt crashes the compiler.

Environment

  • macOS 27.0 (26A5353q), Apple M4 Max (Mac16,9)
  • coreai-build: Metal toolchain v27.1.5194.15, build 3600.67.5.8.1
  • coreai-core 1.0.0b1, coreai-torch 0.4.0, coreai-opt 0.2.0
  • Target arch h18p (iPhone 17 Pro)
  • Model: Qwen/Qwen3-0.6B via coreai.llm.export

Reproduction

Control (works). Uniform 4-bit palettized static export → compiles to ANE:

uv run coreai.llm.export qwen3-0.6b --platform iOS \
  --compression 4bit_weight_palettized_group32 --output-name qwen3_0_6b_ios_pure4bit
xcrun coreai-build compile exports/qwen3_0_6b_ios_pure4bit/qwen3_0_6b_ios_pure4bit.aimodel \
  --platform iOS --preferred-compute neural-engine --architecture h18p --output /tmp/ok
# OK — compiled .aimodelc has 31 `*_ANE_region_*` segments, 0 non-ANE.

Crash. The same static structure with linear INT4 weights. The CLI couples the
quant scheme to the platform (--platform iOS --compression 4bit
RuntimeError: macOS quantization preset provided, but platform is iOS), so the linear
INT4 is applied at the MLIR level via the same quantize_weights primitive the diffusion
pipeline uses (coreai_models/export/compiler.py::apply_mlir_quantization):

# repro_ane_int4_crash.py  —  run from a coreai-models checkout:  uv run python repro_ane_int4_crash.py
import asyncio, torch
from transformers import AutoConfig
from coreai_models.export.pipeline import ExportConfig
from coreai_models.export.ios import export_ios_model
from coreai_models.export.metadata import build_aimodel_metadata
from coreai_models.models.registry import get_model_entry
from coreai_opt.coreai_utils import CompressionGranularity, DType, quantize_weights
from coreai_opt.coreai_utils.common import QScheme

async def main():
    hf_id, ctx = "Qwen/Qwen3-0.6B", 4096
    cfg = AutoConfig.from_pretrained(hf_id); cfg.max_position_embeddings = ctx
    entry = get_model_entry(cfg.model_type)
    model = entry.ios_class.from_hf(hf_id, max_context_length=ctx, target_dtype=torch.float16).eval()
    ec = ExportConfig(hf_model_id=hf_id, variant="iOS", max_context_length=ctx,
                      compute_precision="float16", compression="int4_linear",
                      output_dir="exports", output_name="qwen3_0_6b_ios_int4linear")
    prog = await export_ios_model(model, cfg, ec)
    # linear symmetric INT4, per-block 32 — same scheme as the macOS `4bit` preset
    prog = quantize_weights(prog, dtype=DType.INT4, qscheme=QScheme.SYMMETRIC,
                            granularity=CompressionGranularity.PER_BLOCK, block_size=32,
                            weight_num_threshold=32768, in_place=True)
    prog.optimize()
    out = "exports/qwen3_0_6b_ios_int4linear/qwen3_0_6b_ios_int4linear.aimodel"
    prog.save_asset(out, build_aimodel_metadata(hf_id))
    print("saved", out)

asyncio.run(main())
xcrun coreai-build compile exports/qwen3_0_6b_ios_int4linear/qwen3_0_6b_ios_int4linear.aimodel \
  --platform iOS --preferred-compute neural-engine --architecture h18p --output /tmp/crash

The produced .aimodel is valid (coreai-build inspect shows the same 34 static-shape
functions as the palettized control — extend_{256..4096}_{8,16,64}, prompt_opt_*,
gather_embeddings_*, load_embeddings — with weight op blockwise_shift_scale, dtype
Int4). Only the AOT compile crashes.

Expected

Compile to the ANE, or fail with a diagnostic (e.g. "INT4 linear weights are not
supported on the ANE; use palettization"). A segfault is never acceptable.

Actual

coreai-build runs ~5 min at 100% CPU, then terminates with SIGSEGV (exit 139),
no stdout/stderr diagnostic, and no .aimodelc. Reproduced 2/2 runs. Crash report
(~/Library/Logs/DiagnosticReports/coreai-build-*.ips):

Exception:  EXC_BAD_ACCESS (SIGSEGV) — KERN_INVALID_ADDRESS
Crashing thread: MPSGraphExecutable_queue
  0  libobjc.A.dylib                    objc_release
  1  MetalPerformanceShadersGraph_host  GPU::anePreCompileBinary(MPSGraphExecutable*, llvm::SmallVectorImpl<mlir::…>)
  2  MetalPerformanceShadersGraph_host  BaseModuleRef::compileAndLoadANE()
  3  MetalPerformanceShadersGraph_host  -[MPSGraphExecutable specializedModuleWithDevice:shapedEntryPoints:compilationDescriptor:…]
  4  MetalPerformanceShadersGraph_host  -[MPSGraphExecutable specializedModuleWithDevice:shapedEntryPoints:compilationDescriptor:…]
  5  MetalPerformanceShadersGraph_host  __89-[MPSGraphExecutable specializeWithDevice:shapedEntryPoints:compilationDescriptor:…]_block_invoke
  6  libdispatch.dylib                  _dispatch_call_block_and_release

Notes

  • The control (palettized) and crash (linear-INT4) .aimodels differ only in the
    weight encoding (lut_to_dense vs blockwise_shift_scale); structure, shapes, and the
    fp16 embedding front-end are identical. So the trigger is specifically the linear
    blockwise-INT4 weight form on the ANE pre-compile path.
  • The dynamic (GPU) --platform macOS export lowers to the same blockwise_shift_scale
    form and compiles/runs fine on the GPU MPSGraph path — only the ANE pre-compiler
    crashes on it.
  • --preferred-compute neural-engine on the dynamic export is a no-op (still a GPU
    MPSGraph delegate, 0 ANE regions), so recompiling the existing GPU INT4 export onto the
    ANE is not an alternative.
  • Likely related but distinct: Official iOS static-shape decode path crashes at runtime on the macOS 27 / iOS 27 beta — MPSGraph can't lower the data-indexed KV-cache slice_update #5 (runtime slice_update lowering crash on the static path).

Happy to attach the full .ips crash reports.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions