coreai-build compile SIGSEGVs in MPSGraph anePreCompileBinary on a static-shape LLM with linear INT4 weights (palettized compiles fine)

## Summary

Compiling a **static-shape (iOS) LLM program whose weights are linear blockwise-INT4**
(`blockwise_shift_scale`) for the Neural Engine crashes `coreai-build` with a
**`SIGSEGV` inside MPSGraph's `anePreCompileBinary`**. The **byte-for-byte-identical
program structure** with **palettized** weights (`lut_to_dense`) compiles cleanly to the
ANE. So the ANE pre-compiler cannot legalize a linear-INT4 static program and segfaults
instead of failing gracefully.

This blocks a matched-quantization ANE-vs-GPU comparison on Qwen3-0.6B: the dynamic
(`--platform macOS`) export ships **linear INT4**, the static (`--platform iOS`) export
ships **palettized** weights, and there is currently no way to put the GPU export's
exact INT4 scheme on the ANE — the attempt crashes the compiler.

## Environment

- macOS **27.0 (26A5353q)**, Apple M4 Max (Mac16,9)
- `coreai-build`: Metal toolchain **v27.1.5194.15**, build **3600.67.5.8.1**
- `coreai-core 1.0.0b1`, `coreai-torch 0.4.0`, `coreai-opt 0.2.0`
- Target arch **h18p** (iPhone 17 Pro)
- Model: `Qwen/Qwen3-0.6B` via `coreai.llm.export`

## Reproduction

**Control (works).** Uniform 4-bit *palettized* static export → compiles to ANE:

```bash
uv run coreai.llm.export qwen3-0.6b --platform iOS \
  --compression 4bit_weight_palettized_group32 --output-name qwen3_0_6b_ios_pure4bit
xcrun coreai-build compile exports/qwen3_0_6b_ios_pure4bit/qwen3_0_6b_ios_pure4bit.aimodel \
  --platform iOS --preferred-compute neural-engine --architecture h18p --output /tmp/ok
# OK — compiled .aimodelc has 31 `*_ANE_region_*` segments, 0 non-ANE.
```

**Crash.** The same static structure with **linear INT4** weights. The CLI couples the
quant scheme to the platform (`--platform iOS --compression 4bit` →
`RuntimeError: macOS quantization preset provided, but platform is iOS`), so the linear
INT4 is applied at the MLIR level via the same `quantize_weights` primitive the diffusion
pipeline uses (`coreai_models/export/compiler.py::apply_mlir_quantization`):

```python
# repro_ane_int4_crash.py  —  run from a coreai-models checkout:  uv run python repro_ane_int4_crash.py
import asyncio, torch
from transformers import AutoConfig
from coreai_models.export.pipeline import ExportConfig
from coreai_models.export.ios import export_ios_model
from coreai_models.export.metadata import build_aimodel_metadata
from coreai_models.models.registry import get_model_entry
from coreai_opt.coreai_utils import CompressionGranularity, DType, quantize_weights
from coreai_opt.coreai_utils.common import QScheme

async def main():
    hf_id, ctx = "Qwen/Qwen3-0.6B", 4096
    cfg = AutoConfig.from_pretrained(hf_id); cfg.max_position_embeddings = ctx
    entry = get_model_entry(cfg.model_type)
    model = entry.ios_class.from_hf(hf_id, max_context_length=ctx, target_dtype=torch.float16).eval()
    ec = ExportConfig(hf_model_id=hf_id, variant="iOS", max_context_length=ctx,
                      compute_precision="float16", compression="int4_linear",
                      output_dir="exports", output_name="qwen3_0_6b_ios_int4linear")
    prog = await export_ios_model(model, cfg, ec)
    # linear symmetric INT4, per-block 32 — same scheme as the macOS `4bit` preset
    prog = quantize_weights(prog, dtype=DType.INT4, qscheme=QScheme.SYMMETRIC,
                            granularity=CompressionGranularity.PER_BLOCK, block_size=32,
                            weight_num_threshold=32768, in_place=True)
    prog.optimize()
    out = "exports/qwen3_0_6b_ios_int4linear/qwen3_0_6b_ios_int4linear.aimodel"
    prog.save_asset(out, build_aimodel_metadata(hf_id))
    print("saved", out)

asyncio.run(main())
```

```bash
xcrun coreai-build compile exports/qwen3_0_6b_ios_int4linear/qwen3_0_6b_ios_int4linear.aimodel \
  --platform iOS --preferred-compute neural-engine --architecture h18p --output /tmp/crash
```

The produced `.aimodel` is valid (`coreai-build inspect` shows the same 34 static-shape
functions as the palettized control — `extend_{256..4096}_{8,16,64}`, `prompt_opt_*`,
`gather_embeddings_*`, `load_embeddings` — with weight op `blockwise_shift_scale`, dtype
`Int4`). Only the AOT compile crashes.

## Expected

Compile to the ANE, **or** fail with a diagnostic (e.g. "INT4 linear weights are not
supported on the ANE; use palettization"). A segfault is never acceptable.

## Actual

`coreai-build` runs ~5 min at 100% CPU, then terminates with **`SIGSEGV` (exit 139)**,
no stdout/stderr diagnostic, and **no `.aimodelc`**. Reproduced 2/2 runs. Crash report
(`~/Library/Logs/DiagnosticReports/coreai-build-*.ips`):

```
Exception:  EXC_BAD_ACCESS (SIGSEGV) — KERN_INVALID_ADDRESS
Crashing thread: MPSGraphExecutable_queue
  0  libobjc.A.dylib                    objc_release
  1  MetalPerformanceShadersGraph_host  GPU::anePreCompileBinary(MPSGraphExecutable*, llvm::SmallVectorImpl<mlir::…>)
  2  MetalPerformanceShadersGraph_host  BaseModuleRef::compileAndLoadANE()
  3  MetalPerformanceShadersGraph_host  -[MPSGraphExecutable specializedModuleWithDevice:shapedEntryPoints:compilationDescriptor:…]
  4  MetalPerformanceShadersGraph_host  -[MPSGraphExecutable specializedModuleWithDevice:shapedEntryPoints:compilationDescriptor:…]
  5  MetalPerformanceShadersGraph_host  __89-[MPSGraphExecutable specializeWithDevice:shapedEntryPoints:compilationDescriptor:…]_block_invoke
  6  libdispatch.dylib                  _dispatch_call_block_and_release
```

## Notes

- The control (palettized) and crash (linear-INT4) `.aimodel`s differ **only** in the
  weight encoding (`lut_to_dense` vs `blockwise_shift_scale`); structure, shapes, and the
  fp16 embedding front-end are identical. So the trigger is specifically the linear
  blockwise-INT4 weight form on the ANE pre-compile path.
- The dynamic (GPU) `--platform macOS` export lowers to the **same** `blockwise_shift_scale`
  form and compiles/runs fine on the GPU MPSGraph path — only the **ANE** pre-compiler
  crashes on it.
- `--preferred-compute neural-engine` on the *dynamic* export is a no-op (still a GPU
  MPSGraph delegate, 0 ANE regions), so recompiling the existing GPU INT4 export onto the
  ANE is not an alternative.
- Likely related but distinct: #5 (runtime `slice_update` lowering crash on the static path).

Happy to attach the full `.ips` crash reports.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

coreai-build compile SIGSEGVs in MPSGraph anePreCompileBinary on a static-shape LLM with linear INT4 weights (palettized compiles fine) #55

Summary

Environment

Reproduction

Expected

Actual

Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

coreai-build compile SIGSEGVs in MPSGraph anePreCompileBinary on a static-shape LLM with linear INT4 weights (palettized compiles fine) #55

Description

Summary

Environment

Reproduction

Expected

Actual

Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions