Skip to content

Apple Silicon port: simde + Accelerate (scope A — local build)#3

Closed
christhechris wants to merge 10 commits into
mainfrom
claude/wonderful-neumann-091991
Closed

Apple Silicon port: simde + Accelerate (scope A — local build)#3
christhechris wants to merge 10 commits into
mainfrom
claude/wonderful-neumann-091991

Conversation

@christhechris

Copy link
Copy Markdown

Summary

Builds digHolo natively on Apple Silicon (arm64 macOS) using simde to translate AVX2/FMA3 intrinsics to NEON, and Apple Accelerate as a drop-in BLAS/LAPACK replacement for Intel MKL. Smoke tests pass on an M-series Mac. Linux/Windows x86-64 builds are byte-unchanged (all new code paths are gated behind DIGHOLO_IS_APPLE_SILICON / DIGHOLO_USE_ACCELERATE / DIGHOLO_USE_SIMDE / __aarch64__).

Scope A only: local dev build. No CI runner, no release artefacts, no macOS Python wheels — those are natural follow-ups.

Usage

brew install fftw
cmake --preset macos-release
cmake --build --preset macos-release
ctest --preset macos-release

Requires macOS 13.3+ (for Accelerate's modern LAPACK interface).

What lands

  • New: src/digholo_simd_compat.h — SIMD include shim. Routes to <immintrin.h> on x86; routes to <simde/x86/avx2.h> + <simde/x86/fma.h> with SIMDE_ENABLE_NATIVE_ALIASES on arm64. Also polyfills _mm_malloc / _mm_free via posix_memalign and aliases SSE rounding-mode constants.
  • New: tests/test_reference.cpp + tests/reference/README.md — numerical regression test scaffolding, gated on committed reference data (to be generated on Linux/x86 in a follow-up).
  • Modified: CMakeLists.txt — widened arch guard for arm64, find_library(Accelerate), FetchContent pulling simde pinned to v0.8.2, Apple Silicon branch for link libraries (FFTW3 + FFTW3 threads + Accelerate + simde) and compile defs. Linux/Windows paths untouched.
  • Modified: CMakePresets.jsonmacos-release / macos-debug presets (Ninja, arm64, deployment target 13.3, Darwin-host-only).
  • Modified: src/digHolo.cpp — three narrow edits: swap <immintrin.h> for the shim, add DIGHOLO_USE_ACCELERATE BLAS include branch, stub the CPUID query on arm64.

Gotchas discovered during implementation

Each of these cost a build iteration, worth noting so the next person porting finds them faster:

  1. __CLPK_complex is gone under ACCELERATE_NEW_LAPACK. The modern LAPACK interface exposes __LAPACK_float_complex instead (which in C++ is std::complex<float>, layout-compatible with digHolo's complex64).
  2. <Accelerate/Accelerate.h> transitively pulls Carbon / fp.h, which declares extern const double_t pi — clashes with digHolo's own const float pi at line 109. Fix: narrow includes <vecLib/cblas_new.h> + <vecLib/lapack.h>, which skip Carbon entirely.
  3. Accelerate's CBLAS uses the historical CBLAS_ORDER enum name; digHolo uses the newer CBLAS_LAYOUT. Added a typedef in the Accelerate branch.
  4. Accelerate's cblas_cgemm / cblas_cgemv under modern LAPACK are strongly typed (const __LAPACK_float_complex*) whereas MKL's are const void*. Added BLAS_COMPLEXTYPE casts at the three CBLAS call sites — no-ops on the MKL path.
  5. simde doesn't cover _mm_malloc / _mm_free (those are allocator macros from <xmmintrin.h>, not SIMD intrinsics). Polyfilled in the shim via posix_memalign.
  6. simde doesn't auto-alias the SSE rounding-mode constants (_MM_FROUND_NO_EXC, _MM_ROUND_NEAREST, etc) under SIMDE_ENABLE_NATIVE_ALIASES. Added #defines in the shim mapping back to the SIMDE_MM_* prefixed versions.
  7. libfftw3f_threads is a separate dylib on Homebrew macOS; the Linux CI build (FFTW from source without --enable-threads) silently inlines no-op thread stubs into libfftw3f so no explicit threads link is needed there. Explicit find_library + link added on the Apple Silicon branch only.

Test plan

  • Local build on arm64 macOS (M-series, macOS 15+)
  • ctest --preset macos-release — smoke test passes
  • Reviewer sanity-checks that Linux/Windows x86 diffs are unchanged (all new code is behind guards — git diff isolates it clearly)
  • Reference-output regression data generated + committed in a follow-up PR

Not in scope

  • macOS CI runner
  • Release artefacts (digholo-macos-arm64 zip)
  • macOS Python wheels (cibuildwheel)
  • x86_64 macOS support (not worth the MKL-redistribution pain)

🤖 Generated with Claude Code

christhechris and others added 10 commits April 19, 2026 18:42
Keep Claude Code workspace artefacts and superpowers specs local-only;
they should not flow into the upstream repo.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the direct <immintrin.h> include in src/digHolo.cpp with a
conditional shim. When DIGHOLO_USE_SIMDE is defined (Apple Silicon
branch, wired up in a later commit), the shim pulls in simde's AVX2/FMA
headers with SIMDE_ENABLE_NATIVE_ALIASES so the ~912 intrinsic call
sites compile unchanged. On x86-64 the shim falls through to
<immintrin.h>, so this commit is a no-op on Linux and Windows.
Ninja + arm64 + CMAKE_OSX_DEPLOYMENT_TARGET=13.3 (required for the
Accelerate modern-LAPACK interface used by the simde + Accelerate
port landing in follow-up commits). Preset currently short-circuits
at the existing APPLE fatal-error guard — that is removed in a
subsequent commit.
Accept arm64/aarch64 as a valid processor when APPLE is set. Narrow
the Apple fatal-error to fire only on x86_64 macOS, where MKL truly
isn't available. Linux and Windows x86_64 configures are unchanged.
The macOS preset now progresses to find_package(MKL), which is
replaced with Accelerate in the next commit.
Adds a DIGHOLO_USE_ACCELERATE include branch in src/digHolo.cpp that
pulls in <Accelerate/Accelerate.h> with the modern LAPACK interface
(ACCELERATE_NEW_LAPACK, set by CMake; minimum macOS 13.3). The
cgesvd / sgels / cblas_cgemv / cblas_cgemm call sites are unchanged —
Accelerate exposes standard LAPACK/CBLAS symbol names.

On CMake's side the Apple-Silicon branch skips find_package(MKL)
entirely and links -framework Accelerate. Linux/Windows configure
paths are unchanged.
__CLPK_complex is only exposed by the legacy clapack.h interface; when
ACCELERATE_NEW_LAPACK is defined (which we require for the port, matching
macOS 13.3+), the legacy types are hidden and the new types in
lapack_types.h take over. Use __LAPACK_float_complex — which in C++
resolves to std::complex<float>, layout-compatible with digHolo's
internal complex64 so it continues to work as a pure cast target at
the BLAS/LAPACK call boundary.

Verified with a minimal cgesvd_ test compile against Accelerate with
ACCELERATE_NEW_LAPACK defined.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pinned to v0.8.2. Header-only; exposed as digholo_simde INTERFACE
target. Apple-Silicon targets pick up DIGHOLO_USE_SIMDE, which causes
src/digholo_simd_compat.h (added earlier) to route intrinsic includes
through simde instead of <immintrin.h>. The ~912 AVX2/FMA3 call sites
in digHolo.cpp compile unchanged via SIMDE_ENABLE_NATIVE_ALIASES.
The initial simde + Accelerate wiring compiled the AVX2 intrinsic
layer cleanly but hit several gaps the plan didn't anticipate. All
fixed here:

1. _mm_malloc / _mm_free: x86-only aligned allocators exposed by
   Intel's <xmmintrin.h>, not by simde's AVX2 translation surface.
   digHolo.cpp aliases alignedAllocate/alignedFree onto them.
   Polyfilled in digholo_simd_compat.h via posix_memalign.

2. _MM_FROUND_TO_NEAREST_INT / _MM_FROUND_NO_EXC / _MM_ROUND_NEAREST:
   simde defines the SIMDE_MM_* variants but doesn't auto-alias them
   back under SIMDE_ENABLE_NATIVE_ALIASES. Added #defines in the shim.

3. <Accelerate/Accelerate.h> transitively pulls in Carbon/fp.h, which
   declares 'extern const double_t pi'. Clashes with digHolo's
   'const float pi' at line 109. Switched to narrow includes
   (<vecLib/cblas_new.h> + <vecLib/lapack.h>) which avoid Carbon.

4. Accelerate's CBLAS uses the historical enum name CBLAS_ORDER; the
   code at three call sites uses the newer CBLAS_LAYOUT. Added a
   typedef in the Accelerate branch.

5. Accelerate's cblas_cgemm / cblas_cgemv (under ACCELERATE_NEW_LAPACK)
   take strongly-typed const __LAPACK_float_complex* rather than
   MKL-style const void*. Added BLAS_COMPLEXTYPE casts at the three
   CBLAS call sites (digHolo.cpp:3117, 5056, 9624). Casts are no-ops
   on the MKL path because MKL_Complex8 is layout-compatible with
   complex64 and MKL's CBLAS accepts void*.

After these fixes the arm64 build proceeds cleanly through the SIMD
and BLAS/LAPACK layers; the only remaining compile errors are the
<cpuid.h> / __cpuid_count issues at digHolo.cpp:583,588 — exactly
the scope of Task 6.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two fixes needed to get the arm64 build linking and passing smoke tests:

1. CPUID (src/digHolo.cpp):
   <intrin.h> and <cpuid.h> are x86-only. Added an __aarch64__/__arm64__
   branch that skips both header includes, and a matching branch in
   cpuInfoGet() that reports AVX2/FMA3 as available (since simde provides
   them at compile time), sets avx512f=0, and fills the brand string with
   "Apple Silicon (arm64)" so downstream consumers (FFTW wisdom filename,
   logging) have a stable key.

2. libfftw3f_threads (CMakeLists.txt):
   digHolo.cpp calls fftwf_init_threads / fftwf_plan_with_nthreads
   unconditionally. On the Linux CI build FFTW is compiled without
   --enable-threads, so these symbols live as no-op stubs inside the
   main libfftw3f.a — no separate link needed. Homebrew's FFTW on macOS
   is built with threads enabled and splits them into
   libfftw3f_threads.dylib. Added a find_library + link on the Apple
   Silicon branch only, leaving Linux/Windows link lines untouched.

With these in place:
  cmake --preset macos-release && cmake --build --preset macos-release
  ctest --preset macos-release
passes the existing smoke test on arm64 macOS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Builds test_reference.cpp against the committed digholo library and
diffs its output against reference binaries (to be generated on
Linux/x86 in a follow-up change). Tolerance is 1e-4 relative, 1e-6
absolute — tight enough to catch a misrouted intrinsic or swapped
BLAS convention, loose enough to tolerate simde vs native AVX2 last-
bit differences and Accelerate vs MKL last-bit differences.

Registration is gated on tests/reference/reference_settings.txt
existing, so this scaffolding is a no-op on all platforms until the
reference data lands. Verified: cmake --preset macos-release configures
cleanly with the skip message, smoke test continues to pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant