Apple Silicon port: simde + Accelerate (scope A — local build)#3
Closed
christhechris wants to merge 10 commits into
Closed
Apple Silicon port: simde + Accelerate (scope A — local build)#3christhechris wants to merge 10 commits into
christhechris wants to merge 10 commits into
Conversation
Keep Claude Code workspace artefacts and superpowers specs local-only; they should not flow into the upstream repo. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the direct <immintrin.h> include in src/digHolo.cpp with a conditional shim. When DIGHOLO_USE_SIMDE is defined (Apple Silicon branch, wired up in a later commit), the shim pulls in simde's AVX2/FMA headers with SIMDE_ENABLE_NATIVE_ALIASES so the ~912 intrinsic call sites compile unchanged. On x86-64 the shim falls through to <immintrin.h>, so this commit is a no-op on Linux and Windows.
Ninja + arm64 + CMAKE_OSX_DEPLOYMENT_TARGET=13.3 (required for the Accelerate modern-LAPACK interface used by the simde + Accelerate port landing in follow-up commits). Preset currently short-circuits at the existing APPLE fatal-error guard — that is removed in a subsequent commit.
Accept arm64/aarch64 as a valid processor when APPLE is set. Narrow the Apple fatal-error to fire only on x86_64 macOS, where MKL truly isn't available. Linux and Windows x86_64 configures are unchanged. The macOS preset now progresses to find_package(MKL), which is replaced with Accelerate in the next commit.
Adds a DIGHOLO_USE_ACCELERATE include branch in src/digHolo.cpp that pulls in <Accelerate/Accelerate.h> with the modern LAPACK interface (ACCELERATE_NEW_LAPACK, set by CMake; minimum macOS 13.3). The cgesvd / sgels / cblas_cgemv / cblas_cgemm call sites are unchanged — Accelerate exposes standard LAPACK/CBLAS symbol names. On CMake's side the Apple-Silicon branch skips find_package(MKL) entirely and links -framework Accelerate. Linux/Windows configure paths are unchanged.
__CLPK_complex is only exposed by the legacy clapack.h interface; when ACCELERATE_NEW_LAPACK is defined (which we require for the port, matching macOS 13.3+), the legacy types are hidden and the new types in lapack_types.h take over. Use __LAPACK_float_complex — which in C++ resolves to std::complex<float>, layout-compatible with digHolo's internal complex64 so it continues to work as a pure cast target at the BLAS/LAPACK call boundary. Verified with a minimal cgesvd_ test compile against Accelerate with ACCELERATE_NEW_LAPACK defined. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pinned to v0.8.2. Header-only; exposed as digholo_simde INTERFACE target. Apple-Silicon targets pick up DIGHOLO_USE_SIMDE, which causes src/digholo_simd_compat.h (added earlier) to route intrinsic includes through simde instead of <immintrin.h>. The ~912 AVX2/FMA3 call sites in digHolo.cpp compile unchanged via SIMDE_ENABLE_NATIVE_ALIASES.
The initial simde + Accelerate wiring compiled the AVX2 intrinsic layer cleanly but hit several gaps the plan didn't anticipate. All fixed here: 1. _mm_malloc / _mm_free: x86-only aligned allocators exposed by Intel's <xmmintrin.h>, not by simde's AVX2 translation surface. digHolo.cpp aliases alignedAllocate/alignedFree onto them. Polyfilled in digholo_simd_compat.h via posix_memalign. 2. _MM_FROUND_TO_NEAREST_INT / _MM_FROUND_NO_EXC / _MM_ROUND_NEAREST: simde defines the SIMDE_MM_* variants but doesn't auto-alias them back under SIMDE_ENABLE_NATIVE_ALIASES. Added #defines in the shim. 3. <Accelerate/Accelerate.h> transitively pulls in Carbon/fp.h, which declares 'extern const double_t pi'. Clashes with digHolo's 'const float pi' at line 109. Switched to narrow includes (<vecLib/cblas_new.h> + <vecLib/lapack.h>) which avoid Carbon. 4. Accelerate's CBLAS uses the historical enum name CBLAS_ORDER; the code at three call sites uses the newer CBLAS_LAYOUT. Added a typedef in the Accelerate branch. 5. Accelerate's cblas_cgemm / cblas_cgemv (under ACCELERATE_NEW_LAPACK) take strongly-typed const __LAPACK_float_complex* rather than MKL-style const void*. Added BLAS_COMPLEXTYPE casts at the three CBLAS call sites (digHolo.cpp:3117, 5056, 9624). Casts are no-ops on the MKL path because MKL_Complex8 is layout-compatible with complex64 and MKL's CBLAS accepts void*. After these fixes the arm64 build proceeds cleanly through the SIMD and BLAS/LAPACK layers; the only remaining compile errors are the <cpuid.h> / __cpuid_count issues at digHolo.cpp:583,588 — exactly the scope of Task 6. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two fixes needed to get the arm64 build linking and passing smoke tests: 1. CPUID (src/digHolo.cpp): <intrin.h> and <cpuid.h> are x86-only. Added an __aarch64__/__arm64__ branch that skips both header includes, and a matching branch in cpuInfoGet() that reports AVX2/FMA3 as available (since simde provides them at compile time), sets avx512f=0, and fills the brand string with "Apple Silicon (arm64)" so downstream consumers (FFTW wisdom filename, logging) have a stable key. 2. libfftw3f_threads (CMakeLists.txt): digHolo.cpp calls fftwf_init_threads / fftwf_plan_with_nthreads unconditionally. On the Linux CI build FFTW is compiled without --enable-threads, so these symbols live as no-op stubs inside the main libfftw3f.a — no separate link needed. Homebrew's FFTW on macOS is built with threads enabled and splits them into libfftw3f_threads.dylib. Added a find_library + link on the Apple Silicon branch only, leaving Linux/Windows link lines untouched. With these in place: cmake --preset macos-release && cmake --build --preset macos-release ctest --preset macos-release passes the existing smoke test on arm64 macOS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Builds test_reference.cpp against the committed digholo library and diffs its output against reference binaries (to be generated on Linux/x86 in a follow-up change). Tolerance is 1e-4 relative, 1e-6 absolute — tight enough to catch a misrouted intrinsic or swapped BLAS convention, loose enough to tolerate simde vs native AVX2 last- bit differences and Accelerate vs MKL last-bit differences. Registration is gated on tests/reference/reference_settings.txt existing, so this scaffolding is a no-op on all platforms until the reference data lands. Verified: cmake --preset macos-release configures cleanly with the skip message, smoke test continues to pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Builds digHolo natively on Apple Silicon (arm64 macOS) using simde to translate AVX2/FMA3 intrinsics to NEON, and Apple Accelerate as a drop-in BLAS/LAPACK replacement for Intel MKL. Smoke tests pass on an M-series Mac. Linux/Windows x86-64 builds are byte-unchanged (all new code paths are gated behind
DIGHOLO_IS_APPLE_SILICON/DIGHOLO_USE_ACCELERATE/DIGHOLO_USE_SIMDE/__aarch64__).Scope A only: local dev build. No CI runner, no release artefacts, no macOS Python wheels — those are natural follow-ups.
Usage
Requires macOS 13.3+ (for Accelerate's modern LAPACK interface).
What lands
src/digholo_simd_compat.h— SIMD include shim. Routes to<immintrin.h>on x86; routes to<simde/x86/avx2.h>+<simde/x86/fma.h>withSIMDE_ENABLE_NATIVE_ALIASESon arm64. Also polyfills_mm_malloc/_mm_freeviaposix_memalignand aliases SSE rounding-mode constants.tests/test_reference.cpp+tests/reference/README.md— numerical regression test scaffolding, gated on committed reference data (to be generated on Linux/x86 in a follow-up).CMakeLists.txt— widened arch guard for arm64,find_library(Accelerate),FetchContentpulling simde pinned tov0.8.2, Apple Silicon branch for link libraries (FFTW3 + FFTW3 threads + Accelerate + simde) and compile defs. Linux/Windows paths untouched.CMakePresets.json—macos-release/macos-debugpresets (Ninja, arm64, deployment target 13.3, Darwin-host-only).src/digHolo.cpp— three narrow edits: swap<immintrin.h>for the shim, addDIGHOLO_USE_ACCELERATEBLAS include branch, stub the CPUID query on arm64.Gotchas discovered during implementation
Each of these cost a build iteration, worth noting so the next person porting finds them faster:
__CLPK_complexis gone underACCELERATE_NEW_LAPACK. The modern LAPACK interface exposes__LAPACK_float_complexinstead (which in C++ isstd::complex<float>, layout-compatible with digHolo'scomplex64).<Accelerate/Accelerate.h>transitively pulls Carbon /fp.h, which declaresextern const double_t pi— clashes with digHolo's ownconst float piat line 109. Fix: narrow includes<vecLib/cblas_new.h>+<vecLib/lapack.h>, which skip Carbon entirely.CBLAS_ORDERenum name; digHolo uses the newerCBLAS_LAYOUT. Added a typedef in the Accelerate branch.cblas_cgemm/cblas_cgemvunder modern LAPACK are strongly typed (const __LAPACK_float_complex*) whereas MKL's areconst void*. AddedBLAS_COMPLEXTYPEcasts at the three CBLAS call sites — no-ops on the MKL path._mm_malloc/_mm_free(those are allocator macros from<xmmintrin.h>, not SIMD intrinsics). Polyfilled in the shim viaposix_memalign._MM_FROUND_NO_EXC,_MM_ROUND_NEAREST, etc) underSIMDE_ENABLE_NATIVE_ALIASES. Added#defines in the shim mapping back to theSIMDE_MM_*prefixed versions.libfftw3f_threadsis a separate dylib on Homebrew macOS; the Linux CI build (FFTW from source without--enable-threads) silently inlines no-op thread stubs intolibfftw3fso no explicit threads link is needed there. Explicitfind_library+ link added on the Apple Silicon branch only.Test plan
ctest --preset macos-release— smoke test passesgit diffisolates it clearly)Not in scope
digholo-macos-arm64zip)cibuildwheel)🤖 Generated with Claude Code