Skip to content

theNetworkChuck/mac-studio-cluster

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Mac Studio AI Cluster Guide

Build a 2TB Unified Memory AI Supercomputer with 4 Mac Studios

YouTube Discord

Watch the full video: https://youtu.be/bFgTxr5yst0

This is the companion guide to my YouTube video where I cluster 4 Mac Studios together to create a local AI powerhouse capable of running trillion-parameter models.


The Setup

Component Specs
Nodes 4x Mac Studio M4 Ultra
RAM per Node 512GB Unified Memory
Total Memory 2TB (GPU-accessible)
GPU Cores 320 (80 per node)
Storage 32TB total (8TB per node)
Interconnect Thunderbolt 5 Mesh + Ethernet

Cost Comparison

Setup Memory Approx. Cost
This Mac Cluster 2TB ~$50,000
Equivalent NVIDIA H100s 2TB (26x 80GB) $780,000+

Requirements

Hardware

  • 4x Mac Studio (M4 Ultra recommended, M4 Max works too)
    • 512GB unified memory each (for 2TB total)
    • Or mix configs (minimum 2 nodes for meaningful clustering)
  • Thunderbolt 5 cables (6 cables for full mesh with 4 nodes)
  • Ethernet switch (2.5GbE minimum, 10GbE recommended for model downloads)
  • Ethernet cables (1 per node)

Software

  • macOS Sequoia 15.3+ (Tahoe 26.2 beta or later with RDMA support)
  • Exo Labs v1.1+ (Download)

Network Topology

Thunderbolt Mesh Connection

Connect all 4 Mac Studios in a mesh topology using Thunderbolt 5:

        Mac 1
       /  |  \
      /   |   \
   Mac 2--+--Mac 4
      \   |   /
       \  |  /
        Mac 3

Cable Connections (6 total):

  • Mac 1 ↔ Mac 2
  • Mac 1 ↔ Mac 3
  • Mac 1 ↔ Mac 4
  • Mac 2 ↔ Mac 3
  • Mac 2 ↔ Mac 4
  • Mac 3 ↔ Mac 4

Ethernet Connection

Connect all nodes to the same Ethernet switch for:

  • Node discovery
  • Model downloads
  • API access

Setup Instructions

Step 1: Enable RDMA

This is the secret sauce. RDMA (Remote Direct Memory Access) reduces latency from 300μs to 3μs - a 100x improvement.

  1. Boot into Recovery Mode

    • Shut down your Mac
    • Press and hold the power button until "Loading startup options" appears
    • Click Options → Continue
  2. Open Terminal (Utilities → Terminal)

  3. Enable RDMA

    rdma_ctl enable
  4. Restart and repeat for each node

Step 2: Install Exo Labs

Download and install from: https://github.com/exo-explore/exo

Or via Homebrew:

brew install exo

Step 3: Connect Hardware

  1. Connect all Thunderbolt 5 cables in mesh configuration
  2. Connect all nodes to Ethernet switch
  3. Power on all nodes

Step 4: Launch Cluster

  1. Open Exo Labs app on all nodes
  2. Nodes should auto-discover via Ethernet
  3. Select your parallelism mode:
    • Tensor + RDMA (recommended) - All nodes work on every layer together
    • Pipeline (legacy) - Sequential layer processing

Step 5: Verify Cluster

In Exo Labs, you should see all 4 nodes connected with their combined resources displayed.


Understanding Parallelism

Pipeline Parallelism (Old Way)

Each Mac processes different layers sequentially:

Prompt → [Mac 1: Layers 1-20] → [Mac 2: Layers 21-40] → [Mac 3: Layers 41-60] → [Mac 4: Layers 61-80] → Response

Problem: Sequential = waiting. Each node waits for the previous one.

Tensor Parallelism (New Way with RDMA)

All Macs work on every layer together:

Layer 1: Mac1(25%) + Mac2(25%) + Mac3(25%) + Mac4(25%) → Combine → Layer 2...

Benefit: True parallel processing. ~3.5x faster than pipeline.

Why RDMA Matters

Metric Without RDMA With RDMA
Latency 300 μs 3 μs
Improvement - 100x
Parallelism Pipeline only Tensor enabled

RDMA bypasses the traditional TCP/IP stack, allowing direct GPU-to-GPU memory access over Thunderbolt.


Benchmarks

Model Performance (4-Node Cluster with RDMA)

Model Parameters Size Tokens/sec Notes
Llama 3.2 3B 3B ~2GB 240 Small model baseline
Llama 3.3 70B FP16 70B ~140GB 16 Full precision
Qwen 3 Coder 480B 480B (MoE) ~280GB 40 Mixture of Experts
Kimi K2 1T (MoE) ~658GB 28-30 Thinking model
DeepSeek V3.1 671B 671B ~713GB 26-27 8-bit quantized

Single Node vs Cluster

Model 1 Node 4 Nodes Speedup
Llama 3.2 3B 147 tok/s 240 tok/s 1.6x
Llama 3.3 70B 5 tok/s 16 tok/s 3.2x
Qwen 3 Coder 480B 27 tok/s 40 tok/s 1.5x

Pipeline vs Tensor (Same Cluster)

Mode Llama 70B Improvement
Pipeline (no RDMA) 5 tok/s baseline
Tensor (no RDMA) 3 tok/s slower (too much chatter)
Tensor + RDMA 16 tok/s 3.2x faster

Running Multiple Models

One of the coolest features - run multiple models simultaneously:

Loaded Models (tested simultaneously):
├── Kimi K2 (1T params) - 33% RAM per node
├── DeepSeek V3.1 671B - ~18% RAM per node
├── Llama 3.3 70B FP16 - ~9% RAM per node
├── Llama 3.3 70B 4-bit - ~5% RAM per node
└── Llama 3.2 3B - <1% RAM per node

Total: 5 models loaded and responsive simultaneously.


Integration with Apps

Open WebUI

Exo Labs exposes an OpenAI-compatible API. Point Open WebUI to your cluster:

# docker-compose.yml addition
environment:
  - OPENAI_API_BASE=http://<mac-studio-ip>:8000/v1
  - OPENAI_API_KEY=not-needed

Xcode

Works with Xcode's AI coding features when configured as a local model endpoint.

Claude Code / Cursor / Continue

Any tool that supports OpenAI-compatible endpoints can use your cluster:

export OPENAI_API_BASE="http://<cluster-ip>:8000/v1"
export OPENAI_API_KEY="local"

Troubleshooting

Nodes Not Discovering Each Other

Cycle the Ethernet interfaces to trigger discovery:

# Run on each node (or use the script in /scripts)
ETH=$(networksetup -listallnetworkservices | grep -i ethernet | head -1)
sudo networksetup -setnetworkserviceenabled "$ETH" off
sleep 2
sudo networksetup -setnetworkserviceenabled "$ETH" on

Cluster Crashes Under Load

This can happen with beta software. Full restart sequence:

# See scripts/restart-cluster.sh for full automation

Model Loading Fails

  • Ensure all nodes have the model downloaded
  • Check available memory across cluster
  • Try with fewer nodes first

Thunderbolt Connection Issues

  • Use Apple-certified Thunderbolt 5 cables
  • Check System Information → Thunderbolt for connection status
  • Reconnect cables if nodes drop out

Scripts

See the /scripts directory for automation helpers:

  • check-status.sh - Check if all nodes are online
  • restart-cluster.sh - Full cluster reboot sequence
  • fix-discovery.sh - Cycle network for node discovery
  • start-exo.sh - Launch Exo on all nodes

Power Consumption

State Per Node 4-Node Cluster
Idle ~30W ~120W
Light Load ~80W ~320W
Full Inference ~130-150W ~520-600W

Resources

Related Videos


FAQ

Q: Do I need 4 identical Mac Studios? A: No, you can mix configurations. Even 2 nodes with different RAM amounts will work.

Q: Can I use M3 or M2 Macs? A: Yes, but you need macOS with RDMA support (Sequoia 15.3+) and Thunderbolt 4/5.

Q: Is RDMA available on all Macs? A: Currently requires Apple Silicon with Thunderbolt and the appropriate macOS version.

Q: Can I add more than 4 nodes? A: Exo Labs supports larger clusters, but Thunderbolt mesh topology becomes complex beyond 4 nodes.

Q: What about fine-tuning? A: This setup is optimized for inference. Fine-tuning workflows are still evolving for MLX.


Credits

  • Apple - For enabling RDMA over Thunderbolt
  • Exo Labs - For the clustering software
  • MLX Team - For the machine learning framework that makes this possible

License

MIT License - Feel free to use, modify, and share.


Built by NetworkChuck
YouTubeTwitterDiscord

About

Build a 2TB Unified Memory AI Supercomputer with Mac Studios - Companion guide to NetworkChuck's YouTube video

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages