Build a 2TB Unified Memory AI Supercomputer with 4 Mac Studios
Watch the full video: https://youtu.be/bFgTxr5yst0
This is the companion guide to my YouTube video where I cluster 4 Mac Studios together to create a local AI powerhouse capable of running trillion-parameter models.
| Component | Specs |
|---|---|
| Nodes | 4x Mac Studio M4 Ultra |
| RAM per Node | 512GB Unified Memory |
| Total Memory | 2TB (GPU-accessible) |
| GPU Cores | 320 (80 per node) |
| Storage | 32TB total (8TB per node) |
| Interconnect | Thunderbolt 5 Mesh + Ethernet |
| Setup | Memory | Approx. Cost |
|---|---|---|
| This Mac Cluster | 2TB | ~$50,000 |
| Equivalent NVIDIA H100s | 2TB (26x 80GB) | $780,000+ |
- 4x Mac Studio (M4 Ultra recommended, M4 Max works too)
- 512GB unified memory each (for 2TB total)
- Or mix configs (minimum 2 nodes for meaningful clustering)
- Thunderbolt 5 cables (6 cables for full mesh with 4 nodes)
- Ethernet switch (2.5GbE minimum, 10GbE recommended for model downloads)
- Ethernet cables (1 per node)
- macOS Sequoia 15.3+ (Tahoe 26.2 beta or later with RDMA support)
- Exo Labs v1.1+ (Download)
Connect all 4 Mac Studios in a mesh topology using Thunderbolt 5:
Mac 1
/ | \
/ | \
Mac 2--+--Mac 4
\ | /
\ | /
Mac 3
Cable Connections (6 total):
- Mac 1 ↔ Mac 2
- Mac 1 ↔ Mac 3
- Mac 1 ↔ Mac 4
- Mac 2 ↔ Mac 3
- Mac 2 ↔ Mac 4
- Mac 3 ↔ Mac 4
Connect all nodes to the same Ethernet switch for:
- Node discovery
- Model downloads
- API access
This is the secret sauce. RDMA (Remote Direct Memory Access) reduces latency from 300μs to 3μs - a 100x improvement.
-
Boot into Recovery Mode
- Shut down your Mac
- Press and hold the power button until "Loading startup options" appears
- Click Options → Continue
-
Open Terminal (Utilities → Terminal)
-
Enable RDMA
rdma_ctl enable -
Restart and repeat for each node
Download and install from: https://github.com/exo-explore/exo
Or via Homebrew:
brew install exo- Connect all Thunderbolt 5 cables in mesh configuration
- Connect all nodes to Ethernet switch
- Power on all nodes
- Open Exo Labs app on all nodes
- Nodes should auto-discover via Ethernet
- Select your parallelism mode:
- Tensor + RDMA (recommended) - All nodes work on every layer together
- Pipeline (legacy) - Sequential layer processing
In Exo Labs, you should see all 4 nodes connected with their combined resources displayed.
Each Mac processes different layers sequentially:
Prompt → [Mac 1: Layers 1-20] → [Mac 2: Layers 21-40] → [Mac 3: Layers 41-60] → [Mac 4: Layers 61-80] → Response
Problem: Sequential = waiting. Each node waits for the previous one.
All Macs work on every layer together:
Layer 1: Mac1(25%) + Mac2(25%) + Mac3(25%) + Mac4(25%) → Combine → Layer 2...
Benefit: True parallel processing. ~3.5x faster than pipeline.
| Metric | Without RDMA | With RDMA |
|---|---|---|
| Latency | 300 μs | 3 μs |
| Improvement | - | 100x |
| Parallelism | Pipeline only | Tensor enabled |
RDMA bypasses the traditional TCP/IP stack, allowing direct GPU-to-GPU memory access over Thunderbolt.
| Model | Parameters | Size | Tokens/sec | Notes |
|---|---|---|---|---|
| Llama 3.2 3B | 3B | ~2GB | 240 | Small model baseline |
| Llama 3.3 70B FP16 | 70B | ~140GB | 16 | Full precision |
| Qwen 3 Coder 480B | 480B (MoE) | ~280GB | 40 | Mixture of Experts |
| Kimi K2 | 1T (MoE) | ~658GB | 28-30 | Thinking model |
| DeepSeek V3.1 671B | 671B | ~713GB | 26-27 | 8-bit quantized |
| Model | 1 Node | 4 Nodes | Speedup |
|---|---|---|---|
| Llama 3.2 3B | 147 tok/s | 240 tok/s | 1.6x |
| Llama 3.3 70B | 5 tok/s | 16 tok/s | 3.2x |
| Qwen 3 Coder 480B | 27 tok/s | 40 tok/s | 1.5x |
| Mode | Llama 70B | Improvement |
|---|---|---|
| Pipeline (no RDMA) | 5 tok/s | baseline |
| Tensor (no RDMA) | 3 tok/s | slower (too much chatter) |
| Tensor + RDMA | 16 tok/s | 3.2x faster |
One of the coolest features - run multiple models simultaneously:
Loaded Models (tested simultaneously):
├── Kimi K2 (1T params) - 33% RAM per node
├── DeepSeek V3.1 671B - ~18% RAM per node
├── Llama 3.3 70B FP16 - ~9% RAM per node
├── Llama 3.3 70B 4-bit - ~5% RAM per node
└── Llama 3.2 3B - <1% RAM per node
Total: 5 models loaded and responsive simultaneously.
Exo Labs exposes an OpenAI-compatible API. Point Open WebUI to your cluster:
# docker-compose.yml addition
environment:
- OPENAI_API_BASE=http://<mac-studio-ip>:8000/v1
- OPENAI_API_KEY=not-neededWorks with Xcode's AI coding features when configured as a local model endpoint.
Any tool that supports OpenAI-compatible endpoints can use your cluster:
export OPENAI_API_BASE="http://<cluster-ip>:8000/v1"
export OPENAI_API_KEY="local"Cycle the Ethernet interfaces to trigger discovery:
# Run on each node (or use the script in /scripts)
ETH=$(networksetup -listallnetworkservices | grep -i ethernet | head -1)
sudo networksetup -setnetworkserviceenabled "$ETH" off
sleep 2
sudo networksetup -setnetworkserviceenabled "$ETH" onThis can happen with beta software. Full restart sequence:
# See scripts/restart-cluster.sh for full automation- Ensure all nodes have the model downloaded
- Check available memory across cluster
- Try with fewer nodes first
- Use Apple-certified Thunderbolt 5 cables
- Check System Information → Thunderbolt for connection status
- Reconnect cables if nodes drop out
See the /scripts directory for automation helpers:
check-status.sh- Check if all nodes are onlinerestart-cluster.sh- Full cluster reboot sequencefix-discovery.sh- Cycle network for node discoverystart-exo.sh- Launch Exo on all nodes
| State | Per Node | 4-Node Cluster |
|---|---|---|
| Idle | ~30W | ~120W |
| Light Load | ~80W | ~320W |
| Full Inference | ~130-150W | ~520-600W |
- Exo Labs: https://github.com/exo-explore/exo
- MLX Framework: https://github.com/ml-explore/mlx
- NetworkChuck Discord: https://discord.gg/networkchuck
- NetworkChuck Academy: https://academy.networkchuck.com
Q: Do I need 4 identical Mac Studios? A: No, you can mix configurations. Even 2 nodes with different RAM amounts will work.
Q: Can I use M3 or M2 Macs? A: Yes, but you need macOS with RDMA support (Sequoia 15.3+) and Thunderbolt 4/5.
Q: Is RDMA available on all Macs? A: Currently requires Apple Silicon with Thunderbolt and the appropriate macOS version.
Q: Can I add more than 4 nodes? A: Exo Labs supports larger clusters, but Thunderbolt mesh topology becomes complex beyond 4 nodes.
Q: What about fine-tuning? A: This setup is optimized for inference. Fine-tuning workflows are still evolving for MLX.
- Apple - For enabling RDMA over Thunderbolt
- Exo Labs - For the clustering software
- MLX Team - For the machine learning framework that makes this possible
MIT License - Feel free to use, modify, and share.