msplat

A 3D Gaussian Splatting training engine for Apple Silicon, built entirely on Metal. No external dependencies beyond system frameworks.

The entire training pipeline: projection, sorting, rasterization, SSIM loss, backward pass, Adam optimizer, and densification runs as fused Metal compute shaders.

The result is a self-contained engine that trains a full-resolution Mip-NeRF 360 scene in ~70 seconds and renders it at ~350 FPS on an M4 Max.

Python and Swift bindings are provided, as well as a standalone C++ CLI built for automated pipelines: streaming machine-readable progress, quality presets, a hard splat/memory cap, deterministic exit codes, and clean SIGINT/SIGTERM handling — it trains comfortably on a base M4 / 16 GB MacBook Pro.

demo.mp4

Why this exists

The original 3D Gaussian Splatting implementation is CUDA-only. Ports to other frameworks (gsplat, taichi-3dgs, etc.) still depend on PyTorch for autograd, optimizer state, and tensor management. This means ~2GB of framework overhead, Python GIL contention, and no straightforward path to native macOS/iOS integration.

Architecture

core/metal/msplat_metal.metal    ← Compute kernels
core/src/                        ← C++ training loop, dataset loaders, SSIM eval
core/include/                    ← MTensor (lightweight GPU tensor), Model, API headers
python/bindings.cpp              ← nanobind Python module
swift/Sources/Msplat/            ← Swift package (via C API bridge)
cli/msplat.cpp                   ← C++ CLI

Training pipeline (single iteration)

Each training step dispatches all work into one Metal command encoder:

Forward:
  project_and_sh_forward     ← fused 3D→2D projection + spherical harmonics
  prefix_sum + scatter       ← gaussian→tile intersection mapping
  bitonic_sort_per_tile      ← tile-local depth sort + inline data packing
  nd_rasterize_forward       ← per-pixel alpha compositing (16x16 tiles)
  ssim_h_fwd + ssim_v_fwd   ← separable 11-tap SSIM + L1 loss

Backward:
  ssim_h_bwd + ssim_v_bwd   ← separable SSIM gradient
  rasterize_backward         ← per-pixel backward compositing
  project_and_sh_backward    ← fused projection + SH VJP + SH Adam update
  fused_adam (×4 groups)     ← optimizer step (means, scales, quats, opacity)
  accumulate_grad_stats      ← gradient norms for densification

Key design decisions

Tile-local bitonic sort instead of global radix sort. Each 16x16 tile independently sorts its gaussians (up to 2048) in threadgroup shared memory. The sort kernel also packs per-gaussian data (xy, opacity, conic, color) inline, eliminating a separate scatter dispatch.

GPU-resident densification. The split/clone/cull cycle never leaves the GPU. Classification, growth, and compaction are all compute kernels operating on device buffers. No CPU readback of gradient statistics or gaussian counts.

Fused kernels. Projection and spherical harmonic evaluation share registers (avoid a device memory round-trip for world-space position). The backward pass recomputes 3D covariance from scales/quaternions on-the-fly rather than storing it. SH backward gradients are computed in registers and fed directly into Adam updates, eliminating a separate gradient buffer write/read cycle. The remaining four parameter groups use fused Adam dispatches.

Separable SSIM. The 11x11 Gaussian-weighted SSIM window decomposes into two 1D passes (horizontal then vertical), reducing per-pixel work from 121 to 22 multiply-adds. Forward and backward each take two kernels, using threadgroup shared memory for the intermediate statistics.

Depth-chunked rasterization. For tiles with extreme gaussian counts, the forward pass splits into 512-gaussian chunks with a merge kernel that reconstructs absolute transmittance. The backward pass uses precomputed prefix/suffix transmittance to avoid re-traversal.

Installation & Usage

Python

pip install msplat

import msplat

dataset = msplat.load_dataset("path/to/colmap/", eval_mode=True)
config = msplat.TrainingConfig(iterations=7000, num_downscales=0)
trainer = msplat.GaussianTrainer(dataset, config)

trainer.train(lambda s: print(f"step={s.iteration} splats={s.splat_count:,}"),
              callback_every=100)

trainer.export_ply("output.ply")
trainer.save_checkpoint("checkpoint.msplat")  # save/resume training
metrics = trainer.evaluate()
print(f"PSNR: {metrics['psnr']:.2f}  SSIM: {metrics['ssim']:.3f}")

# Render from arbitrary viewpoints
pose = dataset.camera_pose(0)   # (4, 4) cam-to-world matrix
img = trainer.render_from_pose(pose)  # numpy (H, W, 3) float32

Supported dataset formats: COLMAP, Nerfstudio, Polycam.

Type stubs (_core.pyi) are included for IDE autocompletion.

CLI

pip install msplat[cli]
msplat-train path/to/dataset -n 7000 --eval

The standalone C++ binary (./build/msplat, or bundled in a pipeline) is built for automation — see docs/ and man docs/msplat.1:

# Quality presets (explicit flags override the preset):
msplat path/to/scene --preset draft        # 7000 iters, half-res, ≤1M splats
msplat path/to/scene --preset production    # 100000 iters, ≤6M splats

# Pipeline use: streaming machine-readable progress, bounded memory, black bg:
msplat path/to/scene --preset balanced --progress-format jsonl --max-splats 3000000 -o out.ply
# {"step":1000,"total":30000,"splats":289114,"loss":0.0543,"ms_per_step":10.7}
# ...
# Done: 30000 iters, 2.1M Gaussians, PSNR 27.3, wrote /abs/out.ply

Progress lines stream line-buffered even when piped; SIGINT/SIGTERM save a partial *_interrupted.ply and exit 130/143; exit codes are deterministic (0 ok · 3 load · 5 write). Defaults to a black background (--debug-bg for the magenta debug view). The CLI is OpenSplat-compatible (positional path + additive flags), so it drops into existing 3DGS pipelines. Full reference: man docs/msplat.1.

Swift

Requires Xcode and CMake (brew install cmake).

// Package.swift
dependencies: [
    .package(url: "https://github.com/SeedeXR/msplat.git", from: "1.2.0")
]

Build the XCFramework (one-time, from repo root):

./scripts/build-xcframework.sh

import Msplat

let dataset = GaussianDataset(path: "path/to/colmap/", downscaleFactor: 4.0)
let trainer = GaussianTrainer(dataset: dataset)

for _ in 0..<1000 {
    let stats = trainer.step()
    print("step=\(stats.iteration) splats=\(stats.splatCount)")
}

trainer.exportPly(to: "output.ply")

// Render from arbitrary viewpoints
let pose = dataset.cameraPose(at: 0)  // [Float] cam-to-world matrix
let img = trainer.renderFromPose(camToWorld: pose)

C++ CLI

./scripts/build.sh                  # Release build + Metal-toolchain check + unit tests
./build/msplat path/to/dataset -n 7000 --eval

Or directly with CMake:

cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j
ctest --test-dir build            # C++ unit tests (doctest)

First-time Metal toolchain (recent Xcode ships it as a separate component):

xcodebuild -downloadComponent MetalToolchain

Build from source

git clone https://github.com/SeedeXR/msplat.git && cd msplat

# Python
pip install -e .

# C++ CLI + static lib
cmake -B build -DCMAKE_BUILD_TYPE=Release && cmake --build build -j

# Swift XCFramework
./scripts/build-xcframework.sh
cd swift && swift build

Requires macOS 14+, Apple Silicon. No external dependencies.

Datasets

Training scenes live in a Hugging Face dataset, not in this repo: alexmkwizu/gaussian_training_datasets (Mip-NeRF 360, Tanks & Temples, Deep Blending — COLMAP layout). A small garden scene is included in-repo for quickstart/CI.

Create a datasets/ folder and download everything from Hugging Face (uses the hf CLI from huggingface_hub):

pip install -U "huggingface_hub[cli]"
hf download alexmkwizu/gaussian_training_datasets --repo-type dataset --local-dir datasets
# → datasets/mipnerf360/<scene>/, datasets/tandt/<scene>/, datasets/db/<scene>/

Grab a single scene instead of everything:

hf download alexmkwizu/gaussian_training_datasets --repo-type dataset \
    --include "tandt/truck/*" --local-dir datasets
msplat datasets/tandt/truck -n 7000 --eval        # small images → train at native -d 1

Resolution matters. Pick --downscale-factor by the native image size, aiming for a ~1 MP render. Mip-NeRF 360 ships ~16 MP images → use -d 4; Tanks & Temples / Deep Blending ship ~1 MP images → use -d 1. The CLI warns if the render is too small (over-downscaling small images destabilizes training).

Everything under datasets/ except the bundled garden is git-ignored — datasets are cached locally and pulled from Hugging Face, never committed to this repo.

Adding a new dataset (push to Hugging Face)

Put the scene under datasets/<group>/<scene>/ (COLMAP sparse/0/ + images/), then upload it to the dataset repo (needs write access — hf auth login):

hf upload alexmkwizu/gaussian_training_datasets \
    datasets/tandt/mynewscene tandt/mynewscene --repo-type dataset

Do not git add datasets into this repo — they belong on Hugging Face.

Pre-trained splats — download, view & push

Ready-made .ply splats trained by msplat (7 Mip-NeRF 360 + Tanks & Temples + Deep Blending scenes, indoor PSNR 27–30) live under tested_outputs/ in the same dataset repo:

hf download alexmkwizu/gaussian_training_datasets --repo-type dataset \
    --include "tested_outputs/*" --local-dir .

These are standard 3DGS PLYs — drag any .ply into a web viewer to inspect:

SuperSplat — https://superspl.at/editor (no install; also cleans/edits splats)
antimatter15/splat — https://antimatter15.com/splat/ (expects .splat; produce one with msplat <scene> -o out.splat, or export from your own training run)

Or render from any pose with the Python/Swift API (render_from_pose). Per-scene metrics + how each was trained: tested_outputs/SUMMARY.md.

Push your own trained splats (through hf; needs write access — hf auth login):

# one file
hf upload alexmkwizu/gaussian_training_datasets out.ply tested_outputs/myscene.ply \
    --repo-type dataset
# or a whole local output folder → tested_outputs/
hf upload alexmkwizu/gaussian_training_datasets my_outputs tested_outputs \
    --repo-type dataset --commit-message "Add myscene splat"

Documentation

docs/ — getting started, datasets, building & testing, internals, and the optimization roadmap.
man docs/msplat.1 — full CLI reference: every flag, exit codes, signals, environment variables, and examples.
docs/benchmarks/ — dated measurement ledger (per-stage GPU profile, memory footprint) on a base M4 / 16 GB.

Benchmarks

mipnerf360, M4 Max. msplat runs 7K iterations with no downscales:

msplat-train path/to/scene -n 7000 --num-downscales 0 --eval

Scene	msplat PSNR	msplat SSIM	msplat wall time	gsplat PSNR	gsplat SSIM	gsplat wall time
bicycle	23.23	0.602	59s	23.71	0.668	~335s
counter	27.45	0.880	80s	27.14	0.878	~335s
garden	25.68	0.783	77s	26.30	0.833	~335s
room	30.12	0.897	74s	29.21	0.893	~335s

30K iterations (garden)

msplat-train path/to/garden -n 30000 --num-downscales 0 --eval

	msplat	gsplat
PSNR	27.14	27.32
SSIM	0.853	0.865
Gaussians	3.51M	—
Wall time	700s	~2149s

gsplat numbers from docs.gsplat.studio (TITAN RTX). gsplat wall times are the reported average across all mipnerf360 scenes (per-scene times not published).

Performance history (wall time, M4 Max)

Scene	v1.0	v1.1.3	Speedup
bicycle 7K	82s	59s	1.39x
counter 7K	91s	80s	1.14x
garden 7K	107s	77s	1.39x
room 7K	85s	74s	1.15x
garden 30K	1039s	700s	1.48x

v1.1.3 fuses SH backward gradients into Adam optimizer updates, fuses the SSIM vertical-forward and horizontal-backward passes into a single kernel, and replaces the count→prefix-sum→scatter intersection pipeline with pre-allocated per-tile bins. Speedup scales with gaussian count.

Validated on a base M4 / 16 GB

The full dataset suite (all 7 Mip-NeRF 360 + Tanks & Temples + Deep Blending) trains end-to-end on a 16 GB MacBook Pro (M4). Indoor scenes reach PSNR 27–30; resident memory stays ~2–8 GB. Choose --downscale-factor by native image size (Mip-NeRF 360 ~16 MP → -d 4; Tanks & Temples / Deep Blending ~1 MP → -d 1) — over-downscaling small images destabilizes training, and the CLI warns when the render is too small. Full-resolution Mip-NeRF (-d 1) needs more than 16 GB (images are decoded up-front). See docs/benchmarks/ and docs/datasets.md.

v1.2 adds the pipeline-friendly CLI (progress/jsonl, presets, --max-splats, signals, exit codes, man page, completions) and robustness fixes (resume-step, PLY validation, coarse-render warning, NaN guard) — no change to training speed.

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.github/workflows		.github/workflows
cli		cli
core		core
datasets/mipnerf360		datasets/mipnerf360
demo		demo
docs		docs
examples		examples
python		python
scripts		scripts
swift		swift
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
VERSION		VERSION
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

msplat

Why this exists

Architecture

Training pipeline (single iteration)

Key design decisions

Installation & Usage

Python

CLI

Swift

C++ CLI

Build from source

Datasets

Adding a new dataset (push to Hugging Face)

Pre-trained splats — download, view & push

Documentation

Benchmarks

30K iterations (garden)

Performance history (wall time, M4 Max)

Validated on a base M4 / 16 GB

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

msplat

Why this exists

Architecture

Training pipeline (single iteration)

Key design decisions

Installation & Usage

Python

CLI

Swift

C++ CLI

Build from source

Datasets

Adding a new dataset (push to Hugging Face)

Pre-trained splats — download, view & push

Documentation

Benchmarks

30K iterations (garden)

Performance history (wall time, M4 Max)

Validated on a base M4 / 16 GB

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages