Skip to content

Fix per-iteration memory growth in training loop (missing @autoreleasepool)#6

Open
imdangernoodle wants to merge 1 commit into
rayanht:mainfrom
imdangernoodle:fix/training-loop-autoreleasepool
Open

Fix per-iteration memory growth in training loop (missing @autoreleasepool)#6
imdangernoodle wants to merge 1 commit into
rayanht:mainfrom
imdangernoodle:fix/training-loop-autoreleasepool

Conversation

@imdangernoodle

Copy link
Copy Markdown

Problem

Long training runs OOM-crash partway through. On a ~800k-1.5M gaussian cloud, msplat -n 30000 dies around step ~12k with no error message and no saved scene — the process is SIGKILL'd by macOS memory pressure.

The tell that it's a per-step leak rather than cloud growth: gaussian count stays stable (~800k) while RSS climbs linearly with step count until the kill. The death step scales with -d downscale (≈3k at full res, ≈9-12k at -d 2) — consistent with a fixed per-iteration allocation that's never freed.

Cause

The training loop in cli/msplat.cpp creates autoreleased Metal objects every iteration (command buffers, encoders, transient textures via the msplat_* entry points) but there's no per-iteration autorelease pool to drain them. They accumulate for the entire run.

Fix

Renamed cli/msplat.cppcli/msplat.mm (the project already declares LANGUAGES CXX OBJCXX, so no CMake language changes beyond the source filename) and wrapped the per-iteration loop body in @autoreleasepool { … }. Each step's autoreleased objects now drain at the end of that step.

Verification

Same dataset + command that previously died at step ~12100 now runs to completion at 30000 and saves:

Opacity reset at step 3100
Opacity reset at step 6100
Opacity reset at step 9100
Opacity reset at step 12100      # <- prior runs died here
Saved splat.ply
  PSNR:  18.23  SSIM:  0.81  Gaussians: 801851

RSS holds flat across the run instead of climbing to the kill ceiling.

Notes

  • One-line behavioural change (the @autoreleasepool wrapper) + a file rename so it compiles as Obj-C++.
  • No change to training math or output format.

…rowth

The training loop in cli/msplat created autoreleased Metal objects
(command buffers, encoders, transient textures) every step but never
drained them — they accumulated for the entire run and OOM-killed the
process around step ~12k on a ~800k-1.5M gaussian cloud (no error, no
saved scene, just SIGKILL from memory pressure).

Renamed cli/msplat.cpp → cli/msplat.mm (project already enables OBJCXX)
and wrapped the per-iteration loop body in @autoreleasepool so each
step's autoreleased objects drain at the end of that step instead of
living until process exit.

Repro before fix: `msplat -n 30000 -d 2 <407-frame-nerfstudio>` dies
~step 12100. Death step scaled with downscale (3100 at full res, 9100
at -d 2 earlier, 12100 with a smaller seed) — classic signature of a
fixed per-step leak rather than cloud growth (gaussian count was stable
at ~800k while memory still climbed).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@rayanht

rayanht commented May 21, 2026

Copy link
Copy Markdown
Owner

Hey, thanks for the contribution!

I'll pull this locally real quick and re-run benchmarks to ensure that the GC cycles don't impact performance too much and this will likely get merged tonight.

@TaehoKim86, I expect this will help #5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants