Fix per-iteration memory growth in training loop (missing @autoreleasepool)#6
Open
imdangernoodle wants to merge 1 commit into
Open
Conversation
…rowth The training loop in cli/msplat created autoreleased Metal objects (command buffers, encoders, transient textures) every step but never drained them — they accumulated for the entire run and OOM-killed the process around step ~12k on a ~800k-1.5M gaussian cloud (no error, no saved scene, just SIGKILL from memory pressure). Renamed cli/msplat.cpp → cli/msplat.mm (project already enables OBJCXX) and wrapped the per-iteration loop body in @autoreleasepool so each step's autoreleased objects drain at the end of that step instead of living until process exit. Repro before fix: `msplat -n 30000 -d 2 <407-frame-nerfstudio>` dies ~step 12100. Death step scaled with downscale (3100 at full res, 9100 at -d 2 earlier, 12100 with a smaller seed) — classic signature of a fixed per-step leak rather than cloud growth (gaussian count was stable at ~800k while memory still climbed). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Owner
|
Hey, thanks for the contribution! I'll pull this locally real quick and re-run benchmarks to ensure that the GC cycles don't impact performance too much and this will likely get merged tonight. @TaehoKim86, I expect this will help #5 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Long training runs OOM-crash partway through. On a ~800k-1.5M gaussian cloud,
msplat -n 30000dies around step ~12k with no error message and no saved scene — the process is SIGKILL'd by macOS memory pressure.The tell that it's a per-step leak rather than cloud growth: gaussian count stays stable (~800k) while RSS climbs linearly with step count until the kill. The death step scales with
-ddownscale (≈3k at full res, ≈9-12k at-d 2) — consistent with a fixed per-iteration allocation that's never freed.Cause
The training loop in
cli/msplat.cppcreates autoreleased Metal objects every iteration (command buffers, encoders, transient textures via themsplat_*entry points) but there's no per-iteration autorelease pool to drain them. They accumulate for the entire run.Fix
Renamed
cli/msplat.cpp→cli/msplat.mm(the project already declaresLANGUAGES CXX OBJCXX, so no CMake language changes beyond the source filename) and wrapped the per-iteration loop body in@autoreleasepool { … }. Each step's autoreleased objects now drain at the end of that step.Verification
Same dataset + command that previously died at step ~12100 now runs to completion at 30000 and saves:
RSS holds flat across the run instead of climbing to the kill ceiling.
Notes
@autoreleasepoolwrapper) + a file rename so it compiles as Obj-C++.