feat(trt): 30s decoder profile for the loop-focused workflow#265
Draft
leszko wants to merge 1 commit into
Draft
Conversation
Acts on the latent-size experiment finding (loop window of 20-30s is viable). Adds a 30.0 TRT engine profile so the loop-focused / short- source path runs on a decoder opt-tuned for a 750-frame (30s) window instead of borrowing the 60s engine. - The walk-window path (acestep/streaming/session.py) already resolves decoder + vae_decode at walk_window_s; with this profile, walk_window_s=30 now picks a real 30s decoder. Sources <=30s also pick it directly. - Decoder-only profile: it reuses the existing 60s VAE engines. The walk path keeps vae_encode at full-song size, and the 60s vae_decode accepts <=1500-frame inputs, so 750-frame decodes need no dedicated engine. A sub-60s VAE engine is also not buildable today (TRT optimization-profile failure on Blackwell, see docs/TRT.md). - Build: python -m acestep.engine.trt.build --all --decoder-only --duration 30 Engine binaries live under ~/.daydream-scope (not in the repo), same as the 60s/120s/240s engines. Validated on a 5090: 30s decoder loads via TRT, generates a 750-frame latent, decodes OK, LoRA refit works. vs the same 750-frame window on the 60s engine: ~12% faster warm generate (0.121s vs 0.137s), ~0.25 GB less process VRAM. resolve(): <=30s -> 30s engine, 31-60s -> 60s (unchanged). Quality at a 30s loop window is the ear test this unblocks (set walk_window=true, walk_window_s=30 in web/public/config.json). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Acts on the latent-size experiment finding (a 20–30 s loop window is viable; 60 s isn't required). Registers a 30 s TRT decoder profile so the loop-focused / short-source path runs on a decoder opt-tuned for a 750-frame window instead of borrowing the 60 s engine.
One-line runtime change: a
30.0entry in_TRT_ENGINE_PROFILES(acestep/paths.py) + docs.Why this is the whole change
seq_min = 126on the existing engines, so the 60 s decoder already accepts a 750-frame (30 s) input — 30 s "worked" before, just on an engine opt-tuned for 1500 frames and reserving 60 s of workspace. This profile gives the window its own decoder (seq_opt = seq_max = 750).acestep/streaming/session.py) resolvesdecoder+vae_decodeatwalk_window_swhile keepingvae_encodefull-song. So with this profile,walk_window_s = 30now picks a real 30 s decoder. Sources ≤ 30 s pick it directly too.vae_encodeat full-song size, and the 60 svae_decodealready accepts ≤ 1500-frame inputs, so 750-frame decodes need no dedicated VAE engine. A sub-60 s VAE engine also isn't buildable today — it fails the TRT optimization-profile check on Blackwell (documented indocs/TRT.md).Build (binaries are not in the repo)
Engine lands under
~/.daydream-scope/.../trt_engines/spectral_decoder_mixed_refit_b8_30s/, same as the 60s/120s/240s engines. Built in ~48 s here (reuses the cached decoder ONNX).Validation (RTX 5090)
[1, 750, 64]latent, decodes OK, LoRA refit works.≤30s → 30s engine,31–60s → 60s(existing paths unchanged).Honest scope: the deltas are modest — the dominant VRAM lever was always
vae_encode(which the walk path keeps full-size), and per-pass generate at 750 frames is already fast. The real prize from the experiment is quality at a shorter loop (shorter sections sustain musical content down to ~15–20 s before dead-space rises), which is a listening call.How to ear-test 30 s
In
demos/realtime_motion_graph_web/web/public/config.json:{ "engine": { "walk_window": true, "walk_window_s": 30 } }Load a source longer than 30 s and listen to the looped 30 s sections. (Pairs naturally with the loop-phrase toggle in #264.)
Follow-ups (not in this PR)
_XL_TURBO_TRT_ENGINE_PROFILEStable) — resolve() falls back to 60 s for XL until built.walk_window_sa first-class UI control.🤖 Generated with Claude Code