feat(trt): 30s decoder profile for the loop-focused workflow by leszko · Pull Request #265 · daydreamlive/DEMON

leszko · 2026-06-15T14:17:26Z

Draft — the 30s decoder is built and validated on a 5090, but the point of this PR is to enable an ear test of 30s loop quality on the real engine path before we commit to a default. Promote out of draft once 30s sounds good live.

What

Acts on the latent-size experiment finding (a 20–30 s loop window is viable; 60 s isn't required). Registers a 30 s TRT decoder profile so the loop-focused / short-source path runs on a decoder opt-tuned for a 750-frame window instead of borrowing the 60 s engine.

One-line runtime change: a 30.0 entry in _TRT_ENGINE_PROFILES (acestep/paths.py) + docs.

Why this is the whole change

seq_min = 126 on the existing engines, so the 60 s decoder already accepts a 750-frame (30 s) input — 30 s "worked" before, just on an engine opt-tuned for 1500 frames and reserving 60 s of workspace. This profile gives the window its own decoder (seq_opt = seq_max = 750).
The walk-window path (acestep/streaming/session.py) resolves decoder + vae_decode at walk_window_s while keeping vae_encode full-song. So with this profile, walk_window_s = 30 now picks a real 30 s decoder. Sources ≤ 30 s pick it directly too.
Decoder-only on purpose. The profile reuses the existing 60 s VAE engines: the walk path keeps vae_encode at full-song size, and the 60 s vae_decode already accepts ≤ 1500-frame inputs, so 750-frame decodes need no dedicated VAE engine. A sub-60 s VAE engine also isn't buildable today — it fails the TRT optimization-profile check on Blackwell (documented in docs/TRT.md).

Build (binaries are not in the repo)

python -m acestep.engine.trt.build --all --decoder-only --duration 30

Engine lands under ~/.daydream-scope/.../trt_engines/spectral_decoder_mixed_refit_b8_30s/, same as the 60s/120s/240s engines. Built in ~48 s here (reuses the cached decoder ONNX).

Validation (RTX 5090)

30 s decoder loads via TRT, generates a [1, 750, 64] latent, decodes OK, LoRA refit works.
Profile resolution: ≤30s → 30s engine, 31–60s → 60s (existing paths unchanged).
30 s window vs the same 750-frame window on the 60 s engine: ~12% faster warm generate (0.121 s vs 0.137 s), ~0.25 GB less process VRAM (15.88 vs 16.13 GB).

Honest scope: the deltas are modest — the dominant VRAM lever was always vae_encode (which the walk path keeps full-size), and per-pass generate at 750 frames is already fast. The real prize from the experiment is quality at a shorter loop (shorter sections sustain musical content down to ~15–20 s before dead-space rises), which is a listening call.

How to ear-test 30 s

In demos/realtime_motion_graph_web/web/public/config.json:

{ "engine": { "walk_window": true, "walk_window_s": 30 } }

Load a source longer than 30 s and listen to the looped 30 s sections. (Pairs naturally with the loop-phrase toggle in #264.)

Follow-ups (not in this PR)

XL-turbo 30 s decoder (the _XL_TURBO_TRT_ENGINE_PROFILES table) — resolve() falls back to 60 s for XL until built.
If 30 s wins by ear, consider a 20 s profile and/or making walk_window_s a first-class UI control.

🤖 Generated with Claude Code

Acts on the latent-size experiment finding (loop window of 20-30s is viable). Adds a 30.0 TRT engine profile so the loop-focused / short- source path runs on a decoder opt-tuned for a 750-frame (30s) window instead of borrowing the 60s engine. - The walk-window path (acestep/streaming/session.py) already resolves decoder + vae_decode at walk_window_s; with this profile, walk_window_s=30 now picks a real 30s decoder. Sources <=30s also pick it directly. - Decoder-only profile: it reuses the existing 60s VAE engines. The walk path keeps vae_encode at full-song size, and the 60s vae_decode accepts <=1500-frame inputs, so 750-frame decodes need no dedicated engine. A sub-60s VAE engine is also not buildable today (TRT optimization-profile failure on Blackwell, see docs/TRT.md). - Build: python -m acestep.engine.trt.build --all --decoder-only --duration 30 Engine binaries live under ~/.daydream-scope (not in the repo), same as the 60s/120s/240s engines. Validated on a 5090: 30s decoder loads via TRT, generates a 750-frame latent, decodes OK, LoRA refit works. vs the same 750-frame window on the 60s engine: ~12% faster warm generate (0.121s vs 0.137s), ~0.25 GB less process VRAM. resolve(): <=30s -> 30s engine, 31-60s -> 60s (unchanged). Quality at a 30s loop window is the ear test this unblocks (set walk_window=true, walk_window_s=30 in web/public/config.json). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(trt): 30s decoder profile for the loop-focused workflow#265

feat(trt): 30s decoder profile for the loop-focused workflow#265
leszko wants to merge 1 commit into
mainfrom
rafal/feat/trt-30s-loop-profile

leszko commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

leszko commented Jun 15, 2026

What

Why this is the whole change

Build (binaries are not in the repo)

Validation (RTX 5090)

How to ear-test 30 s

Follow-ups (not in this PR)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant