Skip to content

feat(trt): 30s decoder profile for the loop-focused workflow#265

Draft
leszko wants to merge 1 commit into
mainfrom
rafal/feat/trt-30s-loop-profile
Draft

feat(trt): 30s decoder profile for the loop-focused workflow#265
leszko wants to merge 1 commit into
mainfrom
rafal/feat/trt-30s-loop-profile

Conversation

@leszko

@leszko leszko commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Draft — the 30s decoder is built and validated on a 5090, but the point of this PR is to enable an ear test of 30s loop quality on the real engine path before we commit to a default. Promote out of draft once 30s sounds good live.

What

Acts on the latent-size experiment finding (a 20–30 s loop window is viable; 60 s isn't required). Registers a 30 s TRT decoder profile so the loop-focused / short-source path runs on a decoder opt-tuned for a 750-frame window instead of borrowing the 60 s engine.

One-line runtime change: a 30.0 entry in _TRT_ENGINE_PROFILES (acestep/paths.py) + docs.

Why this is the whole change

  • seq_min = 126 on the existing engines, so the 60 s decoder already accepts a 750-frame (30 s) input — 30 s "worked" before, just on an engine opt-tuned for 1500 frames and reserving 60 s of workspace. This profile gives the window its own decoder (seq_opt = seq_max = 750).
  • The walk-window path (acestep/streaming/session.py) resolves decoder + vae_decode at walk_window_s while keeping vae_encode full-song. So with this profile, walk_window_s = 30 now picks a real 30 s decoder. Sources ≤ 30 s pick it directly too.
  • Decoder-only on purpose. The profile reuses the existing 60 s VAE engines: the walk path keeps vae_encode at full-song size, and the 60 s vae_decode already accepts ≤ 1500-frame inputs, so 750-frame decodes need no dedicated VAE engine. A sub-60 s VAE engine also isn't buildable today — it fails the TRT optimization-profile check on Blackwell (documented in docs/TRT.md).

Build (binaries are not in the repo)

python -m acestep.engine.trt.build --all --decoder-only --duration 30

Engine lands under ~/.daydream-scope/.../trt_engines/spectral_decoder_mixed_refit_b8_30s/, same as the 60s/120s/240s engines. Built in ~48 s here (reuses the cached decoder ONNX).

Validation (RTX 5090)

  • 30 s decoder loads via TRT, generates a [1, 750, 64] latent, decodes OK, LoRA refit works.
  • Profile resolution: ≤30s → 30s engine, 31–60s → 60s (existing paths unchanged).
  • 30 s window vs the same 750-frame window on the 60 s engine: ~12% faster warm generate (0.121 s vs 0.137 s), ~0.25 GB less process VRAM (15.88 vs 16.13 GB).

Honest scope: the deltas are modest — the dominant VRAM lever was always vae_encode (which the walk path keeps full-size), and per-pass generate at 750 frames is already fast. The real prize from the experiment is quality at a shorter loop (shorter sections sustain musical content down to ~15–20 s before dead-space rises), which is a listening call.

How to ear-test 30 s

In demos/realtime_motion_graph_web/web/public/config.json:

{ "engine": { "walk_window": true, "walk_window_s": 30 } }

Load a source longer than 30 s and listen to the looped 30 s sections. (Pairs naturally with the loop-phrase toggle in #264.)

Follow-ups (not in this PR)

  • XL-turbo 30 s decoder (the _XL_TURBO_TRT_ENGINE_PROFILES table) — resolve() falls back to 60 s for XL until built.
  • If 30 s wins by ear, consider a 20 s profile and/or making walk_window_s a first-class UI control.

🤖 Generated with Claude Code

Acts on the latent-size experiment finding (loop window of 20-30s is
viable). Adds a 30.0 TRT engine profile so the loop-focused / short-
source path runs on a decoder opt-tuned for a 750-frame (30s) window
instead of borrowing the 60s engine.

- The walk-window path (acestep/streaming/session.py) already resolves
  decoder + vae_decode at walk_window_s; with this profile,
  walk_window_s=30 now picks a real 30s decoder. Sources <=30s also pick
  it directly.
- Decoder-only profile: it reuses the existing 60s VAE engines. The walk
  path keeps vae_encode at full-song size, and the 60s vae_decode accepts
  <=1500-frame inputs, so 750-frame decodes need no dedicated engine. A
  sub-60s VAE engine is also not buildable today (TRT optimization-profile
  failure on Blackwell, see docs/TRT.md).
- Build: python -m acestep.engine.trt.build --all --decoder-only --duration 30
  Engine binaries live under ~/.daydream-scope (not in the repo), same as
  the 60s/120s/240s engines.

Validated on a 5090: 30s decoder loads via TRT, generates a 750-frame
latent, decodes OK, LoRA refit works. vs the same 750-frame window on the
60s engine: ~12% faster warm generate (0.121s vs 0.137s), ~0.25 GB less
process VRAM. resolve(): <=30s -> 30s engine, 31-60s -> 60s (unchanged).

Quality at a 30s loop window is the ear test this unblocks (set
walk_window=true, walk_window_s=30 in web/public/config.json).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant