Skip to content

feat: text-to-music mode (generate from prompt alone, no input audio)#255

Open
leszko wants to merge 1 commit into
mainfrom
rafal/feat/text2music
Open

feat: text-to-music mode (generate from prompt alone, no input audio)#255
leszko wants to merge 1 commit into
mainfrom
rafal/feat/text2music

Conversation

@leszko

@leszko leszko commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds a text-to-music mode: generate music in realtime from the text prompt alone, with no input audio — selectable at session start and switchable mid-session, in both the backend and the web app.

ACE-Step's trained "no reference audio" signal is the checkpoint's canonical silence latent (the model forward uses it to simulate text2music mode), so a text-only session is a normal streaming session whose source latent and structure context are the silence latent, diffusing from pure noise at denoise=1.0. No model or pipeline changes — only session construction and wiring.

Changes

Contract (registry-first):

  • SessionConfig gains text2music: bool and text2music_duration_s: float = 60.0 — auto-projected into GET /api/protocol and the generated TS types.
  • swap_source command gains a text2music field (no binary PCM frame, mirrors use_server_source).
  • wireContract.gen.ts regenerated.

Engine (acestep/streaming):

  • source.py: text2music_waveform() — silent placeholder that seeds the playback ring (the user hears generated slices stream in over silence) — and resolve_text2music_source() — canonical-silence PreparedSource via EmptyLatent plus fixed 120 BPM / C major / 4 conditioning defaults. Skips VAE encode, semantic extract, librosa beat-tracking (returns 0 BPM on silence, which would poison the text conditioning), CNN key detection, and stem extraction.
  • session.py: create() and the swap path branch on the flag; swap_source() accepts text2music.

Transport + SDK:

  • ws_adapter.py synthesizes the silent source server-side for both the init handshake and t2m swaps; no audio frame crosses the wire.
  • web/sdk/protocol.ts: connect() skips the PCM frame when config.text2music is set; new sendSwapTextToMusic().

Web app:

  • New client-side sentinel source (web/lib/text2music.ts) so every picker surface works unchanged: pinned TEXT TO MUSIC sleeve in the crate fan, pinned entry in the CORE-tab track picker, option in the lite select.
  • useStartSession / useFixtureSwap translate the sentinel into the wire flag; text sessions bypass the hear-the-source-first denoise gate (the source is silence) and snap denoise to 1.0.

Testing

  • tests/unit (149 passed; contract drift guards included), npm run typecheck, npm run build.
  • Headless WS session against a live pod: silent initial buffer in → non-silent generated slices out (rms ~0.27), live prompt re-encode applied mid-stream.
  • Browser: start session → swap to Text to music (denoise snaps to 1, music fills the silent buffer) → live prompt change via Tags panel → swap back to a fixture. Reconnect path re-derives the t2m config from the store sentinel.

🤖 Generated with Claude Code

@leszko leszko marked this pull request as draft June 12, 2026 14:51
@leszko leszko marked this pull request as ready for review June 15, 2026 13:11
@leszko leszko requested a review from ryanontheinside June 15, 2026 13:11
@ryanontheinside

Copy link
Copy Markdown
Collaborator

Thanks for pushing this forward. The UI/protocol direction here is useful as a first sketch of a ?prompt-only source? mode, but I don?t think this should merge as ACE-Step text-to-music yet. The missing piece is not a small bug in the PR; it is the core ACE-Step text-to-music path.

Right now the PR mostly does three things:

  • Adds a text2music flag to the startup and swap_source wire protocol.
  • Skips sending PCM from the browser when that flag is set.
  • Synthesizes a silent source on the backend and creates an EmptyLatent for the streaming session.

That is a reasonable placeholder for ?no uploaded source audio?, but it is not how ACE-Step?s text-to-music generation works.

In this branch, resolve_text2music_source() returns an EmptyLatent as both source.latent and context_latent, and the rest of the session continues through the existing streaming path. Separately, encode_cond_pair() still hardcodes TASK_INSTRUCTIONS["cover"] and [Instrumental]. So the model is not receiving LM-generated semantic/audio codes. It is essentially the existing cover/remix engine running against a silent placeholder.

The important distinction is that ?no reference audio? is not the same thing as ?text-to-music semantic plan?. ACE-Step text-to-music involves a planning stage that generates 5 Hz semantic/audio codes from the prompt, lyrics, and metadata. Those codes then feed the diffusion model. This PR skips that stage entirely.

Relevant ACE-Step references:

What I think a real implementation needs:

  1. Backend LM integration

We need a real ACE-Step LM runtime in the backend, likely via LLMHandler or a local wrapper around the same components. The streaming session currently owns the DiT/VAE/text encoder path only. Real text-to-music needs to load/select the 5 Hz LM model, manage its memory, and expose its capability in the server contract.

  1. A planning phase before streaming generation

For text-to-music, the prompt/lyrics/BPM/key/time signature/duration should first produce audio semantic codes. That may take a while and should be represented as its own state, not hidden inside swap_source. The UI should expect ?planning/generating semantic codes? before the live stream can meaningfully start.

  1. Feeding generated codes into the DiT path

The ACE-Step model code already has hooks for this: audio_codes and precomputed_lm_hints_25Hz flow into condition preparation. A correct implementation should feed the LM output there, or convert/cache it into the equivalent 25 Hz hints, rather than cloning an EmptyLatent into context_latent.

  1. Task-specific conditioning

The text-to-music path should not reuse the cover instruction by accident. TASK_INSTRUCTIONS["text2music"] may be part of the final path, but changing the instruction alone is not sufficient. The key missing work is still the semantic-code generation and wiring.

  1. Runtime lifecycle and UX

Because code generation can be slow, the backend should publish explicit progress/status/failure events. We probably need cancellation, timeout handling, and a clear transition from ?planning? to ?streaming?. This should not feel like an instant fixture swap, because it is not doing the same amount of work.

  1. Capability/version guarding

The client currently skips the binary PCM frame when config.text2music is true. That is a new handshake shape. Server-side fixtures already have a probe because mixed frontend/backend deploys can otherwise hang. Text-to-music needs the same kind of capability/version guard; otherwise an older pod will ignore text2music and block waiting for the audio frame.

  1. Clear product semantics

If we keep a placeholder mode around, it should be named as such. Calling this ?text-to-music? will mislead future work because the hard part is not present yet. The reusable pieces here are probably the picker affordance, the no-PCM handshake idea, and some UI state transitions, but the engine path needs a larger design pass.

So my recommendation is: do not merge this as text-to-music. Either re-scope it as UI/protocol scaffolding behind a disabled/experimental flag, or replace the backend path with a real LM/audio-code implementation. The next PR should start from the ACE-Step LM planning flow and then decide how that output fits into DEMON?s streaming engine.

  • Codex

@ryanontheinside ryanontheinside left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please see main comment

The model already treats the checkpoint's canonical silence latent as
"no reference audio" (its trained text2music conditioning), so a
text-only session is a normal streaming session whose source latent and
structure context are both the silence latent, diffusing from pure
noise at denoise=1.0. No model or pipeline changes — only session
construction and wiring:

- SessionConfig: text2music + text2music_duration_s fields (projected
  into /api/protocol and the generated TS types); swap_source command
  gains a text2music flag for mid-session switches, no binary frame.
- streaming/source.py: text2music_waveform() (silent placeholder that
  seeds the playback ring) and resolve_text2music_source() (canonical
  silence PreparedSource + fixed 120 BPM / C major / 4 defaults —
  librosa beat-tracking on silence returns 0 BPM and would poison the
  text conditioning, so detection is skipped, as is stem extraction).
- ws_adapter: synthesizes the silent source server-side; no PCM upload
  on the wire in either the init handshake or the swap path.
- SDK: connect() skips the binary frame when config.text2music is set
  (mirrors use_server_fixture); new sendSwapTextToMusic().
- Web app: "Text to music" appears as a pinned source in the crate fan,
  the CORE-tab track picker, and the lite select, via a client-side
  sentinel source name. Text sessions bypass the hear-the-source-first
  denoise gate (the source is silence) and snap denoise to 1.0.

Verified end-to-end on GPU: headless WS session (silent initial buffer
in, non-silent slices out, live prompt re-encode applied) and in the
browser (swap to text mode, prompt change, swap back to a fixture).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@ryanontheinside ryanontheinside force-pushed the rafal/feat/text2music branch from a8a14dc to 64b72e9 Compare June 15, 2026 17:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants