feat: text-to-music mode (generate from prompt alone, no input audio)#255
feat: text-to-music mode (generate from prompt alone, no input audio)#255leszko wants to merge 1 commit into
Conversation
|
Thanks for pushing this forward. The UI/protocol direction here is useful as a first sketch of a ?prompt-only source? mode, but I don?t think this should merge as ACE-Step text-to-music yet. The missing piece is not a small bug in the PR; it is the core ACE-Step text-to-music path. Right now the PR mostly does three things:
That is a reasonable placeholder for ?no uploaded source audio?, but it is not how ACE-Step?s text-to-music generation works. In this branch, The important distinction is that ?no reference audio? is not the same thing as ?text-to-music semantic plan?. ACE-Step text-to-music involves a planning stage that generates 5 Hz semantic/audio codes from the prompt, lyrics, and metadata. Those codes then feed the diffusion model. This PR skips that stage entirely. Relevant ACE-Step references:
What I think a real implementation needs:
We need a real ACE-Step LM runtime in the backend, likely via
For text-to-music, the prompt/lyrics/BPM/key/time signature/duration should first produce audio semantic codes. That may take a while and should be represented as its own state, not hidden inside
The ACE-Step model code already has hooks for this:
The text-to-music path should not reuse the cover instruction by accident.
Because code generation can be slow, the backend should publish explicit progress/status/failure events. We probably need cancellation, timeout handling, and a clear transition from ?planning? to ?streaming?. This should not feel like an instant fixture swap, because it is not doing the same amount of work.
The client currently skips the binary PCM frame when
If we keep a placeholder mode around, it should be named as such. Calling this ?text-to-music? will mislead future work because the hard part is not present yet. The reusable pieces here are probably the picker affordance, the no-PCM handshake idea, and some UI state transitions, but the engine path needs a larger design pass. So my recommendation is: do not merge this as text-to-music. Either re-scope it as UI/protocol scaffolding behind a disabled/experimental flag, or replace the backend path with a real LM/audio-code implementation. The next PR should start from the ACE-Step LM planning flow and then decide how that output fits into DEMON?s streaming engine.
|
ryanontheinside
left a comment
There was a problem hiding this comment.
please see main comment
The model already treats the checkpoint's canonical silence latent as "no reference audio" (its trained text2music conditioning), so a text-only session is a normal streaming session whose source latent and structure context are both the silence latent, diffusing from pure noise at denoise=1.0. No model or pipeline changes — only session construction and wiring: - SessionConfig: text2music + text2music_duration_s fields (projected into /api/protocol and the generated TS types); swap_source command gains a text2music flag for mid-session switches, no binary frame. - streaming/source.py: text2music_waveform() (silent placeholder that seeds the playback ring) and resolve_text2music_source() (canonical silence PreparedSource + fixed 120 BPM / C major / 4 defaults — librosa beat-tracking on silence returns 0 BPM and would poison the text conditioning, so detection is skipped, as is stem extraction). - ws_adapter: synthesizes the silent source server-side; no PCM upload on the wire in either the init handshake or the swap path. - SDK: connect() skips the binary frame when config.text2music is set (mirrors use_server_fixture); new sendSwapTextToMusic(). - Web app: "Text to music" appears as a pinned source in the crate fan, the CORE-tab track picker, and the lite select, via a client-side sentinel source name. Text sessions bypass the hear-the-source-first denoise gate (the source is silence) and snap denoise to 1.0. Verified end-to-end on GPU: headless WS session (silent initial buffer in, non-silent slices out, live prompt re-encode applied) and in the browser (swap to text mode, prompt change, swap back to a fixture). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
a8a14dc to
64b72e9
Compare
Summary
Adds a text-to-music mode: generate music in realtime from the text prompt alone, with no input audio — selectable at session start and switchable mid-session, in both the backend and the web app.
ACE-Step's trained "no reference audio" signal is the checkpoint's canonical silence latent (the model forward uses it to simulate text2music mode), so a text-only session is a normal streaming session whose source latent and structure context are the silence latent, diffusing from pure noise at
denoise=1.0. No model or pipeline changes — only session construction and wiring.Changes
Contract (registry-first):
SessionConfiggainstext2music: boolandtext2music_duration_s: float = 60.0— auto-projected intoGET /api/protocoland the generated TS types.swap_sourcecommand gains atext2musicfield (no binary PCM frame, mirrorsuse_server_source).wireContract.gen.tsregenerated.Engine (
acestep/streaming):source.py:text2music_waveform()— silent placeholder that seeds the playback ring (the user hears generated slices stream in over silence) — andresolve_text2music_source()— canonical-silencePreparedSourceviaEmptyLatentplus fixed 120 BPM / C major / 4 conditioning defaults. Skips VAE encode, semantic extract, librosa beat-tracking (returns 0 BPM on silence, which would poison the text conditioning), CNN key detection, and stem extraction.session.py:create()and the swap path branch on the flag;swap_source()acceptstext2music.Transport + SDK:
ws_adapter.pysynthesizes the silent source server-side for both the init handshake and t2m swaps; no audio frame crosses the wire.web/sdk/protocol.ts:connect()skips the PCM frame whenconfig.text2musicis set; newsendSwapTextToMusic().Web app:
web/lib/text2music.ts) so every picker surface works unchanged: pinned TEXT TO MUSIC sleeve in the crate fan, pinned entry in the CORE-tab track picker, option in the lite select.useStartSession/useFixtureSwaptranslate the sentinel into the wire flag; text sessions bypass the hear-the-source-first denoise gate (the source is silence) and snap denoise to 1.0.Testing
tests/unit(149 passed; contract drift guards included),npm run typecheck,npm run build.🤖 Generated with Claude Code