sts-web

Browser-native speech-to-speech running 100% client-side via Rust/WASM + WebGPU.

Model: idle-intelligence/personaplex-24L-q4_k-webgpu — a layer-pruned (32L → 24L), LoRA-recovered, Q4_K-quantized derivative of nvidia/personaplex-7b-v1, built specifically to run in this WebGPU/native runtime. Pruning + recovery + quantization are by @idle-intelligence; the base PersonaPlex weights are NVIDIA's. See the HF discussion for usage Q&A.

Status

Work in progress. The pipeline runs end-to-end in Chrome/Edge with WebGPU — walkie-talkie mode functional, voice presets available. Audio quality is poor. Native generation runs near realtime on an RTX 3080 (~63 ms/frame, 0.8× realtime); prefill (~3.4 s for the system prompt + a few seconds of user audio) currently dominates time-to-first-frame.

Microphone → AudioWorklet (24kHz mono) → Mimi encoder (WASM) → Temporal transformer (WASM/WebGPU) → Depth transformer (WASM/WebGPU) → Mimi decoder (WASM) → AudioWorklet playback

Requirements

Chrome 113+ or Edge 113+ (WebGPU required)
HTTPS (required for WebGPU; dev server uses self-signed cert)
Microphone access for voice input

Quick Start

# 1. Clone the repo
git clone https://github.com/idle-intelligence/sts-web.git
cd sts-web

# 2. Build WASM
wasm-pack build crates/sts-wasm --target web --no-default-features --features wasm

# 3. Start dev server
node web/serve.mjs

# 4. Open https://localhost:8443

Native CLI (`sts`)

Run the model from a terminal — useful for trying personaplex-24L-q4_k-webgpu without a browser, scripting batch inference, or smoke-testing changes against joke.wav.

# 1. Download the model (~3.8 GB)
huggingface-cli download idle-intelligence/personaplex-24L-q4_k-webgpu \
    --local-dir personaplex-24L-q4_k-webgpu

# 2. Build and run (release; first build pulls Burn + cubecl, ~5 min)
cargo run --release --features "wgpu,cli" --bin sts -- \
    --model-dir ./personaplex-24L-q4_k-webgpu \
    --input  my_question.wav \
    --output response.wav \
    --voice  NATF2

The CLI loads the sharded GGUF, the Mimi codec safetensors, the SentencePiece tokenizer, and a .pt voice preset directly — no Python preprocessing. It runs an end-to-end speech-to-speech turn (voice prefill → system prompt → user audio prefill → response generation → Mimi decode) and writes a 24 kHz mono WAV. The model's inner-monologue text is also printed to stdout.

Requirements: Vulkan (Linux/Windows) or Metal (macOS) — wgpu auto-selects. ~4 GB VRAM. Input WAV must be mono; any sample rate is accepted (resampled to 24 kHz). Other voices: NATF0..3, NATM0..3, VARF0..4, VARM0..4. Run sts --help for all options (sampling temperatures, max frame count, layer count for non-default checkpoints).

Architecture

crates/sts-wasm/ — Temporal transformer (24L pruned, 32L upstream) + depth transformer (6L × 16 steps, 8 generated) in Burn + wgpu, Q4_K GGUF quantization
Mimi codec (mimi-rs) — Audio tokenizer/detokenizer, 8 codebooks at 12.5Hz
web/ — Standalone demo, Web Workers for inference + Mimi decode, AudioWorklet for playback
Model weights fetched from HuggingFace at runtime, cached via Cache API

Name		Name	Last commit message	Last commit date
Latest commit History 112 Commits
.cargo		.cargo
assets		assets
bench		bench
crates		crates
patches/cubecl-wgpu-0.9.0		patches/cubecl-wgpu-0.9.0
scripts		scripts
tests		tests
web		web
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
NOTES.md		NOTES.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

sts-web

Status

Requirements

Quick Start

Native CLI (`sts`)

Architecture

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

sts-web

Status

Requirements

Quick Start

Native CLI (sts)

Architecture

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Native CLI (`sts`)

Packages