Skip to content

MM-Speech/VoxMind

Repository files navigation

πŸŽ™οΈ VoxMind

arXiv Model Dataset: AgentChat Dataset: VoxMind-jsonl ModelScope


✨ Overview

Recent end-to-end spoken dialogue models have made natural voice interaction increasingly practical. However, as user requests become more complex and task-oriented, conversational ability alone is often not enough. To address real-world spoken tasks, these models must be equipped with agentic capabilities such as structured reasoning, tool use, and dynamic access to external functions.

VoxMind is an integrated framework designed to equip end-to-end spoken dialogue models with comprehensive agentic abilities. Built around a Think-before-Speak paradigm, VoxMind enables the model to internalize structured reasoning before response generation, which improves planning, tool selection, and spoken answer quality. In addition, to alleviate the latency bottleneck introduced by large-scale tool integration, VoxMind includes a Multi-Agent Dynamic Tool Management architecture that asynchronously delegates tool retrieval to an auxiliary agent aligned with the main model’s reasoning trajectory.


πŸ”₯ News

  • πŸ“„ Our paper has been accepted to ACL 2026 Main Conference!
  • πŸ’» We have open-sourced the inference and training code.
  • πŸ“š We have released the training data.
  • 🧠 We have made the model weights publicly available.
  • πŸ“„ Our preprint is now available on arXiv.

πŸ”₯ Highlights

  • 🎧 Unified audio / text input workflow
  • 🧠 Built-in reasoning structure with <|THINK_START|> ... <|THINK_END|>
  • πŸ› οΈ Structured parsing from <tool_call>...</tool_call> blocks
  • πŸ” Multi-round observation feedback and follow-up reasoning
  • πŸ”Ž Missing-tool discovery and dynamic tool injection
  • ⚑ Parallel retrieval design to reduce dynamic-agent latency
  • πŸ‹οΈ Includes training-related scripts and launcher template

πŸ—‚οΈ Project Structure

voxmind/
β”œβ”€β”€ assets/                      # demo audio files
β”œβ”€β”€ runtime/                     # runtime implementation
β”‚   β”œβ”€β”€ model.py                 # VoxMind model wrapper
β”‚   β”œβ”€β”€ response.py              # VoxMindResponse definition
β”‚   └── prompts.py               # default system prompt
β”œβ”€β”€ scripts/                     # training-related scripts
β”‚   β”œβ”€β”€ think_train.py           # training entry
β”‚   β”œβ”€β”€ think_dataset.py         # training dataset processing
β”‚   └── think_dataset_s2s.py     # seq2seq dataset processing
β”œβ”€β”€ think.sh                     # example multi-GPU training launcher
β”œβ”€β”€ tools.json                   # base tool definitions for agent_demo.py
β”œβ”€β”€ 15tools.json                 # initial local tool cache for dynamic demo
β”œβ”€β”€ 100tools.json                # global tool pool for retrieval demo
β”œβ”€β”€ agent_demo.py                # fixed toolset multi-case demo
β”œβ”€β”€ dynamic_tool_agent_demo.py   # dynamic retrieval + cache injection demo
└── README.md

πŸ€— Model & Datasets

Model

Training data

These resources correspond to the released model weights, structured JSONL data, and speech-side training assets used in the VoxMind pipeline.


πŸš€ Installation

Before running the demos or training scripts, prepare your environment first.

Step 1: Create environment

conda create -n voxmind python=3.10 -y
conda activate voxmind

Step 2: Install dependencies

pip install -r requirements.txt

If some runtime components or local model wrappers have extra dependencies in your environment, install them separately as needed.

Step 3: Prepare local model files

Both demos rely on a local VoxMind model directory. Please update the model path in the scripts according to your own machine.


🎯 Inference

VoxMind currently provides two main demo scripts.

Part 1: Fixed-tool agent reasoning

agent_demo.py demonstrates a standard fixed-tool workflow for speech / text reasoning agents.

Covered scenarios

  • single tool call
  • multi-tool decomposition in one request
  • repeated use of the same tool with different arguments
  • request outside the current toolset
  • missing-tool suggestion followed by second-round reasoning

Built-in mock tools

The demo contains several local mock tools, including:

  • Get Weather
  • Search Flights
  • Search Hotels
  • searchTools

Among them, searchTools is used to simulate the case where the current tool inventory is insufficient and the model needs a missing capability to continue.

Run

python agent_demo.py

Part 2: Dynamic tool retrieval

dynamic_tool_agent_demo.py demonstrates a more realistic dynamic tool-management pipeline.

Instead of exposing a full tool universe to the model at once, the script uses:

  • a small local cache (15tools.json)
  • a large global tool pool (100tools.json)
  • an auxiliary retrieval model to recall the most relevant missing tools

Core idea

  1. the model reasons using a limited local tool window
  2. the reasoning trace becomes a signal
  3. an auxiliary model retrieves top candidate tools from the global pool
  4. the retrieved tools are injected into the local cache
  5. the model performs a second reasoning stage with the updated tool context

Key component: ToolCache

ToolCache is implemented with OrderedDict and is responsible for:

  • maintaining local tool-window size
  • refreshing recency after use
  • injecting only newly retrieved tools
  • evicting older tools when capacity is exceeded

Parallel retrieval

A notable feature is that retrieval runs in parallel with first-stage answer generation:

  1. produce think trace
  2. start retrieval thread using the capability trace
  3. continue answer generation
  4. wait for retrieval completion
  5. inject top-k tools if necessary

This makes the script a useful reference for latency-aware dynamic agents.

Run

python dynamic_tool_agent_demo.py

βš™οΈ Configuration

Input mode

Both demos default to audio input:

INPUT_MODE = "audio"

If you want plain-text testing, change it to:

INPUT_MODE = "text"

Extra context

Both scripts inject runtime metadata through extra_context, for example:

EXTRA_CONTEXT = {
    "current_city": "Beijing",
    "user_language": "en",
}

This is useful when building system prompts with current context information.


πŸ”‘ DashScope API Key

Please configure your own key locally.

Linux / macOS

export DASHSCOPE_API_KEY="your_api_key_here"
python dynamic_tool_agent_demo.py

Windows PowerShell

$env:DASHSCOPE_API_KEY="your_api_key_here"
python dynamic_tool_agent_demo.py

Optional model override

export QWEN_MODEL_NAME="qwen-plus"

Recommended usage in code:

api_key = os.getenv("DASHSCOPE_API_KEY")

πŸ‹οΈ Training

This repository also contains training-related scripts and a launcher template.

Training files

  • scripts/think_train.py β€” training entry script
  • scripts/think_dataset.py β€” dataset loading / preprocessing
  • scripts/think_dataset_s2s.py β€” seq2seq-style dataset handling
  • think.sh β€” example distributed training launcher

Step 1: Prepare your dataset

Prepare your own training JSONL file and corresponding audio directory according to your local project layout.

Step 2: Modify think.sh

Before training, open think.sh and manually fill in all required paths for your own environment:

ROOT_DIR=""
MODEL_DIR=""
TOKEN2WAV_DIR=""
DATASET_PATH=""
AUDIO_ROOT=""
OUTPUT_DIR=""
LOG_DIR=""
DEEPSPEED_CONFIG=""

Step 3: Start training

After completing the path configuration, run:

bash think.sh

Notes

  • think.sh no longer contains private hardcoded absolute paths.
  • Please manually configure all local paths according to your machine.
  • If you are not using multi-GPU training, you can simplify the launcher further.

πŸ§ͺ Minimal Usage Example

from runtime import DEFAULT_SYSTEM_PROMPT, VoxMind

model = VoxMind("/path/to/VoxMind")

tools = []
system_prompt = model.build_system_prompt(
    DEFAULT_SYSTEM_PROMPT,
    tools,
    extra_context={"current_city": "Beijing", "user_language": "en"},
)

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": "What's the weather like in Beijing today?"},
]

response = model.generate(
    messages,
    post_think_prefix="After careful reasoning, here is my detailed answer:\n",
    max_new_tokens=512,
    temperature=0.6,
    top_p=0.9,
    do_sample=True,
)

print(response.think)
print(response.answer)
print(model.parse_tool_calls(response.answer))

πŸ“š Citation

If this repository or its workflow design is helpful to your research, please cite or reference it appropriately.

@misc{liang2026voxmindendtoendagenticspoken,
      title={VoxMind: An End-to-End Agentic Spoken Dialogue System}, 
      author={Tianle Liang and Yifu Chen and Shengpeng Ji and Yijun Chen and Zhiyang Jia and Jingyu Lu and Fan Zhuo and Xueyi Pu and Yangzhuo Li and Zhou Zhao},
      year={2026},
      eprint={2604.15710},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2604.15710}, 
}

About

[ACL 2026] VoxMind: An End-to-End Agentic Spoken Dialogue System

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors