🎙️ VoxMind

✨ Overview

Recent end-to-end spoken dialogue models have made natural voice interaction increasingly practical. However, as user requests become more complex and task-oriented, conversational ability alone is often not enough. To address real-world spoken tasks, these models must be equipped with agentic capabilities such as structured reasoning, tool use, and dynamic access to external functions.

VoxMind is an integrated framework designed to equip end-to-end spoken dialogue models with comprehensive agentic abilities. Built around a Think-before-Speak paradigm, VoxMind enables the model to internalize structured reasoning before response generation, which improves planning, tool selection, and spoken answer quality. In addition, to alleviate the latency bottleneck introduced by large-scale tool integration, VoxMind includes a Multi-Agent Dynamic Tool Management architecture that asynchronously delegates tool retrieval to an auxiliary agent aligned with the main model’s reasoning trajectory.

🔥 News

📄 Our paper has been accepted to ACL 2026 Main Conference!
💻 We have open-sourced the inference and training code.
📚 We have released the training data.
🧠 We have made the model weights publicly available.
📄 Our preprint is now available on arXiv.

🔥 Highlights

🎧 Unified audio / text input workflow
🧠 Built-in reasoning structure with <|THINK_START|> ... <|THINK_END|>
🛠️ Structured parsing from <tool_call>...</tool_call> blocks
🔁 Multi-round observation feedback and follow-up reasoning
🔎 Missing-tool discovery and dynamic tool injection
⚡ Parallel retrieval design to reduce dynamic-agent latency
🏋️ Includes training-related scripts and launcher template

🗂️ Project Structure

voxmind/
├── assets/                      # demo audio files
├── runtime/                     # runtime implementation
│   ├── model.py                 # VoxMind model wrapper
│   ├── response.py              # VoxMindResponse definition
│   └── prompts.py               # default system prompt
├── scripts/                     # training-related scripts
│   ├── think_train.py           # training entry
│   ├── think_dataset.py         # training dataset processing
│   └── think_dataset_s2s.py     # seq2seq dataset processing
├── think.sh                     # example multi-GPU training launcher
├── tools.json                   # base tool definitions for agent_demo.py
├── 15tools.json                 # initial local tool cache for dynamic demo
├── 100tools.json                # global tool pool for retrieval demo
├── agent_demo.py                # fixed toolset multi-case demo
├── dynamic_tool_agent_demo.py   # dynamic retrieval + cache injection demo
└── README.md

🤗 Model & Datasets

Model

🤗 Hugging Face: leungtianle/VoxMind

Training data

🤗 JSONL annotations: leungtianle/VoxMind-jsonl
🤗 AgentChat dataset: leungtianle/AgentChat
🔷 ModelScope speech data: BEISHUI/AgentChat

These resources correspond to the released model weights, structured JSONL data, and speech-side training assets used in the VoxMind pipeline.

🚀 Installation

Before running the demos or training scripts, prepare your environment first.

Step 1: Create environment

conda create -n voxmind python=3.10 -y
conda activate voxmind

Step 2: Install dependencies

pip install -r requirements.txt

If some runtime components or local model wrappers have extra dependencies in your environment, install them separately as needed.

Step 3: Prepare local model files

Both demos rely on a local VoxMind model directory. Please update the model path in the scripts according to your own machine.

🎯 Inference

VoxMind currently provides two main demo scripts.

Part 1: Fixed-tool agent reasoning

agent_demo.py demonstrates a standard fixed-tool workflow for speech / text reasoning agents.

Covered scenarios

single tool call
multi-tool decomposition in one request
repeated use of the same tool with different arguments
request outside the current toolset
missing-tool suggestion followed by second-round reasoning

Built-in mock tools

The demo contains several local mock tools, including:

Get Weather
Search Flights
Search Hotels
searchTools

Among them, searchTools is used to simulate the case where the current tool inventory is insufficient and the model needs a missing capability to continue.

Run

python agent_demo.py

Part 2: Dynamic tool retrieval

dynamic_tool_agent_demo.py demonstrates a more realistic dynamic tool-management pipeline.

Instead of exposing a full tool universe to the model at once, the script uses:

a small local cache (15tools.json)
a large global tool pool (100tools.json)
an auxiliary retrieval model to recall the most relevant missing tools

Core idea

the model reasons using a limited local tool window
the reasoning trace becomes a signal
an auxiliary model retrieves top candidate tools from the global pool
the retrieved tools are injected into the local cache
the model performs a second reasoning stage with the updated tool context

Key component: `ToolCache`

ToolCache is implemented with OrderedDict and is responsible for:

maintaining local tool-window size
refreshing recency after use
injecting only newly retrieved tools
evicting older tools when capacity is exceeded

Parallel retrieval

A notable feature is that retrieval runs in parallel with first-stage answer generation:

produce think trace
start retrieval thread using the capability trace
continue answer generation
wait for retrieval completion
inject top-k tools if necessary

This makes the script a useful reference for latency-aware dynamic agents.

Run

python dynamic_tool_agent_demo.py

⚙️ Configuration

Input mode

Both demos default to audio input:

INPUT_MODE = "audio"

If you want plain-text testing, change it to:

INPUT_MODE = "text"

Extra context

Both scripts inject runtime metadata through extra_context, for example:

EXTRA_CONTEXT = {
    "current_city": "Beijing",
    "user_language": "en",
}

This is useful when building system prompts with current context information.

🔑 DashScope API Key

Please configure your own key locally.

Linux / macOS

export DASHSCOPE_API_KEY="your_api_key_here"
python dynamic_tool_agent_demo.py

Windows PowerShell

$env:DASHSCOPE_API_KEY="your_api_key_here"
python dynamic_tool_agent_demo.py

Optional model override

export QWEN_MODEL_NAME="qwen-plus"

Recommended usage in code:

api_key = os.getenv("DASHSCOPE_API_KEY")

🏋️ Training

This repository also contains training-related scripts and a launcher template.

Training files

scripts/think_train.py — training entry script
scripts/think_dataset.py — dataset loading / preprocessing
scripts/think_dataset_s2s.py — seq2seq-style dataset handling
think.sh — example distributed training launcher

Step 1: Prepare your dataset

Prepare your own training JSONL file and corresponding audio directory according to your local project layout.

Step 2: Modify `think.sh`

Before training, open think.sh and manually fill in all required paths for your own environment:

ROOT_DIR=""
MODEL_DIR=""
TOKEN2WAV_DIR=""
DATASET_PATH=""
AUDIO_ROOT=""
OUTPUT_DIR=""
LOG_DIR=""
DEEPSPEED_CONFIG=""

Step 3: Start training

After completing the path configuration, run:

bash think.sh

Notes

think.sh no longer contains private hardcoded absolute paths.
Please manually configure all local paths according to your machine.
If you are not using multi-GPU training, you can simplify the launcher further.

🧪 Minimal Usage Example

from runtime import DEFAULT_SYSTEM_PROMPT, VoxMind

model = VoxMind("/path/to/VoxMind")

tools = []
system_prompt = model.build_system_prompt(
    DEFAULT_SYSTEM_PROMPT,
    tools,
    extra_context={"current_city": "Beijing", "user_language": "en"},
)

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": "What's the weather like in Beijing today?"},
]

response = model.generate(
    messages,
    post_think_prefix="After careful reasoning, here is my detailed answer:\n",
    max_new_tokens=512,
    temperature=0.6,
    top_p=0.9,
    do_sample=True,
)

print(response.think)
print(response.answer)
print(model.parse_tool_calls(response.answer))

📚 Citation

If this repository or its workflow design is helpful to your research, please cite or reference it appropriately.

@misc{liang2026voxmindendtoendagenticspoken,
      title={VoxMind: An End-to-End Agentic Spoken Dialogue System}, 
      author={Tianle Liang and Yifu Chen and Shengpeng Ji and Yijun Chen and Zhiyang Jia and Jingyu Lu and Fan Zhuo and Xueyi Pu and Yangzhuo Li and Zhou Zhao},
      year={2026},
      eprint={2604.15710},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2604.15710}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.ipynb_checkpoints		.ipynb_checkpoints
assets		assets
configs		configs
cosyvoice2		cosyvoice2
flashcosyvoice		flashcosyvoice
runtime		runtime
scripts		scripts
.gitignore		.gitignore
100tools.json		100tools.json
15tools.json		15tools.json
README.md		README.md
__init__.py		__init__.py
agent_demo.py		agent_demo.py
dynamic_tool_agent_demo.py		dynamic_tool_agent_demo.py
requirements.txt		requirements.txt
think.sh		think.sh
token2wav.py		token2wav.py
tools.json		tools.json
utils.py		utils.py

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

🎙️ VoxMind

✨ Overview

🔥 News

🔥 Highlights

🗂️ Project Structure

🤗 Model & Datasets

Model

Training data

🚀 Installation

Step 1: Create environment

Step 2: Install dependencies

Step 3: Prepare local model files

🎯 Inference

Part 1: Fixed-tool agent reasoning

Covered scenarios

Built-in mock tools

Run

Part 2: Dynamic tool retrieval

Core idea

Key component: ToolCache

Parallel retrieval

Run

⚙️ Configuration

Input mode

Extra context

🔑 DashScope API Key

Linux / macOS

Windows PowerShell

Optional model override

🏋️ Training

Training files

Step 1: Prepare your dataset

Step 2: Modify think.sh

Step 3: Start training

Notes

🧪 Minimal Usage Example

📚 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Key component: `ToolCache`

Step 2: Modify `think.sh`

Packages