Recent end-to-end spoken dialogue models have made natural voice interaction increasingly practical. However, as user requests become more complex and task-oriented, conversational ability alone is often not enough. To address real-world spoken tasks, these models must be equipped with agentic capabilities such as structured reasoning, tool use, and dynamic access to external functions.
VoxMind is an integrated framework designed to equip end-to-end spoken dialogue models with comprehensive agentic abilities. Built around a Think-before-Speak paradigm, VoxMind enables the model to internalize structured reasoning before response generation, which improves planning, tool selection, and spoken answer quality. In addition, to alleviate the latency bottleneck introduced by large-scale tool integration, VoxMind includes a Multi-Agent Dynamic Tool Management architecture that asynchronously delegates tool retrieval to an auxiliary agent aligned with the main modelβs reasoning trajectory.
- π Our paper has been accepted to ACL 2026 Main Conference!
- π» We have open-sourced the inference and training code.
- π We have released the training data.
- π§ We have made the model weights publicly available.
- π Our preprint is now available on arXiv.
- π§ Unified
audio/textinput workflow - π§ Built-in reasoning structure with
<|THINK_START|> ... <|THINK_END|> - π οΈ Structured parsing from
<tool_call>...</tool_call>blocks - π Multi-round observation feedback and follow-up reasoning
- π Missing-tool discovery and dynamic tool injection
- β‘ Parallel retrieval design to reduce dynamic-agent latency
- ποΈ Includes training-related scripts and launcher template
voxmind/
βββ assets/ # demo audio files
βββ runtime/ # runtime implementation
β βββ model.py # VoxMind model wrapper
β βββ response.py # VoxMindResponse definition
β βββ prompts.py # default system prompt
βββ scripts/ # training-related scripts
β βββ think_train.py # training entry
β βββ think_dataset.py # training dataset processing
β βββ think_dataset_s2s.py # seq2seq dataset processing
βββ think.sh # example multi-GPU training launcher
βββ tools.json # base tool definitions for agent_demo.py
βββ 15tools.json # initial local tool cache for dynamic demo
βββ 100tools.json # global tool pool for retrieval demo
βββ agent_demo.py # fixed toolset multi-case demo
βββ dynamic_tool_agent_demo.py # dynamic retrieval + cache injection demo
βββ README.md
- π€ Hugging Face:
leungtianle/VoxMind
- π€ JSONL annotations:
leungtianle/VoxMind-jsonl - π€ AgentChat dataset:
leungtianle/AgentChat - π· ModelScope speech data:
BEISHUI/AgentChat
These resources correspond to the released model weights, structured JSONL data, and speech-side training assets used in the VoxMind pipeline.
Before running the demos or training scripts, prepare your environment first.
conda create -n voxmind python=3.10 -y
conda activate voxmindpip install -r requirements.txtIf some runtime components or local model wrappers have extra dependencies in your environment, install them separately as needed.
Both demos rely on a local VoxMind model directory. Please update the model path in the scripts according to your own machine.
VoxMind currently provides two main demo scripts.
agent_demo.py demonstrates a standard fixed-tool workflow for speech / text reasoning agents.
- single tool call
- multi-tool decomposition in one request
- repeated use of the same tool with different arguments
- request outside the current toolset
- missing-tool suggestion followed by second-round reasoning
The demo contains several local mock tools, including:
Get WeatherSearch FlightsSearch HotelssearchTools
Among them, searchTools is used to simulate the case where the current tool inventory is insufficient and the model needs a missing capability to continue.
python agent_demo.pydynamic_tool_agent_demo.py demonstrates a more realistic dynamic tool-management pipeline.
Instead of exposing a full tool universe to the model at once, the script uses:
- a small local cache (
15tools.json) - a large global tool pool (
100tools.json) - an auxiliary retrieval model to recall the most relevant missing tools
- the model reasons using a limited local tool window
- the reasoning trace becomes a signal
- an auxiliary model retrieves top candidate tools from the global pool
- the retrieved tools are injected into the local cache
- the model performs a second reasoning stage with the updated tool context
ToolCache is implemented with OrderedDict and is responsible for:
- maintaining local tool-window size
- refreshing recency after use
- injecting only newly retrieved tools
- evicting older tools when capacity is exceeded
A notable feature is that retrieval runs in parallel with first-stage answer generation:
- produce think trace
- start retrieval thread using the capability trace
- continue answer generation
- wait for retrieval completion
- inject top-k tools if necessary
This makes the script a useful reference for latency-aware dynamic agents.
python dynamic_tool_agent_demo.pyBoth demos default to audio input:
INPUT_MODE = "audio"If you want plain-text testing, change it to:
INPUT_MODE = "text"Both scripts inject runtime metadata through extra_context, for example:
EXTRA_CONTEXT = {
"current_city": "Beijing",
"user_language": "en",
}This is useful when building system prompts with current context information.
Please configure your own key locally.
export DASHSCOPE_API_KEY="your_api_key_here"
python dynamic_tool_agent_demo.py$env:DASHSCOPE_API_KEY="your_api_key_here"
python dynamic_tool_agent_demo.pyexport QWEN_MODEL_NAME="qwen-plus"Recommended usage in code:
api_key = os.getenv("DASHSCOPE_API_KEY")This repository also contains training-related scripts and a launcher template.
scripts/think_train.pyβ training entry scriptscripts/think_dataset.pyβ dataset loading / preprocessingscripts/think_dataset_s2s.pyβ seq2seq-style dataset handlingthink.shβ example distributed training launcher
Prepare your own training JSONL file and corresponding audio directory according to your local project layout.
Before training, open think.sh and manually fill in all required paths for your own environment:
ROOT_DIR=""
MODEL_DIR=""
TOKEN2WAV_DIR=""
DATASET_PATH=""
AUDIO_ROOT=""
OUTPUT_DIR=""
LOG_DIR=""
DEEPSPEED_CONFIG=""After completing the path configuration, run:
bash think.shthink.shno longer contains private hardcoded absolute paths.- Please manually configure all local paths according to your machine.
- If you are not using multi-GPU training, you can simplify the launcher further.
from runtime import DEFAULT_SYSTEM_PROMPT, VoxMind
model = VoxMind("/path/to/VoxMind")
tools = []
system_prompt = model.build_system_prompt(
DEFAULT_SYSTEM_PROMPT,
tools,
extra_context={"current_city": "Beijing", "user_language": "en"},
)
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": "What's the weather like in Beijing today?"},
]
response = model.generate(
messages,
post_think_prefix="After careful reasoning, here is my detailed answer:\n",
max_new_tokens=512,
temperature=0.6,
top_p=0.9,
do_sample=True,
)
print(response.think)
print(response.answer)
print(model.parse_tool_calls(response.answer))If this repository or its workflow design is helpful to your research, please cite or reference it appropriately.
@misc{liang2026voxmindendtoendagenticspoken,
title={VoxMind: An End-to-End Agentic Spoken Dialogue System},
author={Tianle Liang and Yifu Chen and Shengpeng Ji and Yijun Chen and Zhiyang Jia and Jingyu Lu and Fan Zhuo and Xueyi Pu and Yangzhuo Li and Zhou Zhao},
year={2026},
eprint={2604.15710},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2604.15710},
}