Praxis-VLM: Vision-Grounded Decision Making via Text-Driven Reinforcement Learning

Zhe Hu¹, Jing Li¹, Zhongzhu Pu^2,3, Hou Pong Chan⁴, Yu Yin⁵

¹The Hong Kong Polytechnic University, ²Tsinghua University, ³InspireOmni AI

⁴Alibaba Group, ⁵Case Western Reserve University

🌟 This Repo contains code and data for Praxis-VLM, which leverages textual GRPO training for vision-grounded decison making.

🎉 Updates

[2025-09] Praxis-VLM is accepted at NeurIPS 2025!
[2025-06] Training code of Praxis-VLM is released.
[2025-05] Check out our paper on arxiv.

Overview

We introduce Praxis-VLM, a reasoning VLM for vision-grounded decision-making. Praxis-VLM employs the GRPO algorithm on textual scenarios to instill robust reasoning capabilities. These reasoning skills, acquired purely from text, successfully transfer to multimodal inference with visual inputs, significantly reducing reliance on scarce paired image-text training data. Praxis-VLMs outperforms both the vanilla VLMs and SFT baselines with remarkable generalizability on VIVA, PCA-Bench, and EgoNormia benchmarks.

📚 Training Data Curation

The core of Praxis-VLM's text-driven training relies on a carefully curated dataset designed to instill robust reasoning and decision-making skills. The dataset was designed with the following key features:

Challenging Scenarios: The situations and questions are crafted to be sufficiently complex, necessitating multi-step reasoning to arrive at the optimal decision.
Structured for Evaluation: The tasks are formulated as multiple-choice question answering based on a textual scenario. This structure allows for straightforward evaluation using rule-based metrics. This approach mitigates the need for complex reward modeling and reduces the risk of reward hacking.
Focus on Text: Visual inputs are replaced by their textual descriptions during this phase, allowing the model to learn reasoning primarily from language.

✨ Model Training

We employ Qwen2.5-VL 3b and 7b as the base models. For model training, we leverage Easy-R1 for GRPO implementation. For installation, please refer to the original Easy-R1 library.

For model training:

Math Cold-start Training:

bash examples/qwen2_5_vl_7b_geo3k_grpo.sh

Text-driven RL Training:

bash examples/qwen2_5_vl_7b_mcq_grpo.sh

If you want to specify the reward / weights of each reward component, you can modify the reward/mcq.py file.

Model Inference

Here we use VIVA benchmark as an example. For PCA-Bench and Egonormia, you can download the data from the original hub. We use vllm for inference.

cd scripts
python3 predict_praxis_vlm_vllm.py

Evaluation

We use accuracy for model performance evaluation:

cd scripts
python3 evaluation.py

Citation

@article{hu2025praxis,
  title={Praxis-VLM: Vision-Grounded Decision Making via Text-Driven Reinforcement Learning},
  author={Hu, Zhe and Li, Jing and Pu, Zhongzhu and Chan, Hou Pong and Yin, Yu},
  journal={arXiv preprint arXiv:2503.16965},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
assets		assets
examples		examples
scripts		scripts
verl		verl
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Praxis-VLM: Vision-Grounded Decision Making via Text-Driven Reinforcement Learning

🌟 This Repo contains code and data for Praxis-VLM, which leverages textual GRPO training for vision-grounded decison making.

🎉 Updates

Overview

📚 Training Data Curation

✨ Model Training

Model Inference

Evaluation

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Praxis-VLM: Vision-Grounded Decision Making via Text-Driven Reinforcement Learning

🌟 This Repo contains code and data for Praxis-VLM, which leverages textual GRPO training for vision-grounded decison making.

🎉 Updates

Overview

📚 Training Data Curation

✨ Model Training

Model Inference

Evaluation

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages