🌟 This Repo contains code and data for Praxis-VLM, which leverages textual GRPO training for vision-grounded decison making.
- [2025-09] Praxis-VLM is accepted at NeurIPS 2025!
- [2025-06] Training code of Praxis-VLM is released.
- [2025-05] Check out our paper on arxiv.
We introduce Praxis-VLM, a reasoning VLM for vision-grounded decision-making. Praxis-VLM employs the GRPO algorithm on textual scenarios to instill robust reasoning capabilities. These reasoning skills, acquired purely from text, successfully transfer to multimodal inference with visual inputs, significantly reducing reliance on scarce paired image-text training data. Praxis-VLMs outperforms both the vanilla VLMs and SFT baselines with remarkable generalizability on VIVA, PCA-Bench, and EgoNormia benchmarks.
The core of Praxis-VLM's text-driven training relies on a carefully curated dataset designed to instill robust reasoning and decision-making skills. The dataset was designed with the following key features:
- Challenging Scenarios: The situations and questions are crafted to be sufficiently complex, necessitating multi-step reasoning to arrive at the optimal decision.
- Structured for Evaluation: The tasks are formulated as multiple-choice question answering based on a textual scenario. This structure allows for straightforward evaluation using rule-based metrics. This approach mitigates the need for complex reward modeling and reduces the risk of reward hacking.
- Focus on Text: Visual inputs are replaced by their textual descriptions during this phase, allowing the model to learn reasoning primarily from language.
We employ Qwen2.5-VL 3b and 7b as the base models. For model training, we leverage Easy-R1 for GRPO implementation. For installation, please refer to the original Easy-R1 library.
For model training:
- Math Cold-start Training:
bash examples/qwen2_5_vl_7b_geo3k_grpo.sh
- Text-driven RL Training:
bash examples/qwen2_5_vl_7b_mcq_grpo.sh
If you want to specify the reward / weights of each reward component, you can modify the reward/mcq.py file.
Here we use VIVA benchmark as an example. For PCA-Bench and Egonormia, you can download the data from the original hub. We use vllm for inference.
cd scripts
python3 predict_praxis_vlm_vllm.py
We use accuracy for model performance evaluation:
cd scripts
python3 evaluation.py
@article{hu2025praxis,
title={Praxis-VLM: Vision-Grounded Decision Making via Text-Driven Reinforcement Learning},
author={Hu, Zhe and Li, Jing and Pu, Zhongzhu and Chan, Hou Pong and Yin, Yu},
journal={arXiv preprint arXiv:2503.16965},
year={2025}
}
