Skip to content

Derekkk/Praxis-VLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Praxis-VLM: Vision-Grounded Decision Making via Text-Driven Reinforcement Learning

Zhe Hu1,  Jing Li1,  Zhongzhu Pu2,3,  Hou Pong Chan4,  Yu Yin5
1The Hong Kong Polytechnic University, 2Tsinghua University, 3InspireOmni AI
4Alibaba Group, 5Case Western Reserve University

🌟 This Repo contains code and data for Praxis-VLM, which leverages textual GRPO training for vision-grounded decison making.

🎉 Updates

  • [2025-09] Praxis-VLM is accepted at NeurIPS 2025!
  • [2025-06] Training code of Praxis-VLM is released.
  • [2025-05] Check out our paper on arxiv.

Overview

We introduce Praxis-VLM, a reasoning VLM for vision-grounded decision-making. Praxis-VLM employs the GRPO algorithm on textual scenarios to instill robust reasoning capabilities. These reasoning skills, acquired purely from text, successfully transfer to multimodal inference with visual inputs, significantly reducing reliance on scarce paired image-text training data. Praxis-VLMs outperforms both the vanilla VLMs and SFT baselines with remarkable generalizability on VIVA, PCA-Bench, and EgoNormia benchmarks.

NAME

📚 Training Data Curation

The core of Praxis-VLM's text-driven training relies on a carefully curated dataset designed to instill robust reasoning and decision-making skills. The dataset was designed with the following key features:

  • Challenging Scenarios: The situations and questions are crafted to be sufficiently complex, necessitating multi-step reasoning to arrive at the optimal decision.
  • Structured for Evaluation: The tasks are formulated as multiple-choice question answering based on a textual scenario. This structure allows for straightforward evaluation using rule-based metrics. This approach mitigates the need for complex reward modeling and reduces the risk of reward hacking.
  • Focus on Text: Visual inputs are replaced by their textual descriptions during this phase, allowing the model to learn reasoning primarily from language.

✨ Model Training

We employ Qwen2.5-VL 3b and 7b as the base models. For model training, we leverage Easy-R1 for GRPO implementation. For installation, please refer to the original Easy-R1 library.

For model training:

  • Math Cold-start Training:
bash examples/qwen2_5_vl_7b_geo3k_grpo.sh
  • Text-driven RL Training:
bash examples/qwen2_5_vl_7b_mcq_grpo.sh

If you want to specify the reward / weights of each reward component, you can modify the reward/mcq.py file.

Model Inference

Here we use VIVA benchmark as an example. For PCA-Bench and Egonormia, you can download the data from the original hub. We use vllm for inference.

cd scripts
python3 predict_praxis_vlm_vllm.py

Evaluation

We use accuracy for model performance evaluation:

cd scripts
python3 evaluation.py

Citation

@article{hu2025praxis,
  title={Praxis-VLM: Vision-Grounded Decision Making via Text-Driven Reinforcement Learning},
  author={Hu, Zhe and Li, Jing and Pu, Zhongzhu and Chan, Hou Pong and Yin, Yu},
  journal={arXiv preprint arXiv:2503.16965},
  year={2025}
}

About

[NeurIPS 2025] Praxis-VLM: Vision-Grounded Decision Making via Text-Driven Reinforcement Learning

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors