ReMax is a lightweight policy-gradient method that uses a single greedy-decoded baseline response per prompt to reduce variance without a critic.
Reference: ReMax: A Simple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models.
| Script | Infer | Train | Platform |
|---|---|---|---|
run_qwen3_8b_fsdp.sh |
vLLM | FSDP | NVIDIA |
run_qwen2.5_math_7b_fsdp_sync.sh |
vLLM | FSDP+Sync | NVIDIA |
Override any argument via env vars at the top of the script.
algorithm.adv_estimator=remaxactor_rollout_ref.actor.use_kl_loss=Falseandalgorithm.use_kl_in_reward=True