Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md
run_qwen2.5_math_7b_sync_fsdp.sh	run_qwen2.5_math_7b_sync_fsdp.sh
run_qwen3_8b_fsdp.sh	run_qwen3_8b_fsdp.sh

Name

Last commit message

Last commit date

ReMax

ReMax is a lightweight policy-gradient method that uses a single greedy-decoded baseline response per prompt to reduce variance without a critic.

Canonical Scripts

Script	Infer	Train	Platform
`run_qwen3_8b_fsdp.sh`	vLLM	FSDP	NVIDIA
`run_qwen2.5_math_7b_fsdp_sync.sh`	vLLM	FSDP+Sync	NVIDIA

Override any argument via env vars at the top of the script.

algorithm.adv_estimator=remax
actor_rollout_ref.actor.use_kl_loss=False and algorithm.use_kl_in_reward=True