Skip to content

Question regarding reward calculation #151

@Derekkk

Description

@Derekkk

Thanks for your great work! From the code, it appears that during training the answer is extracted directly using the regex r"<answer>(.*?)</answer>" and the reward is computed based on accuracy. It seems that the format reward (e.g., enforcing the <think></think><answer></answer> structure) is not incorporated into the reward function.

If this is the case, would directly applying RL on Qwen/Qwen2.5-7B without an explicit format-reward lead to lower training efficiency or stability? Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions