Hi team,
I'm currently working on RL fine-tuning using the V-Interaction-400K dataset, and I encountered a critical format issue that blocks the training process. I would greatly appreciate your help with the following:
Background:
During the SFT phase, I used a single-turn conversation format for the messages field, and the model could output expected responses without any errors.
However, when switching to RL fine-tuning (following the same single-turn structure but replacing the assistant role content with the solution field from the dataset), the training consistently throws errors (related to message structure validation and reward calculation failures).
Core Question:
Could you confirm the correct messages format for V-Interaction-400K in RL training? Specifically:
Should it use single-turn conversation (like SFT) or multi-turn conversation?
If single-turn is required, are there any differences from the SFT format (e.g., role naming, content structure, or additional fields)?
Example of My Current Single-Turn Format (SFT-Successful)
For reference, here's the format that worked in SFT:
{
"messages": [
{
"role": "user",
"content": "
\n[Problem description with mathematical expressions and choices]"
},
{
"role": "assistant",
"content": "[Geometric reasoning process]\n\n\n'''python\n[Image processing/geometry visualization code]\n'''\n\n<sandbox_output>
</sandbox_output>\n\n[Calculation and conclusion logic]\n[Reasoning content]\n\n[final answer]
Hi team,
I'm currently working on RL fine-tuning using the V-Interaction-400K dataset, and I encountered a critical format issue that blocks the training process. I would greatly appreciate your help with the following:
Background:
During the SFT phase, I used a single-turn conversation format for the messages field, and the model could output expected responses without any errors.
However, when switching to RL fine-tuning (following the same single-turn structure but replacing the assistant role content with the solution field from the dataset), the training consistently throws errors (related to message structure validation and reward calculation failures).
Core Question:
Could you confirm the correct messages format for V-Interaction-400K in RL training? Specifically:
Should it use single-turn conversation (like SFT) or multi-turn conversation?
If single-turn is required, are there any differences from the SFT format (e.g., role naming, content structure, or additional fields)?
Example of My Current Single-Turn Format (SFT-Successful)
For reference, here's the format that worked in SFT:
{
\n[Problem description with mathematical expressions and choices]"
</sandbox_output>\n\n[Calculation and conclusion logic]\n[Reasoning content]\n\n[final answer]
"messages": [
{
"role": "user",
"content": "
},
{
"role": "assistant",
"content": "[Geometric reasoning process]\n\n
\n'''python\n[Image processing/geometry visualization code]\n'''\n\n<sandbox_output>