Skip to content

json/jsonl example of selected 3000 examples from V-Interaction-400K for RL training #4

@sjy-1995

Description

@sjy-1995

Hi team,

I'm currently working on RL fine-tuning using the V-Interaction-400K dataset, and I encountered a critical format issue that blocks the training process. I would greatly appreciate your help with the following:

Background:
During the SFT phase, I used a single-turn conversation format for the messages field, and the model could output expected responses without any errors.
However, when switching to RL fine-tuning (following the same single-turn structure but replacing the assistant role content with the solution field from the dataset), the training consistently throws errors (related to message structure validation and reward calculation failures).

Core Question:
Could you confirm the correct messages format for V-Interaction-400K in RL training? Specifically:
Should it use single-turn conversation (like SFT) or multi-turn conversation?
If single-turn is required, are there any differences from the SFT format (e.g., role naming, content structure, or additional fields)?

Example of My Current Single-Turn Format (SFT-Successful)
For reference, here's the format that worked in SFT:

{
"messages": [
{
"role": "user",
"content": "\n[Problem description with mathematical expressions and choices]"
},
{
"role": "assistant",
"content": "[Geometric reasoning process]\n\n\n'''python\n[Image processing/geometry visualization code]\n'''\n\n<sandbox_output></sandbox_output>\n\n[Calculation and conclusion logic]\n[Reasoning content]\n\n[final answer]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions