Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
80 changes: 62 additions & 18 deletions fern/versions/latest/pages/reference/cli-commands.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -252,23 +252,6 @@ gym dataset migrate \
--revision 0.0.1
```

### `gym dataset render`

Generate a dataset preview by materializing prompts from a raw input file and a prompt template.

| Option | Description |
| --- | --- |
| `--input`, `-i` | Raw input JSONL file. |
| `--prompt-config` | Prompt template YAML to apply. |
| `--output`, `-o` | Output JSONL file. |

```bash
gym dataset render \
--input raw.jsonl \
--prompt-config prompt.yaml \
--output preview.jsonl
```

### `gym dataset collate`

Validate and collate a dataset, generating metrics and statistics.
Expand All @@ -289,6 +272,33 @@ gym dataset collate \
--mode example_validation
```

### `gym dataset render`

Generate a dataset preview by materializing prompts from a raw input file and a prompt template, producing JSONL with populated `responses_create_params.input` for RL training.

Each input row must **not** already have a populated `responses_create_params.input`; the command applies the prompt template from `--prompt-config` to each row, fills in the input, and preserves the row's other fields.

| Option | Description |
| --- | --- |
| `--input`, `-i` | Raw input JSONL file (rows without `responses_create_params.input`). |
| `--prompt-config` | Prompt template YAML to apply. |
| `--output`, `-o` | Output JSONL file. |

```bash
gym dataset render \
--input raw.jsonl \
--prompt-config prompt.yaml \
--output preview.jsonl
```

<Note>
**Which data-preparation command should I use?**

- **`gym dataset render`** — a focused, standalone step that applies a prompt template to raw rows to populate `responses_create_params.input`. No servers are started. Use it when you have raw data and just need to turn it into prompt-ready rows.
- **`gym dataset collate`** — the full preparation pipeline for training: it can download missing datasets, validate data, and compute dataset metrics, writing train/validation splits and metrics artifacts. Use it to prepare and validate datasets for training or PR submission.
</Note>


---

## Environments
Expand Down Expand Up @@ -444,7 +454,7 @@ Collate data, start the servers, and collect rollouts. This is the main evaluati
| `--model-type NAME` | Load the named model server type config. |
| `--search-dir DIR` | Extra root directory to search for named components. Repeatable. |
| `--no-serve` | Collect against already-running servers instead of starting them. |
| `--resume` | Resume from cached rollouts instead of recollecting. |
| `--resume` | Resume an interrupted run, skipping rows already completed. See [Resume Interrupted Runs](#resume-interrupted-runs). |
| `--agent`, `-a` | Agent to collect rollouts with. |
| `--input`, `-i` | Input tasks JSONL file. |
| `--output`, `-o` | Output rollouts JSONL file. |
Expand Down Expand Up @@ -488,6 +498,40 @@ gym eval run --no-serve \
--num-repeats '{agent_alpha: 4, agent_beta: 1, _default: 1}'
```

#### Generation Parameters

The most common sampling parameters have dedicated flags on `gym eval run`: `--temperature`, `--top-p`, and `--max-output-tokens`. These map onto `responses_create_params.temperature`, `.top_p`, and `.max_output_tokens` respectively.

Any other `responses_create_params` field that has no dedicated flag can be set with a raw Hydra override using the `++responses_create_params.<field>` syntax (see [Hydra overrides](#hydra-overrides-escape-hatch)). Overrides are merged into each input row's existing `responses_create_params` with a **shallow** merge (top-level keys only):

```bash
gym eval run --no-serve \
--agent example_single_tool_call_simple_agent \
--input weather_query.jsonl \
--output weather_rollouts.jsonl \
--temperature 1.0 \
--top-p 1.0 \
--max-output-tokens 4096 \
++responses_create_params.reasoning.effort=low
```

Because the merge is shallow, setting a field inside a nested object (for example, `++responses_create_params.reasoning.effort=low`) replaces the row's entire nested dict at that key — other fields under the same nested object are not preserved.

#### Resume Interrupted Runs

Passing `--resume` to `gym eval run` lets you restart the **same command** after a crash or interruption and pick up only the rows that have not finished yet. It works whether or not you pass `--no-serve`.

How it works:

- **Materialized inputs.** On the first run, the fully expanded input rows (after `--num-repeats`, `--limit`, `--prompt-config`, and any overrides) are written to a sidecar file next to your output. The path is derived from the output path by appending `_materialized_inputs` to the stem — so `rollouts.jsonl` produces `rollouts_materialized_inputs.jsonl`.
- **Incremental output.** Results are flushed to the output file after each completed rollout, so partial output survives a crash.
- **Matching.** On resume, completed work is matched by `(task_index, rollout_index)` against the materialized inputs, and already-completed rows are skipped. The run prints a summary such as the number of original input rows, rows already done, and rows that still need to be run.
- **Fallback.** If either the materialized inputs or the output file is missing, resume is skipped and the run starts fresh. Without `--resume`, existing output is cleared before the run.

<Warning>
If you change the config, schema, or data between runs, the materialized inputs become stale and resume will diff against the old expansion. Delete the `*_materialized_inputs.jsonl` file (and the output file) to start fresh.
</Warning>

### `gym eval aggregate`

Merge sharded rollout results into a single rollouts file with aggregate metrics.
Expand Down
Loading