Skip to content

Add curator-to-SFT JSONL converter#62

Merged
lfengad merged 2 commits into
mainfrom
add-curator-to-sft-jsonl
Jun 29, 2026
Merged

Add curator-to-SFT JSONL converter#62
lfengad merged 2 commits into
mainfrom
add-curator-to-sft-jsonl

Conversation

@lfengad

@lfengad lfengad commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

Add cosmos_framework.scripts.curator_to_sft_jsonl, which converts cosmos-curator splitting-pipeline metas_jsonl output directly into the SFT training JSONL format, applying the same hard filters sft_dataset.py applies silently at train time (duration > 61.0s, per-window frames < 61, optional short-edge) so dataset counts match what training consumes. Emits a sidecar .summary.json with per-reason drop counts and rewrites vision_path relative to the JSONL so datasets stay portable across mounts.

Document the path as a new "Create Dataset from a Cosmos-Curator output directory" section in docs/dataset_jsonl.md.

Ported from imaginaire4 MR 9217: cosmos3.scripts -> cosmos_framework.scripts, OSS SPDX header, and stale sft_dataset.py line refs corrected to 548-550.

Verified: 24/24 tests pass, ruff check/format clean, CLI --help imports.
from MR 9217

Add cosmos_framework.scripts.curator_to_sft_jsonl, which converts
cosmos-curator splitting-pipeline metas_jsonl output directly into the
SFT training JSONL format, applying the same hard filters sft_dataset.py
applies silently at train time (duration > 61.0s, per-window frames < 61,
optional short-edge) so dataset counts match what training consumes. Emits
a sidecar <output>.summary.json with per-reason drop counts and rewrites
vision_path relative to the JSONL so datasets stay portable across mounts.

Document the path as a new "Create Dataset from a Cosmos-Curator output
directory" section in docs/dataset_jsonl.md.

Ported from imaginaire4 MR 9217: cosmos3.scripts -> cosmos_framework.scripts,
OSS SPDX header, and stale sft_dataset.py line refs corrected to 548-550.

Verified: 24/24 tests pass, ruff check/format clean, CLI --help imports.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@lfengad lfengad changed the title Add curator-to-SFT JSONL converter (ported from i4 MR 9217) Add curator-to-SFT JSONL converter Jun 26, 2026
@lfengad lfengad closed this Jun 26, 2026
@lfengad lfengad reopened this Jun 27, 2026
@lfengad lfengad merged commit f095eb8 into main Jun 29, 2026
8 checks passed
@lfengad lfengad deleted the add-curator-to-sft-jsonl branch June 29, 2026 08:32
lfengad added a commit that referenced this pull request Jun 29, 2026
…ge (#68)

## Summary

Ports the control-input CFG feature (from i4 commit `f11349b`) into the
transfer inference path, reconciling with logic already synced into this
repo, and adds CD smoke-test coverage for transfer inference.

- **`omni_mot_model.py`** already carries the
`velocity_postprocess_builder` hook — no model change needed.
- **`transfer.py`**: add `_build_no_control_inference_state` and
`build_control_cfg_postprocess`, wired through
`generate_samples_from_batch` via `velocity_postprocess_builder`.
Previously `transfer.py` passed
`control_guidance`/`control_guidance_interval` directly, where they were
silently dropped by `**kwargs` (control-CFG was a no-op).
- **`args.py`**: add `emphasize_control_in_prompt`
(`TransferDataArgs`/`Overrides` + `_TRANSFER_SAMPLE_DEFAULTS`) to match
the ported prompt-emphasis logic.

## Test coverage

Extends `tests/nano_inference_smoke_test.py` (the
`generator-inference-smoke` CD job) to also run a `video2video` edge
transfer with `control_guidance=1.5` in the same Nano inference call:

- Spec is built inline (`_TRANSFER_SPEC`, written to a temp file — not
committed under `inputs/`), pulling the control video from the public
`NVIDIA/cosmos` GitHub raw URL (same file the cookbook `edge.json`
uses), downscaled for a fast smoke run (480p / 10 steps / single
29-frame chunk).
- Validates transfer-specific attributes (edge `control_path`,
`control_guidance > 1`, `guidance > 1`) and a non-degenerate output clip
via a new `_assert_video_has_content` helper (frame count + pixel
variation).

## Verification

Verified end-to-end on a GB200 node with this repo's `.venv`:
- README Nano edge transfer (`control_guidance=1.5`) → valid 121-frame
720p video; `emphasize_control_in_prompt` and the control-CFG path both
run.
- Inline smoke spec → `status: success`, `control_guidance=1.5`,
`guidance=3.0`, non-degenerate output (`frames=29`, pixel std ≈ 68).

> Note: the branch name predates this work; after rebase it carries only
the transfer commit (the curator-to-SFT converter already merged as
#62).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants