Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
905bf99
First draft of indexed lhotse datasets integration + checkpointable d…
pzelasko May 1, 2026
48818f5
Support new Lhotse's indexed iterators across NeMo
pzelasko May 5, 2026
8a482e4
refactor read_batch
pzelasko May 5, 2026
086f0e3
refactor/cleanup
pzelasko May 5, 2026
69ad897
Merge branch 'main' into stateful-restorable-lhotse-dataloader
pzelasko May 6, 2026
ad6861a
Documentation update to reflect indexed/checkpointable things + gener…
pzelasko May 6, 2026
ffc53c7
Add supported for external index directory
pzelasko May 7, 2026
d7b6855
total token/examples logging; force individual GET instead of GetBatc…
pzelasko May 12, 2026
d057dee
Agentic skill for performing migration to the new dataloader
pzelasko May 12, 2026
31de727
iterable+indexed glue; ais_force_individual rename; skill doc updates
pzelasko May 12, 2026
fc9f366
Fix DP×worker dedup in indexed adapters via PartitionedIndexedIterato…
pzelasko May 13, 2026
ca40489
Proper lhotse shar indexing support and refactoring
pzelasko May 13, 2026
89c7b51
Add dataloader validator under scripts/dataloading
pzelasko May 13, 2026
3e8dab8
dataloder checkpoints save/load correct per-rank information; statele…
pzelasko May 28, 2026
563a81e
Fix dataloader_iter resumability on preemption
pzelasko Jun 3, 2026
685995b
Merge remote-tracking branch 'origin/main' into stateful-restorable-l…
pzelasko Jun 5, 2026
993c246
Support text-only data loading
pzelasko Jun 9, 2026
0f4a125
Fix ShareGPT multimodal resumable dataloading
pzelasko Jun 11, 2026
151d216
Merge remote-tracking branch 'origin/main' into stateful-restorable-l…
pzelasko Jun 12, 2026
2ba49f9
Fix CodeQL review feedback
pzelasko Jun 12, 2026
64fae40
Fix common test failures
pzelasko Jun 12, 2026
9f47392
Fix CI checks
pzelasko Jun 12, 2026
c61b126
Fix remaining callback lint
pzelasko Jun 12, 2026
cb3d81d
Address CodeQL compatibility comments
pzelasko Jun 12, 2026
ae0a131
Document Lhotse compatibility shims
pzelasko Jun 12, 2026
19a4945
Update the resumable dataloader migration skill description
pzelasko Jun 15, 2026
75170c6
Script for analysis of resumable checkpoint dataset tree progress and…
pzelasko Jun 15, 2026
eef6656
Add ability to gracefully skip over malformed JSON lines
pzelasko Jun 22, 2026
e393ed6
Merge remote-tracking branch 'origin/main' into stateful-restorable-l…
pzelasko Jun 22, 2026
0e35bab
Fix PR CI lint checks
pzelasko Jun 22, 2026
a6278c9
Pin Lhotse to 2.0.0a2
pzelasko Jun 22, 2026
3c2ad5d
Fix speechlm2 dataset docs list formatting
pzelasko Jun 22, 2026
26901b5
Fix dataloader iterator resume after skipped validation
pzelasko Jun 23, 2026
a4638fa
Fix ASR Lhotse AIS batch loader tests
pzelasko Jun 23, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
166 changes: 166 additions & 0 deletions .claude/skills/migrate-to-resumable-dataloader/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,166 @@
---
name: migrate-to-resumable-dataloader
description: This skill should be used when the user asks to "migrate to the resumable dataloader", "switch to indexed Lhotse", "adopt the indexed + resumable pipeline", "make my training resumable", "set up StatefulDataLoader for NeMo/Lhotse", "use AIStore GetBatch", or "convert this YAML to the resumable path". Walks a NeMo training YAML and optional launcher, data blend, and runtime context through the indexed + resumable Lhotse migration; lints interacting fields; auto-patches safe YAML changes; emits a migration report, pre-flight checklist, and index-build command. Static analysis only; never launches training.
argument-hint: '<config.yaml> [launcher.py] [blend.yaml] [runtime-notes]'
---

# Migrate a NeMo training YAML to indexed + resumable Lhotse

Use this skill to port a NeMo training config from streaming/replay-style Lhotse
loading to indexed access plus `torchdata.StatefulDataLoader` checkpoint/restore.
The migration is fragile because YAML flags, launcher seed policy, index paths,
storage backend, and resume topology all interact.

## Core concepts

- Indexed sources need `.idx` sidecars for random access into JSONL, tar, and
supported Shar-style data. Build these once per blend/source set.
- `use_stateful_dataloader: true` lets Lightning checkpoint the dataloader
iterator state, but only if seeds, worker counts, and distributed topology are
stable across chunks.
- Training configs must use `force_map_dataset: false` so indexed sources
partition across data-parallel ranks and workers without map-style sampler
overhead. Treat `force_map_dataset: true` for training as not launch-ready
unless the user explicitly approves a temporary exception; every source in the
training iteration graph must be indexed and partition-compatible before
launch.
- Remote audio on AIStore/S3 generally needs `USE_AIS_GET_BATCH=true` so audio
fetches are deferred to sample time instead of constructing eager tar readers
for every shard.

## Inputs

| input | required | source | purpose |
|---|---|---|---|
| Training YAML | yes | argument or `--config=` | Inspect `data.train_ds`, `data.validation_ds`, `trainer`, `exp_manager`, and any model fields that affect resume. |
| Launcher script | no | argument or auto-detect from project conventions | Check per-chunk seed policy, resume topology invariance, Python path setup, AIStore env vars, and optional index staging. |
| Data-blend YAML | no | resolved from `data.train_ds.input_cfg` when possible | Check indexability: compressed paths, non-seekable paths, unsupported `extra_fields`, `slice_length`, and mixed indexed/non-indexed chains. |
| Runtime context | no | argument, config file, or user-provided notes | Detect storage backend, AIStore endpoint availability, container constraints, and index mirror destination. |

## Outputs

Every output lands in `migrate-resumable/<config-stem>/` in the current repo:

| output | purpose |
|---|---|
| `migration-report.md` | Findings, rationale, patched fields, and unresolved blockers. |
| `<config-stem>-resumable.yaml` | Patched training config when safe automatic edits are possible. |
| `<blend-stem>-resumable.yaml` | Patched blend, only when a blend was inspected and safe changes are possible. |
| `pre-flight-checklist.md` | User-run steps before submitting training. |
| `build-indexes-cmd.sh` | One-shot index-build command using the project wrapper when available, otherwise the generic NeMo/Lhotse index builder. |

## Workflow

### 1. Discover and parse inputs

1. Resolve the training YAML path and read it with OmegaConf or a
comment-preserving YAML parser.
2. Resolve any referenced blend YAMLs from `data.*.input_cfg`. Prefer project
conventions when obvious, but fall back to paths relative to the config.
3. If a launcher path is supplied, read it. Otherwise inspect likely project
launchers (`train.py`, `pretrain.py`, shell wrappers, or raw `torchrun` /
`python` commands) and pick the closest match.
4. If runtime context is supplied, read it for container image, environment
variables, filesystem mounts, worker counts, and AIStore endpoint settings.
5. Detect remote storage from source paths (`s3://`, `ais://`, `http(s)://`) and
local filesystem storage from ordinary absolute or relative paths.

### 2. Run lint pipeline

Run every relevant check in:

- `references/option-reference.md`
- `references/conflict-matrix.md`
- `references/failure-modes.md`
- `references/aistore-vs-non-aistore.md` when remote storage is present

Each finding should include severity, field/path, current value, recommended
value, and a short rationale.

Severities:

- **fatal**: automatic patching is not possible; user must preprocess data or
change the source layout.
- **error**: automatic patching is safe and should be applied.
- **warning**: context-dependent; emit a report item and optional YAML comment.
- **note**: informational; no patch.

### 3. Emit patched YAML and blend

Apply safe `error`-severity patches. Preserve comments when possible with
`ruamel.yaml`; otherwise serialize with OmegaConf/YAML and rely on the report for
rationale. For blend edits, never silently drop data: leave an explicit report
entry and comment for every excluded or rewritten source.

### 4. Generate `migration-report.md`

Use `templates/migration-report.md`. Include:

1. Summary of storage workflow, counts by severity, and readiness.
2. Inputs inspected.
3. Findings table.
4. Walkthrough for train data, validation data, trainer/exp manager, launcher,
and storage backend.
5. Data-blend audit.
6. Verification and pre-flight steps.

### 5. Generate `pre-flight-checklist.md`

Use `templates/pre-flight-checklist.md` when present. Required steps:

- Build `.idx` sidecars for every training/validation/test blend involved.
- Verify `indexes_root` points at the same stable mirror used by the runtime, or
that explicit node-local index staging populates it before training starts.
- If AIStore is in play: verify `aistore` SDK availability, `AIS_ENDPOINT`, and
whether `USE_AIS_GET_BATCH` or `USE_AIS_INDIVIDUAL_GETS` is required.
- Verify one invariant seed across resumable chunks.
- Verify `num_workers`, `world_size`, and relevant distributed topology do not
change across resume boundaries.
- Recommend a small smoke ladder: single-node single chunk, single-node resume,
then full topology.

### 6. Generate `build-indexes-cmd.sh`

Prefer a project-provided wrapper when one is clearly present. Otherwise emit a
generic command using:

```bash
python <NeMo>/scripts/dataloading/build_indexes.py \
--indexes-root <shared-index-mirror> \
--workers <N> \
<blend>.yaml [<validation-blend>.yaml ...]
```

If running through a managed runtime or container wrapper, include comments for required
container image, mounts, environment variables, worker count, and any CPU/GPU
container-hook workaround the project requires.

### 7. Print final summary to chat

Keep the final chat response under 10 lines: output directory, finding counts,
report path, and the next command the user should run.

## Knowledge base

- `references/option-reference.md`: field-by-field reference for YAML and
launcher settings.
- `references/failure-modes.md`: known failure signatures, triggers, and fixes.
- `references/conflict-matrix.md`: incompatible option pairs.
- `references/best-practices.md`: priority-ordered checklist.
- `references/aistore-vs-non-aistore.md`: storage workflow selection.
- `templates/migration-report.md`: report template.
- `templates/pre-flight-checklist.md`: checklist template, when present.
- `scripts/analyze.py`: optional static-analysis helper, when present.

## Constraints

- Prefer static analysis. Do not launch training, build indexes, prefetch data, or
modify external runtime state unless the user explicitly asks.
- Cross-check recommendations against the actual NeMo/Lhotse code in the user's
checkout when paths are available. Relevant areas are common Lhotse dataloader
config, indexed adapters, `lhotse.indexing`, AIStore batch loading, and NeMo
dataloader construction.
- Treat project wrappers as optional conveniences, not as part of the generic
migration contract.
- When evidence is missing, say so. Do not encode project-specific run history
or local experiment names as general guidance.
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# AIStore vs filesystem workflows

Indexed + resumable Lhotse can read audio/tar sources from a local filesystem or
from AIStore-compatible URLs. Manifests/cuts may be on disk in either workflow.
Choose the workflow from source path schemes, not from where the process runs.

## Detection

| signal | workflow |
|---|---|
| `tarred_audio_filepaths: s3://...`, `ais://...`, or `http(s)://...` | AIStore/remote workflow |
| `tarred_audio_filepaths: /path/...` or relative filesystem path | filesystem workflow |
| mixed local and remote paths | remote workflow, because it has the stricter requirements |

`AIS_ENDPOINT` in the environment is necessary for AIStore access, but it is not
sufficient evidence that the blend uses AIStore.

## Remote AIStore workflow

Required setup:

- `aistore` SDK installed in the build/training container.
- `AIS_ENDPOINT` exported into the process that reads remote sources.
- `USE_AIS_GET_BATCH=true` when remote tar/audio should be fetched lazily by
minibatch instead of opening every shard eagerly.

Optional setup:

- `USE_AIS_INDIVIDUAL_GETS=true` to bypass the batch endpoint and fetch each
object individually. This is slower but useful when the batch endpoint is
unavailable or returns empty content for some objects.

Index building:

- The index builder reads remote tar files through AIStore byte-range capable
paths and writes `.idx` sidecars to the configured index mirror.
- A successful index build proves byte-range access worked for the indexed
source paths. It does not prove the batch endpoint will later serve every
object successfully.

Runtime data access:

1. Keep manifests/cuts on a local/shared filesystem when random access would be
inefficient from remote storage.
2. Point `data.*.indexes_root` at a persistent index mirror by default.
3. Use node-local index staging only when direct mirror reads are too slow or
metadata-heavy; make the YAML path match the staged destination.
4. Use manifest prefetch only as a fallback for remote manifest paths that
cannot be cached persistently.

## Filesystem-only workflow

Required setup:

- All audio/tar paths resolve through the local filesystem visible in the
container/process.
- AIStore env vars are unset or ignored when no remote paths are present.
- `USE_AIS_GET_BATCH=false` unless a mixed remote source requires it.

Index building:

- The index builder reads local files directly.
- Filesystem throughput and metadata behavior determine the best worker count.

Runtime data access:

1. Keep manifests/cuts on a local/shared filesystem.
2. Point `data.*.indexes_root` at a persistent index mirror.
3. Stage indexes to node-local SSD only when needed and only with matching YAML
paths.

## Common gotchas

- Do not infer workflow from runtime labels alone; inspect the source paths.
- Verify filesystem mounts inside the runtime/container, not only in the host shell.
- Reusing an index mirror requires identical source path strings and unchanged
source contents.
- AIStore individual GETs and batch GETs can exercise different backend paths;
test the exact access mode used by training.
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# Best practices - indexed + resumable Lhotse migration

Prioritized checklist for migrating a NeMo config to indexed access and
checkpointable dataloading.

## Tier 1 - non-negotiable

1. **Pin `seed` and `shard_seed` to fixed integers.** The sampler and model RNG
must resume from a stable state. Avoid `"randomized"` for resumable chains.

2. **Use one seed across every chunk of a resumable chain.** Lightning reseeds
global RNGs at chunk startup. Rotating the seed breaks bit-exact resume even
when dataloader state restores correctly.

3. **Keep `num_workers` and distributed topology invariant.** Changing worker
count, world size, or rank/worker assignment invalidates stateful dataloader
snapshots and iterable partition state.

4. **Build `.idx` sidecars once per stable source path set.** Reuse a persistent
index mirror across experiments. Rebuild only when source contents or path
strings change.

5. **Disable concurrent bucketing for resumable training.** Background producer
threads can advance iterators outside the checkpointed main-thread state.

## Tier 2 - strongly recommended

6. **Run a bit-exact dataloader resume check before sweeping.** Take a few
batches, save dataloader state, take a few more as ground truth, restore in a
fresh process, and compare the restored batches.

7. **Enforce `force_map_dataset: false` for training.** Map-style training has
too much sampler/manifest overhead. Before launch, confirm every training
source is indexed, multiplexer seeds are fixed, and topology is stable; if a
source cannot be indexed, report it as a migration blocker instead of
silently keeping map-style training.

8. **Use frequent checkpoint triggers.** External termination may not execute a
graceful preemption callback. Step- or time-based saves reduce lost progress.

9. **Smoke test in stages.** Run single-node single-chunk, then single-node
multi-chunk resume, then the intended full topology.

10. **Keep `.idx` files on a persistent filesystem by default.** Stage to
node-local SSD only when direct filesystem reads are proven problematic, and
ensure the YAML `indexes_root` matches the staged destination.

11. **Use AIStore batch fetching deliberately.** For remote tar/audio sources,
`USE_AIS_GET_BATCH=true` avoids eager remote tar-reader construction. If the
batch endpoint fails for a dataset, use `USE_AIS_INDIVIDUAL_GETS=true` as a
slower fallback while investigating storage availability.

## Tier 3 - operational hygiene

12. **Tune index-build workers to memory and storage backend.** Many workers can
OOM on large manifests or remote tar headers. Reduce workers or split the
blend when needed.

13. **Keep optional prefetch steps explicit.** Manifest prefetch, index staging,
and model-cache preambles should be visible in the launcher and documented in
the report.

14. **Use CPU-safe container settings for CPU-only index builds.** Some container
runtimes expect GPU hooks by default; bypass or disable them when the index
build runs without GPU access.

## What not to do

- Do not trust `meta.pt` key presence alone as proof of bit-exact resume.
- Do not combine incompatible Lightning checkpoint triggers.
- Do not point `indexes_root` at a node-local path unless the launcher populates
it before every chunk.
- Do not launch iterable training until every source in the chain has been
audited and made partition-compatible.
- Do not use map-style training to bypass indexing blockers; mark the migration
not launch-ready unless the user explicitly approves a temporary exception
with the blocker and expected overhead.
- Do not set `LHOTSE_USE_WORKER_PARTITION` manually; it is an internal signal set
by the dataloader worker initialization path.
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Conflict matrix - indexed + resumable Lhotse

Table format: `A | B | conflict | severity | resolution`.

Severities:

- **fatal**: automatic patching is impossible; data must be preprocessed or the
launcher/storage setup must change.
- **error**: automatic patching is usually safe.
- **warning**: context-dependent; report clearly.
- **note**: informational.

| A | B | conflict | severity | resolution |
|---|---|---|---|---|
| `data.train_ds.indexed: true` | `extra_fields:` on indexed NeMo entries | Indexed adapters cannot preserve arbitrary runtime field rewrites. | fatal | Preprocess the manifest to materialize fields, then drop `extra_fields`. |
| `data.train_ds.indexed: true` | `slice_length:` on indexed entries | Slicing changes cut/audio access and has no stable sidecar unless preprocessed. | fatal | Re-shard or preprocess offline, then drop `slice_length`. |
| `data.train_ds.indexed: true` | compressed JSONL/Shar cuts or compressed tar paths | Compressed streams do not provide stable seekable offsets for sidecars. | fatal | Re-export uncompressed or materialize seekable sources. |
| `data.train_ds.indexed: true` | `pipe:` paths | Pipes are not seekable. | fatal | Materialize upstream data to files or a seekable backend. |
| `data.train_ds.force_map_dataset: true` | resumable training launch | Map-style training keeps too much sampler/manifest work on the main process. | error | Set `data.train_ds.force_map_dataset: false` after making every training source indexed and partition-compatible. |
| `force_map_dataset: true` | `force_iterable_dataset: true` | Dataset mode selection is contradictory. | error | Keep one mode. For training, use `force_map_dataset: false`; for validation/test, keep map-style unless intentionally testing iterable behavior. |
| `use_stateful_dataloader: true` | per-chunk seed rotation | Model-level RNG diverges across resumed chunks. | error | Pin one seed for the whole chain in YAML and launcher. |
| `use_stateful_dataloader: true` | `num_workers` changes between chunks | Saved dataloader state is incompatible. | error | Keep worker count invariant or restart without dataloader state. |
| `use_stateful_dataloader: true` | `world_size` / rank topology changes | Saved iterator and sampler state are topology-sensitive. | error | Keep topology invariant or restart without dataloader state. |
| `force_map_dataset: false` | any non-indexed source in the chain | Non-indexed sources do not partition and are duplicated across ranks/workers. | fatal | Convert all sources to indexed access or split/remove the non-indexed source. Do not switch to map-style training to bypass this unless the user explicitly approves a temporary exception. |
| `force_map_dataset: false` | multiplexer seed is `"randomized"` | Shards may choose different sources at the same step. | error | Use a fixed integer seed. |
| `force_finite: true` | training dataset | Can cap infinite training mixtures unexpectedly. | error | Use finite mode for validation/test only unless intentionally bounded. |
| Checkpoint cadence absent | external preemption / walltime kill | Chunk progress can be lost without mid-chunk saves. | warning | Add frequent step- or time-based checkpoints. |
| Node-local `indexes_root` | no prefetch/staging before startup | `.idx` files are missing at runtime. | error | Point to a persistent mirror or stage indexes before every chunk. |
| AIStore batch mode | objects unavailable through batch endpoint | Batch loader may return empty content or fail collation. | warning | Verify object availability, replicate data, or set `USE_AIS_INDIVIDUAL_GETS=true`. |
| Container lacks AIStore SDK | AIStore source paths | Remote reads may fall back to the wrong backend or fail. | error | Install a compatible `aistore` SDK in build/training containers. |
| CPU-only index build | GPU container hook requires GPU runtime | Container startup can fail before index build begins. | warning | Use CPU-safe container settings or bypass GPU hooks. |
Loading
Loading