NVIDIA-NeMo · pzelasko · May 1, 2026 · May 5, 2026 · May 5, 2026 · May 5, 2026
diff --git a/.claude/skills/migrate-to-resumable-dataloader/SKILL.md b/.claude/skills/migrate-to-resumable-dataloader/SKILL.md
@@ -0,0 +1,166 @@
+---
+name: migrate-to-resumable-dataloader
+description: This skill should be used when the user asks to "migrate to the resumable dataloader", "switch to indexed Lhotse", "adopt the indexed + resumable pipeline", "make my training resumable", "set up StatefulDataLoader for NeMo/Lhotse", "use AIStore GetBatch", or "convert this YAML to the resumable path". Walks a NeMo training YAML and optional launcher, data blend, and runtime context through the indexed + resumable Lhotse migration; lints interacting fields; auto-patches safe YAML changes; emits a migration report, pre-flight checklist, and index-build command. Static analysis only; never launches training.
+argument-hint: '<config.yaml> [launcher.py] [blend.yaml] [runtime-notes]'
+---
+
+# Migrate a NeMo training YAML to indexed + resumable Lhotse
+
+Use this skill to port a NeMo training config from streaming/replay-style Lhotse
+loading to indexed access plus `torchdata.StatefulDataLoader` checkpoint/restore.
+The migration is fragile because YAML flags, launcher seed policy, index paths,
+storage backend, and resume topology all interact.
+
+## Core concepts
+
+- Indexed sources need `.idx` sidecars for random access into JSONL, tar, and
+  supported Shar-style data. Build these once per blend/source set.
+- `use_stateful_dataloader: true` lets Lightning checkpoint the dataloader
+  iterator state, but only if seeds, worker counts, and distributed topology are
+  stable across chunks.
+- Training configs must use `force_map_dataset: false` so indexed sources
+  partition across data-parallel ranks and workers without map-style sampler
+  overhead. Treat `force_map_dataset: true` for training as not launch-ready
+  unless the user explicitly approves a temporary exception; every source in the
+  training iteration graph must be indexed and partition-compatible before
+  launch.
+- Remote audio on AIStore/S3 generally needs `USE_AIS_GET_BATCH=true` so audio
+  fetches are deferred to sample time instead of constructing eager tar readers
+  for every shard.
+
+## Inputs
+
+| input | required | source | purpose |
+|---|---|---|---|
+| Training YAML | yes | argument or `--config=` | Inspect `data.train_ds`, `data.validation_ds`, `trainer`, `exp_manager`, and any model fields that affect resume. |
+| Launcher script | no | argument or auto-detect from project conventions | Check per-chunk seed policy, resume topology invariance, Python path setup, AIStore env vars, and optional index staging. |
+| Data-blend YAML | no | resolved from `data.train_ds.input_cfg` when possible | Check indexability: compressed paths, non-seekable paths, unsupported `extra_fields`, `slice_length`, and mixed indexed/non-indexed chains. |
+| Runtime context | no | argument, config file, or user-provided notes | Detect storage backend, AIStore endpoint availability, container constraints, and index mirror destination. |
+
+## Outputs
+
+Every output lands in `migrate-resumable/<config-stem>/` in the current repo:
+
+| output | purpose |
+|---|---|
+| `migration-report.md` | Findings, rationale, patched fields, and unresolved blockers. |
+| `<config-stem>-resumable.yaml` | Patched training config when safe automatic edits are possible. |
+| `<blend-stem>-resumable.yaml` | Patched blend, only when a blend was inspected and safe changes are possible. |
+| `pre-flight-checklist.md` | User-run steps before submitting training. |
+| `build-indexes-cmd.sh` | One-shot index-build command using the project wrapper when available, otherwise the generic NeMo/Lhotse index builder. |
+
+## Workflow
+
+### 1. Discover and parse inputs
+
+1. Resolve the training YAML path and read it with OmegaConf or a
+   comment-preserving YAML parser.
+2. Resolve any referenced blend YAMLs from `data.*.input_cfg`. Prefer project
+   conventions when obvious, but fall back to paths relative to the config.
+3. If a launcher path is supplied, read it. Otherwise inspect likely project
+   launchers (`train.py`, `pretrain.py`, shell wrappers, or raw `torchrun` /
+   `python` commands) and pick the closest match.
+4. If runtime context is supplied, read it for container image, environment
+   variables, filesystem mounts, worker counts, and AIStore endpoint settings.
+5. Detect remote storage from source paths (`s3://`, `ais://`, `http(s)://`) and
+   local filesystem storage from ordinary absolute or relative paths.
+
+### 2. Run lint pipeline
+
+Run every relevant check in:
+
+- `references/option-reference.md`
+- `references/conflict-matrix.md`
+- `references/failure-modes.md`
+- `references/aistore-vs-non-aistore.md` when remote storage is present
+
+Each finding should include severity, field/path, current value, recommended
+value, and a short rationale.
+
+Severities:
+
+- **fatal**: automatic patching is not possible; user must preprocess data or
+  change the source layout.
+- **error**: automatic patching is safe and should be applied.
+- **warning**: context-dependent; emit a report item and optional YAML comment.
+- **note**: informational; no patch.
+
+### 3. Emit patched YAML and blend
+
+Apply safe `error`-severity patches. Preserve comments when possible with
+`ruamel.yaml`; otherwise serialize with OmegaConf/YAML and rely on the report for
+rationale. For blend edits, never silently drop data: leave an explicit report
+entry and comment for every excluded or rewritten source.
+
+### 4. Generate `migration-report.md`
+
+Use `templates/migration-report.md`. Include:
+
+1. Summary of storage workflow, counts by severity, and readiness.
+2. Inputs inspected.
+3. Findings table.
+4. Walkthrough for train data, validation data, trainer/exp manager, launcher,
+   and storage backend.
+5. Data-blend audit.
+6. Verification and pre-flight steps.
+
+### 5. Generate `pre-flight-checklist.md`
+
+Use `templates/pre-flight-checklist.md` when present. Required steps:
+
+- Build `.idx` sidecars for every training/validation/test blend involved.
+- Verify `indexes_root` points at the same stable mirror used by the runtime, or
+  that explicit node-local index staging populates it before training starts.
+- If AIStore is in play: verify `aistore` SDK availability, `AIS_ENDPOINT`, and
+  whether `USE_AIS_GET_BATCH` or `USE_AIS_INDIVIDUAL_GETS` is required.
+- Verify one invariant seed across resumable chunks.
+- Verify `num_workers`, `world_size`, and relevant distributed topology do not
+  change across resume boundaries.
+- Recommend a small smoke ladder: single-node single chunk, single-node resume,
+  then full topology.
+
+### 6. Generate `build-indexes-cmd.sh`
+
+Prefer a project-provided wrapper when one is clearly present. Otherwise emit a
+generic command using:
+
+```bash
+python <NeMo>/scripts/dataloading/build_indexes.py \
+    --indexes-root <shared-index-mirror> \
+    --workers <N> \
+    <blend>.yaml [<validation-blend>.yaml ...]
+```
+
+If running through a managed runtime or container wrapper, include comments for required
+container image, mounts, environment variables, worker count, and any CPU/GPU
+container-hook workaround the project requires.
+
+### 7. Print final summary to chat
+
+Keep the final chat response under 10 lines: output directory, finding counts,
+report path, and the next command the user should run.
+
+## Knowledge base
+
+- `references/option-reference.md`: field-by-field reference for YAML and
+  launcher settings.
+- `references/failure-modes.md`: known failure signatures, triggers, and fixes.
+- `references/conflict-matrix.md`: incompatible option pairs.
+- `references/best-practices.md`: priority-ordered checklist.
+- `references/aistore-vs-non-aistore.md`: storage workflow selection.
+- `templates/migration-report.md`: report template.
+- `templates/pre-flight-checklist.md`: checklist template, when present.
+- `scripts/analyze.py`: optional static-analysis helper, when present.
+
+## Constraints
+
+- Prefer static analysis. Do not launch training, build indexes, prefetch data, or
+  modify external runtime state unless the user explicitly asks.
+- Cross-check recommendations against the actual NeMo/Lhotse code in the user's
+  checkout when paths are available. Relevant areas are common Lhotse dataloader
+  config, indexed adapters, `lhotse.indexing`, AIStore batch loading, and NeMo
+  dataloader construction.
+- Treat project wrappers as optional conveniences, not as part of the generic
+  migration contract.
+- When evidence is missing, say so. Do not encode project-specific run history
+  or local experiment names as general guidance.
diff --git a/...ude/skills/migrate-to-resumable-dataloader/references/aistore-vs-non-aistore.md b/...ude/skills/migrate-to-resumable-dataloader/references/aistore-vs-non-aistore.md
@@ -0,0 +1,79 @@
+# AIStore vs filesystem workflows
+
+Indexed + resumable Lhotse can read audio/tar sources from a local filesystem or
+from AIStore-compatible URLs. Manifests/cuts may be on disk in either workflow.
+Choose the workflow from source path schemes, not from where the process runs.
+
+## Detection
+
+| signal | workflow |
+|---|---|
+| `tarred_audio_filepaths: s3://...`, `ais://...`, or `http(s)://...` | AIStore/remote workflow |
+| `tarred_audio_filepaths: /path/...` or relative filesystem path | filesystem workflow |
+| mixed local and remote paths | remote workflow, because it has the stricter requirements |
+
+`AIS_ENDPOINT` in the environment is necessary for AIStore access, but it is not
+sufficient evidence that the blend uses AIStore.
+
+## Remote AIStore workflow
+
+Required setup:
+
+- `aistore` SDK installed in the build/training container.
+- `AIS_ENDPOINT` exported into the process that reads remote sources.
+- `USE_AIS_GET_BATCH=true` when remote tar/audio should be fetched lazily by
+  minibatch instead of opening every shard eagerly.
+
+Optional setup:
+
+- `USE_AIS_INDIVIDUAL_GETS=true` to bypass the batch endpoint and fetch each
+  object individually. This is slower but useful when the batch endpoint is
+  unavailable or returns empty content for some objects.
+
+Index building:
+
+- The index builder reads remote tar files through AIStore byte-range capable
+  paths and writes `.idx` sidecars to the configured index mirror.
+- A successful index build proves byte-range access worked for the indexed
+  source paths. It does not prove the batch endpoint will later serve every
+  object successfully.
+
+Runtime data access:
+
+1. Keep manifests/cuts on a local/shared filesystem when random access would be
+   inefficient from remote storage.
+2. Point `data.*.indexes_root` at a persistent index mirror by default.
+3. Use node-local index staging only when direct mirror reads are too slow or
+   metadata-heavy; make the YAML path match the staged destination.
+4. Use manifest prefetch only as a fallback for remote manifest paths that
+   cannot be cached persistently.
+
+## Filesystem-only workflow
+
+Required setup:
+
+- All audio/tar paths resolve through the local filesystem visible in the
+  container/process.
+- AIStore env vars are unset or ignored when no remote paths are present.
+- `USE_AIS_GET_BATCH=false` unless a mixed remote source requires it.
+
+Index building:
+
+- The index builder reads local files directly.
+- Filesystem throughput and metadata behavior determine the best worker count.
+
+Runtime data access:
+
+1. Keep manifests/cuts on a local/shared filesystem.
+2. Point `data.*.indexes_root` at a persistent index mirror.
+3. Stage indexes to node-local SSD only when needed and only with matching YAML
+   paths.
+
+## Common gotchas
+
+- Do not infer workflow from runtime labels alone; inspect the source paths.
+- Verify filesystem mounts inside the runtime/container, not only in the host shell.
+- Reusing an index mirror requires identical source path strings and unchanged
+  source contents.
+- AIStore individual GETs and batch GETs can exercise different backend paths;
+  test the exact access mode used by training.
diff --git a/.claude/skills/migrate-to-resumable-dataloader/references/best-practices.md b/.claude/skills/migrate-to-resumable-dataloader/references/best-practices.md
@@ -0,0 +1,79 @@
+# Best practices - indexed + resumable Lhotse migration
+
+Prioritized checklist for migrating a NeMo config to indexed access and
+checkpointable dataloading.
+
+## Tier 1 - non-negotiable
+
+1. **Pin `seed` and `shard_seed` to fixed integers.** The sampler and model RNG
+   must resume from a stable state. Avoid `"randomized"` for resumable chains.
+
+2. **Use one seed across every chunk of a resumable chain.** Lightning reseeds
+   global RNGs at chunk startup. Rotating the seed breaks bit-exact resume even
+   when dataloader state restores correctly.
+
+3. **Keep `num_workers` and distributed topology invariant.** Changing worker
+   count, world size, or rank/worker assignment invalidates stateful dataloader
+   snapshots and iterable partition state.
+
+4. **Build `.idx` sidecars once per stable source path set.** Reuse a persistent
+   index mirror across experiments. Rebuild only when source contents or path
+   strings change.
+
+5. **Disable concurrent bucketing for resumable training.** Background producer
+   threads can advance iterators outside the checkpointed main-thread state.
+
+## Tier 2 - strongly recommended
+
+6. **Run a bit-exact dataloader resume check before sweeping.** Take a few
+   batches, save dataloader state, take a few more as ground truth, restore in a
+   fresh process, and compare the restored batches.
+
+7. **Enforce `force_map_dataset: false` for training.** Map-style training has
+   too much sampler/manifest overhead. Before launch, confirm every training
+   source is indexed, multiplexer seeds are fixed, and topology is stable; if a
+   source cannot be indexed, report it as a migration blocker instead of
+   silently keeping map-style training.
+
+8. **Use frequent checkpoint triggers.** External termination may not execute a
+   graceful preemption callback. Step- or time-based saves reduce lost progress.
+
+9. **Smoke test in stages.** Run single-node single-chunk, then single-node
+   multi-chunk resume, then the intended full topology.
+
+10. **Keep `.idx` files on a persistent filesystem by default.** Stage to
+    node-local SSD only when direct filesystem reads are proven problematic, and
+    ensure the YAML `indexes_root` matches the staged destination.
+
+11. **Use AIStore batch fetching deliberately.** For remote tar/audio sources,
+    `USE_AIS_GET_BATCH=true` avoids eager remote tar-reader construction. If the
+    batch endpoint fails for a dataset, use `USE_AIS_INDIVIDUAL_GETS=true` as a
+    slower fallback while investigating storage availability.
+
+## Tier 3 - operational hygiene
+
+12. **Tune index-build workers to memory and storage backend.** Many workers can
+    OOM on large manifests or remote tar headers. Reduce workers or split the
+    blend when needed.
+
+13. **Keep optional prefetch steps explicit.** Manifest prefetch, index staging,
+    and model-cache preambles should be visible in the launcher and documented in
+    the report.
+
+14. **Use CPU-safe container settings for CPU-only index builds.** Some container
+    runtimes expect GPU hooks by default; bypass or disable them when the index
+    build runs without GPU access.
+
+## What not to do
+
+- Do not trust `meta.pt` key presence alone as proof of bit-exact resume.
+- Do not combine incompatible Lightning checkpoint triggers.
+- Do not point `indexes_root` at a node-local path unless the launcher populates
+  it before every chunk.
+- Do not launch iterable training until every source in the chain has been
+  audited and made partition-compatible.
+- Do not use map-style training to bypass indexing blockers; mark the migration
+  not launch-ready unless the user explicitly approves a temporary exception
+  with the blocker and expected overhead.
+- Do not set `LHOTSE_USE_WORKER_PARTITION` manually; it is an internal signal set
+  by the dataloader worker initialization path.
diff --git a/.claude/skills/migrate-to-resumable-dataloader/references/conflict-matrix.md b/.claude/skills/migrate-to-resumable-dataloader/references/conflict-matrix.md
@@ -0,0 +1,31 @@
+# Conflict matrix - indexed + resumable Lhotse
+
+Table format: `A | B | conflict | severity | resolution`.
+
+Severities:
+
+- **fatal**: automatic patching is impossible; data must be preprocessed or the
+  launcher/storage setup must change.
+- **error**: automatic patching is usually safe.
+- **warning**: context-dependent; report clearly.
+- **note**: informational.
+
+| A | B | conflict | severity | resolution |
+|---|---|---|---|---|
+| `data.train_ds.indexed: true` | `extra_fields:` on indexed NeMo entries | Indexed adapters cannot preserve arbitrary runtime field rewrites. | fatal | Preprocess the manifest to materialize fields, then drop `extra_fields`. |
+| `data.train_ds.indexed: true` | `slice_length:` on indexed entries | Slicing changes cut/audio access and has no stable sidecar unless preprocessed. | fatal | Re-shard or preprocess offline, then drop `slice_length`. |
+| `data.train_ds.indexed: true` | compressed JSONL/Shar cuts or compressed tar paths | Compressed streams do not provide stable seekable offsets for sidecars. | fatal | Re-export uncompressed or materialize seekable sources. |
+| `data.train_ds.indexed: true` | `pipe:` paths | Pipes are not seekable. | fatal | Materialize upstream data to files or a seekable backend. |
+| `data.train_ds.force_map_dataset: true` | resumable training launch | Map-style training keeps too much sampler/manifest work on the main process. | error | Set `data.train_ds.force_map_dataset: false` after making every training source indexed and partition-compatible. |
+| `force_map_dataset: true` | `force_iterable_dataset: true` | Dataset mode selection is contradictory. | error | Keep one mode. For training, use `force_map_dataset: false`; for validation/test, keep map-style unless intentionally testing iterable behavior. |
+| `use_stateful_dataloader: true` | per-chunk seed rotation | Model-level RNG diverges across resumed chunks. | error | Pin one seed for the whole chain in YAML and launcher. |
+| `use_stateful_dataloader: true` | `num_workers` changes between chunks | Saved dataloader state is incompatible. | error | Keep worker count invariant or restart without dataloader state. |
+| `use_stateful_dataloader: true` | `world_size` / rank topology changes | Saved iterator and sampler state are topology-sensitive. | error | Keep topology invariant or restart without dataloader state. |
+| `force_map_dataset: false` | any non-indexed source in the chain | Non-indexed sources do not partition and are duplicated across ranks/workers. | fatal | Convert all sources to indexed access or split/remove the non-indexed source. Do not switch to map-style training to bypass this unless the user explicitly approves a temporary exception. |
+| `force_map_dataset: false` | multiplexer seed is `"randomized"` | Shards may choose different sources at the same step. | error | Use a fixed integer seed. |
+| `force_finite: true` | training dataset | Can cap infinite training mixtures unexpectedly. | error | Use finite mode for validation/test only unless intentionally bounded. |
+| Checkpoint cadence absent | external preemption / walltime kill | Chunk progress can be lost without mid-chunk saves. | warning | Add frequent step- or time-based checkpoints. |
+| Node-local `indexes_root` | no prefetch/staging before startup | `.idx` files are missing at runtime. | error | Point to a persistent mirror or stage indexes before every chunk. |
+| AIStore batch mode | objects unavailable through batch endpoint | Batch loader may return empty content or fail collation. | warning | Verify object availability, replicate data, or set `USE_AIS_INDIVIDUAL_GETS=true`. |
+| Container lacks AIStore SDK | AIStore source paths | Remote reads may fall back to the wrong backend or fail. | error | Install a compatible `aistore` SDK in build/training containers. |
+| CPU-only index build | GPU container hook requires GPU runtime | Container startup can fail before index build begins. | warning | Use CPU-safe container settings or bypass GPU hooks. |