Expose enable_sample_tables in prepare_dataset and as --no-sample-tables CLI flag by pei-li-hedgehog · Pull Request #230 · NVIDIA/Megatron-Energon

pei-li-hedgehog · 2026-05-19T11:04:35Z

Closes #231.

Summary

SqliteIndexWriter already supports enable_sample_tables=False to skip creating the samples and sample_parts tables, but that knob isn't plumbed through BaseWebdatasetFactory.prepare_dataset() or the energon prepare CLI. On very large datasets (100M+ samples) the SQLite inserts and the post-insert btree builds dominate preparation time. For datasets consumed purely by the integer-indexed loader (ShardInfosITarReader), the samples table isn't queried at training time — checkpoint/resume uses SliceState integer offsets resolved via .info.json + .tar.idx.

Concretely, on a 244M-row / 4000-shard text webdataset we hit a 13-hour K8s pod timeout in the indexing phase. We need a way to skip the SQLite cost for monolithic pretraining datasets while keeping everything the integer-indexed loader needs (.tar.idx, .info.json, split config).

Changes

BaseWebdatasetFactory.prepare_dataset() accepts enable_sample_tables=True (default — backwards compatible), passes through to SqliteIndexWriterAggregator.
energon prepare --no-sample-tables exposes the same option. Mutually exclusive with --tar-index-only (which operates on an already-prepared dataset, different semantics).
append_samples / append_parts become no-ops when the writer was constructed with enable_sample_tables=False, so producers can keep emitting rows unmodified.
SqliteIndexWriter always drops stale samples / sample_parts when reset_tables=True, regardless of whether the new run intends to repopulate them. Avoids stale tables when re-preparing with --no-sample-tables after a previous full prepare.

Out of scope

Sample-key lookups (WebdatasetFileStore / SqliteITarEntryReader), polylithic joins (which INNER JOIN ... ON samples.sample_key = ... at metadataset-prepare time), and aux-data access on CrudeSample datasets all rely on the SQLite sample tables and will not work for datasets prepared with --no-sample-tables. That's the intended trade-off — documented in the new docstring/help text. Failures are loud (clear SQLite no such table errors), not silent.

A larger follow-up would also skip emitting IndexSample / IndexSamplePart items from _preprocess_tar (saving IPC overhead in addition to SQLite cost). Out of scope here.

Testing

All four existing test_prepare_dataset* tests pass (backwards compatibility).

Two new tests:

test_prepare_dataset_no_sample_tables — covers the prepare path:
- --no-sample-tables prepare succeeds; .info.json and per-tar .tar.idx are produced.
- index.sqlite exists but does not contain the samples / sample_parts tables.
- --no-sample-tables + --tar-index-only is rejected with the documented UsageError.
- Constructing a SqliteIndexReader against the prepared dataset and calling sample-key methods (get_sample_pointer_by_key, get_sample_count, get_sample_part) raises MissingSamplesTableError with a descriptive message — not the raw sqlite3.OperationalError: no such table: samples.
test_prepare_dataset_no_sample_tables_save_restore — covers the training-time checkpoint/resume path on a --no-sample-tables dataset:
- Reference: uninterrupted iteration of 20 samples.
- Save state mid-stream after 10 samples via save_state_rank(); continue iterating to capture the next 10 (post_save).
- Build a fresh loader, restore_state_rank(state), iterate 10 samples (post_restore).
- Asserts first_half + post_save == reference and post_restore == post_save.
- This empirically verifies that ShardInfosITarReader + SliceState (the integer-indexed loader/state path) work end-to-end on a dataset prepared without the SQLite samples tables.

ruff check and ruff format --check clean on all touched files.

…les CLI flag The SqliteIndexWriter already supports enable_sample_tables=False to skip the samples and sample_parts tables, but the flag was not plumbed through BaseWebdatasetFactory.prepare_dataset() or the `energon prepare` CLI. On very large datasets (100M+ samples) the SQLite inserts and post-insert btree builds dominate preparation runtime; for datasets consumed purely by the integer-indexed loader (ShardInfosITarReader), the samples table is not needed at training time. Changes: - prepare_dataset() accepts enable_sample_tables=True (default), passes through to SqliteIndexWriterAggregator. - `energon prepare --no-sample-tables` flag exposes the same option. Cannot be combined with --tar-index-only (which operates on an already-prepared dataset). - append_samples/append_parts become no-ops when the writer was constructed with enable_sample_tables=False, so producers can keep emitting rows without changes. - SqliteIndexWriter always drops stale samples/sample_parts on reset_tables=True, regardless of whether the new run intends to repopulate them. This avoids leaving stale tables when re-preparing a dataset with --no-sample-tables. - Adds a unit test covering the CLI flag, table-absence, and the --no-sample-tables vs --tar-index-only conflict. Signed-off-by: Pei Li <pei.li@kaiko.ai>

Adds user-facing docs for the new option to skip the SQLite samples and sample_parts tables during prepare: - docs/source/basic/data_prep.md: subsection under "index.sqlite and index.uuid" explaining when and why to use `energon prepare --no-sample-tables`, including the trade-off (polylithic joins, FileStore access, and `energon mount` won't work). - docs/source/advanced/data_prep_api.md: section showing the equivalent `BaseWebdatasetFactory.prepare_dataset(..., enable_sample_tables=False)` programmatic usage with the same trade-off note. Signed-off-by: Pei Li <pei.li@kaiko.ai>

voegtlel

Hi @pei-li-hedgehog , thank you for this feature! Basically looks good to me, I'd just reduce the docs a bit, to be more on the spot.

voegtlel · 2026-05-19T14:21:18Z

One more thing: We may need to additionally add a proper error message when loading such a dataset as aux fails, i.e. if the sqlite doesn't contain the samples table, but we try to load it, explaining that the dataset was created without an index, and that it needs to be prepared accordingly.

Drop the `:class: warning` admonition (no other doc in the repo uses it that way) and trim both new subsections to focus on the single consequence that matters to users: a dataset prepared with --no-sample-tables / enable_sample_tables=False cannot be used as an auxiliary dataset. Matches the length and style of neighboring subsections. Signed-off-by: Pei Li <pei.li@kaiko.ai>

…e-tables datasets Without this, code paths that load a dataset prepared with --no-sample-tables as auxiliary data (or any other use of the SQLite sample-key index) would fail with the unhelpful raw `sqlite3.OperationalError: no such table: samples`. This adds a new MissingSamplesTableError that names the cause and points to the fix (re-prepare without --no-sample-tables). Every SqliteIndexReader method that queries the samples / sample_parts tables runs a one-time check on first use and raises the descriptive error if the table is absent. get_media_metadata is unaffected since the media_metadata table is independent. Extends the test to verify the error type and message for get_sample_pointer_by_key, get_sample_count, and get_sample_part. Signed-off-by: Pei Li <pei.li@kaiko.ai>

pei-li-hedgehog · 2026-05-19T15:18:10Z

Thanks for the review @voegtlel! All four points addressed.

Good that you mentioned about adding a proper error message when loading such a dataset as aux fails. I've made the following change:

New MissingSamplesTableError in indexing.py. Message names the cause (--no-sample-tables / enable_sample_tables=False) and points to the fix (re-prepare without the flag).
SqliteIndexReader now runs a one-time check on first use of any samples-touching method. get_media_metadata is unaffected — the media_metadata table is independent.
Test extended to cover the error type + message for three of the methods.

Empirically verifies that the integer-indexed loader's checkpoint / resume path works on a dataset prepared with --no-sample-tables. ShardInfosITarReader and SliceState never touch the SQLite samples tables, so the load-bearing claim of the flag is that training-time save/restore still produces the same sample sequence. This test exercises the round-trip: 1. Reference: an uninterrupted iteration of 20 samples. 2. Capture state mid-stream (after 10 samples) via save_state_rank(). 3. Continue iterating to capture the next 10 samples (post_save). 4. Build a fresh loader, restore_state_rank(state), iterate 10 samples (post_restore). 5. Assert first_half + post_save == reference (no divergence from the reference run) and post_restore == post_save (resumed iteration matches continued iteration). Re-prepares the test fixture as CaptioningSample + --no-sample-tables so get_train_dataset returns decodable samples. Signed-off-by: Pei Li <pei.li@kaiko.ai>

…ntryReader - data_prep.md: replace "consumed sequentially" with the precise constraint ("not used as auxiliary or mounted"). - SqliteIndexReader: expose has_sample_tables as a constructor-time attribute (mirrors db_has_sample_parts); drop the per-method _check_samples_table guard. - SqliteITarEntryReader: raise MissingSamplesTableError at __init__ when the samples table is missing — fail fast at the boundary that actually requires it. - Test updated to assert at SqliteITarEntryReader construction. Signed-off-by: Pei Li <pei.li@kaiko.ai>

voegtlel

Just this very small edit

Co-authored-by: Lukas Voegtle <5764745+voegtlel@users.noreply.github.com> Signed-off-by: pei-li-hedgehog <pei.li@kaiko.ai>

Signed-off-by: Pei Li <pei.li@kaiko.ai>

pei-li-hedgehog · 2026-05-21T08:08:21Z

Suggestions applied. Thx!

philipp-fischer · 2026-05-26T06:27:40Z

I'd rather vote for improving the speed of creating those sqlite tables instead of introducing a flag to skip them. Skipping them means we will have datasets that are incompatible with some of energon's feature such as energon mount.

psycherun-kaiko · 2026-05-27T12:50:58Z

How about doing both? I agree improving the indexing speed would be very helpful in general, but in some cases the sample-level indices might really be unnecessary.

Moreover, this path is opt-in. We could additionally make the potential failures more clear to the user via docs, or failing fast.

What do you think?

pei-li-hedgehog mentioned this pull request May 19, 2026

Add option to skip SQLite samples/sample_parts tables during prepare (multi-hour cost on 100M+ sample datasets) #231

Open

voegtlel requested changes May 19, 2026

View reviewed changes

Comment thread docs/source/basic/data_prep.md

Comment thread docs/source/advanced/data_prep_api.md

Comment thread docs/source/advanced/data_prep_api.md Outdated

pei-li-hedgehog added 2 commits May 19, 2026 17:02

voegtlel requested changes May 19, 2026

View reviewed changes

Comment thread docs/source/basic/data_prep.md Outdated

Comment thread src/megatron/energon/flavors/webdataset/indexing.py Outdated

pei-li-hedgehog force-pushed the pli-expose-enable-sample-tables branch from 4c7ab4a to f205b90 Compare May 19, 2026 19:08

pei-li-hedgehog force-pushed the pli-expose-enable-sample-tables branch from f205b90 to e35c68d Compare May 19, 2026 20:30

pei-li-hedgehog requested a review from voegtlel May 19, 2026 20:34

voegtlel requested changes May 20, 2026

View reviewed changes

Comment thread src/megatron/energon/flavors/webdataset/itar_reader.py Outdated

Comment thread src/megatron/energon/flavors/webdataset/itar_reader.py Outdated

pei-li-hedgehog and others added 2 commits May 20, 2026 09:58

Update src/megatron/energon/flavors/webdataset/itar_reader.py

3a7644a

Co-authored-by: Lukas Voegtle <5764745+voegtlel@users.noreply.github.com> Signed-off-by: pei-li-hedgehog <pei.li@kaiko.ai>

Update src/megatron/energon/flavors/webdataset/itar_reader.py

2506865

Co-authored-by: Lukas Voegtle <5764745+voegtlel@users.noreply.github.com> Signed-off-by: pei-li-hedgehog <pei.li@kaiko.ai>

pei-li-hedgehog requested a review from voegtlel May 20, 2026 08:21

pei-li-hedgehog force-pushed the pli-expose-enable-sample-tables branch from 4bcce3b to 2678503 Compare May 20, 2026 08:24

fix test failure for test_prepare_dataset_no_sample_tables

0914862

Signed-off-by: Pei Li <pei.li@kaiko.ai>

pei-li-hedgehog force-pushed the pli-expose-enable-sample-tables branch from 2678503 to 0914862 Compare May 21, 2026 08:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose enable_sample_tables in prepare_dataset and as --no-sample-tables CLI flag#230

Expose enable_sample_tables in prepare_dataset and as --no-sample-tables CLI flag#230
pei-li-hedgehog wants to merge 9 commits into
NVIDIA:developfrom
pei-li-hedgehog:pli-expose-enable-sample-tables

pei-li-hedgehog commented May 19, 2026 •

edited

Loading

Uh oh!

voegtlel left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

voegtlel commented May 19, 2026

Uh oh!

pei-li-hedgehog commented May 19, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

voegtlel left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

pei-li-hedgehog commented May 21, 2026

Uh oh!

philipp-fischer commented May 26, 2026

Uh oh!

psycherun-kaiko commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

pei-li-hedgehog commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Out of scope

Testing

Uh oh!

voegtlel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

voegtlel commented May 19, 2026

Uh oh!

pei-li-hedgehog commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

voegtlel left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

pei-li-hedgehog commented May 21, 2026

Uh oh!

philipp-fischer commented May 26, 2026

Uh oh!

psycherun-kaiko commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pei-li-hedgehog commented May 19, 2026 •

edited

Loading

pei-li-hedgehog commented May 19, 2026 •

edited

Loading

voegtlel left a comment •

edited

Loading