Expose enable_sample_tables in prepare_dataset and as --no-sample-tables CLI flag#230
Conversation
…les CLI flag The SqliteIndexWriter already supports enable_sample_tables=False to skip the samples and sample_parts tables, but the flag was not plumbed through BaseWebdatasetFactory.prepare_dataset() or the `energon prepare` CLI. On very large datasets (100M+ samples) the SQLite inserts and post-insert btree builds dominate preparation runtime; for datasets consumed purely by the integer-indexed loader (ShardInfosITarReader), the samples table is not needed at training time. Changes: - prepare_dataset() accepts enable_sample_tables=True (default), passes through to SqliteIndexWriterAggregator. - `energon prepare --no-sample-tables` flag exposes the same option. Cannot be combined with --tar-index-only (which operates on an already-prepared dataset). - append_samples/append_parts become no-ops when the writer was constructed with enable_sample_tables=False, so producers can keep emitting rows without changes. - SqliteIndexWriter always drops stale samples/sample_parts on reset_tables=True, regardless of whether the new run intends to repopulate them. This avoids leaving stale tables when re-preparing a dataset with --no-sample-tables. - Adds a unit test covering the CLI flag, table-absence, and the --no-sample-tables vs --tar-index-only conflict. Signed-off-by: Pei Li <pei.li@kaiko.ai>
Adds user-facing docs for the new option to skip the SQLite samples and sample_parts tables during prepare: - docs/source/basic/data_prep.md: subsection under "index.sqlite and index.uuid" explaining when and why to use `energon prepare --no-sample-tables`, including the trade-off (polylithic joins, FileStore access, and `energon mount` won't work). - docs/source/advanced/data_prep_api.md: section showing the equivalent `BaseWebdatasetFactory.prepare_dataset(..., enable_sample_tables=False)` programmatic usage with the same trade-off note. Signed-off-by: Pei Li <pei.li@kaiko.ai>
voegtlel
left a comment
There was a problem hiding this comment.
Hi @pei-li-hedgehog , thank you for this feature! Basically looks good to me, I'd just reduce the docs a bit, to be more on the spot.
|
One more thing: We may need to additionally add a proper error message when loading such a dataset as aux fails, i.e. if the sqlite doesn't contain the |
Drop the `:class: warning` admonition (no other doc in the repo uses it that way) and trim both new subsections to focus on the single consequence that matters to users: a dataset prepared with --no-sample-tables / enable_sample_tables=False cannot be used as an auxiliary dataset. Matches the length and style of neighboring subsections. Signed-off-by: Pei Li <pei.li@kaiko.ai>
…e-tables datasets Without this, code paths that load a dataset prepared with --no-sample-tables as auxiliary data (or any other use of the SQLite sample-key index) would fail with the unhelpful raw `sqlite3.OperationalError: no such table: samples`. This adds a new MissingSamplesTableError that names the cause and points to the fix (re-prepare without --no-sample-tables). Every SqliteIndexReader method that queries the samples / sample_parts tables runs a one-time check on first use and raises the descriptive error if the table is absent. get_media_metadata is unaffected since the media_metadata table is independent. Extends the test to verify the error type and message for get_sample_pointer_by_key, get_sample_count, and get_sample_part. Signed-off-by: Pei Li <pei.li@kaiko.ai>
|
Thanks for the review @voegtlel! All four points addressed. Good that you mentioned about
|
Empirically verifies that the integer-indexed loader's checkpoint /
resume path works on a dataset prepared with --no-sample-tables.
ShardInfosITarReader and SliceState never touch the SQLite samples
tables, so the load-bearing claim of the flag is that training-time
save/restore still produces the same sample sequence. This test
exercises the round-trip:
1. Reference: an uninterrupted iteration of 20 samples.
2. Capture state mid-stream (after 10 samples) via save_state_rank().
3. Continue iterating to capture the next 10 samples (post_save).
4. Build a fresh loader, restore_state_rank(state), iterate 10 samples
(post_restore).
5. Assert first_half + post_save == reference (no divergence from
the reference run) and post_restore == post_save (resumed
iteration matches continued iteration).
Re-prepares the test fixture as CaptioningSample + --no-sample-tables
so get_train_dataset returns decodable samples.
Signed-off-by: Pei Li <pei.li@kaiko.ai>
4c7ab4a to
f205b90
Compare
…ntryReader
- data_prep.md: replace "consumed sequentially" with the precise constraint
("not used as auxiliary or mounted").
- SqliteIndexReader: expose has_sample_tables as a constructor-time attribute
(mirrors db_has_sample_parts); drop the per-method _check_samples_table guard.
- SqliteITarEntryReader: raise MissingSamplesTableError at __init__ when the
samples table is missing — fail fast at the boundary that actually requires it.
- Test updated to assert at SqliteITarEntryReader construction.
Signed-off-by: Pei Li <pei.li@kaiko.ai>
f205b90 to
e35c68d
Compare
Co-authored-by: Lukas Voegtle <5764745+voegtlel@users.noreply.github.com> Signed-off-by: pei-li-hedgehog <pei.li@kaiko.ai>
Co-authored-by: Lukas Voegtle <5764745+voegtlel@users.noreply.github.com> Signed-off-by: pei-li-hedgehog <pei.li@kaiko.ai>
4bcce3b to
2678503
Compare
Signed-off-by: Pei Li <pei.li@kaiko.ai>
2678503 to
0914862
Compare
|
Suggestions applied. Thx! |
|
I'd rather vote for improving the speed of creating those sqlite tables instead of introducing a flag to skip them. Skipping them means we will have datasets that are incompatible with some of energon's feature such as |
|
How about doing both? I agree improving the indexing speed would be very helpful in general, but in some cases the sample-level indices might really be unnecessary. Moreover, this path is opt-in. We could additionally make the potential failures more clear to the user via docs, or failing fast. What do you think? |
Closes #231.
Summary
SqliteIndexWriteralready supportsenable_sample_tables=Falseto skip creating thesamplesandsample_partstables, but that knob isn't plumbed throughBaseWebdatasetFactory.prepare_dataset()or theenergon prepareCLI. On very large datasets (100M+ samples) the SQLite inserts and the post-insert btree builds dominate preparation time. For datasets consumed purely by the integer-indexed loader (ShardInfosITarReader), thesamplestable isn't queried at training time — checkpoint/resume usesSliceStateinteger offsets resolved via.info.json+.tar.idx.Concretely, on a 244M-row / 4000-shard text webdataset we hit a 13-hour K8s pod timeout in the indexing phase. We need a way to skip the SQLite cost for monolithic pretraining datasets while keeping everything the integer-indexed loader needs (
.tar.idx,.info.json, split config).Changes
BaseWebdatasetFactory.prepare_dataset()acceptsenable_sample_tables=True(default — backwards compatible), passes through toSqliteIndexWriterAggregator.energon prepare --no-sample-tablesexposes the same option. Mutually exclusive with--tar-index-only(which operates on an already-prepared dataset, different semantics).append_samples/append_partsbecome no-ops when the writer was constructed withenable_sample_tables=False, so producers can keep emitting rows unmodified.SqliteIndexWriteralways drops stalesamples/sample_partswhenreset_tables=True, regardless of whether the new run intends to repopulate them. Avoids stale tables when re-preparing with--no-sample-tablesafter a previous full prepare.Out of scope
Sample-key lookups (
WebdatasetFileStore/SqliteITarEntryReader), polylithic joins (whichINNER JOIN ... ON samples.sample_key = ...at metadataset-prepare time), and aux-data access onCrudeSampledatasets all rely on the SQLite sample tables and will not work for datasets prepared with--no-sample-tables. That's the intended trade-off — documented in the new docstring/help text. Failures are loud (clear SQLiteno such tableerrors), not silent.A larger follow-up would also skip emitting
IndexSample/IndexSamplePartitems from_preprocess_tar(saving IPC overhead in addition to SQLite cost). Out of scope here.Testing
All four existing
test_prepare_dataset*tests pass (backwards compatibility).Two new tests:
test_prepare_dataset_no_sample_tables— covers the prepare path:--no-sample-tablesprepare succeeds;.info.jsonand per-tar.tar.idxare produced.index.sqliteexists but does not contain thesamples/sample_partstables.--no-sample-tables+--tar-index-onlyis rejected with the documentedUsageError.SqliteIndexReaderagainst the prepared dataset and calling sample-key methods (get_sample_pointer_by_key,get_sample_count,get_sample_part) raisesMissingSamplesTableErrorwith a descriptive message — not the rawsqlite3.OperationalError: no such table: samples.test_prepare_dataset_no_sample_tables_save_restore— covers the training-time checkpoint/resume path on a--no-sample-tablesdataset:save_state_rank(); continue iterating to capture the next 10 (post_save).restore_state_rank(state), iterate 10 samples (post_restore).first_half + post_save == referenceandpost_restore == post_save.ShardInfosITarReader+SliceState(the integer-indexed loader/state path) work end-to-end on a dataset prepared without the SQLite samples tables.ruff checkandruff format --checkclean on all touched files.