Skip to content

Speed up UniversalDataset batching for tabular data#175

Open
tippered1-debug wants to merge 1 commit into
sb-ai-lab:masterfrom
tippered1-debug:perf/universal-dataset-batched-getitems
Open

Speed up UniversalDataset batching for tabular data#175
tippered1-debug wants to merge 1 commit into
sb-ai-lab:masterfrom
tippered1-debug:perf/universal-dataset-batched-getitems

Conversation

@tippered1-debug

Copy link
Copy Markdown

Summary

This PR adds a batched __getitems__ fast path for UniversalDataset when no tokenizer is used.

For tabular-only data, PyTorch DataLoader can request a whole batch of indices via __getitems__. The previous path fetched every row separately through __getitem__, then rebuilt the batch in collate_dict. The new path slices NumPy arrays once per batch and lets collate_dict handle an already batched dictionary.

The tokenizer/text path keeps the old row-wise behavior.

Changes

  • Add UniversalDataset.__getitems__ for tabular-only batched fetching.
  • Extend collate_dict to accept both list[dict] and already batched dict.
  • Make collation dtype-aware using _dtypes_mapping.
  • Add focused tests for batched fetching, dtype preservation, and tokenizer fallback.

Benchmark

Synthetic benchmark, 200k rows:

  • tabular_balanced: 15.239x faster batch fetching
  • tabular_many_cat: 11.083x faster batch fetching
  • tabular_large_batch: 6.944x faster batch fetching

This benchmark measures the UniversalDataset / DataLoader batch fetching path, not full model training time.

Tests

  • pytest tests/unit/test_text/test_universal_dataset.py -q

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant