Speed up UniversalDataset batching for tabular data by tippered1-debug · Pull Request #175 · sb-ai-lab/LightAutoML

tippered1-debug · 2026-06-10T22:16:24Z

Summary

This PR adds a batched __getitems__ fast path for UniversalDataset when no tokenizer is used.

For tabular-only data, PyTorch DataLoader can request a whole batch of indices via __getitems__. The previous path fetched every row separately through __getitem__, then rebuilt the batch in collate_dict. The new path slices NumPy arrays once per batch and lets collate_dict handle an already batched dictionary.

The tokenizer/text path keeps the old row-wise behavior.

Changes

Add UniversalDataset.__getitems__ for tabular-only batched fetching.
Extend collate_dict to accept both list[dict] and already batched dict.
Make collation dtype-aware using _dtypes_mapping.
Add focused tests for batched fetching, dtype preservation, and tokenizer fallback.

Benchmark

Synthetic benchmark, 200k rows:

tabular_balanced: 15.239x faster batch fetching
tabular_many_cat: 11.083x faster batch fetching
tabular_large_batch: 6.944x faster batch fetching

This benchmark measures the UniversalDataset / DataLoader batch fetching path, not full model training time.

Tests

pytest tests/unit/test_text/test_universal_dataset.py -q

Speed up UniversalDataset batching for tabular data

14b8500

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up UniversalDataset batching for tabular data#175

Speed up UniversalDataset batching for tabular data#175
tippered1-debug wants to merge 1 commit into
sb-ai-lab:masterfrom
tippered1-debug:perf/universal-dataset-batched-getitems

tippered1-debug commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tippered1-debug commented Jun 10, 2026

Summary

Changes

Benchmark

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant