Skip to content

improves dataloader performance#687

Merged
LucaMarconato merged 29 commits into
mainfrom
giovp/dataloader3
May 21, 2026
Merged

improves dataloader performance#687
LucaMarconato merged 29 commits into
mainfrom
giovp/dataloader3

Conversation

@giovp
Copy link
Copy Markdown
Member

@giovp giovp commented Aug 21, 2024

replaces #565 #622

iteration over 20 batches, single worker

new implementation
image

main
image

one annoying thing is that the "apply" method of the dataframe to get the bounding box selection is quite slow.

@giovp
Copy link
Copy Markdown
Member Author

giovp commented Sep 3, 2024

quick push to try #699 where tiling is vectorized, removed the need for pandas.DataFrame.apply. Quite big speedup

image

LucaMarconato and others added 3 commits May 15, 2026 23:16
**Bugs fixed in datasets.py:**
- rasterize=True path was broken: __getitem__ always called image.sel() regardless
  of rasterize flag, bypassing rasterize_fn entirely. Fixed by storing self._rasterize
  and branching in __getitem__.
- ad.concat(*tables_l) unpacked the list as positional args, failing with >1 region.
  Fixed to ad.concat(tables_l).
- Vectorized selection pre-computation was always run even for rasterize=True where
  it is unused. Fixed by guarding with `if not rasterize`.
- Removed stale commented-out pandas.apply fallback code.

**Fixes in _utils.py:**
- Removed redundant nopython=True from @nb.njit (njit implies nopython=True,
  and the argument caused a RuntimeWarning).
- Replaced invalid nb.types.Array[nb.float64, nb.float64] annotations with np.ndarray.

**Fixes in spatial_query.py:**
- Restored BoundingBoxRequest validation that was commented out. The validator's
  __post_init__ already handles both 1-D (single box) and 2-D (multi-box) arrays.

**Benchmark (benchmark_dataloader.py):**
Synthetic 2048x2048 image, 500 circle regions (32 px radius), 3-channel.

  Phase       main      PR (fixed)  speedup
  init        ~162 ms   ~20 ms      ~8x
  fetch 500   ~618 ms   ~118 ms     ~5x
  per-tile    ~1237 us  ~235 us     ~5x

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@LucaMarconato
Copy link
Copy Markdown
Member

I picked this up, performance indeed improves significantly with the new vectorized bounding box approach. Thanks @giovp

asv benchmarks result:

| Change   | Before [accf496c] <main>   | After [e87d3183] <giovp/dataloader3>   |   Ratio | Benchmark (Parameter)                          |
|----------|----------------------------|----------------------------------------|---------|------------------------------------------------|
| -        | 618±20ms                   | 123±1ms                                |    0.2  | benchmark_dataloader.TimeDataloader.time_fetch |
| -        | 169±20ms                   | 18.7±0.1ms                             |    0.11 | benchmark_dataloader.TimeDataloader.time_init  |

@codecov
Copy link
Copy Markdown

codecov Bot commented May 21, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.55%. Comparing base (accf496) to head (a51cb95).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #687      +/-   ##
==========================================
+ Coverage   92.28%   92.55%   +0.26%     
==========================================
  Files          51       51              
  Lines        7804     7763      -41     
==========================================
- Hits         7202     7185      -17     
+ Misses        602      578      -24     
Files with missing lines Coverage Δ
src/spatialdata/_core/query/_utils.py 93.26% <100.00%> (ø)
src/spatialdata/_core/query/spatial_query.py 95.52% <ø> (ø)
src/spatialdata/dataloader/datasets.py 90.52% <100.00%> (+0.27%) ⬆️

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@LucaMarconato LucaMarconato merged commit 9cd4eb7 into main May 21, 2026
9 checks passed
@LucaMarconato LucaMarconato deleted the giovp/dataloader3 branch May 21, 2026 12:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants