Support query optimization with Dask expression arrays by mrocklin · Pull Request #11382 · pydata/xarray

mrocklin · 2026-06-12T01:07:37Z

A while ago I finished Dask expression arrays which support query optimization. This PR supports them in Xarray. This required a few things:

Creating a new __dask_exprs__ protocol in Dask (see Support composite expressions with __dask_exprs__ protocol dask/dask#12457)
Implementing that protocol in Xarray (this does most of the lifting)
Building a array chunk manager (this was done in the dask-array project)
Some silliness around xarray's map_blocks function
Changing a few explicit uses of dask.array to instead use the chunk manager (these should probably be changed regardless)

Example

import dask
import dask_array
import xarray as xr
from xarray.namedarray.parallelcompat import get_chunked_array_type


ds = xr.tutorial.scatter_example_dataset(seed=42).chunk({"x": 1, "y": 1, "z": 2, "w": 2})

# The slice and rechunk start above the elementwise operation.  dask-array's
# optimizer can push them down so it only builds the small requested window.
window = (ds.A + ds.B).chunk({"y": 3}).isel(x=slice(0, 1), y=slice(0, 3))

tasks_before = len(window.__dask_graph__())
(optimized_window,) = dask.optimize(window)
optimized_data = window.data.optimize()
tasks_after = len(optimized_data.__dask_graph__())

manager = get_chunked_array_type(ds.A.data)

print(f"xarray chunk manager: {type(manager).__name__}")
print(f"dask.optimize result: {type(optimized_window).__name__}")
print(f"array type: {type(window.data).__module__}.{type(window.data).__name__}")
print(f"graph tasks before optimize: {tasks_before}")
print(f"graph tasks after optimize:  {tasks_after}")
print()
print("Before optimize:")
window.data.pprint()
print()
print("After optimize:")
optimized_data.pprint()

Output

xarray chunk manager: DaskArrayExprManager
dask.optimize result: DataArray
array type: dask_array._collection.Array
graph tasks before optimize: 448
graph tasks after optimize:  12

Before optimize:
  Operation                Shape    Bytes   Chunks
  Getitem           (1, 3, 4, 4)    384 B  1×3×2×2
  └ Rechunk        (3, 11, 4, 4)  4.1 kiB  1×3×2×2
    └ Add          (3, 11, 4, 4)  4.1 kiB  1×1×2×2
      ├ FromArray  (3, 11, 4, 4)  4.1 kiB  1×1×2×2
      └ FromArray  (3, 11, 4, 4)  4.1 kiB  1×1×2×2

After optimize:
  Operation           Shape  Bytes   Chunks
  Add          (1, 3, 4, 4)  384 B  1×3×2×2
  ├ FromArray  (1, 3, 4, 4)  384 B  1×3×2×2
  └ FromArray  (1, 3, 4, 4)  384 B  1×3×2×2

norlandrhagen · 2026-06-13T20:00:19Z

Excited to check this out! Fingers crossed it helps the xarray/dask 'large task graph' serialization warning.

mrocklin

To aid review I've added some comments on what I think is essential, and what could be dropped with only slight degradation in functionality (but lots of simplicity in review)

mrocklin · 2026-06-16T13:41:16Z

+    elif module_available("dask", "2024.08.2"):
+        from dask.array import reshape_blockwise as dask_reshape_blockwise
+
+        return dask_reshape_blockwise(x, shape=shape, chunks=chunks)


I think that this change is useful regardless if xarray wants to do the chunk manager thing rather than be tied to dask.array

SGTM. FWIW it's technically not array api (and will never be since it's only valuable for chunked arary kinds) . presumably dask_array implements it?

Fair point. I guess the question then becomes "how should xarray handle dispatch along these operations between different chunked APIs". What I've done here doesn't feel quite right. Any suggestions?

**we have the "chunk manager" for this.

OK. I understand more clearly now. Thank you. I guess there are two options here then:

Leave it as is. A little unclean because we're kind of abusing a protocol, but in a harmless way

Extend the Xarray's chunk manager

From your earlier SGTM comment my sense is that it'd be reasonable to extend the chunk manager, but not a big deal. Planning to leave this as is for now, but let me know if you'd prefer otherwise.

mrocklin · 2026-06-16T13:41:35Z

    #  once https://github.com/pydata/xarray/issues/9229 being implemented

-    pushed_array = da.reductions.cumreduction(
+    pushed_array = cumreduction(


Same here, this change is about keeping things generic.

we have this in the "chunk manager" as scan. Can you have claude make that change please

Yup. Can do. Thanks for the pointer.

mrocklin · 2026-06-16T13:43:16Z

@@ -0,0 +1,275 @@
+from __future__ import annotations


The changes in this file are the largest, and also aren't strictly necessary. They're here to support xarray's map_blocks function, which is a little odd in the proposed architecture. I'd be happy to remvove these changes if it would accelerate review. In their defense though, they're also pretty isolated from the main codebase and so should have a low blast radius.

mrocklin · 2026-06-16T13:43:52Z

+    def __dask_rebuild_from_exprs__(self, exprs):
+        ds = self._to_temp_dataset().__dask_rebuild_from_exprs__(exprs)
+        return self._from_temp_dataset(ds)
+


This is the core of the change. There's a newly proposed protocol in Dask and this is Xarray supporting that protocol.

mrocklin · 2026-06-16T13:44:48Z

+                    return HighLevelGraph.merge(*graphs.values())
            except ImportError:
-                from dask import sharedict
+                pass


sharedict is pretty ancient

can we remove the try/except then?

Sure. Done.

mrocklin · 2026-06-16T13:45:28Z

+            self._indexes,
+            self._encoding,
+            self._close,
+        )


This is, again, the core of the change I'm looking for. I hope that it's both fairly straightforward and has a low blast radius.

confirmed this is v. similar to existing _dask_postcompute as expected.

mrocklin · 2026-06-16T13:46:03Z

+            wrapper=_wrapper,
+            get_chunk_slicer=_get_chunk_slicer,
+            dataset_to_dataarray=dataset_to_dataarray,
+        )  # type: ignore[return-value]


This is part of the larger change that's not really necessary. it's only here to support xarray's map_blocks function

dcherian · 2026-06-17T15:51:42Z

+        import dask
+        from dask._collections import new_collection
+
+        exprs_iter = iter(exprs)


hahahah can we just exprs = list(exprs) it and assert len(exprs) == 1? This is some epic Claude nonsense.

I think there needs to be exactly one expression per Dask collection. zip(..., strict=True) would also be a cleaner way to do this.

I'm not sure I follow. I don't think we want to assert that len(exprs) == 1. For context, in a Dataset there are likely to be several exprs, one for each dask array. We want to iterate through them while also iterating through the reconstructed dataset and replace the expressions into the Dataset

Ah, sorry, I was responding here to @dcherian 's response, not @shoyer 's . My internet access today is a bit spotty and I was responding to outdated information. +1 on zip strict.

dcherian

Generally looks fine to me. I left some minor requests.

I didn't look at the map_blocks stuff too closely. But I can't understand how it works conceptually. It doesn't seem to be an 'expression'. Have wee lost culling then (e.g. ds.pipe(xr.map_blocks(...)).sel(...) => s.sel(...).pipe(xr.map_blocks, ...) )?

Can you add some docs to the top of that dask_array_expr file explaining what it does?

dcherian · 2026-06-17T16:06:53Z

Also we'll need a CI env to test it :) . Can we reuse the existing dask.array test suite?

shoyer

Should this wait until the upstream Dask changes go in, or is it safe to merge now?

shoyer · 2026-06-17T16:16:06Z

+        import dask
+        from dask._collections import new_collection
+
+        exprs_iter = iter(exprs)


I think there needs to be exactly one expression per Dask collection. zip(..., strict=True) would also be a cleaner way to do this.

shoyer · 2026-06-17T16:17:58Z

+        for v in self.variables.values():
+            if dask.is_dask_collection(v):
+                if not is_dask_array_expr_array(v._data):
+                    return None


Could you note why falling back to claiming there are no expressions in the mixed case is the right thing to do? Alternatively, I can imagine raising an error might be more user-friendly.

They're both valid options. The mixed case does actually work (at least if you take the dask.compute(...) path). It's entirely possible though that this is indicative of a situation that users would still want to be made aware of and correct. Erring could make sense. So too could warning.

I guess the common way this case might occur is when an external library constructs dask.array.Array and a user combines that with a dask_array .

I can see a warning being useful. Should the choice between silence/warning/error be an option on the dask_array side? An error-by-default policy could push the ecosystem towards using expressions by default. In general, I have developed a strong dislike for this kind of "accept-everything" behaviour, it makes things hard to reason about.

Happy to put a warning in if that's what people want. I think that this isn't a decision that I make. I also think that it's the kind of decision that doesn't need to block this PR. It's low stakes and easy to change in the future.

shoyer · 2026-06-17T16:22:02Z

+
+dask = pytest.importorskip("dask")
+da = pytest.importorskip("dask.array")
+dask_array = pytest.importorskip("dask_array")


I don't think we install dask_array currently in our CI, which would probably be a good idea to ensure this doesn't break.

Could you try adding this into our pixi.toml config?

xarray/pixi.toml

Line 400 in fb20c68

test-py313 = { features = [

I've added dask-array to the dask feature, which I think does the job. Not certain though.

I've removed this for now. I could use guidance on CI.

mrocklin · 2026-06-17T17:30:59Z

Should this wait until the upstream Dask changes go in, or is it safe to merge now?

It's fine to wait. I think it's good to coordinate merging both. I was waiting to push on merging the dask changes until this got some eyes on it. Happy to accelerate merging that PR as needed.

mrocklin · 2026-06-17T17:31:20Z

And thanks for the feedback all. Working on things now.

mrocklin · 2026-06-17T17:54:05Z

I didn't look at the map_blocks stuff too closely. But I can't understand how it works conceptually. It doesn't seem to be an 'expression'. Have wee lost culling then (e.g. ds.pipe(xr.map_blocks(...)).sel(...) => s.sel(...).pipe(xr.map_blocks, ...) )?

My plan is to remove map_blocks from the PR in order to get the more important changes in quickly. However, broadly how we're doing this is creating a composite expression that takes in each of the dask-array expressions in the dataset, and emitting lots of dask-array expressions. The actual task does what xarray.map_blocks has always done (or at least as was my historical understanding), take each set of numpy array chunks, assemble an xarray dataset on the fly call the user defined function, and then emit the numpy arrays again. We're just kind of doing that same thing but now at the expr level.

In terms of dask array optimizations yes, you're correct that map_blocks is fairly opaque.

But really, let's just kick that down the road I think. What's here is ok I think, but I'm not keen to push through a complex thing at the moment.

Co-Authored-By: Codex <codex@openai.com>

mrocklin · 2026-06-17T21:36:53Z

For testing I tried briefly to add dask_array to one of the main CI lanes. I was hoping to run into just a few explicit dask.array issues. Turns out that there are many. I don't think that they're issues with code actually, but instead issues with the tests. Many are deeply tied to things like the dask.array.Array constructor.

Nothing that an agent couldn't chew through given time, but it would definitely make this PR much larger, which I'm not sure is appropriate at the moment.

Keep the dask-array chunk-manager fixes in xarray while dropping the dedicated dask-array CI environment. This leaves map_blocks out of scope, keeps optional dask_array discovery localized, and updates the groupby expectation now that it remains expression-backed. Co-Authored-By: Codex <codex@openai.com>

mrocklin · 2026-06-17T23:37:47Z

I don't know how best to handle testing here. I could use help thinking through Xarray's CI system and testing matrix, as well as help thinking through how deeply we want to test this when. Do we want the entire xarray test suite to pass before we merge in the protocol support? That's ok, but it'll probably be a lot of review. If not, then what do we want to make sure is tested before merging?

mrocklin · 2026-06-18T00:04:18Z

Also it's not clear to me that the CI failures here are due to these changes. I suspect it may be the recent pytest 9.1 release (perhaps pinning pytest would be wise if so)

mrocklin · 2026-06-18T00:32:25Z

Thank you for the review @dcherian @shoyer . I think I've handled the comments except for CI.

On CI I don't have strong conviction on any plan due to ignorance of the project, but what I would probably do is the following:

Merge this without CI support
Follow up with work that adds dask_array to one of the lanes (or make a new lane) and runs all existing tests on it. It will fail hard.
Go through all tests and either make them generic, or mark them with a custom pytest.mark for failing with dask_array due to some permissible reason (like they were made a decade ago and deeply tie into dask.array internal details)

I think that this is a lot of busy work that agents can handle pretty well. I'm happy to kick it off and iterate on it. If it were me I wouldn't include it in this PR. Personally I would keep this PR light.

Another option would be to add a CI entry for just the tests added in this PR, and maybe a few more scattered throughout the repo. This feels ephemeral to me though.

Anyway, I've done what I can here. Passing off to you all if you're still interested. Thanks for the time spent so far.

github-actions Bot added the topic-dask label Jun 12, 2026

mrocklin mentioned this pull request Jun 12, 2026

Support composite expressions with __dask_exprs__ protocol dask/dask#12457

Open

mrocklin force-pushed the codex/composite-expr-protocol branch from 147a748 to 3501992 Compare June 12, 2026 02:53

mrocklin commented Jun 16, 2026

View reviewed changes

dcherian reviewed Jun 17, 2026

View reviewed changes

shoyer reviewed Jun 17, 2026

View reviewed changes

mrocklin and others added 7 commits June 17, 2026 11:21

Add Dask expression protocol support

034774b

Use dask_array expressions for map_blocks

59039a0

Isolate dask_array expression integration

87561b8

Co-Authored-By: Codex <codex@openai.com>

Use active chunked array backend

fd7a9c8

Fix dask_array expression typing checks [skip-rtd]

4b17760

Address expression protocol review comments

2804b6a

Co-Authored-By: Codex <codex@openai.com>

Defer dask_array map_blocks integration

584e611

Co-Authored-By: Codex <codex@openai.com>

mrocklin force-pushed the codex/composite-expr-protocol branch from 9b67dd6 to 584e611 Compare June 17, 2026 18:21

Inline dask-array expression check

566da38

github-actions Bot added the Automation Github bots, testing workflows, release automation label Jun 17, 2026

mrocklin force-pushed the codex/composite-expr-protocol branch from 9f63cb1 to e856404 Compare June 17, 2026 23:36

Uh oh!

Conversation

mrocklin commented Jun 12, 2026

Example

Output

Uh oh!

norlandrhagen commented Jun 13, 2026

Uh oh!

mrocklin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dcherian left a comment

Choose a reason for hiding this comment

Uh oh!

dcherian commented Jun 17, 2026

Uh oh!

shoyer left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrocklin commented Jun 17, 2026

Uh oh!

mrocklin commented Jun 17, 2026