Skip to content

dataset4dstem support torch array while maintaining backward compatibility#228

Open
bobleesj wants to merge 7 commits into
electronmicroscopy:devfrom
bobleesj:dataset-support-torch
Open

dataset4dstem support torch array while maintaining backward compatibility#228
bobleesj wants to merge 7 commits into
electronmicroscopy:devfrom
bobleesj:dataset-support-torch

Conversation

@bobleesj
Copy link
Copy Markdown
Collaborator

@bobleesj bobleesj commented May 19, 2026

What problem this PR addreseses

A long discussion has been initiated here: #222 with action plan in #222 (comment)

tl;dr - allow datset4dstem to hold torch tensor, w/o breaking existing notebooks and codes.

API

# Existing — unchanged
Dataset4dstem.from_array(numpy_arr, sampling=..., units=..., name=...)

# New — GPU-resident path
Dataset4dstem.from_tensor(torch_tensor, sampling=..., units=..., name=...)

Access

┌─────────────────────────────┬─────────────────────────────┬───────────────────────────────────────────────────┐
│      Property / Method      │  numpy-backed (from_array)  │            tensor-backed (from_tensor)            │
├─────────────────────────────┼─────────────────────────────┼───────────────────────────────────────────────────┤
│ dset.array                  │ np.ndarray                  │ None (use .tensor)                                │
├─────────────────────────────┼─────────────────────────────┼───────────────────────────────────────────────────┤
│ dset.tensor                 │ AttributeError              │ torch.Tensor                                      │
├─────────────────────────────┼─────────────────────────────┼───────────────────────────────────────────────────┤
│ dset.numpy()                │ np.ndarray (same as .array) │ np.ndarray (CPU copy via .detach().cpu().numpy()) │
├─────────────────────────────┼─────────────────────────────┼───────────────────────────────────────────────────┤
│ dset.device                 │ "cpu"                       │ "cuda:0" / "mps" / "cpu"                          │
├─────────────────────────────┼─────────────────────────────┼───────────────────────────────────────────────────┤
│ dset.to(device)             │ AttributeError              │ moves tensor, returns self                        │
├─────────────────────────────┼─────────────────────────────┼───────────────────────────────────────────────────┤
│ dset.shape / .ndim / .dtype │ from numpy                  │ from tensor                                       │
└─────────────────────────────┴─────────────────────────────┴───────────────────────────────────────────────────┘

May 22, 2026 update

Verification:

Widget:

Screenshot 2026-05-22 at 11 25 32 PM

Ptycho notebook:

Screenshot 2026-05-22 at 11 23 32 PM Screenshot 2026-05-22 at 11 24 22 PM

Direct ptycho noteobok:

Screenshot 2026-05-22 at 11 26 15 PM

What should the reviewer(s) do

Arthur's comment: #222 (comment)

  • non-breaking just adding basic torch support
  • in base Dataset dset.tensor # is None if not set, .tensor and .array, but only one of them will be set (raises AttributeError for explicitness)
  • Dataset4dstem: dset.from_tensor classmethod
  • dset.device (just return self.tensor.device)
  • dset.to (just moving dset.tensor)
  • dset.numpy() method (will be required in the future, make it now so people get used to it)

Copy link
Copy Markdown
Collaborator Author

@bobleesj bobleesj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arthurmccray This is ready for review - I tried to have as minimal change as possible while catering to the comments provided.

dataset3d and others aren't touched on purpose to make this PR easy to review. Happy to iterative a few times or address anything to make this PR more robust.

Comment thread src/quantem/core/datastructures/dataset.py
self._array = arr
super().__init__()
# Dual-slot storage: exactly one of (_array, _tensor) is set.
if array is None and tensor is None:
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some conditional checks for now - user can either have nupmy-backed OR torch-backed. Not both at this stage

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way that an array or tensor is never initialized for a Dataset? Otherwise, I feel like this first conditional is kind of redundant since everything is instantiated with from_data or from_tensor.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed that some of these protections are probably unnecessary, but it's okay to leave them assuming that they will be removed once the transition is complete (maybe with a comment stating as much)

return (array if array is not None else self._tensor).ndim

@property
def dtype(self) -> DTypeLike:
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dtype - based on the given numpy or torch

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are torch.dtype included in numpys DTypeLike?

return (array if array is not None else self._tensor).dtype

@property
def device(self) -> str:
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

device - cpy for numpy, for torch, depends on the tensor

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can actually do .device on numpy arrays, np.arange(10).device -> "cpu". it's included to be compatible with other array packages :)

return "cpu"

For NumPy-only datasets, this is always "cpu".
def numpy(self) -> NDArray:
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arthurmccray comment on - getting User used to this for explicit array type.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks good to me! Only thing i would add is the flags.writable thing that Cedric found to the torch tensor output, making it clear that it cannot be writable. I haven't tested this but it seems like what we want: #222 (comment)

Comment thread src/quantem/core/datastructures/dataset4dstem.py
Comment thread src/quantem/core/datastructures/dataset4dstem.py
Comment thread widget/src/quantem/widget/show4dstem.py
Copy link
Copy Markdown
Collaborator

@arthurmccray arthurmccray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good! A couple questions on dtypes and devices, and a few places where we should at least put comments for temporary things that will be removed once the transition is complete. Once those are addressed I think it should be good to merge

self._array = arr
super().__init__()
# Dual-slot storage: exactly one of (_array, _tensor) is set.
if array is None and tensor is None:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed that some of these protections are probably unnecessary, but it's okay to leave them assuming that they will be removed once the transition is complete (maybe with a comment stating as much)

return (array if array is not None else self._tensor).ndim

@property
def dtype(self) -> DTypeLike:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are torch.dtype included in numpys DTypeLike?

return (array if array is not None else self._tensor).dtype

@property
def device(self) -> str:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can actually do .device on numpy arrays, np.arange(10).device -> "cpu". it's included to be compatible with other array packages :)

return "cpu"

For NumPy-only datasets, this is always "cpu".
def numpy(self) -> NDArray:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks good to me! Only thing i would add is the flags.writable thing that Cedric found to the torch tensor output, making it clear that it cannot be writable. I haven't tested this but it seems like what we want: #222 (comment)

raise AttributeError(
f"Cannot .to({device!r}) on numpy-backed Dataset '{self.name}'."
)
self._tensor = tensor.to(device)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the config module we have a method for validating and getting canonical names for devices, which might be useful here.

from quantem.core import config

dev, _id = config.validate_device(device)
self._tensor = tensor.to(dev)

Comment on lines +189 to +193
if tensor.ndim != 4:
raise ValueError(
f"Dataset4dstem.from_tensor requires a 4D tensor "
f"(scan_row, scan_col, dp_row, dp_col), got shape {tuple(tensor.shape)}."
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine for now, but I think we should update the validators (or maybe make a new ensure_valid_tensor to match ensure_valid_array). I generally like having validators as it significantly cuts down on bloat.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants