dataset4dstem support torch array while maintaining backward compatibility#228
dataset4dstem support torch array while maintaining backward compatibility#228bobleesj wants to merge 7 commits into
dataset4dstem support torch array while maintaining backward compatibility#228Conversation
bobleesj
left a comment
There was a problem hiding this comment.
@arthurmccray This is ready for review - I tried to have as minimal change as possible while catering to the comments provided.
dataset3d and others aren't touched on purpose to make this PR easy to review. Happy to iterative a few times or address anything to make this PR more robust.
| self._array = arr | ||
| super().__init__() | ||
| # Dual-slot storage: exactly one of (_array, _tensor) is set. | ||
| if array is None and tensor is None: |
There was a problem hiding this comment.
Some conditional checks for now - user can either have nupmy-backed OR torch-backed. Not both at this stage
There was a problem hiding this comment.
Is there a way that an array or tensor is never initialized for a Dataset? Otherwise, I feel like this first conditional is kind of redundant since everything is instantiated with from_data or from_tensor.
There was a problem hiding this comment.
agreed that some of these protections are probably unnecessary, but it's okay to leave them assuming that they will be removed once the transition is complete (maybe with a comment stating as much)
| return (array if array is not None else self._tensor).ndim | ||
|
|
||
| @property | ||
| def dtype(self) -> DTypeLike: |
There was a problem hiding this comment.
dtype - based on the given numpy or torch
There was a problem hiding this comment.
Are torch.dtype included in numpys DTypeLike?
| return (array if array is not None else self._tensor).dtype | ||
|
|
||
| @property | ||
| def device(self) -> str: |
There was a problem hiding this comment.
device - cpy for numpy, for torch, depends on the tensor
There was a problem hiding this comment.
you can actually do .device on numpy arrays, np.arange(10).device -> "cpu". it's included to be compatible with other array packages :)
| return "cpu" | ||
|
|
||
| For NumPy-only datasets, this is always "cpu". | ||
| def numpy(self) -> NDArray: |
There was a problem hiding this comment.
@arthurmccray comment on - getting User used to this for explicit array type.
There was a problem hiding this comment.
this looks good to me! Only thing i would add is the flags.writable thing that Cedric found to the torch tensor output, making it clear that it cannot be writable. I haven't tested this but it seems like what we want: #222 (comment)
arthurmccray
left a comment
There was a problem hiding this comment.
Overall this looks good! A couple questions on dtypes and devices, and a few places where we should at least put comments for temporary things that will be removed once the transition is complete. Once those are addressed I think it should be good to merge
| self._array = arr | ||
| super().__init__() | ||
| # Dual-slot storage: exactly one of (_array, _tensor) is set. | ||
| if array is None and tensor is None: |
There was a problem hiding this comment.
agreed that some of these protections are probably unnecessary, but it's okay to leave them assuming that they will be removed once the transition is complete (maybe with a comment stating as much)
| return (array if array is not None else self._tensor).ndim | ||
|
|
||
| @property | ||
| def dtype(self) -> DTypeLike: |
There was a problem hiding this comment.
Are torch.dtype included in numpys DTypeLike?
| return (array if array is not None else self._tensor).dtype | ||
|
|
||
| @property | ||
| def device(self) -> str: |
There was a problem hiding this comment.
you can actually do .device on numpy arrays, np.arange(10).device -> "cpu". it's included to be compatible with other array packages :)
| return "cpu" | ||
|
|
||
| For NumPy-only datasets, this is always "cpu". | ||
| def numpy(self) -> NDArray: |
There was a problem hiding this comment.
this looks good to me! Only thing i would add is the flags.writable thing that Cedric found to the torch tensor output, making it clear that it cannot be writable. I haven't tested this but it seems like what we want: #222 (comment)
| raise AttributeError( | ||
| f"Cannot .to({device!r}) on numpy-backed Dataset '{self.name}'." | ||
| ) | ||
| self._tensor = tensor.to(device) |
There was a problem hiding this comment.
From the config module we have a method for validating and getting canonical names for devices, which might be useful here.
from quantem.core import config
dev, _id = config.validate_device(device)
self._tensor = tensor.to(dev)
| if tensor.ndim != 4: | ||
| raise ValueError( | ||
| f"Dataset4dstem.from_tensor requires a 4D tensor " | ||
| f"(scan_row, scan_col, dp_row, dp_col), got shape {tuple(tensor.shape)}." | ||
| ) |
There was a problem hiding this comment.
This is fine for now, but I think we should update the validators (or maybe make a new ensure_valid_tensor to match ensure_valid_array). I generally like having validators as it significantly cuts down on bloat.
What problem this PR addreseses
A long discussion has been initiated here: #222 with action plan in #222 (comment)
tl;dr - allow
datset4dstemto hold torch tensor, w/o breaking existing notebooks and codes.API
May 22, 2026 update
Verification:
Widget:
Ptycho notebook:
Direct ptycho noteobok:
What should the reviewer(s) do
Arthur's comment: #222 (comment)
.tensorand.array, but only one of them will be set (raisesAttributeErrorfor explicitness)dset.from_tensorclassmethod