EncodeMistralText: move tokens to text_encoder.device before forward#51
Open
genno-whittlery wants to merge 1 commit into
Open
EncodeMistralText: move tokens to text_encoder.device before forward#51genno-whittlery wants to merge 1 commit into
genno-whittlery wants to merge 1 commit into
Conversation
Upstream pipeline stages land tokens (and the attention mask) on cuda:0 by default. When the Mistral text encoder is hosted on CPU -- for model-parallel setups where a single 32 GB consumer card can't fit Mistral-24B's ~48 GB bf16 -- the existing forward call hits a device-mismatch error: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select) Move tokens and tokens_attention_mask to self.text_encoder.device just before the forward. .to(device) is a no-op when the tensors are already on that device, so the existing single-GPU happy-path is unaffected. Surfaced while bringing up dual-GPU FLUX.2 LoRA training in OneTrainer on 2x RTX 5090, where the Mistral TE has to stay on CPU and the transformer gets split across the two GPUs. Documented at https://github.com/genno-whittlery/flux2-dual-gpu-lora (OneTrainer port section).
Closed
4 tasks
genno-whittlery
added a commit
to genno-whittlery/flux2-dual-gpu-lora
that referenced
this pull request
May 11, 2026
Filed PR Nerogar/OneTrainer#1450 was closed by a maintainer who flagged that the body referenced issue #588 as "open since 2024-11", but the issue had actually been closed 2025-09-24. Closure happened on that factual error -- the dual-GPU implementation and its validation evidence weren't reviewed. The branch (genno-whittlery/OneTrainer:dual-gpu-flux2) is unchanged and still the canonical place to consume the OneTrainer port. Same for the mgds companion fix (genno-whittlery/mgds:te-device-fix) -- that PR (Nerogar/mgds#51) remains open separately. Dropped the "closes #588" claim from the README description.
c2c0562 to
e03c769
Compare
Collaborator
|
for this error to be triggered, that would require that the tokenizer is on another device than the text encoder. I don't see a usecase for that - why isn't the tokenizer also on CPU? is Mistral fast enough on CPU? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
EncodeMistralText.get_itemnow movestokensandattention_masktoself.text_encoder.devicebefore the forward call. One-line fix (counting the device move + the existing-mask guard).Why
Upstream pipeline stages land
tokensoncuda:0by default. When the Mistral text encoder is hosted on CPU — for example, on model-parallel setups where a single 32 GB consumer card can't fit Mistral-3-Small-24B's ~48 GB bf16 footprint — the existing forward call dies with:.to(device)is a no-op when the tensor is already on that device, so the single-GPU happy-path is unaffected.Where this came up
Building a dual-GPU model-parallel FLUX.2 LoRA training path for OneTrainer on 2× RTX 5090 — the Mistral TE has to stay on CPU because neither GPU can fit half the FLUX.2 transformer and the TE. With the TE on CPU,
EncodeMistralText.get_itemis the last device-mismatch bug before the forward succeeds. Full context: https://github.com/genno-whittlery/flux2-dual-gpu-lora (OneTrainer port section); paired with Nerogar/OneTrainer#1450.Test plan
.to(te_device)is no-op when devices match)