Skip to content

EncodeMistralText: move tokens to text_encoder.device before forward#51

Open
genno-whittlery wants to merge 1 commit into
Nerogar:masterfrom
genno-whittlery:te-device-fix
Open

EncodeMistralText: move tokens to text_encoder.device before forward#51
genno-whittlery wants to merge 1 commit into
Nerogar:masterfrom
genno-whittlery:te-device-fix

Conversation

@genno-whittlery

@genno-whittlery genno-whittlery commented May 11, 2026

Copy link
Copy Markdown

Summary

EncodeMistralText.get_item now moves tokens and attention_mask to self.text_encoder.device before the forward call. One-line fix (counting the device move + the existing-mask guard).

Why

Upstream pipeline stages land tokens on cuda:0 by default. When the Mistral text encoder is hosted on CPU — for example, on model-parallel setups where a single 32 GB consumer card can't fit Mistral-3-Small-24B's ~48 GB bf16 footprint — the existing forward call dies with:

RuntimeError: Expected all tensors to be on the same device, but found at
least two devices, cuda:0 and cpu! (when checking argument for argument index
in method wrapper_CUDA__index_select)

.to(device) is a no-op when the tensor is already on that device, so the single-GPU happy-path is unaffected.

Where this came up

Building a dual-GPU model-parallel FLUX.2 LoRA training path for OneTrainer on 2× RTX 5090 — the Mistral TE has to stay on CPU because neither GPU can fit half the FLUX.2 transformer and the TE. With the TE on CPU, EncodeMistralText.get_item is the last device-mismatch bug before the forward succeeds. Full context: https://github.com/genno-whittlery/flux2-dual-gpu-lora (OneTrainer port section); paired with Nerogar/OneTrainer#1450.

Test plan

  • OneTrainer dual-GPU FLUX.2 LoRA run on 2× RTX 5090 with Mistral TE on CPU — forward succeeds, full 32-step training epoch completes
  • Single-GPU OneTrainer FLUX.2 run unchanged (.to(te_device) is no-op when devices match)

Upstream pipeline stages land tokens (and the attention mask) on
cuda:0 by default. When the Mistral text encoder is hosted on CPU --
for model-parallel setups where a single 32 GB consumer card can't fit
Mistral-24B's ~48 GB bf16 -- the existing forward call hits a
device-mismatch error:

  RuntimeError: Expected all tensors to be on the same device, but
  found at least two devices, cuda:0 and cpu! (when checking argument
  for argument index in method wrapper_CUDA__index_select)

Move tokens and tokens_attention_mask to self.text_encoder.device just
before the forward. .to(device) is a no-op when the tensors are already
on that device, so the existing single-GPU happy-path is unaffected.

Surfaced while bringing up dual-GPU FLUX.2 LoRA training in OneTrainer
on 2x RTX 5090, where the Mistral TE has to stay on CPU and the
transformer gets split across the two GPUs. Documented at
https://github.com/genno-whittlery/flux2-dual-gpu-lora (OneTrainer port
section).
genno-whittlery added a commit to genno-whittlery/flux2-dual-gpu-lora that referenced this pull request May 11, 2026
Filed PR Nerogar/OneTrainer#1450 was closed by a maintainer who flagged
that the body referenced issue #588 as "open since 2024-11", but the
issue had actually been closed 2025-09-24. Closure happened on that
factual error -- the dual-GPU implementation and its validation
evidence weren't reviewed.

The branch (genno-whittlery/OneTrainer:dual-gpu-flux2) is unchanged
and still the canonical place to consume the OneTrainer port. Same
for the mgds companion fix (genno-whittlery/mgds:te-device-fix) --
that PR (Nerogar/mgds#51) remains open separately.

Dropped the "closes #588" claim from the README description.
@dxqb

dxqb commented May 15, 2026

Copy link
Copy Markdown
Collaborator

for this error to be triggered, that would require that the tokenizer is on another device than the text encoder. I don't see a usecase for that - why isn't the tokenizer also on CPU?

is Mistral fast enough on CPU?

@dxqb dxqb added the question Further information is requested label May 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

question Further information is requested

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants