Skip to content

Qwen3 embed#17

Open
DrJesseGlass wants to merge 7 commits into
downstreamfrom
qwen3-embed
Open

Qwen3 embed#17
DrJesseGlass wants to merge 7 commits into
downstreamfrom
qwen3-embed

Conversation

@DrJesseGlass

Copy link
Copy Markdown
Owner

No description provided.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 647657e777

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

// Pad or truncate embeds to seq_len
let embed_len = embeds.dim(0)?;
if embed_len >= seq_len {
return embeds.narrow(0, 0, seq_len)?.unsqueeze(0);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Place modality embeddings at masked token indices

broadcast_embed_to_mask currently pads/truncates embeds from position 0 and returns it as a full sequence, but forward_multimodal later applies it with a sparse token mask (mask_expanded * image_embeds_broadcast). When image/audio placeholder tokens are not a prefix starting at index 0 (e.g., BOS/text before modality tokens), the selected positions pull zeros or the wrong embedding rows, so modality tokens are replaced incorrectly.

Useful? React with 👍 / 👎.

return padded.unsqueeze(0);
}

Ok(zeros)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Support batched multimodal embedding broadcast

For b_sz > 1, broadcast_embed_to_mask unconditionally returns an all-zero tensor, which means every masked image/audio token in batched requests gets zeroed instead of receiving encoder features. This effectively drops all multimodal information whenever batch size is greater than 1.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants