transformers for quant and full with unified example for smol lm3#2
transformers for quant and full with unified example for smol lm3#2DrJesseGlass wants to merge 9 commits into
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| let scale = 1.0 / (self.head_dim as f64).sqrt(); | ||
| // Make q contiguous before matmul to avoid stride mismatch | ||
| let q = q.contiguous()?; | ||
| let attn_weights = (q.matmul(&k.t()?)? * scale)?; |
There was a problem hiding this comment.
Use proper transpose for attention matmul
The quantized SmolLM3 attention builds Q/K/V as 4D tensors (B, num_heads, seq_len, head_dim) but computes scores with q.matmul(&k.t()?). Tensor::t() only handles 2D tensors, so with 4D k this call will fail at runtime and the quantized model cannot run. The full-precision path uses k.transpose(2, 3)? instead, which is the needed permutation to produce (B, H, L, L) attention scores.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
// From candle's tensor.rs
pub fn t(&self) -> Result {
let rank = self.rank();
if rank < 2 { /* error */ }
else { self.transpose(rank - 2, rank - 1) } // Same as transpose(2,3) for 4D!
}
So you are wrong. LOL
f812d1f to
e9cf0e3
Compare
No description provided.