Metal backend: Enable Voxtral Realtime#17536
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17536
Note: Links to docs will display an error until the docs builds have been completed. ❌ 5 New Failures, 14 PendingAs of commit ab8aaa2 with merge base 119a099 ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
6ef541f to
5f3ae0b
Compare
| max_t_mel = 24000 # 3000 * 8 | ||
| sample_mel = torch.randn( | ||
| 1, model.config.num_mel_bins, max_t_mel, dtype=param_dtype | ||
| ) | ||
| dynamic_shapes = {"mel": {2: Dim.AUTO}} |
There was a problem hiding this comment.
I wonder if we can just use this for xnnpack?
There was a problem hiding this comment.
Pull request overview
This pull request adds Metal backend support for the Voxtral Realtime model, enabling it to run on Apple Silicon GPUs. The implementation automatically switches between custom SDPA operations (for XNNPACK) and standard PyTorch operations (for Metal/AOTI) based on the target backend, while maintaining support for both streaming and offline modes.
Changes:
- Implemented Metal-compatible attention mechanism using standard PyTorch SDPA and StaticKVCache with index_copy_ operations
- Added backend auto-detection and configuration in export and test scripts with support for vr-streaming and vr-offline modes
- Extended CI/CD workflows to test Voxtral Realtime on Metal backend with quantized-int4-metal configuration
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| examples/models/voxtral_realtime/model.py | Added use_standard_attention config flag, implemented StaticKVCache and StandardSDPA classes for Metal backend compatibility, updated LMAttention to switch between custom and standard attention based on backend |
| examples/models/voxtral_realtime/export_voxtral_rt.py | Added Metal backend support in export functions with Dim.AUTO for dynamic shapes, implemented linear bias decomposition for Metal, added Metal partitioner configuration, and updated documentation |
| examples/models/voxtral_realtime/CMakePresets.json | Added Metal-specific CMake presets for building the Voxtral Realtime runner with Metal backend support on Darwin platforms |
| .github/workflows/metal.yml | Added Voxtral-Mini-4B-Realtime-2602 to Metal CI test matrix, excluded non-quantized variant due to size constraints |
| .ci/scripts/test_model_e2e.sh | Added mode parameter support for vr-streaming/vr-offline modes with auto-detection (XNNPACK defaults to streaming, others to offline) and validation logic |
| .ci/scripts/export_model_artifact.sh | Added mode parameter support with auto-detection and validation, configured preprocessor arguments based on streaming mode, added fpa4w quantization support for Metal |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This pull request adds Metal backend support for the Voxtral Realtime model (offline mode, streaming support is coming next)
Voxtral Realtime Metal backend support:
export_voxtral_rt.py, including custom decompositions and dynamic shape handling for Metal/AOTI compatibility. Also added validation for Metal-specific quantization (fpa4w).export_model_artifact.shandtest_model_e2e.shscripts to support a newmodeparameter. This is used in the Voxtral Realtime model for selecting between streaming/offline export modes.CI/CD and workflow updates:
Voxtral-Mini-4B-Realtime-2602to Metal workflow matrix.