Add llama.cpp MTP speculative decoding#210
Merged
Merged
Conversation
Contributor
|
Chat app preview removed for |
ab0070b to
595e6f7
Compare
There was a problem hiding this comment.
Pull request overview
Adds llama.cpp MTP (multi-token prediction) speculative decoding support behind a backend-neutral API, updates native runtime packaging/pinning to include required wrapper/core libs, and extends tests/docs to cover the new behavior across backends and platforms.
Changes:
- Introduces
SpeculativeDecodingConfig/SpeculativeDecodingStrategyand wires speculative decoding enablement consistently across llama.cpp, LiteRT-LM, and WebGPU/web guards. - Adds llama.cpp MTP runtime integration (FFI bindings + wrapper symbol resolution, rollback snapshot support via
ModelParams.speculativeRollbackTokenMax, Android Vulkan guard). - Updates native bundle/core library classification for the new
llama-commonlibrary and refreshes pinned native artifacts + docs/changelog + benchmark controls.
Reviewed changes
Copilot reviewed 23 out of 24 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| test/unit/hook/native_bundle_config_test.dart | Adds coverage for llama-common core-library classification and selection behavior. |
| test/unit/hook/build_hook_linux_integration_test.dart | Ensures Linux build hook emits llama-common (and SONAME variant) alongside other core libs. |
| test/unit/hook/build_hook_android_integration_test.dart | Ensures Android build hook includes llama-common in emitted bundle libs. |
| test/unit/core/models/inference/model_params_test.dart | Adds tests for new speculativeRollbackTokenMax default/copyWith/validation. |
| test/unit/core/models/inference/generation_params_test.dart | Adds tests for resolved speculative decoding config behavior and clearing config via copyWith. |
| test/unit/backends/webgpu/webgpu_backend_test.dart | Verifies WebGPU backend rejects speculative decoding config. |
| test/unit/backends/llama_cpp/llama_cpp_service_test.dart | Adds unit tests for Android Vulkan MTP guard and updates speculative decoding rejection expectations. |
| test/unit/backends/litert_lm/litert_lm_service_test.dart | Verifies LiteRT-LM native maps MTP config to speculative decoding and rejects unsupported config knobs. |
| test/unit/backends/litert_lm/litert_lm_backend_web_test.dart | Verifies LiteRT-LM web rejects speculative decoding config. |
| test/integration/backends/llama_cpp/native_symbol_integration_test.dart | Adds integration checks for MTP symbol declarations/resolution and wrapper availability. |
| README.md | Documents MTP usage, rollback snapshot requirement, Android Vulkan guard, and updates pinned native tag references. |
| lib/src/hook/native_bundle_config.dart | Marks llama-common as a core runtime library for bundling/selection. |
| lib/src/core/models/inference/model_params.dart | Adds speculativeRollbackTokenMax (n_rs_seq) and validation/copyWith wiring. |
| lib/src/core/models/inference/generation_params.dart | Adds backend-neutral speculative decoding API (SpeculativeDecodingConfig + resolution helpers). |
| lib/src/backends/webgpu/webgpu_backend.dart | Switches WebGPU speculative check to unified isSpeculativeDecodingEnabled. |
| lib/src/backends/llama_cpp/llama_cpp_service.dart | Implements llama.cpp MTP speculative decoding path, symbol resolution, rollback logic, Android Vulkan guard, and perf reporting tweaks. |
| lib/src/backends/llama_cpp/bindings.dart | Adds generated FFI bindings for MTP wrapper APIs and opaque llama_dart_mtp type. |
| lib/src/backends/litert_lm/litert_lm_service.dart | Treats config-based speculative decoding as enabling speculative mode; rejects unsupported config knobs explicitly. |
| lib/src/backends/litert_lm/litert_lm_backend_web.dart | Rejects speculativeDecodingConfig on web LiteRT-LM runtime. |
| hook/build.dart | Pins native llama.cpp runtime tag to the MTP-capable build. |
| example/chat_app/pubspec.lock | Bumps local package version to match updated library version. |
| example/chat_app/lib/litert_lm_benchmark_app.dart | Adds benchmark controls/metrics for speculative decoding and token-based output accounting. |
| darwin/llamadart/Package.swift | Updates pinned native artifact tag and checksum for Apple SPM distribution. |
| CHANGELOG.md | Documents new speculative decoding API, MTP support, runtime pin update, rollback param, and Android Vulkan guard. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
595e6f7 to
10a9bcf
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds llama.cpp MTP speculative decoding support behind the shared speculative decoding API, including native wrapper bindings, rollback safety, runtime feature guards, and benchmark controls.
Changes
SpeculativeDecodingConfig.mtp(...)and llama.cpp runtime wiring for MTP draft contexts.llamadart-nativebuild and includes the new native bundle core library entry.Validation
dart analyzedart test -p vm -j 1 --exclude-tags local-onlydart test -p chrome --exclude-tags local-onlyflutter test test/litert_lm_benchmark_app_test.dartfromexample/chat_app./tool/docs/validate_links.shgit diff --checkNotes
Real-model benchmark results are model/backend dependent. Local runs showed Qwen3.6 MoE benefits from MTP on macOS Metal, while smaller Qwen3.5 4B MTP was slower with current settings. Android Vulkan MTP remains guarded because it previously hit an upstream Vulkan device-lost crash; Android CPU MTP had prior real-device speedup evidence.