Add llama.cpp MTP speculative decoding by leehack · Pull Request #210 · leehack/llamadart

leehack · 2026-06-08T00:50:42Z

Summary

Adds llama.cpp MTP speculative decoding support behind the shared speculative decoding API, including native wrapper bindings, rollback safety, runtime feature guards, and benchmark controls.

Changes

Adds SpeculativeDecodingConfig.mtp(...) and llama.cpp runtime wiring for MTP draft contexts.
Pins the native runtime to the MTP-capable llamadart-native build and includes the new native bundle core library entry.
Guards unsupported/runtime-risk paths, including Android Vulkan MTP by default.
Updates LiteRT-LM/WebGPU behavior so the high-level speculative decoding API remains runtime-neutral.
Extends README, changelog, and public API docs for MTP usage, benchmark caveats, and platform notes.
Adds/updates unit and integration coverage, including native MTP wrapper symbol resolution.

Validation

dart analyze
dart test -p vm -j 1 --exclude-tags local-only
dart test -p chrome --exclude-tags local-only
flutter test test/litert_lm_benchmark_app_test.dart from example/chat_app
./tool/docs/validate_links.sh
git diff --check

Notes

Real-model benchmark results are model/backend dependent. Local runs showed Qwen3.6 MoE benefits from MTP on macOS Metal, while smaller Qwen3.5 4B MTP was slower with current settings. Android Vulkan MTP remains guarded because it previously hit an upstream Vulkan device-lost crash; Android CPU MTP had prior real-device speedup evidence.

github-actions · 2026-06-08T00:52:10Z

Chat app preview removed for leehack/llamadart-chat-pr-210.

Copilot

Pull request overview

Adds llama.cpp MTP (multi-token prediction) speculative decoding support behind a backend-neutral API, updates native runtime packaging/pinning to include required wrapper/core libs, and extends tests/docs to cover the new behavior across backends and platforms.

Changes:

Introduces SpeculativeDecodingConfig / SpeculativeDecodingStrategy and wires speculative decoding enablement consistently across llama.cpp, LiteRT-LM, and WebGPU/web guards.
Adds llama.cpp MTP runtime integration (FFI bindings + wrapper symbol resolution, rollback snapshot support via ModelParams.speculativeRollbackTokenMax, Android Vulkan guard).
Updates native bundle/core library classification for the new llama-common library and refreshes pinned native artifacts + docs/changelog + benchmark controls.

Reviewed changes

Copilot reviewed 23 out of 24 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
test/unit/hook/native_bundle_config_test.dart	Adds coverage for `llama-common` core-library classification and selection behavior.
test/unit/hook/build_hook_linux_integration_test.dart	Ensures Linux build hook emits `llama-common` (and SONAME variant) alongside other core libs.
test/unit/hook/build_hook_android_integration_test.dart	Ensures Android build hook includes `llama-common` in emitted bundle libs.
test/unit/core/models/inference/model_params_test.dart	Adds tests for new `speculativeRollbackTokenMax` default/copyWith/validation.
test/unit/core/models/inference/generation_params_test.dart	Adds tests for resolved speculative decoding config behavior and clearing config via copyWith.
test/unit/backends/webgpu/webgpu_backend_test.dart	Verifies WebGPU backend rejects speculative decoding config.
test/unit/backends/llama_cpp/llama_cpp_service_test.dart	Adds unit tests for Android Vulkan MTP guard and updates speculative decoding rejection expectations.
test/unit/backends/litert_lm/litert_lm_service_test.dart	Verifies LiteRT-LM native maps MTP config to speculative decoding and rejects unsupported config knobs.
test/unit/backends/litert_lm/litert_lm_backend_web_test.dart	Verifies LiteRT-LM web rejects speculative decoding config.
test/integration/backends/llama_cpp/native_symbol_integration_test.dart	Adds integration checks for MTP symbol declarations/resolution and wrapper availability.
README.md	Documents MTP usage, rollback snapshot requirement, Android Vulkan guard, and updates pinned native tag references.
lib/src/hook/native_bundle_config.dart	Marks `llama-common` as a core runtime library for bundling/selection.
lib/src/core/models/inference/model_params.dart	Adds `speculativeRollbackTokenMax` (`n_rs_seq`) and validation/copyWith wiring.
lib/src/core/models/inference/generation_params.dart	Adds backend-neutral speculative decoding API (`SpeculativeDecodingConfig` + resolution helpers).
lib/src/backends/webgpu/webgpu_backend.dart	Switches WebGPU speculative check to unified `isSpeculativeDecodingEnabled`.
lib/src/backends/llama_cpp/llama_cpp_service.dart	Implements llama.cpp MTP speculative decoding path, symbol resolution, rollback logic, Android Vulkan guard, and perf reporting tweaks.
lib/src/backends/llama_cpp/bindings.dart	Adds generated FFI bindings for MTP wrapper APIs and opaque `llama_dart_mtp` type.
lib/src/backends/litert_lm/litert_lm_service.dart	Treats config-based speculative decoding as enabling speculative mode; rejects unsupported config knobs explicitly.
lib/src/backends/litert_lm/litert_lm_backend_web.dart	Rejects `speculativeDecodingConfig` on web LiteRT-LM runtime.
hook/build.dart	Pins native llama.cpp runtime tag to the MTP-capable build.
example/chat_app/pubspec.lock	Bumps local package version to match updated library version.
example/chat_app/lib/litert_lm_benchmark_app.dart	Adds benchmark controls/metrics for speculative decoding and token-based output accounting.
darwin/llamadart/Package.swift	Updates pinned native artifact tag and checksum for Apple SPM distribution.
CHANGELOG.md	Documents new speculative decoding API, MTP support, runtime pin update, rollback param, and Android Vulkan guard.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

leehack force-pushed the add-llama-cpp-mtp branch 8 times, most recently from ab0070b to 595e6f7 Compare June 8, 2026 02:02

leehack marked this pull request as ready for review June 8, 2026 03:07

Copilot AI review requested due to automatic review settings June 8, 2026 03:07

Copilot started reviewing on behalf of leehack June 8, 2026 03:07 View session

Copilot AI reviewed Jun 8, 2026

View reviewed changes

Comment thread lib/src/backends/llama_cpp/llama_cpp_service.dart Outdated

Comment thread test/integration/backends/llama_cpp/native_symbol_integration_test.dart Outdated

Add llama.cpp MTP speculative decoding

10a9bcf

leehack force-pushed the add-llama-cpp-mtp branch from 595e6f7 to 10a9bcf Compare June 8, 2026 03:32

leehack merged commit fd23afa into main Jun 8, 2026
11 checks passed

leehack deleted the add-llama-cpp-mtp branch June 8, 2026 11:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add llama.cpp MTP speculative decoding#210

Add llama.cpp MTP speculative decoding#210
leehack merged 1 commit into
mainfrom
add-llama-cpp-mtp

leehack commented Jun 8, 2026

Uh oh!

github-actions Bot commented Jun 8, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

leehack commented Jun 8, 2026

Summary

Changes

Validation

Notes

Uh oh!

github-actions Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented Jun 8, 2026 •

edited

Loading