Skip to content

Add llama.cpp MTP speculative decoding#210

Merged
leehack merged 1 commit into
mainfrom
add-llama-cpp-mtp
Jun 8, 2026
Merged

Add llama.cpp MTP speculative decoding#210
leehack merged 1 commit into
mainfrom
add-llama-cpp-mtp

Conversation

@leehack

@leehack leehack commented Jun 8, 2026

Copy link
Copy Markdown
Owner

Summary

Adds llama.cpp MTP speculative decoding support behind the shared speculative decoding API, including native wrapper bindings, rollback safety, runtime feature guards, and benchmark controls.

Changes

  • Adds SpeculativeDecodingConfig.mtp(...) and llama.cpp runtime wiring for MTP draft contexts.
  • Pins the native runtime to the MTP-capable llamadart-native build and includes the new native bundle core library entry.
  • Guards unsupported/runtime-risk paths, including Android Vulkan MTP by default.
  • Updates LiteRT-LM/WebGPU behavior so the high-level speculative decoding API remains runtime-neutral.
  • Extends README, changelog, and public API docs for MTP usage, benchmark caveats, and platform notes.
  • Adds/updates unit and integration coverage, including native MTP wrapper symbol resolution.

Validation

  • dart analyze
  • dart test -p vm -j 1 --exclude-tags local-only
  • dart test -p chrome --exclude-tags local-only
  • flutter test test/litert_lm_benchmark_app_test.dart from example/chat_app
  • ./tool/docs/validate_links.sh
  • git diff --check

Notes

Real-model benchmark results are model/backend dependent. Local runs showed Qwen3.6 MoE benefits from MTP on macOS Metal, while smaller Qwen3.5 4B MTP was slower with current settings. Android Vulkan MTP remains guarded because it previously hit an upstream Vulkan device-lost crash; Android CPU MTP had prior real-device speedup evidence.

@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Chat app preview removed for leehack/llamadart-chat-pr-210.

@leehack leehack force-pushed the add-llama-cpp-mtp branch 8 times, most recently from ab0070b to 595e6f7 Compare June 8, 2026 02:02
@leehack leehack marked this pull request as ready for review June 8, 2026 03:07
Copilot AI review requested due to automatic review settings June 8, 2026 03:07

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds llama.cpp MTP (multi-token prediction) speculative decoding support behind a backend-neutral API, updates native runtime packaging/pinning to include required wrapper/core libs, and extends tests/docs to cover the new behavior across backends and platforms.

Changes:

  • Introduces SpeculativeDecodingConfig / SpeculativeDecodingStrategy and wires speculative decoding enablement consistently across llama.cpp, LiteRT-LM, and WebGPU/web guards.
  • Adds llama.cpp MTP runtime integration (FFI bindings + wrapper symbol resolution, rollback snapshot support via ModelParams.speculativeRollbackTokenMax, Android Vulkan guard).
  • Updates native bundle/core library classification for the new llama-common library and refreshes pinned native artifacts + docs/changelog + benchmark controls.

Reviewed changes

Copilot reviewed 23 out of 24 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
test/unit/hook/native_bundle_config_test.dart Adds coverage for llama-common core-library classification and selection behavior.
test/unit/hook/build_hook_linux_integration_test.dart Ensures Linux build hook emits llama-common (and SONAME variant) alongside other core libs.
test/unit/hook/build_hook_android_integration_test.dart Ensures Android build hook includes llama-common in emitted bundle libs.
test/unit/core/models/inference/model_params_test.dart Adds tests for new speculativeRollbackTokenMax default/copyWith/validation.
test/unit/core/models/inference/generation_params_test.dart Adds tests for resolved speculative decoding config behavior and clearing config via copyWith.
test/unit/backends/webgpu/webgpu_backend_test.dart Verifies WebGPU backend rejects speculative decoding config.
test/unit/backends/llama_cpp/llama_cpp_service_test.dart Adds unit tests for Android Vulkan MTP guard and updates speculative decoding rejection expectations.
test/unit/backends/litert_lm/litert_lm_service_test.dart Verifies LiteRT-LM native maps MTP config to speculative decoding and rejects unsupported config knobs.
test/unit/backends/litert_lm/litert_lm_backend_web_test.dart Verifies LiteRT-LM web rejects speculative decoding config.
test/integration/backends/llama_cpp/native_symbol_integration_test.dart Adds integration checks for MTP symbol declarations/resolution and wrapper availability.
README.md Documents MTP usage, rollback snapshot requirement, Android Vulkan guard, and updates pinned native tag references.
lib/src/hook/native_bundle_config.dart Marks llama-common as a core runtime library for bundling/selection.
lib/src/core/models/inference/model_params.dart Adds speculativeRollbackTokenMax (n_rs_seq) and validation/copyWith wiring.
lib/src/core/models/inference/generation_params.dart Adds backend-neutral speculative decoding API (SpeculativeDecodingConfig + resolution helpers).
lib/src/backends/webgpu/webgpu_backend.dart Switches WebGPU speculative check to unified isSpeculativeDecodingEnabled.
lib/src/backends/llama_cpp/llama_cpp_service.dart Implements llama.cpp MTP speculative decoding path, symbol resolution, rollback logic, Android Vulkan guard, and perf reporting tweaks.
lib/src/backends/llama_cpp/bindings.dart Adds generated FFI bindings for MTP wrapper APIs and opaque llama_dart_mtp type.
lib/src/backends/litert_lm/litert_lm_service.dart Treats config-based speculative decoding as enabling speculative mode; rejects unsupported config knobs explicitly.
lib/src/backends/litert_lm/litert_lm_backend_web.dart Rejects speculativeDecodingConfig on web LiteRT-LM runtime.
hook/build.dart Pins native llama.cpp runtime tag to the MTP-capable build.
example/chat_app/pubspec.lock Bumps local package version to match updated library version.
example/chat_app/lib/litert_lm_benchmark_app.dart Adds benchmark controls/metrics for speculative decoding and token-based output accounting.
darwin/llamadart/Package.swift Updates pinned native artifact tag and checksum for Apple SPM distribution.
CHANGELOG.md Documents new speculative decoding API, MTP support, runtime pin update, rollback param, and Android Vulkan guard.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread lib/src/backends/llama_cpp/llama_cpp_service.dart Outdated
Comment thread test/integration/backends/llama_cpp/native_symbol_integration_test.dart Outdated
@leehack leehack force-pushed the add-llama-cpp-mtp branch from 595e6f7 to 10a9bcf Compare June 8, 2026 03:32
@leehack leehack merged commit fd23afa into main Jun 8, 2026
11 checks passed
@leehack leehack deleted the add-llama-cpp-mtp branch June 8, 2026 11:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants