leehack · leehack · Jun 8, 2026 · Jun 7, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -45,6 +45,23 @@
     default; iOS, macOS, Linux, and Windows now default to `llama_cpp` only.
   * Added native release pin automation so the maintainer sync workflow updates
     Apple SPM checksums from published native release asset digests.
+  * Added `SpeculativeDecodingConfig` as a backend-neutral generation option for
+    selecting speculative decoding strategies such as MTP while keeping the
+    existing `GenerationParams.speculativeDecoding` flag as a compatibility
+    switch.
+  * Added llama.cpp native MTP speculative decoding for compatible GGUF models
+    through `SpeculativeDecodingConfig.mtp(...)`, defaulting to a conservative
+    one-token draft depth unless callers tune `draftTokenMax`.
+  * Updated the default llama.cpp native runtime pin to
+    `leehack/llamadart-native@b9547`, including the MTP wrapper exports and
+    `llama-common` runtime packaging.
+  * Added `ModelParams.speculativeRollbackTokenMax` so llama.cpp contexts can
+    reserve recurrent-state rollback snapshots required by Qwen3.5 MTP-style
+    models.
+  * Guarded llama.cpp MTP on Android Vulkan by default because the upstream
+    `draft-mtp` backend-sampling path can abort with `vk::DeviceLostError`;
+    CPU and other supported backends remain available, and a dart-define debug
+    override is available for reproductions.
 * **CI reliability**:
   * Cached and retried tiny GGUF test-model downloads used by VM integration
     tests so main-branch CI is less exposed to Hugging Face 429 rate limits.

diff --git a/README.md b/README.md
@@ -114,7 +114,7 @@ hooks:
     llamadart:
       # Optional. Defaults to llamadart's tested native runtime pin.
       # Use a leehack/llamadart-native release tag when testing another build.
-      llamadart_native_tag: b9536
+      llamadart_native_tag: b9547
 
       # Optional. GitHub repository slug or github.com URL.
       llamadart_native_repository: leehack/llamadart-native
@@ -142,7 +142,7 @@ the native-assets hook fails while downloading that asset.
 
 Native source overrides are for compatibility testing. They do not regenerate
 Dart FFI bindings or symbol lookups, so the selected binary still must be ABI-
-and symbol-compatible with the default `leehack/llamadart-native@b9536` runtime.
+and symbol-compatible with the default `leehack/llamadart-native@b9547` runtime.
 
 Available native tags are published on the
 [`leehack/llamadart-native` releases page](https://github.com/leehack/llamadart-native/releases).
@@ -154,7 +154,7 @@ gh release list --repo leehack/llamadart-native --limit 20
 
 Before overriding, confirm the release includes the asset for your target. The
 hook downloads files named `llamadart-native-<bundle>-<tag>.tar.gz`, for example
-`llamadart-native-windows-x64-b9536.tar.gz`.
+`llamadart-native-windows-x64-b9547.tar.gz`.
 For local testing, `llamadart_native_path` may point directly at a bundle
 archive, at an extracted bundle directory, or at a directory containing
 `<tag>/<bundle>/`, `<bundle>/`, or the expected archive file.
@@ -204,6 +204,44 @@ other models, or to override detection, pass `ModelParams.chatTemplate`. See
 for the template support matrix, real-model smoke commands, and how to add a
 family.
 
+llama.cpp MTP speculative decoding is available for compatible GGUF models. For
+Qwen3.5 MTP-style models, reserve rollback snapshots on the context and enable
+MTP on the generation request:
+
+```dart
+await engine.loadModel(
+  'path/to/Qwen3.5-0.8B-MTP-Q4_K_M.gguf',
+  modelParams: const ModelParams(
+    contextSize: 2048,
+    batchSize: 512,
+    microBatchSize: 512,
+    speculativeRollbackTokenMax: 1,
+  ),
+);
+
+await for (final token in engine.generate(
+  'Explain local inference in one paragraph.',
+  params: const GenerationParams(
+    maxTokens: 128,
+    speculativeDecodingConfig: SpeculativeDecodingConfig.mtp(
+      draftTokenMax: 1,
+    ),
+  ),
+)) {
+  stdout.write(token);
+}
+```
+
+Higher `draftTokenMax` values can be faster on some models/devices, but they
+should be benchmarked with the target model because excess draft depth can add
+verification overhead.
+
+Android Vulkan MTP is currently disabled by default. The upstream llama.cpp
+`draft-mtp` backend-sampling path can abort Android Vulkan processes with
+`vk::DeviceLostError`; use CPU for Android MTP validation, or rebuild with
+`--dart-define=LLAMADART_ANDROID_VULKAN_ALLOW_MTP=true` only when reproducing
+or benchmarking that upstream path.
+
 ### 6. Download and cache a remote model file
 
 ```dart
@@ -372,7 +410,7 @@ overrides are rejected instead of being silently ignored. `.litertlm`
 generation honors `GenerationParams`
 `maxTokens`, `temp`, `topK`, `topP`, and `seed` on native and web, with
 `stopSequences` enforced by llamadart. Native LiteRT-LM also honors stream
-batching thresholds and the opt-in `speculativeDecoding` flag; Web LiteRT-LM
+batching thresholds and the opt-in speculative decoding APIs; Web LiteRT-LM
 rejects speculative decoding until the browser runtime exposes an equivalent
 control. llama.cpp-only sampling and constrained-decoding controls
 such as Min-P, repeat penalty overrides, grammar/lazy grammar triggers,
@@ -384,7 +422,7 @@ the current strict structured-output boundary.
 <details>
 <summary>Full module matrix (available modules by target)</summary>
 
-Available llama.cpp module matrix from the default native tag `b9536`:
+Available llama.cpp module matrix from the default native tag `b9547`:
 
 | Target | Available backend modules in bundle |
 |--------|-------------------------------------|
@@ -504,6 +542,8 @@ Notes:
 - `ModelParams.splitMode` passes through to llama.cpp `split_mode`; it defaults to upstream `layer` behavior.
 - `ModelParams.mainGpu` passes through to llama.cpp `main_gpu`. To select one GPU for the full model, use `splitMode: ModelSplitMode.none` with the desired `mainGpu` index.
 - `ModelParams.batchSize` (`n_batch`) and `ModelParams.microBatchSize` (`n_ubatch`) can be set independently for memory/performance tuning; defaults keep legacy behavior (`n_batch = n_ctx`, `n_ubatch = n_batch`).
+- `ModelParams.speculativeRollbackTokenMax` passes through to llama.cpp `n_rs_seq`. Keep the default `0` for normal generation; set it to at least the MTP draft token max when a llama.cpp MTP model needs bounded rollback snapshots, such as Qwen3.5 MTP.
+- Android Vulkan MTP is guarded by default because the upstream llama.cpp MTP backend-sampling path can crash the process. The debug-only escape hatch is `--dart-define=LLAMADART_ANDROID_VULKAN_ALLOW_MTP=true`.
 - `ModelParams.preferMemory64` and `ModelParams.modelBytesHint` are web/WebGPU only (ignored on native). They select the 64-bit (wasm64/mem64) bridge core so models larger than the ~4 GiB wasm32 address space (for example Gemma 4 E2B) can load; `null` auto-decides from the size hint (size-driven, no hardcoded model names). See the [WebGPU bridge docs](https://leehack.github.io/llamadart/docs/platforms/webgpu-bridge).
 - Apple targets use consolidated llama.cpp native libraries, so
   `llamadart_native_backends` does not split Apple backend modules. Use
@@ -700,9 +740,9 @@ Current pinned runtime artifacts:
 
 | Runtime path | Published artifact |
 |--------------|--------------------|
-| Native llama.cpp / GGUF | `leehack/llamadart-native@b9536` |
+| Native llama.cpp / GGUF | `leehack/llamadart-native@b9547` |
 | Native LiteRT-LM / `.litertlm` | `leehack/litert-lm-native@v0.13.1` |
-| Apple SPM llama.cpp / GGUF | `leehack/llamadart-native@b9536` Apple XCFramework |
+| Apple SPM llama.cpp / GGUF | `leehack/llamadart-native@b9547` Apple XCFramework |
 | Apple SPM LiteRT-LM / `.litertlm` | `leehack/litert-lm-native@v0.13.1` Apple XCFrameworks |
 | Web llama.cpp / GGUF | `leehack/llama-web-bridge-assets@v0.1.16` |
 | Web LiteRT-LM / `.litertlm` | App-provided `@litert-lm/core` module URL; the chat app defaults to jsDelivr `@litert-lm/core/+esm` |

diff --git a/darwin/llamadart/Package.swift b/darwin/llamadart/Package.swift
@@ -4,7 +4,7 @@ import PackageDescription
 
 let packageRoot = URL(fileURLWithPath: #filePath).deletingLastPathComponent()
 let artifactsRoot = packageRoot.appendingPathComponent("Artifacts")
-let llamaCppTag = "b9536"
+let llamaCppTag = "b9547"
 let liteRtLmTag = "v0.13.1"
 
 func localArtifactPath(_ name: String) -> String? {
@@ -54,7 +54,7 @@ let package = Package(
             repository: "leehack/llamadart-native",
             artifactName: "llamadart-native-apple-xcframework-\(llamaCppTag).zip",
             tag: llamaCppTag,
-            checksum: "e71058acca310999c1c5ee03e52e1992bd4c31b528d97ca019e2ea132fc79ae8"
+            checksum: "df326c10018c0ac739560d0744db52598b7ea8158fd935b02f769d3ac2905237"
         ),
         nativeRepoBinaryTarget(
             name: "LiteRtLm",

diff --git a/example/chat_app/lib/litert_lm_benchmark_app.dart b/example/chat_app/lib/litert_lm_benchmark_app.dart
@@ -183,7 +183,12 @@ Map<String, Object?> _summarizeRuns(List<Map<String, Object?>> runs) {
       'decodeWithSamplingTokensPerSecond',
     ),
     'wallMilliseconds': _numericSummary(runs, 'wallMilliseconds'),
+    'outputTokens': _numericSummary(runs, 'outputTokens'),
     'evalTokens': _numericSummary(runs, 'evalTokens'),
+    'targetWallTokensPerSecond': _numericSummary(
+      runs,
+      'targetWallTokensPerSecond',
+    ),
   };
 }
 
@@ -479,6 +484,7 @@ class _LiteRtLmBenchmarkAppState extends State<LiteRtLmBenchmarkApp> {
           contextSize: _maxTokens,
           gpuLayers: ModelParams.maxGpuLayers,
           preferredBackend: backendPreference,
+          speculativeRollbackTokenMax: _speculative ? 1 : 0,
         ),
       );
       loadSw.stop();
@@ -492,7 +498,13 @@ class _LiteRtLmBenchmarkAppState extends State<LiteRtLmBenchmarkApp> {
         await engine
             .generate(
               _promptController.text,
-              params: GenerationParams(maxTokens: _outputTokens, seed: 1),
+              params: GenerationParams(
+                maxTokens: _outputTokens,
+                seed: 1,
+                speculativeDecodingConfig: _speculative
+                    ? const SpeculativeDecodingConfig.mtp()
+                    : null,
+              ),
             )
             .drain<void>();
       }
@@ -506,22 +518,31 @@ class _LiteRtLmBenchmarkAppState extends State<LiteRtLmBenchmarkApp> {
         final sw = Stopwatch()..start();
         await for (final chunk in engine.generate(
           _promptController.text,
-          params: GenerationParams(maxTokens: _outputTokens, seed: 1),
+          params: GenerationParams(
+            maxTokens: _outputTokens,
+            seed: 1,
+            speculativeDecodingConfig: _speculative
+                ? const SpeculativeDecodingConfig.mtp()
+                : null,
+          ),
         )) {
           buffer.write(chunk);
         }
         sw.stop();
         wallMs = sw.elapsedMilliseconds;
         lastText = buffer.toString();
+        final outputTokenCount = lastText.isEmpty
+            ? 0
+            : await engine.getTokenCount(lastText);
         perf = await engine.getPerformanceContext();
         final runMetrics = {
           'index': i,
           'wallMilliseconds': wallMs,
+          'speculativeDecoding': _speculative,
+          'outputTokens': outputTokenCount,
           'promptEvalTokens': perf?.promptEvalTokens,
           'evalTokens': perf?.evalTokens,
-          'hitEosBeforeTarget': perf == null
-              ? null
-              : perf.evalTokens < _outputTokens,
+          'hitEosBeforeTarget': outputTokenCount < _outputTokens,
           'promptEvalMs': perf?.promptEvalMs,
           'evalMs': perf?.evalMs,
           'sampleMs': perf?.sampleMs,
@@ -535,9 +556,12 @@ class _LiteRtLmBenchmarkAppState extends State<LiteRtLmBenchmarkApp> {
               perf == null || perf.evalMs + perf.sampleMs <= 0
               ? null
               : perf.evalTokens / ((perf.evalMs + perf.sampleMs) / 1000.0),
-          'wallTokensPerSecond': wallMs <= 0 || perf == null
+          'wallTokensPerSecond': wallMs <= 0 || outputTokenCount <= 0
               ? null
-              : perf.evalTokens / (wallMs / 1000.0),
+              : outputTokenCount / (wallMs / 1000.0),
+          'targetWallTokensPerSecond': wallMs <= 0
+              ? null
+              : _outputTokens / (wallMs / 1000.0),
         };
         runsDetail.add(runMetrics);
         _append('RUN llamadart ${jsonEncode(runMetrics)}');
@@ -550,11 +574,15 @@ class _LiteRtLmBenchmarkAppState extends State<LiteRtLmBenchmarkApp> {
         'backendName': backendName,
         'resolvedGpuLayers': resolvedGpuLayers,
         'targetDecodeTokens': _outputTokens,
+        'speculativeDecoding': _speculative,
+        'outputTokens': runsDetail.isEmpty
+            ? null
+            : runsDetail.last['outputTokens'],
         'promptEvalTokens': perf?.promptEvalTokens,
         'evalTokens': perf?.evalTokens,
-        'hitEosBeforeTarget': perf == null
+        'hitEosBeforeTarget': runsDetail.isEmpty
             ? null
-            : perf.evalTokens < _outputTokens,
+            : runsDetail.last['hitEosBeforeTarget'],
         'promptEvalMs': perf?.promptEvalMs,
         'evalMs': perf?.evalMs,
         'sampleMs': perf?.sampleMs,
@@ -568,9 +596,12 @@ class _LiteRtLmBenchmarkAppState extends State<LiteRtLmBenchmarkApp> {
             perf == null || perf.evalMs + perf.sampleMs <= 0
             ? null
             : perf.evalTokens / ((perf.evalMs + perf.sampleMs) / 1000.0),
-        'wallTokensPerSecond': wallMs <= 0 || perf == null
+        'wallTokensPerSecond': runsDetail.isEmpty
             ? null
-            : perf.evalTokens / (wallMs / 1000.0),
+            : runsDetail.last['wallTokensPerSecond'],
+        'targetWallTokensPerSecond': runsDetail.isEmpty
+            ? null
+            : runsDetail.last['targetWallTokensPerSecond'],
         'runs': _runs,
         'warmups': _warmups,
         'measured': _summarizeRuns(runsDetail),

diff --git a/example/chat_app/pubspec.lock b/example/chat_app/pubspec.lock
@@ -349,7 +349,7 @@ packages:
       path: "../.."
       relative: true
     source: path
-    version: "0.7.1"
+    version: "0.7.2"
   logging:
     dependency: transitive
     description:

diff --git a/hook/build.dart b/hook/build.dart
@@ -11,7 +11,7 @@ import 'package:path/path.dart' as path;
 
 import 'package:llamadart/src/hook/native_bundle_config.dart';
 
-const _llamaCppTag = 'b9536';
+const _llamaCppTag = 'b9547';
 const _nativeRepoSlug = 'leehack/llamadart-native';
 
 const _packageName = 'llamadart';

diff --git a/lib/src/backends/litert_lm/litert_lm_backend_web.dart b/lib/src/backends/litert_lm/litert_lm_backend_web.dart
@@ -961,6 +961,9 @@ class LiteRtLmBackend
     if (params.speculativeDecoding) {
       unsupported.add('speculativeDecoding');
     }
+    if (params.speculativeDecodingConfig != null) {
+      unsupported.add('speculativeDecodingConfig');
+    }
     if (params.streamBatchTokenThreshold !=
         defaults.streamBatchTokenThreshold) {
       unsupported.add('streamBatchTokenThreshold');

diff --git a/lib/src/backends/litert_lm/litert_lm_service.dart b/lib/src/backends/litert_lm/litert_lm_service.dart
@@ -522,7 +522,7 @@ class LiteRtLmService {
   ) {
     return _ensureClientForRuntime(
       outputTokens: params.maxTokens,
-      speculativeDecoding: params.speculativeDecoding,
+      speculativeDecoding: params.isSpeculativeDecodingEnabled,
     );
   }
 
@@ -843,6 +843,7 @@ class LiteRtLmService {
     if (params.grammarRoot != defaults.grammarRoot) {
       unsupported.add('grammarRoot');
     }
+    _addUnsupportedSpeculativeDecodingOptions(params, unsupported);
 
     if (unsupported.isEmpty) {
       return;
@@ -851,10 +852,30 @@ class LiteRtLmService {
       'LiteRtLmBackend does not support llama.cpp-specific GenerationParams: '
       '${unsupported.join(', ')}. Supported LiteRT-LM generation options are '
       'maxTokens, temp, topK, topP, seed, stopSequences, '
-      'speculativeDecoding, and native stream batching thresholds.',
+      'speculativeDecoding, speculativeDecodingConfig, and native stream '
+      'batching thresholds.',
     );
   }
 
+  void _addUnsupportedSpeculativeDecodingOptions(
+    GenerationParams params,
+    List<String> unsupported,
+  ) {
+    final config = params.resolvedSpeculativeDecodingConfig;
+    if (config == null) {
+      return;
+    }
+    if (config.draftTokenMax != null) {
+      unsupported.add('speculativeDecodingConfig.draftTokenMax');
+    }
+    if (config.draftTokenMin != null) {
+      unsupported.add('speculativeDecodingConfig.draftTokenMin');
+    }
+    if (config.minProbability != null) {
+      unsupported.add('speculativeDecodingConfig.minProbability');
+    }
+  }
+
   int _defaultSamplerSeed() {
     return DateTime.now().microsecondsSinceEpoch & 0x7fffffff;
   }