Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,23 @@
default; iOS, macOS, Linux, and Windows now default to `llama_cpp` only.
* Added native release pin automation so the maintainer sync workflow updates
Apple SPM checksums from published native release asset digests.
* Added `SpeculativeDecodingConfig` as a backend-neutral generation option for
selecting speculative decoding strategies such as MTP while keeping the
existing `GenerationParams.speculativeDecoding` flag as a compatibility
switch.
* Added llama.cpp native MTP speculative decoding for compatible GGUF models
through `SpeculativeDecodingConfig.mtp(...)`, defaulting to a conservative
one-token draft depth unless callers tune `draftTokenMax`.
* Updated the default llama.cpp native runtime pin to
`leehack/llamadart-native@b9547`, including the MTP wrapper exports and
`llama-common` runtime packaging.
* Added `ModelParams.speculativeRollbackTokenMax` so llama.cpp contexts can
reserve recurrent-state rollback snapshots required by Qwen3.5 MTP-style
models.
* Guarded llama.cpp MTP on Android Vulkan by default because the upstream
`draft-mtp` backend-sampling path can abort with `vk::DeviceLostError`;
CPU and other supported backends remain available, and a dart-define debug
override is available for reproductions.
* **CI reliability**:
* Cached and retried tiny GGUF test-model downloads used by VM integration
tests so main-branch CI is less exposed to Hugging Face 429 rate limits.
Expand Down
54 changes: 47 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -114,7 +114,7 @@ hooks:
llamadart:
# Optional. Defaults to llamadart's tested native runtime pin.
# Use a leehack/llamadart-native release tag when testing another build.
llamadart_native_tag: b9536
llamadart_native_tag: b9547

# Optional. GitHub repository slug or github.com URL.
llamadart_native_repository: leehack/llamadart-native
Expand Down Expand Up @@ -142,7 +142,7 @@ the native-assets hook fails while downloading that asset.

Native source overrides are for compatibility testing. They do not regenerate
Dart FFI bindings or symbol lookups, so the selected binary still must be ABI-
and symbol-compatible with the default `leehack/llamadart-native@b9536` runtime.
and symbol-compatible with the default `leehack/llamadart-native@b9547` runtime.

Available native tags are published on the
[`leehack/llamadart-native` releases page](https://github.com/leehack/llamadart-native/releases).
Expand All @@ -154,7 +154,7 @@ gh release list --repo leehack/llamadart-native --limit 20

Before overriding, confirm the release includes the asset for your target. The
hook downloads files named `llamadart-native-<bundle>-<tag>.tar.gz`, for example
`llamadart-native-windows-x64-b9536.tar.gz`.
`llamadart-native-windows-x64-b9547.tar.gz`.
For local testing, `llamadart_native_path` may point directly at a bundle
archive, at an extracted bundle directory, or at a directory containing
`<tag>/<bundle>/`, `<bundle>/`, or the expected archive file.
Expand Down Expand Up @@ -204,6 +204,44 @@ other models, or to override detection, pass `ModelParams.chatTemplate`. See
for the template support matrix, real-model smoke commands, and how to add a
family.

llama.cpp MTP speculative decoding is available for compatible GGUF models. For
Qwen3.5 MTP-style models, reserve rollback snapshots on the context and enable
MTP on the generation request:

```dart
await engine.loadModel(
'path/to/Qwen3.5-0.8B-MTP-Q4_K_M.gguf',
modelParams: const ModelParams(
contextSize: 2048,
batchSize: 512,
microBatchSize: 512,
speculativeRollbackTokenMax: 1,
),
);

await for (final token in engine.generate(
'Explain local inference in one paragraph.',
params: const GenerationParams(
maxTokens: 128,
speculativeDecodingConfig: SpeculativeDecodingConfig.mtp(
draftTokenMax: 1,
),
),
)) {
stdout.write(token);
}
```

Higher `draftTokenMax` values can be faster on some models/devices, but they
should be benchmarked with the target model because excess draft depth can add
verification overhead.

Android Vulkan MTP is currently disabled by default. The upstream llama.cpp
`draft-mtp` backend-sampling path can abort Android Vulkan processes with
`vk::DeviceLostError`; use CPU for Android MTP validation, or rebuild with
`--dart-define=LLAMADART_ANDROID_VULKAN_ALLOW_MTP=true` only when reproducing
or benchmarking that upstream path.

### 6. Download and cache a remote model file

```dart
Expand Down Expand Up @@ -372,7 +410,7 @@ overrides are rejected instead of being silently ignored. `.litertlm`
generation honors `GenerationParams`
`maxTokens`, `temp`, `topK`, `topP`, and `seed` on native and web, with
`stopSequences` enforced by llamadart. Native LiteRT-LM also honors stream
batching thresholds and the opt-in `speculativeDecoding` flag; Web LiteRT-LM
batching thresholds and the opt-in speculative decoding APIs; Web LiteRT-LM
rejects speculative decoding until the browser runtime exposes an equivalent
control. llama.cpp-only sampling and constrained-decoding controls
such as Min-P, repeat penalty overrides, grammar/lazy grammar triggers,
Expand All @@ -384,7 +422,7 @@ the current strict structured-output boundary.
<details>
<summary>Full module matrix (available modules by target)</summary>

Available llama.cpp module matrix from the default native tag `b9536`:
Available llama.cpp module matrix from the default native tag `b9547`:

| Target | Available backend modules in bundle |
|--------|-------------------------------------|
Expand Down Expand Up @@ -504,6 +542,8 @@ Notes:
- `ModelParams.splitMode` passes through to llama.cpp `split_mode`; it defaults to upstream `layer` behavior.
- `ModelParams.mainGpu` passes through to llama.cpp `main_gpu`. To select one GPU for the full model, use `splitMode: ModelSplitMode.none` with the desired `mainGpu` index.
- `ModelParams.batchSize` (`n_batch`) and `ModelParams.microBatchSize` (`n_ubatch`) can be set independently for memory/performance tuning; defaults keep legacy behavior (`n_batch = n_ctx`, `n_ubatch = n_batch`).
- `ModelParams.speculativeRollbackTokenMax` passes through to llama.cpp `n_rs_seq`. Keep the default `0` for normal generation; set it to at least the MTP draft token max when a llama.cpp MTP model needs bounded rollback snapshots, such as Qwen3.5 MTP.
- Android Vulkan MTP is guarded by default because the upstream llama.cpp MTP backend-sampling path can crash the process. The debug-only escape hatch is `--dart-define=LLAMADART_ANDROID_VULKAN_ALLOW_MTP=true`.
- `ModelParams.preferMemory64` and `ModelParams.modelBytesHint` are web/WebGPU only (ignored on native). They select the 64-bit (wasm64/mem64) bridge core so models larger than the ~4 GiB wasm32 address space (for example Gemma 4 E2B) can load; `null` auto-decides from the size hint (size-driven, no hardcoded model names). See the [WebGPU bridge docs](https://leehack.github.io/llamadart/docs/platforms/webgpu-bridge).
- Apple targets use consolidated llama.cpp native libraries, so
`llamadart_native_backends` does not split Apple backend modules. Use
Expand Down Expand Up @@ -700,9 +740,9 @@ Current pinned runtime artifacts:

| Runtime path | Published artifact |
|--------------|--------------------|
| Native llama.cpp / GGUF | `leehack/llamadart-native@b9536` |
| Native llama.cpp / GGUF | `leehack/llamadart-native@b9547` |
| Native LiteRT-LM / `.litertlm` | `leehack/litert-lm-native@v0.13.1` |
| Apple SPM llama.cpp / GGUF | `leehack/llamadart-native@b9536` Apple XCFramework |
| Apple SPM llama.cpp / GGUF | `leehack/llamadart-native@b9547` Apple XCFramework |
| Apple SPM LiteRT-LM / `.litertlm` | `leehack/litert-lm-native@v0.13.1` Apple XCFrameworks |
| Web llama.cpp / GGUF | `leehack/llama-web-bridge-assets@v0.1.16` |
| Web LiteRT-LM / `.litertlm` | App-provided `@litert-lm/core` module URL; the chat app defaults to jsDelivr `@litert-lm/core/+esm` |
Expand Down
4 changes: 2 additions & 2 deletions darwin/llamadart/Package.swift
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ import PackageDescription

let packageRoot = URL(fileURLWithPath: #filePath).deletingLastPathComponent()
let artifactsRoot = packageRoot.appendingPathComponent("Artifacts")
let llamaCppTag = "b9536"
let llamaCppTag = "b9547"
let liteRtLmTag = "v0.13.1"

func localArtifactPath(_ name: String) -> String? {
Expand Down Expand Up @@ -54,7 +54,7 @@ let package = Package(
repository: "leehack/llamadart-native",
artifactName: "llamadart-native-apple-xcframework-\(llamaCppTag).zip",
tag: llamaCppTag,
checksum: "e71058acca310999c1c5ee03e52e1992bd4c31b528d97ca019e2ea132fc79ae8"
checksum: "df326c10018c0ac739560d0744db52598b7ea8158fd935b02f769d3ac2905237"
),
nativeRepoBinaryTarget(
name: "LiteRtLm",
Expand Down
53 changes: 42 additions & 11 deletions example/chat_app/lib/litert_lm_benchmark_app.dart
Original file line number Diff line number Diff line change
Expand Up @@ -183,7 +183,12 @@ Map<String, Object?> _summarizeRuns(List<Map<String, Object?>> runs) {
'decodeWithSamplingTokensPerSecond',
),
'wallMilliseconds': _numericSummary(runs, 'wallMilliseconds'),
'outputTokens': _numericSummary(runs, 'outputTokens'),
'evalTokens': _numericSummary(runs, 'evalTokens'),
'targetWallTokensPerSecond': _numericSummary(
runs,
'targetWallTokensPerSecond',
),
};
}

Expand Down Expand Up @@ -479,6 +484,7 @@ class _LiteRtLmBenchmarkAppState extends State<LiteRtLmBenchmarkApp> {
contextSize: _maxTokens,
gpuLayers: ModelParams.maxGpuLayers,
preferredBackend: backendPreference,
speculativeRollbackTokenMax: _speculative ? 1 : 0,
),
);
loadSw.stop();
Expand All @@ -492,7 +498,13 @@ class _LiteRtLmBenchmarkAppState extends State<LiteRtLmBenchmarkApp> {
await engine
.generate(
_promptController.text,
params: GenerationParams(maxTokens: _outputTokens, seed: 1),
params: GenerationParams(
maxTokens: _outputTokens,
seed: 1,
speculativeDecodingConfig: _speculative
? const SpeculativeDecodingConfig.mtp()
: null,
),
)
.drain<void>();
}
Expand All @@ -506,22 +518,31 @@ class _LiteRtLmBenchmarkAppState extends State<LiteRtLmBenchmarkApp> {
final sw = Stopwatch()..start();
await for (final chunk in engine.generate(
_promptController.text,
params: GenerationParams(maxTokens: _outputTokens, seed: 1),
params: GenerationParams(
maxTokens: _outputTokens,
seed: 1,
speculativeDecodingConfig: _speculative
? const SpeculativeDecodingConfig.mtp()
: null,
),
)) {
buffer.write(chunk);
}
sw.stop();
wallMs = sw.elapsedMilliseconds;
lastText = buffer.toString();
final outputTokenCount = lastText.isEmpty
? 0
: await engine.getTokenCount(lastText);
perf = await engine.getPerformanceContext();
final runMetrics = {
'index': i,
'wallMilliseconds': wallMs,
'speculativeDecoding': _speculative,
'outputTokens': outputTokenCount,
'promptEvalTokens': perf?.promptEvalTokens,
'evalTokens': perf?.evalTokens,
'hitEosBeforeTarget': perf == null
? null
: perf.evalTokens < _outputTokens,
'hitEosBeforeTarget': outputTokenCount < _outputTokens,
'promptEvalMs': perf?.promptEvalMs,
'evalMs': perf?.evalMs,
'sampleMs': perf?.sampleMs,
Expand All @@ -535,9 +556,12 @@ class _LiteRtLmBenchmarkAppState extends State<LiteRtLmBenchmarkApp> {
perf == null || perf.evalMs + perf.sampleMs <= 0
? null
: perf.evalTokens / ((perf.evalMs + perf.sampleMs) / 1000.0),
'wallTokensPerSecond': wallMs <= 0 || perf == null
'wallTokensPerSecond': wallMs <= 0 || outputTokenCount <= 0
? null
: perf.evalTokens / (wallMs / 1000.0),
: outputTokenCount / (wallMs / 1000.0),
'targetWallTokensPerSecond': wallMs <= 0
? null
: _outputTokens / (wallMs / 1000.0),
};
runsDetail.add(runMetrics);
_append('RUN llamadart ${jsonEncode(runMetrics)}');
Expand All @@ -550,11 +574,15 @@ class _LiteRtLmBenchmarkAppState extends State<LiteRtLmBenchmarkApp> {
'backendName': backendName,
'resolvedGpuLayers': resolvedGpuLayers,
'targetDecodeTokens': _outputTokens,
'speculativeDecoding': _speculative,
'outputTokens': runsDetail.isEmpty
? null
: runsDetail.last['outputTokens'],
'promptEvalTokens': perf?.promptEvalTokens,
'evalTokens': perf?.evalTokens,
'hitEosBeforeTarget': perf == null
'hitEosBeforeTarget': runsDetail.isEmpty
? null
: perf.evalTokens < _outputTokens,
: runsDetail.last['hitEosBeforeTarget'],
'promptEvalMs': perf?.promptEvalMs,
'evalMs': perf?.evalMs,
'sampleMs': perf?.sampleMs,
Expand All @@ -568,9 +596,12 @@ class _LiteRtLmBenchmarkAppState extends State<LiteRtLmBenchmarkApp> {
perf == null || perf.evalMs + perf.sampleMs <= 0
? null
: perf.evalTokens / ((perf.evalMs + perf.sampleMs) / 1000.0),
'wallTokensPerSecond': wallMs <= 0 || perf == null
'wallTokensPerSecond': runsDetail.isEmpty
? null
: perf.evalTokens / (wallMs / 1000.0),
: runsDetail.last['wallTokensPerSecond'],
'targetWallTokensPerSecond': runsDetail.isEmpty
? null
: runsDetail.last['targetWallTokensPerSecond'],
'runs': _runs,
'warmups': _warmups,
'measured': _summarizeRuns(runsDetail),
Expand Down
2 changes: 1 addition & 1 deletion example/chat_app/pubspec.lock
Original file line number Diff line number Diff line change
Expand Up @@ -349,7 +349,7 @@ packages:
path: "../.."
relative: true
source: path
version: "0.7.1"
version: "0.7.2"
logging:
dependency: transitive
description:
Expand Down
2 changes: 1 addition & 1 deletion hook/build.dart
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ import 'package:path/path.dart' as path;

import 'package:llamadart/src/hook/native_bundle_config.dart';

const _llamaCppTag = 'b9536';
const _llamaCppTag = 'b9547';
const _nativeRepoSlug = 'leehack/llamadart-native';

const _packageName = 'llamadart';
Expand Down
3 changes: 3 additions & 0 deletions lib/src/backends/litert_lm/litert_lm_backend_web.dart
Original file line number Diff line number Diff line change
Expand Up @@ -961,6 +961,9 @@ class LiteRtLmBackend
if (params.speculativeDecoding) {
unsupported.add('speculativeDecoding');
}
if (params.speculativeDecodingConfig != null) {
unsupported.add('speculativeDecodingConfig');
}
if (params.streamBatchTokenThreshold !=
defaults.streamBatchTokenThreshold) {
unsupported.add('streamBatchTokenThreshold');
Expand Down
25 changes: 23 additions & 2 deletions lib/src/backends/litert_lm/litert_lm_service.dart
Original file line number Diff line number Diff line change
Expand Up @@ -522,7 +522,7 @@ class LiteRtLmService {
) {
return _ensureClientForRuntime(
outputTokens: params.maxTokens,
speculativeDecoding: params.speculativeDecoding,
speculativeDecoding: params.isSpeculativeDecodingEnabled,
);
}

Expand Down Expand Up @@ -843,6 +843,7 @@ class LiteRtLmService {
if (params.grammarRoot != defaults.grammarRoot) {
unsupported.add('grammarRoot');
}
_addUnsupportedSpeculativeDecodingOptions(params, unsupported);

if (unsupported.isEmpty) {
return;
Expand All @@ -851,10 +852,30 @@ class LiteRtLmService {
'LiteRtLmBackend does not support llama.cpp-specific GenerationParams: '
'${unsupported.join(', ')}. Supported LiteRT-LM generation options are '
'maxTokens, temp, topK, topP, seed, stopSequences, '
'speculativeDecoding, and native stream batching thresholds.',
'speculativeDecoding, speculativeDecodingConfig, and native stream '
'batching thresholds.',
);
}

void _addUnsupportedSpeculativeDecodingOptions(
GenerationParams params,
List<String> unsupported,
) {
final config = params.resolvedSpeculativeDecodingConfig;
if (config == null) {
return;
}
if (config.draftTokenMax != null) {
unsupported.add('speculativeDecodingConfig.draftTokenMax');
}
if (config.draftTokenMin != null) {
unsupported.add('speculativeDecodingConfig.draftTokenMin');
}
if (config.minProbability != null) {
unsupported.add('speculativeDecodingConfig.minProbability');
}
}

int _defaultSamplerSeed() {
return DateTime.now().microsecondsSinceEpoch & 0x7fffffff;
}
Expand Down
Loading
Loading