Releases: leehack/llamadart
Releases · leehack/llamadart
v0.7.2
Metadata patch release.
- Added explicit pub.dev platform metadata for Android, iOS, Linux, macOS, web, and Windows so the package listing reflects actual cross-platform support.
- Hardened the native prompt reuse CI model download to use authenticated, failing Hugging Face requests.
v0.7.1
- Apple native runtime packaging:
- Added Flutter iOS/macOS Swift Package Manager integration so Apple apps link
the pinnedleehack/llamadart-nativeandleehack/litert-lm-native
XCFramework artifacts throughdarwin/llamadart/Package.swift. - Disabled the legacy hook-managed Apple bundle path for Flutter iOS/macOS
builds, avoiding wrapper/frameworkMinimumOSVersionmismatches in App
Store uploads. Standalone Dart macOS keeps the native-assets dylib fallback. - Raised the Flutter Apple runtime floors to iOS 16.4 and macOS 14.0 to match
the published XCFramework artifacts.
- Added Flutter iOS/macOS Swift Package Manager integration so Apple apps link
- Runtime defaults and release automation:
- Android native builds still include both
llama_cppandlitert_lmby
default; iOS, macOS, Linux, and Windows now default tollama_cpponly. - Added native release pin automation so the maintainer sync workflow updates
Apple SPM checksums from published native release asset digests.
- Android native builds still include both
- CI reliability:
- Cached and retried tiny GGUF test-model downloads used by VM integration
tests so main-branch CI is less exposed to Hugging Face 429 rate limits. - Excluded local SwiftPM artifact caches from pub archives; Flutter Apple
consumers still resolve the published remote XCFramework targets.
- Cached and retried tiny GGUF test-model downloads used by VM integration
- Compatibility note: no Dart API breaking changes. Flutter Apple apps must
target iOS 16.4/macOS 14.0 or newer, and non-Android native apps that ship
.litertlmmodels should opt in withllamadart_native_runtimes.
v0.7.0
- LiteRT-LM backend and runtime selection:
- Added first-class
.litertlmrouting throughLlamaBackend()on native and
web targets, with native bundle downloads fromleehack/litert-lm-native
and web loading through@litert-lm/core. - Added
ModelParams.liteRtLmBackendso callers can select LiteRT-LM CPU,
GPU, or Android NPU execution where supported.autochooses GPU on
Android/macOS and CPU elsewhere on native targets. - Added cached Hugging Face
.litertlmloading through
loadModelSource(...), preserving the selected LiteRT-LM backend after the
cache manager resolves the local file. - Added native LiteRT-LM tokenization, detokenization, log-level control,
runtime metrics, and high-levelChatSessiontoken counting support. - Added
hooks.user_defines.llamadart.llamadart_native_tag,
llamadart_native_repository, andllamadart_native_pathso apps can test
a different compatible native runtime source without patchingllamadart. - Updated the Windows runtime fallback scanner to discover custom GitHub and
local archive cache namespaces when.dart_tool/libis unavailable.
- Added first-class
- LiteRT-LM chat, templates, and generation quality:
- Added
GenerationParams.speculativeDecodingfor native LiteRT-LM. The
default remains disabled; llama.cpp, WebGPU, and LiteRT-LM web reject the
option until their speculative paths are implemented. - Fixed Gemma 4
.litertlmthinking and tool calling by replacing the stub
template with the canonical Gemma 4 chat template, parsing the runtime
thought channel as reasoning, and suppressing reasoning deltas when callers
setenableThinking: false. - Added a filename-keyed
.litertlmchat-template registry seeded with
Gemma 4/3/3n and Qwen 2.5/3. PassModelParams.chatTemplateto override
detection for other models. - Fixed LiteRT-LM tool calling for grammar-using handlers by forwarding
supportsGrammarConstraintsfrom the activeNativeAutoBackenddelegate. - Stopped structured tool-call streams from leaking raw Hermes/Qwen JSON or
Gemma<|tool_call>markers as assistant content before the final
tool_callschunk.
- Added
- Web and chat-app support:
- Added
ModelParams.preferMemory64andModelParams.modelBytesHintso
large WebGPU GGUF models such as Gemma 4 E2B can choose the 64-bit bridge
core before hitting the wasm32 address-space limit. - Fixed web
.litertlmchat-app turns by swallowing unsupported token-count
refreshes, avoiding unsupportedminP/penaltyparameters for LiteRT-LM
web generation, and replacing the stuck "Loading model 0%" label with an
indeterminate load message. - Halved web
.litertlmload time by skipping WebGPUCacheStorageprefetch
for LiteRT-LM models, which are fetched directly by@litert-lm/core. - Fixed web GGUF downloads reporting success before the bridge was ready by
awaitingwindow.__llamadartBridgeReadyPromise, requiring the bridge
prefetch API, and surfacing actionable errors for old bridge assets. - Allowed benign Hugging Face
?download=trueURLs to be prefetched into the
browser cache while still skipping credentialed or signed URLs.
- Added
- Lifecycle, cancellation, and native stability:
- Fixed iOS
.litertlmloading by resolving embeddedLiteRtLmand
StreamProxyframeworks from the app bundle, matching the macOS runtime
path behavior. - Improved LiteRT-LM diagnostics before model load, including selected
CPU/GPU/NPU backend reporting, platform availability errors, and complete
dynamic-library candidate failures. - Validated platform-specific LiteRT-LM companion libraries during
native-asset setup so incomplete runtime bundles fail at build time. - Hardened native and LiteRT-LM cancellation/disposal so in-flight generation
no longer races token release, worker teardown, engine deletion, closed
response ports, or stream writes after cancellation. - Freed multimodal prompt buffers on tokenize/eval error paths, serialized
multimodal projector load/unload, and closed a leaked native-backend
handshake reply port.
- Fixed iOS
- Correctness and download resilience:
ChatSessionnow forwards empty-choices completion chunks instead of
throwing, strips multiple<think>blocks, and trims history only on
user-message turn boundaries.LlamaEngine.generatewraps unexpected backend errors in
LlamaInferenceExceptionso callers catchingLlamaExceptionsee the
documented error type.- Tool-call parsing now uses stable fallback ids, tolerates code-fence
language tokens without trailing delimiters, and keeps commas inside quoted
argument values. - JSON-schema-to-GBNF conversion now resolves
$refs nested inside other
$reftargets and fails loudly on unresolvable or external$refs. - Array grammar generation validates
minItems/maxItems, model downloads
use connection and idle-read timeouts, and partial-download resume is
restricted to files with stored validators.
- Benchmarks, docs, and validation:
- Added fair Gemma 4 LiteRT-LM versus llama.cpp/GGUF benchmark tooling for
Android, macOS, and web, with speculative-decoding metrics, Pixel benchmark
failure detection, and target-specific timeouts. - Added
tool/gguf_chat_features_smoke.dartand the
chat-app-web-gemma4-webgpu-smokeE2E scenario for real-model parser and
WebGPU mem64 validation. - Updated README, website docs, and
doc/litert_lm_templates.mdfor backend
selection, platform/runtime support, package-size controls, benchmark
results, model templates, and current LiteRT-LM capability limits.
- Added fair Gemma 4 LiteRT-LM versus llama.cpp/GGUF benchmark tooling for
- Compatibility note: no public API breaking changes for existing GGUF /
llama.cpp callers. LiteRT-LM support is additive, with deprecated benchmark
wrappers retained for compatibility; unsupported llama.cpp-only parameters are
rejected for.litertlmloads instead of being silently ignored.
v0.6.17
- Native runtime sync:
- Updated native hook pinning and regenerated bindings through
leehack/llamadart-native@b9371, picking up llama.cppb9371. - Picked up the Apple mobile Metal stability fix that disables Metal
residency sets on iOS/tvOS/visionOS native bundles, avoiding affected
device context-creation failures such asMTLLibraryErrorDomain Code=3.
- Updated native hook pinning and regenerated bindings through
- Compatibility note: no public API breaking changes in
0.6.17;
existing0.6.16callers remain compatible. The release only refreshes
the pinned native runtime and generated low-level bindings.
v0.6.16
- Native runtime diagnostics:
- Fixed native
getVramInfo()so it reports free/total VRAM from
llama.cpp GPU-class backend devices when available, using props-based
memory reporting first and the legacy memory probe as a fallback. - Routed native VRAM probing through the ggml registry fallback path so
Windows split bundles resolve backend-device symbols from the runtime that
owns the device registry.
- Fixed native
- WebGPU and chat app fixes:
- Improved browser recovery for large remote WebGPU model/projector loads by
retrying wasm32 model-staging aborts with the wasm64 core before surfacing
memory-pressure failures. - Improved the runnable chat app's web remote-model startup path so model
assets are prefetched into browser cache when available, browser
CacheStoragefailures fall back to direct network loading, and
credentialed/signed model URLs skip persistent browser cache storage.
- Improved browser recovery for large remote WebGPU model/projector loads by
- Model download UX:
- Improved the runnable chat app's mobile download behavior so lifecycle
pauses no longer deliberately cancel active foreground downloads; the app
now lets short screen-lock/background interruptions continue when the OS
permits and still keeps explicit pause/dispose cancellation paths. - Added in-app and docs guidance for mobile large-model downloads, including
resumable partial files, foreground Dart lifecycle limits, and the need for
opt-in native background download/model-store integrations for robust
cross-app GGUF management.
- Improved the runnable chat app's mobile download behavior so lifecycle
- Compatibility note: no public API breaking changes in
0.6.16;
existing0.6.15callers remain compatible. The changes improve native VRAM
diagnostics, WebGPU browser recovery, and chat app download lifecycle
behavior.
v0.6.15
What's Changed
- chore(deps): bump path-to-regexp from 0.1.12 to 0.1.13 in /website in the npm_and_yarn group across 1 directory by @dependabot[bot] in #100
- chore(website): refresh docs dependencies by @leehack in #154
- test: add local E2E scenario runner by @leehack in #155
- Fix GLM-OCR multimodal prompt rendering by @leehack in #157
- Refactor chat template render context serialization by @leehack in #158
- Prepare llamadart 0.6.15 release by @leehack in #159
Full Changelog: v0.6.14...v0.6.15
v0.6.14
What's Changed
- docs: document WebGPU readiness checks by @leehack in #148
- chore(native): sync native release b9159 by @github-actions[bot] in #150
- feat: add model download controller by @leehack in #149
- ci: update actions for Node 24 by @leehack in #151
- chore: update WebGPU bridge assets to v0.1.16 by @leehack in #152
- chore: prepare v0.6.14 release by @leehack in #153
Full Changelog: v0.6.13...v0.6.14
v0.6.13
What's Changed
- fix: cascade WebGPU batch defaults by @leehack in #121
- feat: add model download cache manager by @leehack in #129
- feat(api): expose state_save_file / state_load_file on LlamaEngine by @thereisnotime in #123
- fix: serialize same-key model downloads by @leehack in #139
- fix: recover cache metadata sidecars by @leehack in #140
- docs: codify agent quality guidelines by @leehack in #142
- feat(webgpu): wire bridge state persistence by @leehack in #141
- fix: tolerate missing optional webgpu checksum assets by @leehack in #143
- docs: add production-readiness PR checklist by @leehack in #144
- fix(models): reject remote-only options for local sources by @leehack in #145
- feat(models): improve Hugging Face source ergonomics by @leehack in #146
- chore: prepare v0.6.13 release by @leehack in #147
Full Changelog: v0.6.12...v0.6.13
v0.6.12
What's Changed
- fix(native): test b9016 runtime and expose mainGpu by @leehack in #113
- Filter backend-owned runtime DLLs by @leehack in #115
- feat(api): expose load-time tuning knobs on ModelParams by @thereisnotime in #116
- Sync WebGPU bridge assets to llama.cpp b9016 by @leehack in #117
- Update WebGPU bridge assets to v0.1.14 by @leehack in #118
- Prepare 0.6.12 release by @leehack in #119
New Contributors
- @thereisnotime made their first contribution in #116
Full Changelog: v0.6.11...v0.6.12
v0.6.11
What's Changed
- chore(native): sync native release b8778 by @github-actions[bot] in #106
- fix(gemma4): stream thought channels as thinking by @leehack in #108
- chore(native): sync native release b8955 by @github-actions[bot] in #109
- chore(release): prepare 0.6.11 by @leehack in #110
Full Changelog: v0.6.10...v0.6.11