diff --git a/CHANGELOG.md b/CHANGELOG.md index 9a048de5..812cb10f 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,5 +1,16 @@ ## Unreleased +* **LiteRT-LM LoRA adapters**: + * Added native `.litertlm` LoRA wiring for one LiteRT-LM adapter at scale + `1.0`, including `ModelParams.loras`, `setLora(...)`, + `removeLora(...)`, and `clearLoras()` on native LiteRT-LM contexts. + * Added Dart FFI coverage for + `litert_lm_session_config_set_lora_file`. Generation with a LoRA adapter + requires a matching `litert-lm-native` runtime that exports that C symbol; + older local runtime libraries fail with a focused unsupported-runtime + message. + * Kept multiple weighted adapters on the llama.cpp/GGUF path and kept + LiteRT-LM web rejecting LoRA explicitly. * **Structured output**: * Added `responseFormat` routing to `LlamaEngine.create(...)` for grammar-capable backends, deprecated the legacy `chatTemplate(...)` diff --git a/README.md b/README.md index 616f04c4..1447e7dd 100644 --- a/README.md +++ b/README.md @@ -353,8 +353,13 @@ to unconstrained output. LiteRT-LM web is currently limited to single-turn text prompts through `@litert-lm/core`; it does not yet preserve structured chat history, system prompts, tool declarations, or thinking/tool-call parsing with the same semantics as native. The current -implementation does not expose embeddings, state persistence, LoRA, or -multimodal operations through LiteRT-LM. `ChatSession` uses a conservative +implementation does not expose embeddings, state persistence, or multimodal +operations through LiteRT-LM. Native `.litertlm` LoRA supports one LiteRT-LM +adapter at scale `1.0` through `ModelParams.loras`, `setLora(...)`, +`removeLora(...)`, and `clearLoras()` when the loaded `litert-lm-native` +runtime exports `litert_lm_session_config_set_lora_file`. Multiple weighted +adapters remain llama.cpp/GGUF-only, and LiteRT-LM web does not expose LoRA. +`ChatSession` uses a conservative prompt-size estimate for history pruning only when exact tokenization is unavailable. `LiteRtLmBackendPreference.auto` chooses GPU on Android/macOS/web and CPU on @@ -367,8 +372,8 @@ rejects native-only tuning fields. The native fields cover activation data type, CPU dynamic-model prefill chunk size, parallel `.litertlm` file-section loading, and Android NPU dispatch library directory. llama.cpp-only tuning knobs such as partial GPU layer offload, batch/micro-batch sizing, KV-cache type, -flash attention, mmap/mlock, thread counts, LoRA load configs, and rope -overrides are rejected instead of being silently ignored. `.litertlm` +flash attention, mmap/mlock, thread counts, weighted/multiple LoRA configs, +and rope overrides are rejected instead of being silently ignored. `.litertlm` generation honors `GenerationParams` `maxTokens`, `temp`, `topK`, `topP`, and `seed` on native and web, with `stopSequences` enforced by llamadart. Native LiteRT-LM also honors stream @@ -953,11 +958,17 @@ void dispose() { ## 🎨 Low-Rank Adaptation (LoRA) -`llamadart` supports applying multiple LoRA adapters dynamically at runtime. +`llamadart` supports applying LoRA adapters dynamically at runtime. -- **Dynamic Scaling**: Adjust the strength (`scale`) of each adapter on the fly. +- **Dynamic Scaling**: Adjust the strength (`scale`) of each adapter on GGUF + llama.cpp backends. - **Isolate-Safe**: Native adapters are managed in a background Isolate to prevent UI jank. -- **Efficient**: Multiple LoRAs share the memory of a single base model. +- **Efficient**: Adapter state is tied to the active model context. + +GGUF llama.cpp backends support multiple weighted adapters. Native LiteRT-LM +`.litertlm` currently supports one LiteRT-LM adapter at scale `1.0` and +requires a `litert-lm-native` runtime that exports +`litert_lm_session_config_set_lora_file`. LiteRT-LM web does not expose LoRA. Check out our [LoRA Training Notebook](https://github.com/leehack/llamadart/blob/main/example/training_notebook/lora_training.ipynb) to learn how to train and convert your own adapters. diff --git a/lib/src/backends/litert_lm/litert_lm_runtime.dart b/lib/src/backends/litert_lm/litert_lm_runtime.dart index 433bbf04..e224efa1 100644 --- a/lib/src/backends/litert_lm/litert_lm_runtime.dart +++ b/lib/src/backends/litert_lm/litert_lm_runtime.dart @@ -9,7 +9,9 @@ import 'package:path/path.dart' as path; import '../../core/models/inference/model_params.dart'; -const _litertLmVersion = '0.13.1'; +/// Pinned litert-lm-native runtime version used by bundled native assets. +const liteRtLmNativeRuntimeVersion = '0.13.1'; +const _litertLmVersion = liteRtLmNativeRuntimeVersion; const _litertLmLibDirEnv = 'LLAMADART_LITERT_LM_LIB_DIR'; const _liteRtLmIosNativeAsset = 'package:llamadart/litert_lm_LiteRtLm'; const _processLibraryCandidate = ''; @@ -555,6 +557,7 @@ class LiteRtLmRuntimeClient { List>? messages, List>? tools, Map? extraContext, + String? loraPath, double temperature = 0.8, int topK = 40, double topP = 0.95, @@ -580,7 +583,7 @@ class LiteRtLmRuntimeClient { bindings.sessionConfigSetSamplerParams(sessionConfig, sampler); calloc.free(sampler); } - + Pointer? loraPtr; final systemPtr = systemMessage == null ? nullptr : _systemMessageJson(systemMessage).toNativeUtf8(allocator: calloc); @@ -595,6 +598,25 @@ class LiteRtLmRuntimeClient { : jsonEncode(extraContext).toNativeUtf8(allocator: calloc); Pointer<_LiteRtLmConversationConfig> config = nullptr; try { + if (loraPath != null) { + if (loraPath.trim().isEmpty) { + throw ArgumentError.value( + loraPath, + 'loraPath', + 'must be non-empty when provided', + ); + } + loraPtr = loraPath.toNativeUtf8(allocator: calloc); + final loraSet = bindings.sessionConfigSetLoraFile( + sessionConfig, + loraPtr.cast(), + ); + if (!loraSet) { + throw StateError( + 'litert_lm_session_config_set_lora_file failed for $loraPath', + ); + } + } config = bindings.conversationConfigCreate(); if (config == nullptr) { throw StateError('litert_lm_conversation_config_create returned null'); @@ -638,6 +660,9 @@ class LiteRtLmRuntimeClient { if (extraContextPtr != nullptr) { calloc.free(extraContextPtr); } + if (loraPtr != null) { + calloc.free(loraPtr); + } } } @@ -1391,7 +1416,8 @@ class LiteRtLmRuntimeClient { final envPath = Platform.environment[_litertLmLibDirEnv]; if (envPath != null && envPath.isNotEmpty) { final dir = Directory(envPath); - if (liteRtLmIsCacheDirectoryForAbi(dir, abi)) { + if (liteRtLmIsCacheDirectoryForAbi(dir, abi) || + _liteRtLmOverrideDirectoryForAbi(dir, abi)) { return dir.absolute; } } @@ -1418,6 +1444,11 @@ class LiteRtLmRuntimeClient { return null; } + bool _liteRtLmOverrideDirectoryForAbi(Directory dir, Abi abi) { + final library = _liteRtLmLibraryFileNameForAbi(abi); + return library != null && File('${dir.path}/$library').existsSync(); + } + List _candidateSearchRoots() { final roots = {Directory.current.path}; final scriptPath = Platform.script.toFilePath(); @@ -1991,6 +2022,26 @@ class _LiteRtLmBindings { ) >('litert_lm_session_config_set_sampler_params'); + late final _sessionConfigSetLoraFile = _library + .lookupFunction< + Bool Function(Pointer<_LiteRtLmSessionConfig>, Pointer), + bool Function(Pointer<_LiteRtLmSessionConfig>, Pointer) + >('litert_lm_session_config_set_lora_file'); + + bool sessionConfigSetLoraFile( + Pointer<_LiteRtLmSessionConfig> config, + Pointer path, + ) { + try { + return _sessionConfigSetLoraFile(config, path); + } on ArgumentError catch (error) { + throw UnsupportedError( + 'LiteRT-LM LoRA requires litert-lm-native with ' + 'litert_lm_session_config_set_lora_file. $error', + ); + } + } + late final conversationConfigCreate = _library .lookupFunction< Pointer<_LiteRtLmConversationConfig> Function(), diff --git a/lib/src/backends/litert_lm/litert_lm_runtime_stub.dart b/lib/src/backends/litert_lm/litert_lm_runtime_stub.dart index c2f4fedd..9f10da30 100644 --- a/lib/src/backends/litert_lm/litert_lm_runtime_stub.dart +++ b/lib/src/backends/litert_lm/litert_lm_runtime_stub.dart @@ -96,6 +96,7 @@ class LiteRtLmRuntimeClient { List>? messages, List>? tools, Map? extraContext, + String? loraPath, double temperature = 0.8, int topK = 40, double topP = 0.95, diff --git a/lib/src/backends/litert_lm/litert_lm_service.dart b/lib/src/backends/litert_lm/litert_lm_service.dart index 332e20a7..05b232ac 100644 --- a/lib/src/backends/litert_lm/litert_lm_service.dart +++ b/lib/src/backends/litert_lm/litert_lm_service.dart @@ -8,6 +8,7 @@ import '../../core/models/chat/chat_role.dart'; import '../../core/models/config/flash_attention.dart'; import '../../core/models/config/gpu_backend.dart'; import '../../core/models/config/kv_cache_type.dart'; +import '../../core/models/config/lora_config.dart'; import '../../core/models/config/log_level.dart'; import '../../core/models/inference/generation_params.dart'; import '../../core/models/inference/model_params.dart'; @@ -19,6 +20,10 @@ import 'litert_lm_chat_templates.dart'; import 'litert_lm_platform.dart'; import 'litert_lm_runtime.dart'; +const _liteRtLmLoraLimitMessage = + 'LiteRtLmBackend supports one LiteRT-LM LoRA adapter at scale 1.0. ' + 'Multiple weighted adapters remain llama.cpp/GGUF-only.'; + /// Worker-owned service for the LiteRT-LM backend. /// /// This keeps all LiteRT-LM FFI state inside the backend worker isolate. The @@ -40,6 +45,7 @@ class LiteRtLmService { int _nextContextHandle = 1; int? _modelHandle; int? _contextHandle; + LoraAdapterConfig? _activeLora; LiteRtLmRuntimeMetrics? _lastMetrics; LlamaLogLevel _logLevel = LlamaLogLevel.warn; bool _modelLoaded = false; @@ -82,6 +88,7 @@ class LiteRtLmService { _activeSpeculativeDecoding = null; _modelHandle = _nextModelHandle++; _contextHandle = null; + _activeLora = null; _lastMetrics = null; _cancelRequested = false; _modelLoaded = true; @@ -101,6 +108,7 @@ class LiteRtLmService { _activeSpeculativeDecoding = null; _modelHandle = null; _contextHandle = null; + _activeLora = null; _lastMetrics = null; _cancelRequested = false; _modelLoaded = false; @@ -115,6 +123,7 @@ class LiteRtLmService { _disposeContextRuntimeState(); _modelParams = params; _contextHandle = _nextContextHandle++; + _activeLora = _loraForParams(params); _contextCreated = true; return _contextHandle!; } @@ -124,6 +133,7 @@ class LiteRtLmService { _checkContextHandle(contextHandle); _disposeContextRuntimeState(); _contextHandle = null; + _activeLora = null; _contextCreated = false; } @@ -164,6 +174,7 @@ class LiteRtLmService { topP: params.topP, seed: params.seed ?? _defaultSamplerSeed(), npuBackend: backend == 'npu', + loraPath: _activeLora?.path, ); if (_cancelRequested) { client.cancel(); @@ -273,6 +284,7 @@ class LiteRtLmService { topP: params.topP, seed: params.seed ?? _defaultSamplerSeed(), npuBackend: backend == 'npu', + loraPath: _activeLora?.path, ); if (_cancelRequested) { client.cancel(); @@ -391,7 +403,28 @@ class LiteRtLmService { /// Handles LiteRT-LM LoRA operations. void handleLora(int contextHandle, String? path, double? scale, String op) { _checkContextHandle(contextHandle); - throw UnsupportedError('LiteRtLmBackend does not support LoRA adapters.'); + switch (op) { + case 'set': + if (path == null) { + throw ArgumentError('LiteRT-LM LoRA set requires an adapter path.'); + } + _activeLora = _validateLoraAdapter( + LoraAdapterConfig(path: path, scale: scale ?? 1.0), + ); + case 'remove': + if (path == null) { + throw ArgumentError( + 'LiteRT-LM LoRA remove requires an adapter path.', + ); + } + if (_activeLora?.path == path) { + _activeLora = null; + } + case 'clear': + _activeLora = null; + default: + throw ArgumentError('Unsupported LiteRT-LM LoRA operation: $op'); + } } /// Returns the active backend name. @@ -504,6 +537,7 @@ class LiteRtLmService { _activeBackend = null; _modelHandle = null; _contextHandle = null; + _activeLora = null; _modelLoaded = false; _contextCreated = false; } @@ -709,6 +743,13 @@ class LiteRtLmService { params.validate(); final unsupported = []; + String? loraError; + try { + _loraForParams(params); + } on ArgumentError catch (error) { + unsupported.add('loras'); + loraError = error.message.toString(); + } if (params.contextSize <= 0) { unsupported.add('contextSize=${params.contextSize}'); } @@ -721,9 +762,6 @@ class LiteRtLmService { if (params.mainGpu != 0) { unsupported.add('mainGpu'); } - if (params.loras.isNotEmpty) { - unsupported.add('loras'); - } if (params.numberOfThreads != 0) { unsupported.add('numberOfThreads'); } @@ -773,10 +811,39 @@ class LiteRtLmService { 'contextSize, chatTemplate, preferredBackend, all-or-CPU gpuLayers ' 'hints, liteRtLmBackend for explicit CPU/GPU/NPU selection, ' 'liteRtLmActivationDataType, liteRtLmPrefillChunkSize, ' - 'liteRtLmParallelFileSectionLoading, and liteRtLmDispatchLibDir.', + 'liteRtLmParallelFileSectionLoading, liteRtLmDispatchLibDir, and ' + 'one LiteRT-LM LoRA adapter at scale 1.0.' + '${loraError == null ? '' : ' $loraError'}', ); } + LoraAdapterConfig? _loraForParams(ModelParams params) { + if (params.loras.isEmpty) { + return null; + } + if (params.loras.length > 1) { + throw ArgumentError(_liteRtLmLoraLimitMessage); + } + return _validateLoraAdapter(params.loras.single); + } + + LoraAdapterConfig _validateLoraAdapter(LoraAdapterConfig adapter) { + if (adapter.scale != 1.0) { + throw ArgumentError( + '$_liteRtLmLoraLimitMessage Requested scale: ${adapter.scale}.', + ); + } + if (adapter.path.trim().isEmpty) { + throw ArgumentError('LiteRT-LM LoRA adapter path must be non-empty.'); + } + if (!File(adapter.path).existsSync()) { + throw ArgumentError( + 'LiteRT-LM LoRA adapter does not exist: ${adapter.path}', + ); + } + return adapter; + } + void _validateContextBackendParams(ModelParams params) { final requestedBackend = _explicitContextBackendName(params); if (requestedBackend == null) { diff --git a/test/unit/backends/litert_lm/litert_lm_backend_test.dart b/test/unit/backends/litert_lm/litert_lm_backend_test.dart index ad1a8fe2..46494a28 100644 --- a/test/unit/backends/litert_lm/litert_lm_backend_test.dart +++ b/test/unit/backends/litert_lm/litert_lm_backend_test.dart @@ -228,10 +228,12 @@ void main() { } }); - test('rejects unsupported load and llama.cpp-specific operations', () async { + test('supports LoRA state and rejects unsupported operations', () async { final backend = LiteRtLmBackend(); final wrongFormat = File('${tempDir.path}/model.gguf'); await wrongFormat.writeAsString('fake model'); + final adapterFile = File('${tempDir.path}/adapter.lora'); + await adapterFile.writeAsBytes(const [1, 2, 3]); try { await expectLater( @@ -255,17 +257,12 @@ void main() { const ModelParams(), ); - expect( - () => backend.setLoraAdapter(contextHandle, 'adapter.bin', 1.0), - throwsUnsupportedError, - ); - expect( - () => backend.removeLoraAdapter(contextHandle, 'adapter.bin'), - throwsUnsupportedError, - ); - expect( - () => backend.clearLoraAdapters(contextHandle), - throwsUnsupportedError, + await backend.setLoraAdapter(contextHandle, adapterFile.path, 1.0); + await backend.removeLoraAdapter(contextHandle, adapterFile.path); + await backend.clearLoraAdapters(contextHandle); + await expectLater( + backend.setLoraAdapter(contextHandle, adapterFile.path, 0.5), + throwsArgumentError, ); await expectLater( backend.multimodalContextCreate(handle, 'mmproj.bin'), diff --git a/test/unit/backends/litert_lm/litert_lm_service_test.dart b/test/unit/backends/litert_lm/litert_lm_service_test.dart index eac59502..46cff357 100644 --- a/test/unit/backends/litert_lm/litert_lm_service_test.dart +++ b/test/unit/backends/litert_lm/litert_lm_service_test.dart @@ -517,7 +517,7 @@ void main() { const ModelParams( splitMode: ModelSplitMode.none, mainGpu: 1, - loras: [LoraAdapterConfig(path: 'adapter.bin')], + loras: [LoraAdapterConfig(path: 'adapter.bin', scale: 0.5)], numberOfThreads: 2, numberOfThreadsBatch: 3, microBatchSize: 64, @@ -535,24 +535,28 @@ void main() { isA().having( (error) => error.message.toString(), 'message', - predicate( - (message) => const [ - 'splitMode', - 'mainGpu', - 'loras', - 'numberOfThreads', - 'numberOfThreadsBatch', - 'microBatchSize', - 'maxParallelSequences', - 'useMmap=false', - 'flashAttention', - 'cacheTypeK', - 'cacheTypeV', - 'kvUnified', - 'ropeFrequencyBase', - 'ropeFrequencyScale', - ].every(message.contains), - 'contains every unsupported ModelParams field', + allOf( + predicate( + (message) => const [ + 'splitMode', + 'mainGpu', + 'loras', + 'numberOfThreads', + 'numberOfThreadsBatch', + 'microBatchSize', + 'maxParallelSequences', + 'useMmap=false', + 'flashAttention', + 'cacheTypeK', + 'cacheTypeV', + 'kvUnified', + 'ropeFrequencyBase', + 'ropeFrequencyScale', + ].every(message.contains), + 'contains every unsupported ModelParams field', + ), + contains('scale 1.0'), + contains('Requested scale: 0.5'), ), ), ), @@ -628,16 +632,62 @@ void main() { expect( () => service.handleLora(contextHandle, 'adapter.bin', 1.0, 'set'), - throwsUnsupportedError, + throwsA( + isA().having( + (error) => error.message.toString(), + 'message', + contains('does not exist'), + ), + ), ); expect( - () => - service.handleLora(contextHandle, 'adapter.bin', null, 'remove'), - throwsUnsupportedError, + () => service.handleLora(contextHandle, 'adapter.bin', 0.5, 'set'), + throwsA( + isA().having( + (error) => error.message.toString(), + 'message', + allOf(contains('scale 1.0'), contains('0.5')), + ), + ), + ); + + final adapterFile = File('${tempDir.path}/adapter.lora'); + await adapterFile.writeAsBytes(const [1, 2, 3]); + expect( + () => service.handleLora(contextHandle, adapterFile.path, 1.0, 'set'), + returnsNormally, + ); + expect( + () => service.handleLora( + contextHandle, + '${tempDir.path}/other.lora', + null, + 'remove', + ), + returnsNormally, + ); + expect( + () => service.handleLora( + contextHandle, + adapterFile.path, + null, + 'remove', + ), + returnsNormally, ); expect( () => service.handleLora(contextHandle, null, null, 'clear'), - throwsUnsupportedError, + returnsNormally, + ); + expect( + () => service.handleLora(contextHandle, null, null, 'unknown'), + throwsA( + isA().having( + (error) => error.message.toString(), + 'message', + contains('Unsupported LiteRT-LM LoRA operation'), + ), + ), ); await expectLater( service.generate( @@ -645,7 +695,13 @@ void main() { 'hello', const GenerationParams(grammar: 'root ::= "x"'), ), - emitsError(isA()), + emitsError( + isA().having( + (error) => error.message.toString(), + 'message', + contains('GenerationParams: grammar'), + ), + ), ); } finally { service.dispose(); @@ -653,6 +709,42 @@ void main() { }, ); + test('passes LiteRT-LM LoRA path to native conversations', () async { + final fakeClient = _FakeLiteRtLmRuntimeClient(); + final service = LiteRtLmService(clientFactory: () => fakeClient); + final adapterFile = File('${tempDir.path}/conversation.lora'); + await adapterFile.writeAsBytes(const [1, 2, 3]); + + try { + final modelHandle = await service.loadModel( + modelFile.path, + const ModelParams(preferredBackend: GpuBackend.cpu), + ); + final contextHandle = service.createContext( + modelHandle, + ModelParams( + preferredBackend: GpuBackend.cpu, + loras: [LoraAdapterConfig(path: adapterFile.path)], + ), + ); + + fakeClient.generatedOverride = Stream.value('ok'); + final chunks = await service + .generate( + contextHandle, + 'hello', + const GenerationParams(maxTokens: 4), + ) + .map(utf8.decode) + .toList(); + + expect(chunks, ['ok']); + expect(fakeClient.lastLoraPath, adapterFile.path); + } finally { + service.dispose(); + } + }); + test('rejects media parts before native runtime initialization', () async { final fakeClient = _FakeLiteRtLmRuntimeClient(); final service = LiteRtLmService(clientFactory: () => fakeClient); @@ -1721,6 +1813,7 @@ class _FakeLiteRtLmRuntimeClient extends LiteRtLmRuntimeClient { final Completer initializeStarted = Completer(); final Completer generateStarted = Completer(); final StreamController generated = StreamController(); + Stream? generatedOverride; final Completer? _initializeBlocker; final Object? initializeError; String? lastModelPath; @@ -1744,6 +1837,7 @@ class _FakeLiteRtLmRuntimeClient extends LiteRtLmRuntimeClient { List>? lastMessages; List>? lastTools; Map? lastExtraContext; + String? lastLoraPath; String? lastTokenizeText; bool? lastTokenizeAddSpecial; List? lastDetokenizeTokens; @@ -1813,6 +1907,7 @@ class _FakeLiteRtLmRuntimeClient extends LiteRtLmRuntimeClient { List>? messages, List>? tools, Map? extraContext, + String? loraPath, double temperature = 0.8, int topK = 40, double topP = 0.95, @@ -1833,6 +1928,7 @@ class _FakeLiteRtLmRuntimeClient extends LiteRtLmRuntimeClient { lastExtraContext = extraContext == null ? null : Map.from(extraContext); + lastLoraPath = loraPath; createConversationCount += 1; onCreateConversation?.call(); } @@ -1859,7 +1955,7 @@ class _FakeLiteRtLmRuntimeClient extends LiteRtLmRuntimeClient { if (!generateStarted.isCompleted) { generateStarted.complete(); } - return generated.stream; + return generatedOverride ?? generated.stream; } @override @@ -1876,7 +1972,7 @@ class _FakeLiteRtLmRuntimeClient extends LiteRtLmRuntimeClient { if (!generateStarted.isCompleted) { generateStarted.complete(); } - return generated.stream; + return generatedOverride ?? generated.stream; } @override diff --git a/test/unit/backends/litert_lm/worker_test.dart b/test/unit/backends/litert_lm/worker_test.dart index 7dbe626b..bda1e1a4 100644 --- a/test/unit/backends/litert_lm/worker_test.dart +++ b/test/unit/backends/litert_lm/worker_test.dart @@ -184,14 +184,7 @@ void main() { (sendPort) => LiteRtLmLoraRequest(contextHandle, 'clear', sendPort: sendPort), ); - expect( - clearLora, - isA().having( - (response) => response.kind, - 'kind', - 'unsupported', - ), - ); + expect(clearLora, isA()); final specialDetokenize = await _sendRequest( worker.sendPort, diff --git a/test/unit/backends/native/native_backend_test.dart b/test/unit/backends/native/native_backend_test.dart index cc2a5930..4be3145c 100644 --- a/test/unit/backends/native/native_backend_test.dart +++ b/test/unit/backends/native/native_backend_test.dart @@ -644,13 +644,15 @@ void main() { }); test( - 'high-level engine rejects LoRA operations for litertlm bundles', + 'high-level engine supports LoRA operations for litertlm bundles', () async { final tempDir = await Directory.systemTemp.createTemp( 'llamadart_native_auto_lora_litert_', ); final modelFile = File('${tempDir.path}/gemma-4-E2B-it.litertlm'); await modelFile.writeAsString('fake model'); + final adapterFile = File('${tempDir.path}/adapter.lora'); + await adapterFile.writeAsBytes(const [1, 2, 3]); final engine = LlamaEngine(LlamaBackend()); try { @@ -659,18 +661,9 @@ void main() { modelParams: const ModelParams(preferredBackend: GpuBackend.cpu), ); - await expectLater( - engine.setLora('adapter.bin'), - throwsA(isA()), - ); - await expectLater( - engine.removeLora('adapter.bin'), - throwsA(isA()), - ); - await expectLater( - engine.clearLoras(), - throwsA(isA()), - ); + await engine.setLora(adapterFile.path); + await engine.removeLora(adapterFile.path); + await engine.clearLoras(); } finally { await engine.dispose(); await tempDir.delete(recursive: true); diff --git a/tool/litert_lm_engine_smoke.dart b/tool/litert_lm_engine_smoke.dart index 798e4ff8..e4320f3f 100644 --- a/tool/litert_lm_engine_smoke.dart +++ b/tool/litert_lm_engine_smoke.dart @@ -5,6 +5,8 @@ import 'package:llamadart/llamadart.dart'; const _defaultPrompt = 'What is 2+2? Answer only with the number.'; +enum _LoraMode { params, set } + Future main(List args) async { final modelPath = args.isNotEmpty ? args[0] : _env('LITERT_LM_MODEL'); if (modelPath == null || modelPath.trim().isEmpty) { @@ -15,7 +17,9 @@ Future main(List args) async { 'Optional env: LITERT_LM_ACTIVATION_DATA_TYPE=float32|float16|int16|int8, ' 'LITERT_LM_PREFILL_CHUNK_SIZE=, ' 'LITERT_LM_PARALLEL_FILE_SECTION_LOADING=true|false, ' - 'LITERT_LM_DISPATCH_LIB_DIR=', + 'LITERT_LM_DISPATCH_LIB_DIR=, ' + 'LITERT_LM_LORA=, ' + 'LITERT_LM_LORA_MODE=params|set', ); exitCode = 64; return; @@ -39,6 +43,8 @@ Future main(List args) async { 'LITERT_LM_PARALLEL_FILE_SECTION_LOADING', ); final dispatchLibDir = _env('LITERT_LM_DISPATCH_LIB_DIR'); + final loraPath = _env('LITERT_LM_LORA'); + final loraMode = _parseLoraMode(_env('LITERT_LM_LORA_MODE')); final backend = _parseBackend(backendArg); final engine = LlamaEngine(LlamaBackend()); @@ -53,9 +59,15 @@ Future main(List args) async { liteRtLmPrefillChunkSize: prefillChunkSize, liteRtLmParallelFileSectionLoading: parallelFileSectionLoading, liteRtLmDispatchLibDir: dispatchLibDir, + loras: loraPath == null || loraMode == _LoraMode.set + ? const [] + : [LoraAdapterConfig(path: loraPath)], ), ); loadSw.stop(); + if (loraPath != null && loraMode == _LoraMode.set) { + await engine.setLora(loraPath); + } final promptTokens = await engine.tokenize(prompt, addSpecial: false); final promptTokensWithSpecial = await engine.tokenize( @@ -84,6 +96,8 @@ Future main(List args) async { 'liteRtLmPrefillChunkSize': prefillChunkSize, 'liteRtLmParallelFileSectionLoading': parallelFileSectionLoading, 'liteRtLmDispatchLibDir': dispatchLibDir, + 'loraPath': loraPath, + 'loraMode': loraPath == null ? null : loraMode.name, 'targetDecodeTokens': outputTokens, 'promptTokenCount': promptTokens.length, 'promptTokenCountWithSpecial': promptTokensWithSpecial.length, @@ -166,6 +180,24 @@ bool? _parseOptionalBool(String? value, String name) { } } +_LoraMode _parseLoraMode(String? value) { + if (value == null) { + return _LoraMode.params; + } + switch (value.trim().toLowerCase()) { + case 'params': + return _LoraMode.params; + case 'set': + return _LoraMode.set; + default: + throw ArgumentError.value( + value, + 'LITERT_LM_LORA_MODE', + 'Expected params or set.', + ); + } +} + LiteRtLmBackendPreference _parseBackend(String value) { switch (value.trim().toLowerCase()) { case 'auto': diff --git a/website/docs/changelog/recent-releases.md b/website/docs/changelog/recent-releases.md index 34bfa945..16286ebe 100644 --- a/website/docs/changelog/recent-releases.md +++ b/website/docs/changelog/recent-releases.md @@ -9,6 +9,15 @@ For canonical full release notes, use: ## Unreleased +- Added native `.litertlm` LoRA wiring for one LiteRT-LM adapter at scale + `1.0`, including `ModelParams.loras`, `setLora(...)`, `removeLora(...)`, and + `clearLoras()` on native LiteRT-LM contexts. +- Added Dart FFI coverage for `litert_lm_session_config_set_lora_file`. + Generation with a LoRA adapter requires a matching `litert-lm-native` + runtime that exports that C symbol; older local runtime libraries fail with a + focused unsupported-runtime message. +- Kept multiple weighted adapters on the llama.cpp/GGUF path and kept + LiteRT-LM web rejecting LoRA explicitly. - Added opt-in native `.litertlm` `ModelParams` for activation data type, prefill chunk size, parallel file-section loading, and Android NPU LiteRT dispatch library directory, forwarding the pinned LiteRT-LM `v0.13.1` diff --git a/website/docs/guides/backend-selection.md b/website/docs/guides/backend-selection.md index 1d24629c..f8bed1a9 100644 --- a/website/docs/guides/backend-selection.md +++ b/website/docs/guides/backend-selection.md @@ -72,7 +72,7 @@ JavaScript runtime. | Web | llama.cpp WebGPU/CPU bridge for GGUF URLs | `@litert-lm/core` for web-compatible `.litertlm` URLs | | Embeddings | Supported on native; supported on web bridge assets with embedding APIs | Not exposed by current LiteRT-LM APIs | | KV-cache state persistence | Supported on native; supported on WebGPU bridge assets that expose state APIs | Not exposed | -| LoRA adapters | Supported on native GGUF flows | Not exposed | +| LoRA adapters | Supported on native GGUF flows, including multiple weighted adapters | Native only: one LiteRT-LM adapter at scale `1.0` with a compatible `litert-lm-native` runtime; not exposed on LiteRT-LM web | | Thinking and tool-call parsing | Supported through template handlers | Native: supported through the high-level `LlamaEngine` parser for compatible templates; LiteRT-native constrained tool execution is not wired yet. Web: single-turn text only; no structured chat/tool forwarding yet. | | Grammar / constrained decoding | Supported by llama.cpp-backed paths | llama.cpp GBNF is not supported; template-generated tool grammar is skipped, strict `responseFormat` requests fail early, and explicit grammar params are rejected | | Multimodal projectors | Supported through llama.cpp `mtmd` paths where the model/projector supports it | Not exposed through llamadart today | @@ -124,7 +124,7 @@ For GGUF / llama.cpp, common load-time controls include: - `numberOfThreads` / `numberOfThreadsBatch` - `batchSize` / `microBatchSize` - `splitMode` / `mainGpu` -- LoRA and state-persistence APIs +- Multiple weighted LoRA adapters and state-persistence APIs For `.litertlm` / LiteRT-LM, use: @@ -136,6 +136,11 @@ For `.litertlm` / LiteRT-LM, use: - `liteRtLmParallelFileSectionLoading`: native `.litertlm` file-section loading override - `liteRtLmDispatchLibDir`: Android NPU LiteRT dispatch library directory +- `ModelParams.loras`: one LiteRT-LM adapter at scale `1.0` on native + LiteRT-LM when the runtime exports + `litert_lm_session_config_set_lora_file` +- `setLora(...)`, `removeLora(...)`, and `clearLoras()` with the same + one-adapter, scale-`1.0` native LiteRT-LM limit - `GenerationParams.maxTokens`, `temp`, `topK`, `topP`, and `seed` - `GenerationParams.speculativeDecoding` on native LiteRT-LM only - `stopSequences`, enforced by `llamadart` diff --git a/website/docs/guides/lora-adapters.md b/website/docs/guides/lora-adapters.md index 87c56200..ba59ea3a 100644 --- a/website/docs/guides/lora-adapters.md +++ b/website/docs/guides/lora-adapters.md @@ -53,6 +53,9 @@ await engine.setLora('/models/lora/domain.gguf', scale: 0.70); - Use `removeLora(path)` to disable one adapter. - Use `clearLoras()` to reset to base model behavior. +Stacking and non-`1.0` scales are GGUF llama.cpp features. Native LiteRT-LM +`.litertlm` currently accepts one LiteRT-LM adapter at scale `1.0`. + ## Training your own LoRA adapters For end-to-end training + conversion, start with the official notebook: @@ -88,13 +91,21 @@ Practical compatibility checks: ## Platform notes -- Native backends implement runtime LoRA operations. -- Web bridge runtime currently exposes no-op LoRA operations in this release; - do not assume LoRA effect on web targets yet. +- Native llama.cpp/GGUF backends implement runtime LoRA operations with + multiple weighted adapters. +- Native LiteRT-LM `.litertlm` supports one LiteRT-LM adapter at scale `1.0` + through `ModelParams.loras` or the runtime LoRA APIs when the loaded + `litert-lm-native` library exports + `litert_lm_session_config_set_lora_file`. +- Web bridge and LiteRT-LM web runtimes do not expose LoRA effects. ## Troubleshooting - If `setLora(...)` fails, verify the adapter path is accessible at runtime. - Ensure adapter/base-model compatibility (architecture/family alignment). +- If a `.litertlm` model reports that the LoRA C symbol is missing, use a + `litert-lm-native` build that exports + `litert_lm_session_config_set_lora_file` or switch to a GGUF model on a + llama.cpp backend. - When behavior seems unchanged, confirm you are testing on a native target and not a web fallback path. diff --git a/website/docs/platforms/support-matrix.md b/website/docs/platforms/support-matrix.md index 41ed3620..723bd394 100644 --- a/website/docs/platforms/support-matrix.md +++ b/website/docs/platforms/support-matrix.md @@ -130,8 +130,12 @@ load `.litertlm` models. | Windows x64 | `windows-x64` | `cpu` | Supported | | Web (browser) | N/A (`@litert-lm/core`) | `cpu`, `gpu` | Experimental; web-compatible `.litertlm` URLs only | -LiteRT-LM does not currently expose embeddings, state persistence, LoRA, or -multimodal projector APIs through llamadart. On native LiteRT-LM targets, +LiteRT-LM does not currently expose embeddings, state persistence, or +multimodal projector APIs through llamadart. Native `.litertlm` LoRA supports +one LiteRT-LM adapter at scale `1.0` when the loaded `litert-lm-native` +runtime exports `litert_lm_session_config_set_lora_file`; multiple weighted +adapters remain llama.cpp/GGUF-only, and LiteRT-LM web does not expose LoRA. On +native LiteRT-LM targets, high-level thinking and tool-call parsing still run through `LlamaEngine` for compatible templates, but llama.cpp-style GBNF grammar constraints are not supported for `.litertlm` generation. Native LiteRT-LM can opt into runtime