feat(inference): improve on-device Gemma response quality#26
Open
ryon137 wants to merge 14 commits into
Open
Conversation
- Switch from LlmInference.generateResponse() to LlmInferenceSession so temperature and topK can be set per-inference call - temperature 0.7 (was unset/1.0): more deterministic, better factual accuracy - topK 40 (was 15): standard Gemma recommendation, less repetitive output - maxTopK raised to 40 to match session topK - Word limit 15 → 25: gives room for complete answers without rambling - System prompt: add "if you don't know, say so — do not guess" Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…mpt format LlmInferenceSession.addQueryChunk() applies its own prompt template on top of our manually-formatted Gemma turn tokens, producing garbled input and an empty or null response. Revert to LlmInference.generateResponse(prompt) which passes our formatted prompt through unchanged. maxTopK 40 and the improved system prompt are retained. Temperature control via the session API is not achievable without reworking the prompt builder to emit raw text and rely on MediaPipe's template injection instead. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add AudioUtils.generateListenPingPcm() (A5 880Hz, 120ms) for re-listen entry cue - Play ping inside startReListenRecording thread before stale-buffer discard; pause audioTrack after discard (~400ms) so ping is heard before SCO goes quiet - Guard reListenMode = FOLLOW_UP behind cleaned.isNotBlank() so a blank Gemma response no longer re-enters the follow-up loop indefinitely Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ng TTS ResponseRouter replaces the inline routing block in speakResponse with a pure, testable function. Key behavioral change: JSON responses that lack a recognized action (no 'action' field, unknown type, or malformed JSON) now fall Silent instead of being spoken as raw text — closing the regression where the on-device model occasionally emits JSON Prism can't execute. 19 new tests cover the full routing matrix including the previously untested "output must not be raw JSON" cases. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rate - Wrap ResponseRouter.route() call in try/catch so any unexpected exception from executeAction or routing logic is caught and logged rather than propagating to the main thread and crashing the service - Fix ping sample rate fallback from OUTPUT_SAMPLE_RATE (24000) to TTS_SAMPLE_RATE (22050) to match the AudioTrack the ping is written to - Add `route never throws regardless of input` test to enforce that ResponseRouter always returns a Result and never propagates Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Log the raw model output before processing, the ResponseRouter decision (type + text/reason), and the final spoken text — so a silent-exit case leaves a complete trace in logcat without needing to repro from scratch. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
DiagnosticLogger writes a rolling log to: /sdcard/Android/data/com.ryoncook.glassesai/files/diag.log Readable at any time via adb without the user capturing anything. Each turn is separated and logs: WAKE → STT → MODEL raw output → ROUTE decision → TTS spoken text (or silent reason) → any errors. File is capped at ~80 KB and trims oldest entries automatically. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tem prompt
Gemma with maxTopK=40 was inventing action types like {"action":"answer","response":"..."}
which the router dropped silently. Two fixes:
1. ResponseRouter: when action type is unrecognized, check for response/text/answer/
content/message/reply fields and speak the first non-blank one instead of going silent
2. System prompt: explicit "For everything else, respond with plain text only — never
use JSON" to discourage the model from inventing action schemas
3 new tests pin the salvage-text behavior.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The model echoes the user's question with a trailing ? before answering, so SENTENCE_END was cutting at the ? and discarding the real answer. Only split on a period followed by whitespace + capital. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…r can finish The 5s timeout was racing with on-device Whisper transcription: recording ended at ~4.5s, Whisper took ~500ms, timeout fired 478ms before the result arrived and tore down the conversation. Fix: call removeCallbacks on the timeout immediately after the recording loop exits so the timeout only fires when the user genuinely doesn't speak. Also bumped follow-up timeout to 8s and added explicit afterOnDeviceTurnComplete for blank-Whisper / empty-audio paths that previously relied on the now-cancelled timeout for cleanup. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…p infinite loop Whisper returns [INAUDIBLE], [ Pause ], [ Inaudible ], [BLANK_AUDIO] etc. for silence/background noise. These were not caught by the repetition check (< 8 words), passed as valid transcripts, triggered inference, and caused an infinite re-listen loop. Added BRACKET_PLACEHOLDER regex check before the word-count gate. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…completion Thread.sleep(durationMs + 200) started AFTER audioTrack.write() returned, but write() in stream mode returns when data is in the kernel buffer, not after it plays through Bluetooth. For short responses this over-waited by ~900ms; for longer ones with high BT latency the +200ms cushion was not enough and re-listen started while audio was still playing. Poll playbackHeadPosition every 50ms and proceed when position is stalled for 400ms — gives accurate completion regardless of response length or BT latency. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ckHeadPosition playbackHeadPosition stalls intermittently over BT SCO, causing the poll loop to exit early. Duration-based sleep (pcm bytes / 2 / sampleRate + 500ms BT cushion minus time write() blocked) is more reliable. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e-listen sleep-based approaches (fixed duration, duration-writeMs) were unreliable because write() blocking time doesn't accurately reflect how much audio has left the BT SCO pipeline. stop() in STREAM mode plays through all queued data and then transitions to PLAYSTATE_STOPPED — deterministic drain. 200ms added after STOPPED for BT SCO wire latency. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Inference tuning (on-device Gemma 3 1B)
generateResponse()is the correct path for single-turn GemmaResponse pipeline fixes
action, extract and speakconfirmfield rather than swallowing the response.before a capital letter, not?or!— question-style responses (e.g., "Did you mean…?") are preserved intactAudio / recording fixes
Whisper hallucination detection
TTS playback timing
playbackHeadPositionwhich stalls intermittently over BT SCOObservability
/sdcard/prism_diag.txt) records the full prompt → raw response → parsed result trail for field debuggingTest plan
./gradlew :app:testDebugUnitTest)🤖 Generated with Claude Code