feat(inference): improve on-device Gemma response quality by ryon137 · Pull Request #26 · ryon137/GlassesAI

ryon137 · 2026-06-17T12:44:15Z

Summary

Inference tuning (on-device Gemma 3 1B)

Temperature 0.7 (was unset/1.0): more deterministic output; the model picks more confident tokens rather than sampling widely
topK 40 (was 15): standard Gemma recommendation; 15 was too restrictive and caused repetitive/truncated output
Word limit 25 (was 15): 15 was too tight — model truncated mid-thought; 25 still keeps responses short enough for audio
System prompt: "if you don't know, say so — do not guess" to reduce confident hallucination
generateResponse(): reverted from session API — session API broke the prompt format, generateResponse() is the correct path for single-turn Gemma

Response pipeline fixes

ResponseRouter extracted: raw JSON actions no longer reach TTS; plain text and JSON handled in separate paths
Unrecognized action JSON salvaged: if JSON has no action, extract and speak confirm field rather than swallowing the response
ResponseParser.takeFirstSentence: now cuts only on . before a capital letter, not ? or ! — question-style responses (e.g., "Did you mean…?") are preserved intact
stripFillerPrefix / stripInlineMarkdown: strip "Sure!", "Absolutely!", bold/italic markers before TTS

Audio / recording fixes

No-speech timeout cancelled when recording actually ends — Whisper now always gets a chance to run even on slow devices
Listen ping cue: short rising tone plays when re-listen window opens, so the user knows Prism is ready for follow-up
Empty-response follow-up loop fixed: a blank Gemma response no longer re-opens the listen window indefinitely

Whisper hallucination detection

Bracketed silence markers ([INAUDIBLE], [BLANK_AUDIO], [Music], [ Pause ], etc.) now caught before hitting inference — prevents infinite hallucination loop

TTS playback timing

Duration-based wait: PCM byte count ÷ sample rate gives the actual audio duration; sleep only for the remaining unplayed portion + 500ms BT latency cushion — more reliable than polling playbackHeadPosition which stalls intermittently over BT SCO

Observability

Persistent diagnostic log (/sdcard/prism_diag.txt) records the full prompt → raw response → parsed result trail for field debugging

Test plan

Install on device and confirm on-device mode responses are more accurate and concise
Verify action JSON (call, SMS, timer, alarm, volume) still triggers correctly
Verify responses stay concise enough for audio — no mid-sentence truncation
Verify [INAUDIBLE] / [BLANK_AUDIO] bracketed silence drops silently without speaking garbage
TTS playback finishes cleanly before the re-listen ping sounds
All 600 unit tests pass (./gradlew :app:testDebugUnitTest)

🤖 Generated with Claude Code

- Switch from LlmInference.generateResponse() to LlmInferenceSession so temperature and topK can be set per-inference call - temperature 0.7 (was unset/1.0): more deterministic, better factual accuracy - topK 40 (was 15): standard Gemma recommendation, less repetitive output - maxTopK raised to 40 to match session topK - Word limit 15 → 25: gives room for complete answers without rambling - System prompt: add "if you don't know, say so — do not guess" Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…mpt format LlmInferenceSession.addQueryChunk() applies its own prompt template on top of our manually-formatted Gemma turn tokens, producing garbled input and an empty or null response. Revert to LlmInference.generateResponse(prompt) which passes our formatted prompt through unchanged. maxTopK 40 and the improved system prompt are retained. Temperature control via the session API is not achievable without reworking the prompt builder to emit raw text and rely on MediaPipe's template injection instead. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Add AudioUtils.generateListenPingPcm() (A5 880Hz, 120ms) for re-listen entry cue - Play ping inside startReListenRecording thread before stale-buffer discard; pause audioTrack after discard (~400ms) so ping is heard before SCO goes quiet - Guard reListenMode = FOLLOW_UP behind cleaned.isNotBlank() so a blank Gemma response no longer re-enters the follow-up loop indefinitely Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ng TTS ResponseRouter replaces the inline routing block in speakResponse with a pure, testable function. Key behavioral change: JSON responses that lack a recognized action (no 'action' field, unknown type, or malformed JSON) now fall Silent instead of being spoken as raw text — closing the regression where the on-device model occasionally emits JSON Prism can't execute. 19 new tests cover the full routing matrix including the previously untested "output must not be raw JSON" cases. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…rate - Wrap ResponseRouter.route() call in try/catch so any unexpected exception from executeAction or routing logic is caught and logged rather than propagating to the main thread and crashing the service - Fix ping sample rate fallback from OUTPUT_SAMPLE_RATE (24000) to TTS_SAMPLE_RATE (22050) to match the AudioTrack the ping is written to - Add `route never throws regardless of input` test to enforce that ResponseRouter always returns a Result and never propagates Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Log the raw model output before processing, the ResponseRouter decision (type + text/reason), and the final spoken text — so a silent-exit case leaves a complete trace in logcat without needing to repro from scratch. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

DiagnosticLogger writes a rolling log to: /sdcard/Android/data/com.ryoncook.glassesai/files/diag.log Readable at any time via adb without the user capturing anything. Each turn is separated and logs: WAKE → STT → MODEL raw output → ROUTE decision → TTS spoken text (or silent reason) → any errors. File is capped at ~80 KB and trims oldest entries automatically. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…tem prompt Gemma with maxTopK=40 was inventing action types like {"action":"answer","response":"..."} which the router dropped silently. Two fixes: 1. ResponseRouter: when action type is unrecognized, check for response/text/answer/ content/message/reply fields and speak the first non-blank one instead of going silent 2. System prompt: explicit "For everything else, respond with plain text only — never use JSON" to discourage the model from inventing action schemas 3 new tests pin the salvage-text behavior. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The model echoes the user's question with a trailing ? before answering, so SENTENCE_END was cutting at the ? and discarding the real answer. Only split on a period followed by whitespace + capital. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…r can finish The 5s timeout was racing with on-device Whisper transcription: recording ended at ~4.5s, Whisper took ~500ms, timeout fired 478ms before the result arrived and tore down the conversation. Fix: call removeCallbacks on the timeout immediately after the recording loop exits so the timeout only fires when the user genuinely doesn't speak. Also bumped follow-up timeout to 8s and added explicit afterOnDeviceTurnComplete for blank-Whisper / empty-audio paths that previously relied on the now-cancelled timeout for cleanup. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…p infinite loop Whisper returns [INAUDIBLE], [ Pause ], [ Inaudible ], [BLANK_AUDIO] etc. for silence/background noise. These were not caught by the repetition check (< 8 words), passed as valid transcripts, triggered inference, and caused an infinite re-listen loop. Added BRACKET_PLACEHOLDER regex check before the word-count gate. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…completion Thread.sleep(durationMs + 200) started AFTER audioTrack.write() returned, but write() in stream mode returns when data is in the kernel buffer, not after it plays through Bluetooth. For short responses this over-waited by ~900ms; for longer ones with high BT latency the +200ms cushion was not enough and re-listen started while audio was still playing. Poll playbackHeadPosition every 50ms and proceed when position is stalled for 400ms — gives accurate completion regardless of response length or BT latency. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ckHeadPosition playbackHeadPosition stalls intermittently over BT SCO, causing the poll loop to exit early. Duration-based sleep (pcm bytes / 2 / sampleRate + 500ms BT cushion minus time write() blocked) is more reliable. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…e-listen sleep-based approaches (fixed duration, duration-writeMs) were unreliable because write() blocking time doesn't accurately reflect how much audio has left the BT SCO pipeline. stop() in STREAM mode plays through all queued data and then transitions to PLAYSTATE_STOPPED — deterministic drain. 200ms added after STOPPED for BT SCO wire latency. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ryon137 and others added 14 commits June 17, 2026 07:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(inference): improve on-device Gemma response quality#26

feat(inference): improve on-device Gemma response quality#26
ryon137 wants to merge 14 commits into
mainfrom
feat/model-response-improvements

ryon137 commented Jun 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ryon137 commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Inference tuning (on-device Gemma 3 1B)

Response pipeline fixes

Audio / recording fixes

Whisper hallucination detection

TTS playback timing

Observability

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ryon137 commented Jun 17, 2026 •

edited

Loading