Skip to content

feat(inference): improve on-device Gemma response quality#26

Open
ryon137 wants to merge 14 commits into
mainfrom
feat/model-response-improvements
Open

feat(inference): improve on-device Gemma response quality#26
ryon137 wants to merge 14 commits into
mainfrom
feat/model-response-improvements

Conversation

@ryon137

@ryon137 ryon137 commented Jun 17, 2026

Copy link
Copy Markdown
Owner

Summary

Inference tuning (on-device Gemma 3 1B)

  • Temperature 0.7 (was unset/1.0): more deterministic output; the model picks more confident tokens rather than sampling widely
  • topK 40 (was 15): standard Gemma recommendation; 15 was too restrictive and caused repetitive/truncated output
  • Word limit 25 (was 15): 15 was too tight — model truncated mid-thought; 25 still keeps responses short enough for audio
  • System prompt: "if you don't know, say so — do not guess" to reduce confident hallucination
  • generateResponse(): reverted from session API — session API broke the prompt format, generateResponse() is the correct path for single-turn Gemma

Response pipeline fixes

  • ResponseRouter extracted: raw JSON actions no longer reach TTS; plain text and JSON handled in separate paths
  • Unrecognized action JSON salvaged: if JSON has no action, extract and speak confirm field rather than swallowing the response
  • ResponseParser.takeFirstSentence: now cuts only on . before a capital letter, not ? or ! — question-style responses (e.g., "Did you mean…?") are preserved intact
  • stripFillerPrefix / stripInlineMarkdown: strip "Sure!", "Absolutely!", bold/italic markers before TTS

Audio / recording fixes

  • No-speech timeout cancelled when recording actually ends — Whisper now always gets a chance to run even on slow devices
  • Listen ping cue: short rising tone plays when re-listen window opens, so the user knows Prism is ready for follow-up
  • Empty-response follow-up loop fixed: a blank Gemma response no longer re-opens the listen window indefinitely

Whisper hallucination detection

  • Bracketed silence markers ([INAUDIBLE], [BLANK_AUDIO], [Music], [ Pause ], etc.) now caught before hitting inference — prevents infinite hallucination loop

TTS playback timing

  • Duration-based wait: PCM byte count ÷ sample rate gives the actual audio duration; sleep only for the remaining unplayed portion + 500ms BT latency cushion — more reliable than polling playbackHeadPosition which stalls intermittently over BT SCO

Observability

  • Persistent diagnostic log (/sdcard/prism_diag.txt) records the full prompt → raw response → parsed result trail for field debugging

Test plan

  • Install on device and confirm on-device mode responses are more accurate and concise
  • Verify action JSON (call, SMS, timer, alarm, volume) still triggers correctly
  • Verify responses stay concise enough for audio — no mid-sentence truncation
  • Verify [INAUDIBLE] / [BLANK_AUDIO] bracketed silence drops silently without speaking garbage
  • TTS playback finishes cleanly before the re-listen ping sounds
  • All 600 unit tests pass (./gradlew :app:testDebugUnitTest)

🤖 Generated with Claude Code

ryon137 and others added 14 commits June 17, 2026 07:44
- Switch from LlmInference.generateResponse() to LlmInferenceSession
  so temperature and topK can be set per-inference call
- temperature 0.7 (was unset/1.0): more deterministic, better factual accuracy
- topK 40 (was 15): standard Gemma recommendation, less repetitive output
- maxTopK raised to 40 to match session topK
- Word limit 15 → 25: gives room for complete answers without rambling
- System prompt: add "if you don't know, say so — do not guess"

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…mpt format

LlmInferenceSession.addQueryChunk() applies its own prompt template on top
of our manually-formatted Gemma turn tokens, producing garbled input and an
empty or null response. Revert to LlmInference.generateResponse(prompt) which
passes our formatted prompt through unchanged.

maxTopK 40 and the improved system prompt are retained. Temperature control
via the session API is not achievable without reworking the prompt builder to
emit raw text and rely on MediaPipe's template injection instead.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add AudioUtils.generateListenPingPcm() (A5 880Hz, 120ms) for re-listen entry cue
- Play ping inside startReListenRecording thread before stale-buffer discard;
  pause audioTrack after discard (~400ms) so ping is heard before SCO goes quiet
- Guard reListenMode = FOLLOW_UP behind cleaned.isNotBlank() so a blank Gemma
  response no longer re-enters the follow-up loop indefinitely

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ng TTS

ResponseRouter replaces the inline routing block in speakResponse with a
pure, testable function. Key behavioral change: JSON responses that lack a
recognized action (no 'action' field, unknown type, or malformed JSON) now
fall Silent instead of being spoken as raw text — closing the regression
where the on-device model occasionally emits JSON Prism can't execute.

19 new tests cover the full routing matrix including the previously
untested "output must not be raw JSON" cases.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rate

- Wrap ResponseRouter.route() call in try/catch so any unexpected exception
  from executeAction or routing logic is caught and logged rather than
  propagating to the main thread and crashing the service
- Fix ping sample rate fallback from OUTPUT_SAMPLE_RATE (24000) to
  TTS_SAMPLE_RATE (22050) to match the AudioTrack the ping is written to
- Add `route never throws regardless of input` test to enforce that
  ResponseRouter always returns a Result and never propagates

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Log the raw model output before processing, the ResponseRouter decision
(type + text/reason), and the final spoken text — so a silent-exit case
leaves a complete trace in logcat without needing to repro from scratch.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
DiagnosticLogger writes a rolling log to:
  /sdcard/Android/data/com.ryoncook.glassesai/files/diag.log

Readable at any time via adb without the user capturing anything.
Each turn is separated and logs: WAKE → STT → MODEL raw output →
ROUTE decision → TTS spoken text (or silent reason) → any errors.
File is capped at ~80 KB and trims oldest entries automatically.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tem prompt

Gemma with maxTopK=40 was inventing action types like {"action":"answer","response":"..."}
which the router dropped silently. Two fixes:

1. ResponseRouter: when action type is unrecognized, check for response/text/answer/
   content/message/reply fields and speak the first non-blank one instead of going silent
2. System prompt: explicit "For everything else, respond with plain text only — never
   use JSON" to discourage the model from inventing action schemas

3 new tests pin the salvage-text behavior.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The model echoes the user's question with a trailing ? before answering,
so SENTENCE_END was cutting at the ? and discarding the real answer. Only
split on a period followed by whitespace + capital.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…r can finish

The 5s timeout was racing with on-device Whisper transcription: recording
ended at ~4.5s, Whisper took ~500ms, timeout fired 478ms before the result
arrived and tore down the conversation. Fix: call removeCallbacks on the
timeout immediately after the recording loop exits so the timeout only fires
when the user genuinely doesn't speak. Also bumped follow-up timeout to 8s
and added explicit afterOnDeviceTurnComplete for blank-Whisper / empty-audio
paths that previously relied on the now-cancelled timeout for cleanup.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…p infinite loop

Whisper returns [INAUDIBLE], [ Pause ], [ Inaudible ], [BLANK_AUDIO] etc.
for silence/background noise. These were not caught by the repetition check
(< 8 words), passed as valid transcripts, triggered inference, and caused an
infinite re-listen loop. Added BRACKET_PLACEHOLDER regex check before the
word-count gate.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…completion

Thread.sleep(durationMs + 200) started AFTER audioTrack.write() returned,
but write() in stream mode returns when data is in the kernel buffer, not
after it plays through Bluetooth. For short responses this over-waited by
~900ms; for longer ones with high BT latency the +200ms cushion was not
enough and re-listen started while audio was still playing. Poll
playbackHeadPosition every 50ms and proceed when position is stalled for
400ms — gives accurate completion regardless of response length or BT latency.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ckHeadPosition

playbackHeadPosition stalls intermittently over BT SCO, causing the poll
loop to exit early. Duration-based sleep (pcm bytes / 2 / sampleRate +
500ms BT cushion minus time write() blocked) is more reliable.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e-listen

sleep-based approaches (fixed duration, duration-writeMs) were unreliable
because write() blocking time doesn't accurately reflect how much audio has
left the BT SCO pipeline. stop() in STREAM mode plays through all queued
data and then transitions to PLAYSTATE_STOPPED — deterministic drain.
200ms added after STOPPED for BT SCO wire latency.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant