Skip to content

ryon137/GlassesAI

Repository files navigation

GlassesAI

A self-hosted AI assistant for Ray-Ban smart glasses. Replaces Meta AI with a fully local pipeline — no cloud, no Meta account required.

Tested on Ray-Ban Meta glasses + Pixel 9 Pro running Android 14.


Architecture

GlassesAI supports two inference modes selectable in-app:

Server Mode (default)

┌─────────────────────────────────────────────────────────────────┐
│                        Ray-Ban Glasses                          │
│              Bluetooth A2DP (audio) / SCO (voice call)         │
└───────────────────────────┬─────────────────────────────────────┘
                            │
┌───────────────────────────▼─────────────────────────────────────┐
│                    Android Phone (GlassesAI)                    │
│                                                                 │
│  ┌─────────────────┐    wake word     ┌──────────────────────┐  │
│  │   Phone Mic     │ ───────────────► │  Vosk (on-device)    │  │
│  │  (always-on)    │                  │  grammar mode, offline│  │
│  └─────────────────┘                  └──────────┬───────────┘  │
│                                                  │ detected     │
│  ┌──────────────────────────────────────────────▼───────────┐  │
│  │                  GlassesAIService (FGS)                   │  │
│  │  • SCO connects → glasses mic becomes active              │  │
│  │  • Streams PCM audio to server via WebSocket              │  │
│  │  • Plays server audio response via AudioTrack (SCO)       │  │
│  │  • Pauses media (Spotify, etc.) during conversation       │  │
│  └──────────────────────────┬────────────────────────────────┘  │
└─────────────────────────────┼───────────────────────────────────┘
                              │ WebSocket (wss://)
                              │ PCM audio up / PCM audio down
┌─────────────────────────────▼───────────────────────────────────┐
│                      Linux Server (server.py)                   │
│                                                                 │
│  ┌──────────────────┐   ┌─────────────────┐   ┌─────────────┐  │
│  │  faster-whisper  │   │   Mistral 7B    │   │  Kokoro TTS │  │
│  │  (CUDA, int8)    │──►│  via Ollama     │──►│  (CPU)      │  │
│  │  STT  ~0.2s      │   │  LLM inference  │   │  24kHz PCM  │  │
│  └──────────────────┘   └─────────────────┘   └─────────────┘  │
│                                                                 │
│  Implements the Gemini Live WebSocket protocol                  │
└─────────────────────────────────────────────────────────────────┘

On-Device Mode (fully offline)

┌─────────────────────────────────────────────────────────────────┐
│                    Android Phone (GlassesAI)                    │
│                                                                 │
│  ┌──────────────┐  wake word  ┌──────────────────────────────┐  │
│  │  Phone Mic   │ ──────────► │  Vosk (grammar, offline)     │  │
│  └──────────────┘             └───────────────┬──────────────┘  │
│                                               │ detected        │
│  ┌────────────────────────────────────────────▼──────────────┐  │
│  │                  GlassesAIService (FGS)                    │  │
│  │                                                            │  │
│  │  Vosk free-form STT  ──►  Gemma 3 1B INT4 (MediaPipe)    │  │
│  │                                         │                  │  │
│  │               Piper TTS (bundled) ◄─────┘                  │  │
│  │               22050 Hz PCM → AudioTrack (SCO)              │  │
│  └────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘

No network required after initial model download.


Features

  • Two inference modes — Server (your self-hosted Linux box) or On-Device (fully offline)
  • Fully offline wake word detection (OpenWakeWord, on-device) — wake word: "Hey Prism"
  • Bluetooth SCO for full-duplex voice via glasses mic + speaker
  • Configurable server URL, auth token, and TTS voice
  • 11 Kokoro voices in server mode; Piper TTS bundled for on-device mode
  • Media pause/resume during conversations (Spotify, podcasts, etc.)
  • In-app status display — mirrors the notification, always in sync
  • Auto-reconnects on glasses Bluetooth connect
  • Bearer token auth on WebSocket connection (token configured in Settings)
  • Feature toggles for optional permissions (Phone Calls, SMS)
  • Voice confirmation for calls and SMS — Prism reads back the action and listens for yes/no before executing
  • Multi-turn follow-up — brief re-listen window after each response with rolling conversation context

Requirements

Phone:

  • Android 12+ (API 31)
  • Bluetooth-capable (SCO support)

Glasses:

  • Ray-Ban Meta smart glasses (any generation)
  • Paired to the phone via the Meta View app (pairing only — Meta app does not need to run)

Server (Server mode only):

  • Linux machine with NVIDIA GPU (CUDA) recommended for Whisper
  • Python 3.10+
  • server.py — separate repo

On-Device mode:

  • ~800 MB free storage for the Gemma model
  • A HuggingFace account with access to litert-community/Gemma3-1B-IT (free, requires accepting terms)

Setup

1. Android App

Clone and open in Android Studio:

git clone https://github.com/ryon137/GlassesAI.git
cd GlassesAI
cp secrets.properties.example secrets.properties

Edit secrets.properties (optional — sets the default server URL pre-filled in Settings):

DEFAULT_SERVER_URL=ws://your-server-ip:9073

Use wss:// if your server is behind a TLS reverse proxy (recommended for remote access).

Build and install via Android Studio, or:

./gradlew installDebug

On first launch, open Settings and enter your server URL and auth token.

2. Server (Server mode only)

See GlassesAI-server for server setup instructions.

3. On-Device Model (On-Device mode only)

The app downloads the Gemma model (~800 MB) from HuggingFace on first use. You need a read token with access to litert-community/Gemma3-1B-IT:

  1. Accept the model terms at huggingface.co/litert-community/Gemma3-1B-IT
  2. Generate a read token at huggingface.co/settings/tokens
  3. Enter the token in the app when prompted

The Piper TTS voice model is bundled in the APK — no separate download needed.

4. First Run

  1. Open the app and grant microphone and Bluetooth permissions
  2. Select Server or On-Device mode in Settings
  3. Set your server URL + auth token (Server mode) or download the Gemma model (On-Device mode), tap Save
  4. Tap Start — the Vosk wake word model downloads on first launch (~40 MB)
  5. Say your wake word — you'll hear a chime when it's listening

Configuration

Setting Where Description Default
Server URL Settings → Server WebSocket URL of your server ws://YOUR_SERVER_IP:9073
Auth Token Settings → Server Bearer token matching your server config
Inference mode Settings → Mode Server or On-Device Server
Voice Settings → Server 11 Kokoro voices (server mode only) Heart (American, female)

Wake word is "Hey Prism" — hardcoded to eliminate false-trigger variance.


Project Structure

app/src/main/java/com/ryoncook/glassesai/
├── GlassesAIService.kt     # Core foreground service — audio, BT, WebSocket, on-device inference
├── MainActivity.kt         # Navigation host (bottom nav, permission requests)
├── HomeFragment.kt         # Status display, Start/Stop button
├── SettingsFragment.kt     # All settings — mode, server, wake word, voice, features
├── InfoFragment.kt         # App info, how-to, privacy/permissions summary
├── InferenceManager.kt     # On-device LLM via MediaPipe (Gemma 3 1B INT4)
├── TtsManager.kt           # On-device TTS via sherpa-onnx + Piper (asset extraction + synthesis)
├── ResponseParser.kt       # Strips markdown code fences from LLM output
├── ActionParser.kt         # Parses structured JSON actions (call, SMS) from LLM response
├── FeatureGate.kt          # Gates feature execution behind Settings toggles
├── ConfigValidator.kt      # Validates server config before service start
├── ModelDownloadManager.kt # Downloads Gemma model from HuggingFace with progress reporting
├── ModelDownloadService.kt # Foreground service wrapper for model download
├── WakeWordMatcher.kt      # Vosk grammar-mode wake word detection helper
├── AudioUtils.kt           # PCM audio helpers
├── BluetoothReceiver.kt    # BroadcastReceiver — auto-reconnect on BT connect
└── Config.kt               # App-wide constants

app/src/main/assets/
└── tts-model/              # Bundled Piper TTS (vits-piper-en_US-lessac-medium-int8)

Roadmap

  • Phone actions — make calls and send SMS by voice
  • Expanded voice commands — volume control, timers/alarms
  • End-of-turn descending beep — audio cue when AI finishes responding
  • Status transition timing — fix "Responding…" lingering after AI voice stops
  • On-device mode — fully offline with Gemma 3 1B + Piper TTS
  • Per-user auth token — configured in Settings, not baked into the build
  • Play Store basics — app icon, R8 minification, privacy policy
  • Default phone and clock app — pre-select in Settings to avoid disambiguation popups
  • Custom wake word — "Hey Prism" OpenWakeWord model with in-app threshold calibration
  • UI redesign — system-settings style, value-driven descriptions, on-device as default mode
  • Call/SMS confirmation — Prism reads back "Call Mom?" and listens immediately for verbal yes/no before executing; auto-cancels on timeout
  • Multi-turn follow-up — stay in a brief listening window after a response; short conversation context for the LLM
  • Model response improvements — tune system prompt and inference parameters for on-device Gemma
  • Separate media volume control — "set volume" controls Prism's voice (TTS track); add distinct command for glasses media (A2DP) stream
  • Read notifications and calendar by voice
  • OpenClaw integration — smart home, WhatsApp, web search

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors