A deep dive into building an AI-powered visual inventory system
You know that feeling when you're tearing apart your house looking for your keys, your charger, or that random cable you swore was "right there"? This app fixes that.
The pitch: Snap photos of your stuff. Later, just ask "Where's my red notebook?" and the app tells you—with a photo to prove it.
It's like having a photographic memory, except the memory belongs to GPT-4 and lives in the cloud.
Think of this system as a three-act play:
You → Camera → Supabase Storage → Database → Webhook → GPT-4o Vision → Embeddings
When you take a photo:
- Expo Camera captures the image
- Supabase Storage stores the file (the
imagesbucket) - A row gets inserted into the
scanstable with just the image URL - A database webhook fires automatically
- The
analyze-imageEdge Function wakes up and asks GPT-4o: "What am I looking at?" - GPT returns a description, detected objects, and room type
- The description gets converted to a vector embedding (more on this below)
- Everything gets saved back to the database
"Where's my charger?" → GPT extracts room → Vectorize query → Find matches → GPT picks best one → Answer
When you search:
- GPT-4o-mini checks if you mentioned a room ("in the bedroom")
- Your query becomes a vector embedding
- pgvector finds the most similar image descriptions
- GPT-4o looks at the top 3 candidates and picks the winner
- You get a natural language answer + the photo
"Your charger is on the nightstand in your bedroom" + a photo of exactly that.
| Layer | Tech | Why |
|---|---|---|
| Mobile | React Native + Expo SDK 54 | Write once, run everywhere. Expo handles the camera/photo permissions nightmare. |
| Backend | Supabase | Postgres + Storage + Edge Functions + Realtime in one package. No separate servers to manage. |
| Edge Functions | Deno | Runs on Supabase's edge network. Fast cold starts. TypeScript native. |
| AI - Vision | GPT-4o | Best-in-class image understanding. Can describe scenes, identify objects, detect room types. |
| AI - Search | GPT-4o-mini | Fast and cheap for simple extraction tasks (pulling room names from queries). |
| Embeddings | text-embedding-3-small | OpenAI's embedding model. Converts text to 1536-dimensional vectors. |
| Vector Search | pgvector | PostgreSQL extension for similarity search. No separate vector database needed. |
- Postgres is powerful. Triggers, RPC functions, pgvector—it's a real database.
- Edge Functions use Deno. Cleaner than Firebase Functions' Node.js setup.
- SQL is readable. Firestore's NoSQL queries can get ugly fast.
- Row-level security. Built-in, not bolted on.
Cost and speed optimization:
- GPT-4o ($5/M input tokens): Used for complex tasks—image analysis, final answer generation
- GPT-4o-mini ($0.15/M input tokens): Used for simple extraction—"Is there a room name in this query?"
Running GPT-4o for every tiny task would burn money. GPT-4o-mini is 33x cheaper for simple jobs.
CREATE TABLE scans (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
image_url TEXT NOT NULL,
description TEXT,
room_label TEXT,
detected_objects JSONB,
embedding vector(1536),
created_at TIMESTAMPTZ DEFAULT now()
);The embedding column is the secret sauce. It stores a 1536-dimensional vector representing the semantic meaning of each image's description. When you search, your query becomes a vector too, and pgvector finds the closest matches using cosine similarity.
visual-memory/
├── App.tsx # The entire app UI (simple, one file)
├── lib/
│ └── supabase.ts # Supabase client initialization
└── supabase/
└── functions/
├── analyze-image/ # Webhook: processes new photos
└── search-items/ # API: handles search queries
Notice there's no components/ folder. This is an MVP. The whole app is one file. Don't over-engineer early.
The problem: Room inheritance logic checks if the last scan was within 10 minutes. But new Date("08:52:27.046579") returns garbage.
Root cause: The created_at column was type time instead of timestamptz. We were storing just the time, not the date.
The fix:
ALTER TABLE scans DROP COLUMN created_at;
ALTER TABLE scans ADD COLUMN created_at TIMESTAMPTZ DEFAULT now();Lesson: Always use TIMESTAMPTZ for timestamps. Never time or timestamp (without timezone). PostgreSQL timezone handling is a minefield—TIMESTAMPTZ is the safe choice.
The problem: User takes a close-up photo of a desk. GPT says "Office" because desks are in offices... except this desk was in the bedroom.
Root cause: The GPT prompt listed valid rooms but didn't say what to do when the room wasn't visible.
The fix: Updated the prompt:
IMPORTANT: Only assign a specific room if you can clearly see room context
(walls, furniture layout, multiple room features). If the image is a close-up
of an object/surface without visible room context, use 'Close-up' - do NOT
guess the room based on the object type alone.
Lesson: LLMs do what you tell them. If you don't explicitly say "don't guess," they'll guess. Be specific about edge cases.
The problem: Fallback logic checked for labels like "Close-up" or "Surface", but the GPT prompt only allowed "Unknown" as the fallback.
Root cause: The ambiguous labels list and the GPT prompt were out of sync.
The fix: Added "Close-up" to the valid room options in the prompt.
Lesson: When you have validation/categorization logic in multiple places, they must stay synchronized. One source of truth is better than two.
Instead of searching the entire database:
-- BAD: Searches everything
SELECT * FROM scans
ORDER BY embedding <=> query_embedding
LIMIT 3;Filter first, then search:
-- GOOD: Filters first, then searches
SELECT * FROM scans
WHERE room_label = 'Bedroom' -- Hard filter
ORDER BY embedding <=> query_embedding -- Then vector search
LIMIT 3;Why it matters: If the user says "red thing in the bedroom," a fuzzy vector search might return a red cup from the kitchen. The hard filter guarantees bedroom-only results.
When GPT can't determine the room from a close-up shot, check what room the user was photographing 10 minutes ago:
if (ambiguousLabels.includes(finalRoom)) {
const { data: history } = await supabase
.from('scans')
.select('room_label, created_at')
.not('room_label', 'in', '("Unknown","Close-up")')
.order('created_at', { ascending: false })
.limit(1)
.single();
if (history && (Date.now() - new Date(history.created_at)) < 10 * 60 * 1000) {
finalRoom = history.room_label; // Inherit from recent scan
}
}Why it works: Users typically photograph multiple things in the same room. If they took a photo of the bedroom 2 minutes ago, the close-up of a drawer is probably also in the bedroom.
Stage 1: Vector similarity finds candidates (fast, cheap, fuzzy) Stage 2: LLM picks the winner (slow, smart, accurate)
This is like Google showing you 10 results, then you pick the best one—except the "you" is GPT-4.
// Stage 1: Get candidates via embeddings
const candidates = await supabase.rpc('match_images', {
query_embedding: queryVector,
match_count: 3
});
// Stage 2: Let GPT pick the best one
const response = await openai.chat({
messages: [{
role: "system",
content: "Pick the best match from these 3 candidates..."
}]
});Why not just use GPT for everything? Cost. Embedding search costs ~$0.02 per 1M tokens. GPT-4 costs $5 per 1M tokens. Use the cheap tool for filtering, expensive tool for precision.
response_format: { type: "json_object" }This makes GPT return valid JSON every time. No more parsing nightmares. No more "Sure! Here's the JSON:" preamble.
Always use this when you need structured data from an LLM.
Right now the UI just shows "Searching..." text. A skeleton loader would feel faster.
If a user uploads 50 photos, each triggers a separate webhook. A queue system would be smarter.
Photos should save locally first, then sync. Currently requires internet.
When GPT-4o fails, the user sees "Error processing image." That's not helpful. Surface the actual problem.
Think of GPT as an extremely capable but literal intern:
- They'll do exactly what you say
- They won't do what you don't say
- They'll make assumptions you didn't expect
- Clear instructions = good results
The prompt is your spec. If the spec is vague, the output is unpredictable.
Embeddings are compressed meaning. When you convert "red notebook on the kitchen counter" to a vector, you're compressing that sentence into 1536 numbers that capture its semantic essence. Similar meanings = similar vectors.
Vector search finds "vibes." It doesn't match keywords—it matches meaning. "Where's my charger" finds "phone cable on nightstand" because they're semantically close.
# Start the app
npm start
# Deploy a function after changes
npx supabase functions deploy analyze-image
npx supabase functions deploy search-items
# Link to your Supabase project (first time)
npx supabase login
npx supabase link --project-ref YOUR_PROJECT_REF
# Check function logs
npx supabase functions logs analyze-imageThis project is a great example of modern AI-native architecture:
- The AI isn't a feature—it's the foundation
- The database isn't just storage—it's a vector search engine
- The backend isn't servers—it's edge functions that wake up on demand
The whole thing costs pennies to run (Supabase free tier + pay-per-use OpenAI) and could scale to millions of images without architecture changes.
That's the power of building on modern infrastructure. You get to focus on the product, not the plumbing.
Written for future-me and anyone else who inherits this codebase. May your items never be lost again.