Semantic image clustering that runs entirely on your local machine. Upload a collection of photos, get back labeled groups organized by what they're about — not just visual similarity.
Clusters 20 images in ~50 seconds on an M4 Mac Mini (no GPU, no cloud).
flowchart TB
Up["Upload images"] --> CLIP["CLIP ViT-L/14 (ONNX)<br/>768-dim semantic embeddings per image"]
CLIP --> Ward["Ward hierarchical clustering<br/>min/max size-constrained groups"]
Ward --> Rep["Centroid-based selection<br/>3 images closest to each cluster centroid"]
Rep --> VLM["Ollama vision LLM<br/>title + catchy phrase per cluster"]
VLM --> API["JSON API"]
API --> UI["React frontend renders inline"]
CLIP (Contrastive Language–Image Pre-training) is an OpenAI model trained on 400 million (image, text) pairs using a contrastive objective: the image encoder and text encoder are trained jointly so that matching pairs have high cosine similarity in a shared embedding space.
This project uses the vision-only half: CLIP ViT-L/14 in ONNX format from Xenova/clip-vit-large-patch14. The model takes a 224×224 image and produces a 768-dimensional vector. Each dimension encodes abstract semantic content learned from the web-scale training data.
Why CLIP over alternatives:
| Model | Training objective | Best for | Weakness here |
|---|---|---|---|
| CLIP ViT-L/14 | Contrastive (image + text) | Semantic concept clustering | Slightly slower than smaller models |
| ResNet-50/101 | ImageNet classification | Visual feature extraction | Groups visually similar, not semantically similar images |
| DINOv2 | Self-supervised distillation | Re-identification, fine-grained similarity | No semantic grounding from text |
| CLIP ViT-B/32 | Contrastive (image + text) | Faster alternative | 512-dim, lower capacity than L/14 |
CLIP's contrastive training means images of the same concept cluster together even when they look different visually — a photograph of a dog and a cartoon of a dog end up near each other; two photographs of different fields of grass do too.
Before inference, each image is put through the standard CLIP preprocessing pipeline:
- Resize to 224×224 using CatmullRom interpolation (Lanczos-quality bicubic, implemented via
golang.org/x/image/draw) — better edge preservation than bilinear at a moderate cost. - Channel layout — pixel values are rearranged from HWC (height × width × channels) to NCHW (batch × channels × height × width), which is what the ONNX model expects.
- Normalization — each channel is normalized with CLIP's specific mean and std:
R: (pixel/255 - 0.48145466) / 0.26862954
G: (pixel/255 - 0.45782750) / 0.26130258
B: (pixel/255 - 0.40821073) / 0.27577711
- L2 normalization — the output 768-dim vector is L2-normalized so all embeddings lie on the unit hypersphere. This means cosine similarity = dot product, which simplifies downstream distance math.
The ONNX session is wrapped in a Model struct with pre-allocated input/output tensors (3×224×224 float32 in, 768 float32 out). Tensor memory is shared between Go and the C ONNX Runtime via a backing slice, avoiding allocations on the hot path. A sync.Mutex serializes all calls into the single ORT session — ONNX Runtime itself is thread-safe per-session, but the backing-slice reuse pattern requires exclusive access during inference.
Clustering is done via agglomerative hierarchical clustering using Ward's linkage criterion, implemented from scratch in Go. The algorithm runs on the L2-normalized CLIP embeddings.
The algorithm starts with each image as its own cluster and iteratively merges the two closest clusters until reaching the target number of clusters. This is the opposite of divisive approaches (start with one cluster, split down). Agglomerative methods produce a dendrogram — a tree of all possible merge decisions — and you cut it at any depth.
Ward's method minimizes the total within-cluster variance at each merge step. The distance between two clusters A and B under Ward's criterion is:
d(A, B) = (|A| * |B|) / (|A| + |B|) * ||centroid(A) - centroid(B)||²
Where |A| and |B| are cluster sizes. The size-weighting term means Ward penalizes merges that would create large, spread-out clusters — it naturally produces compact, similarly-sized groups. This is better than single-linkage (which produces chains) or complete-linkage (which can break apart natural groups).
Centroid updates are computed incrementally as a weighted average of the two merged centroids:
centroid(merged) = (|A|*centroid(A) + |B|*centroid(B)) / (|A| + |B|)
Standard Ward clustering doesn't support min/max cluster size constraints. imageclust enforces them with a two-phase approach:
Phase 1 — target cluster count:
Given totalImages, minSize, and maxSize, the feasible range of cluster counts is:
nMin = ceil(totalImages / maxSize) # fewest clusters that fit under maxSize
nMax = floor(totalImages / minSize) # most clusters that fit above minSize
nTarget = (nMin + nMax) / 2 # midpoint heuristic
Phase 2 — merge-time max enforcement:
During agglomeration, before each merge, the algorithm checks whether |A| + |B| > maxSize. If it would exceed the limit, that pair is marked as non-mergeable (distance set to math.MaxFloat32) and the next-closest pair is tried instead.
Phase 3 — post-hoc split:
If any cluster still exceeds maxSize after agglomeration completes (possible when most pairs are blocked), the cluster is recursively split using the same Ward algorithm on its sub-embeddings.
Phase 4 — min enforcement:
Clusters smaller than minSize after all merges are dropped from the final output.
- Initial distance matrix: O(n²) pairwise Ward distances
- Each merge iteration: O(n²) scan to find the minimum (naive; a priority queue would give O(n log n) but n is small here)
- Overall: O(n³) in the worst case — negligible for the image counts this tool is designed for
After clustering, the pipeline needs to pick 3 images per cluster to send to the vision LLM for labeling. It selects the images closest to the cluster centroid using cosine similarity.
Since all embeddings are L2-normalized (unit vectors), cosine similarity equals the dot product:
cosine_sim(image, centroid) = image · centroid (when both are unit vectors)
The centroid is computed as the mean of the cluster's embedding vectors, then L2-normalized. The top-k images by dot product score are selected via partial selection sort (O(n·k) — fine for small n).
This ensures the most "representative" images go to the LLM — the ones that best capture the semantic center of the cluster — rather than random picks or outliers.
Each cluster is labeled by sending its representative images to a locally-running vision model via the Ollama REST API (/api/generate). Images are base64-encoded and embedded in the request body.
The prompt asks the model to return strict JSON:
{"title": "short title here", "catchy_phrase": "catchy phrase here"}Title is capped at 25 characters; catchy phrase at 100.
Retry logic uses exponential backoff with jitter:
backoff(attempt) = initialBackoff * 2^attempt * (1 + 0.3 * rand())
Starting at 2 seconds, capped at 30 seconds, with up to 3 attempts. Jitter prevents thundering-herd if multiple clusters retry simultaneously.
Supported vision models (via Ollama):
| Model | Size | Speed | Quality | Pull command |
|---|---|---|---|---|
llava:7b |
4.7 GB | Fast | Good | ollama pull llava:7b |
llama3.2-vision:11b |
8.0 GB | Medium | Better | ollama pull llama3.2-vision:11b |
moondream |
1.7 GB | Fastest | Lower | ollama pull moondream |
Set OLLAMA_MODEL to switch. llava:7b is the best speed/quality tradeoff for most use cases.
Context propagation — the request's context.Context is forwarded through to each Ollama HTTP call, so if the user cancels the browser request, in-flight LLM work is aborted cleanly.
The pipeline has two parallel stages:
CLIP embedding (worker pool):
A bounded goroutine pool of runtime.NumCPU() workers fans out across all images. Each worker calls Model.Embed(), which preprocesses the image concurrently (decode, resize, normalize) and then acquires the mutex for the ORT session. The bottleneck is the single serialized inference session, so more workers than images-in-flight provides no benefit — but preprocessing overlap does help.
flowchart LR
images(["images"]) --> jobs["job channel"]
jobs --> w1["worker 1"]
jobs --> w2["worker 2"]
jobs --> wN["worker N"]
w1 --> mutex["ORT mutex"]
w2 --> mutex
wN --> mutex
mutex --> results["result channel"]
results --> slice(["ordered slice"])
Cluster title generation (unbounded parallel):
All clusters are titled concurrently — one goroutine per cluster. Since Ollama queues requests it can't serve immediately, this is safe. Set OLLAMA_NUM_PARALLEL on the Ollama server side to control how many vision inference slots it allocates.
flowchart TD
subgraph server ["Go HTTP Server — gorilla/mux"]
post["POST /api/cluster"] --> handler["handlers.ClusterAndGenerate()"]
handler --> store["session store<br/>in-memory, 1h TTL"]
store --> run["workflow.ImageCluster.Run()"]
run --> embed["embed workers<br/>NumCPU goroutines"]
run --> titles["cluster titles<br/>1 goroutine per cluster"]
embed --> clip["clip.Model.Embed()<br/>ONNX AdvancedSession, mutex-serialized"]
titles --> oc["ollama.Client<br/>HTTP /api/generate, backoff"]
clip --> cluster["clustering.Perform()<br/>Ward + size constraints"]
cluster --> rep["selectRepresentatives()<br/>cosine similarity ranking"]
titles --> rep
spa["GET / serves React SPA"]
imgreq["GET /api/image/{name}?session=id"]
end
clip -. inference .-> onnx[("ONNX Runtime")]
oc -. vision LLM .-> ollamasrv[("Ollama :11434")]
macOS:
brew install onnxruntime ollama
ollama pull llava:7b # 4.7 GB vision model
bash scripts/download_model.sh # ~1.2 GB CLIP modelLinux: Download ONNX Runtime from github.com/microsoft/onnxruntime/releases (v1.20.1, linux-x64 or linux-aarch64), extract the .so, set ONNXRUNTIME_LIB_PATH. Then install Ollama and run the model download script.
go build -o imageclust .
OLLAMA_MODEL=llava:7b ./imageclust
# open http://localhost:8080Environment variables (all optional):
| Variable | Default | Description |
|---|---|---|
ONNXRUNTIME_LIB_PATH |
/opt/homebrew/lib/libonnxruntime.dylib |
Path to ORT shared library |
CLIP_MODEL_PATH |
models/clip-vit-large-patch14/vision_model.onnx |
CLIP ONNX model |
OLLAMA_HOST |
http://localhost:11434 |
Ollama API endpoint |
OLLAMA_MODEL |
llama3.2-vision:11b |
Vision-capable model name |
The Dockerfile builds a self-contained image with the Go server and React frontend. Ollama must run on the host (or another container) — the default OLLAMA_HOST is http://host.docker.internal:11434.
docker build -t imageclust .
docker run -p 8080:8080 \
-v /path/to/models:/app/models \
imageclustThe CLIP model (~1.2 GB) is mounted at runtime via the volume. To bake it in instead, uncomment the COPY models/ line in the Dockerfile.
Cross-platform builds with --platform work correctly (arm64 → aarch64, amd64 → x64 ORT release naming).
Hardware: Apple M4 Pro, 14-core, 64 GB RAM. CPU inference only (no GPU/CoreML EP).
| Time per image | ~432 ms |
| Throughput | ~2.3 images/sec |
| Memory per call | ~3.7 MB |
Inference is serialized (one ORT session, mutex-protected). Preprocessing (decode → resize → NCHW normalization) runs in parallel across the worker pool; the ORT session is the bottleneck.
| Images | Time | Memory |
|---|---|---|
| 10 | 0.15 ms | 0.3 MB |
| 20 | 0.54 ms | 1.2 MB |
| 50 | 3.5 ms | 7.4 MB |
| 100 | 14 ms | 29 MB |
| 200 | 55 ms | 115 MB |
O(n²) distance matrix. Negligible relative to CLIP and Ollama.
| Images | Clusters | Total time | CLIP share | Ollama share |
|---|---|---|---|---|
| 10 | 2 | ~23 s | ~4 s | ~19 s |
| 20 | 4 | ~51 s | ~9 s | ~42 s |
Bottleneck is Ollama (~10 s/cluster, sequential per inference slot). CLIP is ~17% of total time for 20 images. To speed things up: run a smaller vision model (llava:7b is already fast; moondream is faster but lower quality), or set OLLAMA_NUM_PARALLEL on the server to allow concurrent cluster labeling.
internal/
clip/ — CLIP ViT-L/14 ONNX inference (AdvancedSession, mutex-serialized)
ollama/ — Direct HTTP client for Ollama /api/generate (no SDK)
workflow/ — Pipeline orchestration: embed → cluster → title
clustering/ — Ward hierarchical clustering with min/max size constraints
handlers/ — HTTP layer: multipart upload, JSON API, session store
models/ — Shared types (UploadedImage, ClusterDetails)
utils/ — Filename sanitization
frontend/
src/components/
ImageUploadForm.jsx — Upload form with drag-and-drop
ClusterResults.jsx — Inline cluster grid renderer
scripts/
download_model.sh — Fetch CLIP ONNX from HuggingFace
benchmark.sh — End-to-end pipeline timing script
POST /api/cluster — multipart form
| Field | Type | Description |
|---|---|---|
images |
file (multiple) | Image files to cluster |
minClusterSize |
int | Minimum images per cluster (default 3) |
maxClusterSize |
int | Maximum images per cluster (default 6) |
Response:
{
"status": "success",
"sessionId": "abc123",
"clusters": [
{
"id": "Cluster-0",
"title": "Serene rural sunset",
"catchy_phrase": "Nature's canvas of tranquility",
"images": ["img_0.jpg", "img_3.jpg", "img_7.jpg"]
}
]
}GET /api/image/{filename}?session=<sessionId> — serves an uploaded image. Sessions expire after 1 hour; the background cleanup goroutine removes temp directories every 10 minutes.
MIT