Browser-resident Gemma 4 inference in pure Rust → WebAssembly + WebGPU. Loads the same GGUF blobs Ollama already has on disk, runs the forward pass on your local GPU through hand-written WGSL, never touches a remote server.
The intent is a PWA-pluggable inference engine, not a port of Ollama-the-server. Ollama has 275K LOC of Go that wraps llama.cpp via CGO plus model registry, CLI, conversion tooling, multimodal pipelines — almost none of which apply to a browser library. What survives the scope cut is the core inference path over Ollama's storage format.
A two-crate Cargo workspace:
| Crate | Path | Target | Status |
|---|---|---|---|
rullama |
crates/rullama |
wasm + native | release-track |
rullama-finetune |
crates/rullama-finetune |
wasm + native | LoRA SGD over the same wgpu kernels; PWA exposes TrainingSession in the Fine-tune tab |
The iOS bench harness (tools/ios-bench) is a sibling crate excluded from the
workspace so cargo build --workspace --target wasm32-unknown-unknown doesn't
try to compile its staticlib for wasm.
- ✅
gemma4:e2btext inference on the desktop loads end-to-end and generates greedy output bit-identical to Ollama. (gemma4:e4bis shape-compatible — pull and try it.) - ✅
gemma4:e2btext inference on iPhone — full Q4_K_M model loaded into iPhone 16e (A18, 8 GB shared RAM) and streaming tokens at ~4.65 tok/s via a Dedicated Worker + sync OPFS path. Multimodal towers stay Mac-only for now; mobile picks the text-only loader (max_context=512). - ✅ Vision + audio multimodal on the desktop. ViT (16 blocks, 768
hidden) + Conformer (12 blocks, 1024 hidden) towers run on the same wgpu
device as the text path; soft tokens splice into the prompt via
<|image>/<|audio>sentinels. Validated bit-identical to Ollama on a fixed image and a 30-second pangram WAV. - ✅ Q4_K + Q6_K + F16 + F32 quants (the actual mix in
gemma4:e2bQ4_K_M). - ✅ Streaming load via HTTP byte-range requests or OPFS sync access
handles — the 7 GB GGUF never enters wasm linear memory in bulk. The
PWA writes to OPFS once via
FileSystemSyncAccessHandle.write()in a worker, and reads tile-by-tile during inference, so the wasm peak stays in the tens of MiB regardless of model size. - ✅ Multi-turn chat with system prompt, mid-generation Stop, persistent KV cache.
- ✅ Encoder chained + per-layer submits (M7 + M15) — one CommandEncoder spans each transformer layer, submitted incrementally so the GPU drains smoothly even on tight-RAM phones.
- ✅ In-browser LoRA fine-tuning (
rullama-finetune, wasm + native). Backward kernels for matmul Q4_K / Q6_K, rmsnorm, rope, geglu, attention, cross-entropy; Adam optimizer over GPU buffers; rank-r LoRA on attention- FFN projections. 200-step overfit-one drops loss from ~17.7 → 0 on the
dev fixture. Trained adapters export as safetensors and load back into
the inference
ModelvialoadAdapter— no roundtrip through native. The PWA's Fine-tune tab drives all of this in the foreground tab.
- FFN projections. 200-step overfit-one drops loss from ~17.7 → 0 on the
dev fixture. Trained adapters export as safetensors and load back into
the inference
- ❌ MoE
gemma4:26b/gemma4:31b— out of scope. - ❌ Other architectures (llama, mistral, qwen, phi).
- 🛠️ Mobile multimodal — desktop multimodal works; the iPhone loader currently skips the vision/audio towers to fit in shared RAM. Lazy upload for those is a follow-up.
You need:
- Rust ≥ 1.91 +
wasm-pack(cargo install wasm-pack --locked --version 0.13.1) - A WebGPU-capable browser (Chrome 113+, Edge 113+, recent Firefox; iOS Safari 17.4+ for phones)
- Ollama installed locally with
gemma4:e2bpulled (ollama pull gemma4:e2b)
# Unified bundle — exposes both inference (`Model`) and training
# (`TrainingSession`) wasm-bindgen surfaces. Built from `rullama-finetune`
# because that's the crate that re-exports both. `--out-name rullama` keeps
# the JS entry at `pkg/rullama.js` for PWA import compatibility.
wasm-pack build crates/rullama-finetune --target web --release \
--out-dir ../../pkg --out-name rullama
# Inference-only variant (smaller bundle, no TrainingSession). Use when
# shipping a chat-only deployment.
wasm-pack build crates/rullama --target web --release --out-dir ../../pkgThis emits pkg/rullama.js + pkg/rullama_bg.wasm + TypeScript typings.
The user-facing browser app lives in web/ — a production-quality React + Vite
- Tailwind + Workbox chat PWA (service-worker offline shell, restart dialog on
deploys, attachment UI, conversation history in OPFS + SQLite via
rsqlite-wasm) built against the shared wasm bundle.
# React / Vite PWA — auto-runs the wasm bundle build via `pnpm dev`.
cd web
pnpm install
pnpm dev # https://localhost:5173/The first load streams the ~7 GB blob from the local Ollama install (or an R2
mirror — see scripts/upload-models-to-r2.sh) through a Dedicated Worker that
owns a FileSystemSyncAccessHandle over OPFS. Bytes go network → sync handle
→ disk without ever pinning a Blob in the JS heap. Subsequent loads (within
the same Safari session) reuse the cached file.
The PWA is fully drivable from the Mac via Apple's safaridriver:
# One-time setup on the phone:
# Settings → Safari → Advanced → Remote Automation = on
# Web Inspector = on
# Feature Flags → WebGPU = on
# Then on the Mac:
safaridriver -p 4444 &
./web/serve-iphone.sh # HTTPS serve reachable from the phone's Safari
./web/test/iphone-test.sh # navigate → Load → chat → log perf/tmp/rullama-page.log collects beacon traces from the page ([chat],
[pe], [tg], [gen], [wkr], [rs]) so any regression in a phone
run leaves a server-side trail even after a WebContent crash.
compose.yaml packages the built PWA + a model-blob HTTP service behind
nginx, designed to sit behind Cloudflare. The Cargo workspace ships
cargo docker:* aliases (dispatched through the xtask crate) so the
deploy loop doesn't need shell aliases:
| Alias | Effective command |
|---|---|
cargo docker:build |
docker compose build |
cargo docker:start |
docker compose up -d |
cargo docker:stop |
docker compose down |
cargo docker:restart |
docker compose build --no-cache then docker compose up -d --force-recreate |
cargo docker:logs |
docker compose logs -f --tail=200 |
cargo docker:ps |
docker compose ps |
First run compiles xtask (~1 s); subsequent invocations reuse the cached
binary. Add new tasks by appending a match arm in xtask/src/main.rs and
a corresponding line in .cargo/config.toml. The compose file's
OLLAMA_MODELS_DIR env var picks the host's model store; defaults to
/usr/share/ollama/.ollama/models.
The same code paths run natively against host wgpu (Metal on macOS, Vulkan on Linux). Useful for parity testing without a browser:
# Greedy parity vs Ollama (CPU oracle)
cargo run -p rullama --release --features cpu-reference --example greedy_parity -- \
~/.ollama/models/blobs/sha256-<digest> "Hi" 5
# Full-stack chat through the public Model API
cargo run -p rullama --release --features cpu-reference --example model_api -- \
~/.ollama/models/blobs/sha256-<digest> "Hi" --greedy --max=16
# Standalone chained forward (M7 perf path)
cargo run -p rullama --release --features cpu-reference --example chained_smoke -- \
~/.ollama/models/blobs/sha256-<digest> "Hi" --max=8--features cpu-reference is now a no-op (the f32 oracle is always built); the
flag is kept so existing scripts keep working.
rullama-finetune runs LoRA SGD against the live wgpu kernels — no Burn, no
PyTorch, no separate runtime. Scope: rank-r LoRAs on
attn_q / attn_k / attn_v / attn_o and the FFN projections, Adam, global
L2 grad clipping, gradient accumulation, mixed precision, gradient
checkpointing. PerPosition CE is a single-forward variant with a ~C/2 speedup
vs. the multi-forward path.
In the browser: the unified wasm bundle (see Build)
exposes TrainingSession to JS alongside Model. The Fine-tune tab in
web/ drives a full session — dataset upload, hyperparam UI, live
loss chart, save adapter to OPFS as safetensors. The same Model that's
loaded for inference accepts the trained adapter via Model.loadAdapter(bytes)
(re-runs in the chat tab against the adapted weights).
Native:
# Overfit a single (prompt, target) pair — acceptance test that the
# backward path and Adam are wired correctly.
cargo run -p rullama-finetune --release --example overfit_one -- \
~/.ollama/models/blobs/sha256-<digest>
# Train on a JSONL dataset. See `crates/rullama-finetune/examples/data/echo.jsonl`
# for the format; env knobs documented in the example's docstring.
cargo run -p rullama-finetune --release --example train_jsonl -- \
~/.ollama/models/blobs/sha256-<digest> \
crates/rullama-finetune/examples/data/echo.jsonl
# End-to-end smoke: train an adapter, save safetensors, reload via the
# public Model API, run a generation against the adapted weights.
cargo run -p rullama-finetune --release --example eval_adapter -- \
~/.ollama/models/blobs/sha256-<digest> /path/to/adapter.safetensorsPWA (host page) ──┐
▼ postMessage RPC
┌──────────────────────────────────────────────────────────────────┐
│ inference-worker.js (Dedicated Worker) │
│ ▶ owns FileSystemSyncAccessHandle for the GGUF │
│ ▶ owns the wasm Model handle │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ wasm32 (Rust, the rullama crate) │ │
│ │ Model.loadFromOpfs(read_fn, total) │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ GgufReader (header only, ~5 MB) │ │
│ │ │ │ │
│ │ │ TensorFetcher (OPFS sync read | HTTP Range)│
│ │ ▼ │ │
│ │ WeightCache ─────────▶ Forward / VisionForward / │ │
│ │ (lazy GPU upload, GpuAudioForward │ │
│ │ per-tile range fetch (per-layer encoder │ │
│ │ on big tensors) submits, GPU-resident │ │
│ │ KV cache) │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ wgpu (WebGPU / Metal / Vulkan) │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ WGSL kernels: matmul Q4_K/Q6_K/F16, rmsnorm, │ │
│ │ rmsnorm_per_row, rope_neox, attention (incl. │ │
│ │ HPD-f16 + block-local + subgroup variants), │ │
│ │ conv2d, geglu, softcap, residual_add, scale, │ │
│ │ top_k, quick_gelu, plus backward kernels for │ │
│ │ training (cross_entropy, rmsnorm, rope, geglu, │ │
│ │ attention dQ / dKV, matmul Q4_K / Q6_K, Adam) │ │
│ └──────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────┘
│
▲ postMessage replies (tokens, errors)
PWA renders tokens, manages chat history, handles attachments.
The Worker move (M15) is what unblocked iPhone inference: iOS Safari only
exposes FileSystemSyncAccessHandle in Worker contexts, and the Worker
isolates inference from main-thread page-watchdog reapers.
The reference Go implementation lives in Ollama's tree under
model/models/gemma4/. Every op in crates/rullama/src/reference/forward.rs
(CPU oracle), forward_chained.rs (production GPU forward),
multimodal/vision.rs, and multimodal/audio.rs corresponds 1:1.
Measurements as of M15:
| Target | Steady-state tok/s (gen) | Notes |
|---|---|---|
| iPhone 16e (A18, iOS 26) | ~4.65 tok/s | text-only, max_context=512 |
| AMD Radeon Pro 555 (Mac) | ~1 tok/s (M7 baseline) | naive kernels, tiled matmul deferred |
The architectural foundation (chained encoder, GPU-resident KV cache, per-layer submits, per-tile range fetch from OPFS) is in place. Inference kernels are still naive matvec; reaching ≥10 tok/s on both Mac and phone needs tiled matmul + bind-group caching + kernel fusion (the M8 line on the roadmap).
The iPhone A18 advertises 1 GiB for both max_buffer_size and
max_storage_buffer_binding_size — four times the WebGPU spec floor — so
there's real headroom for fewer/larger weight buffers (currently 455 of
them resident, see M15 follow-ups).
Other capability notes captured during iPhone validation:
shader-f16✓ — packed FP16 MADs engage on A18.timestamp-query✓ — Pro 555 doesn't expose this; could wire GPU-side per-pass timing.subgroups✗ — A18 has SIMDgroup hardware but Safari's WebGPU doesn't surface WGSL subgroup ops yet. Vision attention falls through to the no-subgroup HPD-f16 kernel automatically.
crates/rullama/
├── src/
│ ├── api.rs # JS-facing Model: load / loadFromUrl / loadFromOpfs[TextOnly] / loadAdapter / clearAdapter
│ ├── lora.rs # InferenceAdapter — parses the safetensors blob TrainingSession writes
│ ├── backend/
│ │ ├── context.rs # WgpuCtx (device, queue, adapter limits)
│ │ ├── dispatch.rs # cached + chained kernel dispatchers (incl. backward + Adam)
│ │ ├── pipelines.rs # one ComputePipeline per kernel (built once)
│ │ ├── weight_cache.rs # lazy GPU upload, per-tile range fetch on big tensors
│ │ ├── matmul.rs / elementwise.rs / spike.rs # one-shot dispatchers (parity tests)
│ ├── gguf/
│ │ ├── reader.rs # GGUF v3 parser (header + tensor descriptors)
│ │ ├── fetcher.rs # TensorFetcher trait + In-memory / HttpRange / Opfs impls
│ │ ├── tensor.rs # dequant_tensor_to_f32 / dequant_row_to_f32 (sync + async)
│ │ ├── quant.rs / dtype.rs / value.rs
│ ├── kernels/wgsl/ # 70+ hand-written compute shaders (text + vision + audio + backward)
│ ├── model/config.rs # Gemma4Config: parses gemma4.* metadata keys
│ ├── multimodal/
│ │ ├── vision.rs # ViT forward (16 blocks, 768d, ClippableLinear)
│ │ ├── audio.rs # Conformer forward (12 blocks, 1024d, block-local attention)
│ │ └── audio_features.rs # WAV → 128-bin log-mel (realfft)
│ ├── reference/
│ │ ├── forward.rs # CPU f32 forward (parity oracle)
│ │ ├── forward_gpu.rs # M3-era GPU forward with per-kernel readbacks (oracle)
│ │ ├── forward_chained.rs # M7 production GPU forward, per-layer submits (M15)
│ │ ├── ops.rs / weights.rs
│ ├── sampling.rs # temperature, top-k, top-p, rep penalty
│ ├── template/gemma4_small.rs # chat-template renderer (matches Ollama)
│ └── tokenizer/ # GGUF BPE tokenizer (Ollama-bit-exact)
└── examples/
├── greedy_parity.rs # CPU forward greedy vs Ollama
├── chained_smoke.rs # standalone Forward driver
├── model_api.rs # public Model API end-to-end
├── vision_parity.rs # vision tower vs Ollama (M11)
├── audio_parity.rs # audio tower vs Ollama (M13)
├── matmul_bench.rs # native wgpu matmul microbench
└── inspect.rs / decode_ids.rs / encode_check.rs / list_tensors.rs / …
crates/rullama-finetune/
├── src/
│ ├── shared/ # vendored config / error / progress types
│ ├── dataset_loader.rs # JSONL parser + Tokenizer trait
│ ├── lr_schedule.rs # warmup + linear / cosine / cosine-warm-restarts
│ ├── lora.rs # per-LoRA GPU state (A / B), grad buffers
│ ├── scratch.rs # per-step GPU scratch buffers for backward
│ ├── wasm_bindgen_api.rs # JS-facing TrainingSession (wasm32 only)
│ └── session.rs # TrainingSession — forward → loss → backward → Adam
└── examples/
├── overfit_one.rs # single-pair acceptance test
├── train_jsonl.rs # JSONL dataset trainer
├── eval_adapter.rs # load a trained safetensors blob and generate
└── data/echo.jsonl
examples/
├── web/ # React + Vite + Tailwind + Workbox SW production demo
│ └── src/components/FineTunePanel.tsx # in-browser LoRA training tab over the loaded Model
└── pwa/ # Vanilla JS bench harness + safaridriver scripts
├── index.html / bench.html
├── inference-worker.js # Dedicated Worker — owns Model + sync OPFS handle
├── opfs-store.js # OPFS download + read API (main-thread)
├── opfs-writer-worker.js # streams GGUF → OPFS via SyncAccessHandle.write
├── serve.sh # dev HTTPS server + /api/log /api/blob endpoints
├── run-on-iphone.sh / iphone-session-keeper.sh / clean-iphone.sh
└── bench-on-iphone.sh
tools/ios-bench/ # staticlib for Xcode — C-ABI rullama_run_bench
docker/ # nginx + R2 mirror configs
scripts/ # ops scripts (model upload, etc.)
Dual-licensed under either of:
- Apache License 2.0 (LICENSE-APACHE)
- MIT License (LICENSE-MIT)
at your option.
Contributions are accepted under the same dual-license terms.