Deployment repo for a self-hosted Ollama instance. The image ships with two pre-pulled models:
qwen2.5:3b(default) — general chat, drafting, summarisationgemma3:1b— small/fast for classification, tagging, extraction
This repo only owns the Ollama deployment. The application code that calls it (the /orgs/{orgSlug}/ai/chat API) lives in partners-connect/server — see the spec at partners-connect/server/specs/026-llm-chat-integration/.
cortex/
├── ollama/
│ └── Dockerfile # builds an Ollama image with qwen2.5:3b + gemma3:1b pre-pulled
├── Dockerfile.deploy # consumed by Clever Cloud, pulls the image from GHCR
├── docker-compose.yml # local smoke test
├── .github/workflows/
│ ├── build-push-ollama.yaml # CI: build & push to ghcr.io on every main push
│ └── deploy-ollama.yaml # CD: deploy to Clever Cloud
└── DEPLOYMENT.md # one-time Clever Cloud setup runbook
docker compose up --buildHost-side port is 11435 locally to avoid colliding with a native Ollama install on the dev machine (which uses 11434). Inside the compose network, services still reach Ollama at
http://ollama:11434. In production (Clever Cloud), Ollama is forced to listen on8080via theOLLAMA_HOSTenv var (set by the deploy workflow) so Clever Cloud's healthcheck — which polls0.0.0.0:8080— succeeds.
First build takes ~4–5 minutes (pulls Ollama, then pulls qwen2.5:3b ~1.9 GB and gemma3:1b ~815 MB into the image).
Once running:
# List models baked into the image
curl -s http://127.0.0.1:11435/api/tags | jq
# Test inference
curl -s -X POST http://127.0.0.1:11435/api/generate \
-H 'Content-Type: application/json' \
-d '{"model":"qwen2.5:3b","prompt":"Reply with one word: hello","stream":false}'Stop:
docker compose downgit push main ─▶ build-push-ollama.yaml ─▶ ghcr.io/<owner>/<repo>-ollama:<sha>
└▶ ghcr.io/<owner>/<repo>-ollama:latest
manual trigger ─▶ deploy-ollama.yaml ─▶ Clever Cloud (single instance)
Image builds automatically on every push to main. Deploys are manual via gh workflow run "CD - Deploy Ollama to Clever Cloud" -f image_tag=latest (or the GitHub Actions UI).
See DEPLOYMENT.md for the one-time Clever Cloud setup (create apps, configure network group, set GitHub secrets).
Edit the ollama pull lines in ollama/Dockerfile. Push to main to rebuild the image, then re-trigger the deploy workflow. Models get baked into the image, so a larger model means a larger image (and longer pulls on Clever Cloud).
Ollama loads one model into RAM at a time, so having multiple models on disk doesn't multiply runtime memory — image size grows but RAM only needs to fit the largest single model in use.
Sizing guide for the Clever Cloud instance (runtime RAM, with 8K context):
| Model | Disk | Runtime RAM | Min instance |
|---|---|---|---|
gemma3:1b |
815 MB | ~2.0 GB | S (~2 GB) — tight |
qwen2.5:1.5b |
986 MB | ~1.8 GB | S (~2 GB) |
qwen2.5:3b (default) |
1.9 GB | ~3.5 GB | M (~4 GB) |
llama3.2:3b |
2.0 GB | ~3.5 GB | M (~4 GB) |
gemma3:4b |
3.3 GB | ~5 GB | L (~8 GB) |
qwen2.5:7b |
4.7 GB | ~6 GB | L (~8 GB) |
For runtime-pulled models (not baked in), attach a Clever Cloud FS Bucket add-on mounted at /root/.ollama so models persist across restarts.
The Ollama API has no authentication. The Clever Cloud Ollama app must NOT have a public domain — only the partners-connect server (on the same network group) should be able to reach it. See DEPLOYMENT.md.