Note: this repo is a showcase of the LUMINAL setup. It's not a turnkey deployment and isn't meant to be cloned and run as-is.
LUMINAL is a self-hosted AI stack. It runs on a Proxmox VM with an NVIDIA GPU and stitches together workflow automation, local LLM inference, RAG, and smart-home control in one Docker Compose project.
The goal: run a real AI product entirely on self-hosted hardware. Models, auth, data, everything. No hosted inference APIs, no commercial accounts, nothing leaving the network.
Midnight is a custom assistant written on top of OpenWebUI that talks to the HELIOS media library through 7 Python tools (Plex, Radarr, Sonarr, Tautulli, Bazarr, SABnzbd, Seerr). It answers questions by calling live APIs instead of making stuff up.
Everything runs as Docker containers on a Proxmox VM. The stack breaks down like this:
- Auth β Cloudflare Access sits in front. Google OAuth via trusted headers. No local passwords.
- Interface β OpenWebUI is the frontend. It hosts Midnight, does RAG against Qdrant, and sends LLM calls to Ollama.
- Inference & data β Ollama runs three local LLMs with GPU passthrough. Qdrant holds the RAG vectors. n8n handles visual workflow automation.
- Physical world β Home Assistant plus Matter Server, both on host networking so mDNS device discovery works.
Midnight talks to a separate media stack (HELIOS) over HTTP APIs. LUMINAL is the brain, HELIOS is the library.
| Category | Service | Role in the system | |
|---|---|---|---|
| π€ AI Services | n8n | Visual workflow engine. Glue for cross-service automation. | |
| OpenWebUI | Chat interface and tool runtime. Hosts Midnight, handles RAG. | ||
| Ollama | Local LLM inference with GPU acceleration. | ||
| π§ AI Infrastructure | Qdrant | Vector DB for OpenWebUI's RAG. | |
| SearXNG | Metasearch backend for OpenWebUI's web-search RAG. | ||
| Docker | Containers, named volumes, GPU passthrough. | ||
| π Home Automation | Home Assistant | Device control hub. Runs on host network for discovery. | |
| Matter Server | Matter protocol bridge for HA. | ||
| π Security | Cloudflare Access | Zero Trust SSO. Google OAuth in front of OpenWebUI. | |
Three models get pulled on first boot and cached on disk. Each does something different:
llama3.1:8b(4.9 GB) β fast general-purpose model. Good for quick chat and simple tool calls.gemma4:e4b(9.6 GB) β multimodal with native tool use. This is what Midnight runs on.gpt-oss:20b(~13 GB on disk, 20B params) β heavier reasoning when capability matters more than latency.
All three share one Ollama instance and one GPU.
Midnight is a custom AI assistant built on top of OpenWebUI that queries the HELIOS media library. It uses function tools for everything. No question gets answered from what the model "knows" if a tool could answer from live data.
| Component | Description |
|---|---|
| Base Model | gemma4:e4b via Ollama |
| Interface | OpenWebUI with a custom system prompt |
| Tools | 7 Python function tools |
| Knowledge | RAG-indexed reference docs |
| Tool | What it does |
|---|---|
midnight_plex_tool |
Library search, recently added, episode details, cast, actor/director lookup |
midnight_radarr_tool |
Movie details, genres, synopses |
midnight_sonarr_tool |
TV show details, upcoming episodes |
midnight_tautulli_tool |
Watch history, current activity, most watched |
midnight_bazarr_tool |
Subtitle status and history |
midnight_sabnzbd_tool |
Download queue and history |
midnight_seerr_tool |
Content requests and search |
- Always calls a tool. Never answers a library question from model knowledge.
- Normalizes curly quotes and special characters before sending to APIs.
- Pulls real episode synopses from Plex instead of guessing plot summaries.
- Returns the actual Plex "added on" date, not the file's download timestamp.
- Says "I don't see that in the library" when something isn't there, instead of inventing a plausible answer.
"What movies do we have with Tom Hanks?"
"What's new in the library?"
"What's the Bob's Burgers episode 'It's a Stunterful Life' about?"
"Show me Christmas movies"
"What's currently downloading?"
"Who's watching right now?"
See midnight/README.md for the full system prompt and tool docs.
Why things are set up the way they are.
OpenWebUI doesn't have its own login. Cloudflare Access sits in front, redirects to Google, and passes the authenticated email via a trusted header (Cf-Access-Authenticated-User-Email). OpenWebUI auto-creates the user from that header. No local passwords to manage, and access policy lives in one place instead of scattered across services.
OpenWebUI trusts that header regardless of source IP, which is only safe if nothing untrusted can reach the port. Here cloudflared runs on a separate LAN host and connects to OpenWebUI over the network, so the port stays published on the LAN β closing the direct-access/header-spoofing gap means restricting port 3000 to the tunnel host at the firewall (a DOCKER-USER iptables allowlist, since Docker's published ports bypass ufw), not binding to loopback. FORWARDED_ALLOW_IPS pins which upstream uvicorn trusts for X-Forwarded-* headers as defense-in-depth.
Credentials (n8n encryption key, JWT secret, OpenWebUI session key) are mounted as files via Docker Secrets. They don't show up in env, process dumps, or compose logs. The plaintext files live in a locked-down system directory outside the repo.
The real env.sh and secrets/ directory live at /etc/LUMINAL/, symlinked into the project and gitignored. direnv picks them up on cd into the project, so interactive shells, cron jobs, and Docker Compose all see the same values without explicit sourcing. The pattern came after almost committing secrets one too many times.
Every piece of persistent state (n8n workflows, Ollama model cache, Qdrant indices, chat history, HA config) lives in an external Docker named volume. Containers get torn down and recreated without losing anything. Upgrades stop feeling risky.
scripts/docker-rebuild.sh pulls new images first, then runs docker compose up -d so only the services whose images actually changed get recreated. Everything else keeps running. It also runs a health check that skips the one-shot Ollama pullers (they're supposed to exit), retries transient failures, and returns 0/1/2 exit codes so cron can alert properly. --dry-run shows what would change without touching anything.
Every long-running service declares its own Docker healthcheck (Qdrant probes its port via bash//dev/tcp since its image ships no HTTP client; the rest hit /healthz-style endpoints), and startup ordering is gated on condition: service_healthy instead of fixed sleeps. So the script's health check gets a true signal from the whole stack, and the model pullers wait for Ollama to actually be serving before they run.
Midnight's system prompt assumes the model will hallucinate if allowed to. Every question has to go through a tool call. The prompt explicitly bans answering from model knowledge when a tool could answer instead. It normalizes curly quotes in input and uses RAG against MIDNIGHT_REFERENCE.md to pick the right tool. Trade-off: Midnight is occasionally too strict and refuses things it could reasonably answer. Better than made-up movie titles.
Ollama runs all LLM inference on the NVIDIA GPU at hardware speed β no API costs, no rate limits, nothing leaving the box. OpenWebUI runs the :cuda image so its RAG side (embeddings, reranking, Whisper STT) is GPU-accelerated too; it does not run LLM inference itself β that's Ollama's job. Both reserve the GPU in the compose file (the plain :latest OpenWebUI image is CPU-only and would silently ignore the reservation).
Version history and evolution in CHANGELOG.md.