Ollama: Elite Local AI Development Platform

Full-stack AI infrastructure for building, deploying, and monitoring large language models with production web interface

Vision

Ollama is a sophisticated full-stack AI platform designed for engineers who demand production-grade reliability, security, and performance. Run state-of-the-art language models entirely on your local infrastructure with a beautiful web interface—all AI workloads run locally on Docker, with optional GCP Load Balancer for public access.

Architecture:

Backend: Python/FastAPI + PostgreSQL + Redis + Ollama
Frontend: Next.js 14 + React 18 + TypeScript + Firebase OAuth
Deployment: Docker containers + GCP Load Balancer for https://elevatediq.ai/ollama

Target Audience: Elite engineers, research teams, enterprises requiring air-gapped AI systems, and developers building custom AI applications.

Production Status ✅:

Backend: Verified with 50-user load test (7,162 requests, 100% success, 75ms P95 latency)
Frontend: Production-ready Next.js with OAuth, real-time chat, streaming responses
Live Platform: https://elevatediq.ai/ollama
Infrastructure: GCP Landing Zone

Features

Backend (FastAPI)

🚀 High-Performance API: FastAPI with async I/O
🧠 Multi-Model Support: Ollama, OpenAI-compatible APIs
💾 PostgreSQL + Redis: Conversation persistence and caching
🔐 Firebase Authentication: OAuth with Google Sign-In
📊 Prometheus Metrics: Production-grade observability
🔒 Security: Rate limiting, API keys, CORS, TLS 1.3+

Frontend (Next.js)

💬 Real-time Chat: Stream responses from LLMs with conversation history
🔐 OAuth Integration: Secure Google Sign-In via Firebase
🎨 Modern UI: Tailwind CSS with custom dark theme
📱 Responsive Design: Mobile-first, works on all devices
⚡ Optimized Performance: Code splitting, lazy loading, <200KB bundle
🧪 Type Safety: Full TypeScript coverage with strict mode

Documentation

📚 Complete Documentation Portal: docs/shared/README.md

📘 Repository Instructions - canonical instruction index for .github/

Quick Links

Getting Started:

📖 Development Setup Guide - Complete environment setup
🚀 Quick Start Guide - Get running in 10 minutes
🏗️ Architecture Overview - System design and components

Development:

🤝 Contributing Guidelines - How to contribute
📋 Copilot Instructions - AI assistant guidelines
🧪 Testing Guide - Test strategy and coverage

Operations:

🧭 On-Prem Execution Index - Primary target-server-local navigation
🏠 On-Prem Deployment Model - Canonical host inventories and immutable execution rules
🚢 Deployment Guide - Production deployment procedures (reference)
🧭 Shared Documentation Navigation - Canonical shared navigation layer
📚 Documentation SSOT - Canonical docs map and ownership rules
📐 Repository Rules - Canonical repo rules and naming constraints
🧱 Documentation Meta - documentation layers and ownership
🔤 Standard Naming Convention - canonical naming rules
📊 Monitoring & Observability - Metrics, logs, and alerts
📖 Operational Runbooks - Incident response procedures

API Reference:

🔌 API Documentation - Complete REST endpoint reference
🔐 Authentication Guide - Firebase OAuth setup
📡 Public API Access - Using the public endpoint

Compliance & Security:

✅ Landing Zone Compliance - GCP compliance status
🔒 Security Guide - Security best practices
🏛️ Standards Reference - Code quality standards

All Documentation

Browse the complete Shared Documentation Navigation for the canonical guide map, or the Indexed Documentation Hub for legacy compatibility snapshots.

Development & Contributing

New to Ollama development? Start here:

📖 Development Setup Guide - Complete environment setup for developers
🤝 Contributing Guidelines - How to contribute
📋 Standards & Compliance - Development standards
🔍 Shared Documentation Navigation - All documentation organized by topic
📝 Incomplete Tasks - Outstanding work items and roadmap

Quality Assurance

This project uses automated quality checks:

Type Checking: mypy ollama/ --strict (GitHub Actions)
Code Formatting: Black + Ruff (Pre-commit hooks + GitHub Actions)
Testing: 90%+ coverage with pytest (GitHub Actions)
Security: pip-audit, Bandit, CodeQL (GitHub Actions)
Linting: Ruff with strict rules (Pre-commit hooks + GitHub Actions)

Local Checks (before committing):

# Run all quality checks locally
pre-commit run --all-files

# Or run individually:
mypy ollama/ --strict
ruff check ollama/ --fix
black ollama/ tests/ --check
pytest tests/ --cov=ollama
pip-audit

Quick Start

Web Interface

Visit the live platform: https://elevatediq.ai/ollama
Sign in with Google (Firebase OAuth)
Start chatting with LLMs instantly

🌐 Web Interface Features:
✅ Real-time chat with streaming responses
✅ Multiple AI models (llama3.2, mistral, codellama, etc.)
✅ Conversation history and persistence
✅ Markdown rendering with syntax highlighting
✅ Responsive design (mobile, tablet, desktop)
✅ Dark mode optimized for long sessions

API Access

# Use the public API endpoint
curl -H "X-API-Key: your-api-key" \
  https://elevatediq.ai/ollama/health

# Python client with public endpoint
from ollama import Client

client = Client(
    base_url="https://elevatediq.ai/ollama",
    api_key="your-api-key"
)

response = client.generate(
    model="llama2",
    prompt="What is local AI?"
)

Local Development Setup

Backend Setup

# Clone repository
git clone https://github.com/kushin77/ollama.git
cd ollama

# Install backend dependencies
pip install -r requirements/base.txt

# Start backend services
docker-compose up -d

# Run development server
uvicorn ollama.main:app --reload --host 0.0.0.0 --port 8000

Frontend Setup

# Navigate to frontend directory
cd frontend

# Install dependencies
npm ci

# Configure environment
cp .env.example .env.local
# Edit .env.local with your Firebase credentials

# Start development server
npm run dev
# Open http://localhost:3000

Full Documentation:

Backend: docs/DEPLOYMENT.md
Frontend: frontend/README.md

# Clone and initialize
git clone https://github.com/kushin77/ollama.git
cd ollama
./scripts/bootstrap.sh --production

# Start the stack (development uses real IP, NOT localhost)
export REAL_IP=$(hostname -I | awk '{print $1}')
sed -i "s|PUBLIC_API_URL=.*|PUBLIC_API_URL=http://$REAL_IP:8000|" .env.dev
docker-compose -f docker/docker-compose.local.yml up -d

# Verify health via real IP
curl -s http://$REAL_IP:8000/health | jq .

Docker Quick Start

# Production deployment (through GCP Load Balancer)
curl -H "X-API-Key: your-api-key" \
  https://elevatediq.ai/ollama/api/v1/health

# Local development deployment
export REAL_IP=$(hostname -I | awk '{print $1}')
docker run -d \
  --name ollama \
  --gpus all \
  -p $REAL_IP:8000:8000 \
  -v ollama-models:/root/.ollama \
  -e PUBLIC_API_URL="http://$REAL_IP:8000" \
  kushin77/ollama:latest

# Pull a model and test
docker exec ollama ollama pull llama2
docker exec ollama ollama run llama2 "Why is local AI important?"

Architecture

High-Level System Design

Local Deployment

Application → API Server (localhost:8000) → Inference Engine

Public Endpoint via GCP Load Balancer

Client → HTTPS (elevatediq.ai) → GCP LB → API Server (8000) → Inference Engine
                                   ↓
                              TLS Termination
                              Rate Limiting
                              Security Headers

Full Architecture

┌─────────────────────────────────────────────────────────┐ │ Application Layer │ │ (FastAPI, Gradio UI, CLI Tools, Custom Integrations) │ └──────────────────┬──────────────────────────────────────┘ │ ┌──────────────────▼──────────────────────────────────────┐ │ Ollama API Gateway │ │ (Request validation, rate limiting, caching, routing) │ └──────────────────┬──────────────────────────────────────┘ │ ┌──────────────────▼──────────────────────────────────────┐ │ Inference Engine Layer │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ LLM Worker │ │ LLM Worker │ │ LLM Worker │ │ │ │ (GPU 0) │ │ (GPU 1) │ │ (GPU N) │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │ │ │ ┌──────────────────────────────────────────────────┐ │ │ │ Model Cache & Context Manager │ │ │ │ (Weights, Embeddings, KV Cache) │ │ │ └──────────────────────────────────────────────────┘ │ └──────────────────┬──────────────────────────────────────┘ │ ┌──────────────────▼──────────────────────────────────────┐ │ Storage & State Layer │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ PostgreSQL │ │ Redis Cache │ │ Vector DB │ │ │ │ (Metadata) │ │ (Sessions) │ │ (Embeddings) │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ └──────────────────┬──────────────────────────────────────┘ │ ┌──────────────────▼──────────────────────────────────────┐ │ Monitoring & Observability Layer │ │ (Prometheus, Grafana, Loki, Jaeger) │ └─────────────────────────────────────────────────────────┘


### Component Breakdown

| Component | Purpose | Technology |
|-----------|---------|-----------|
| **API Gateway** | Request routing, auth, rate limiting | FastAPI, gRPC |
| **Inference Workers** | Model execution with GPU acceleration | PyTorch, vLLM, TensorRT |
| **Model Registry** | Version control and management | Custom + Hugging Face |
| **Cache Layer** | Response and KV-cache optimization | Redis, in-memory |
| **Vector Database** | Semantic search and RAG support | Qdrant, Milvus |
| **Telemetry** | Metrics, traces, logs | Prometheus, Jaeger, Loki |
| **State Store** | Persistent metadata and conversation history | PostgreSQL |

---

## Features

### Core Capabilities
- ✅ **Multi-Model Support**: Run multiple models simultaneously with resource isolation
- ✅ **GPU Acceleration**: Automatic CUDA/Metal/ROCm detection and optimization
- ✅ **Distributed Inference**: Scale across multiple GPUs and machines
- ✅ **Model Quantization**: 4-bit, 8-bit, mixed-precision inference
- ✅ **Context Caching**: Efficient KV-cache management and reuse
- ✅ **RAG Integration**: Built-in vector database for semantic retrieval
- ✅ **Streaming Responses**: Server-sent events for real-time output
- ✅ **Batch Processing**: Efficient inference for multiple requests

### Advanced Features
- 🔒 **Air-Gapped Security**: No phone-home, full data isolation
- 📊 **Comprehensive Observability**: Prometheus metrics, distributed tracing
- 🔄 **Auto-Scaling**: Dynamic resource allocation based on load
- 🎯 **Fine-Tuning Support**: Local model adaptation with training infrastructure
- 🔐 **Multi-Tenant Isolation**: Namespace-based resource segregation
- 📦 **Model Versioning**: Content-addressed model storage with rollback
- 🚀 **Performance Profiling**: Built-in benchmarking and optimization tools

---

## Prerequisites

### Hardware Requirements

**Minimum** (for experimentation):
- GPU: 6GB VRAM (RTX 2060 or equivalent)
- CPU: 4-core modern processor
- RAM: 16GB system memory
- Storage: 100GB NVMe SSD

**Recommended** (production):
- GPU: 24GB+ VRAM (A100, RTX 4090, or enterprise GPU)
- CPU: 16+ cores, high single-thread performance
- RAM: 64GB+ system memory
- Storage: 500GB+ NVMe SSD (fast I/O critical)

### Software Requirements

```bash
# Linux (Ubuntu 22.04 LTS or RHEL 9+)
- CUDA 12.1+ OR ROCm 5.6+ (for GPU support)
- Docker 24.0+
- Docker Compose 2.20+
- Python 3.11+
- Git 2.40+

# Optional but recommended
- NVIDIA Container Toolkit (for GPU in Docker)
- Prometheus 2.40+
- Grafana 9.0+
- PostgreSQL 15+

Installation

Method 1: Docker Compose (Recommended for Production)

git clone https://github.com/kushin77/ollama.git
cd ollama

# Copy environment template
cp .env.example .env

# Configure for your environment
nano .env  # Set GPU, RAM, model paths

# Start production stack
docker-compose -f docker/docker-compose.prod.yml up -d

# Verify services
docker-compose -f docker/docker-compose.prod.yml ps
curl http://localhost:8000/health

Method 2: Local Development Installation

# Prerequisites
python -m venv venv
source venv/bin/activate  # or `venv\Scripts\activate` on Windows

# Install dependencies
pip install -r requirements/core.txt
pip install -r requirements/dev.txt  # For development

# Initialize database
python scripts/init_db.py

# Download base models
ollama pull llama2 mistral neural-chat

# Start development server
python -m ollama.server --config config/development.yaml

Method 3: From Source (Advanced)

git clone https://github.com/kushin77/ollama.git
cd ollama

# Build Docker images
docker build -t ollama:latest -f Dockerfile.prod .
docker build -t ollama-worker:latest -f Dockerfile.worker .

# Run with custom configuration
docker-compose -f docker-compose.custom.yml up

Configuration

Public Endpoint Configuration

For elevatediq.ai/ollama deployments via GCP Load Balancer:

# config/production.yaml
server:
  public_url: "https://elevatediq.ai/ollama"
  domain: "elevatediq.ai"

security:
  api_key_auth_enabled: true
  cors_origins:
    - "https://elevatediq.ai"
    - "https://*.elevatediq.ai"
  tls_enabled: false # TLS handled by GCP LB

# .env
OLLAMA_PUBLIC_URL=https://elevatediq.ai/ollama
OLLAMA_DOMAIN=elevatediq.ai
API_KEY_AUTH_ENABLED=true
CORS_ORIGINS=["https://elevatediq.ai","https://*.elevatediq.ai"]

See docs/gcp-load-balancer.md for complete GCP configuration.

Local Development Configuration

# .env.example
OLLAMA_HOST=0.0.0.0:8000
OLLAMA_MODELS_PATH=/models
OLLAMA_CACHE_SIZE=50G
OLLAMA_GPU_MEMORY=24000  # MB

# Database
DATABASE_URL=postgresql://ollama:password@localhost:5432/ollama
REDIS_URL=redis://localhost:6379/0

# Monitoring
PROMETHEUS_ENABLED=true
JAEGER_ENABLED=true
LOG_LEVEL=INFO

# Security
API_KEY_AUTH_ENABLED=true
CORS_ORIGINS=["http://localhost:3000"]

Model Configuration (`config/models.yaml`)

models:
  llama2:
    source: huggingface # or 'local', 'ollama-registry'
    model_id: meta-llama/Llama-2-7b-chat
    quantization: q4_K_M # q4_K_M, q5_K_M, fp16, bf16
    context_length: 4096
    gpu_memory_reserved: 10G
    batch_size: 8
    max_concurrent: 2

  mistral:
    source: huggingface
    model_id: mistralai/Mistral-7B-Instruct-v0.1
    quantization: q5_K_M
    context_length: 32768
    gpu_memory_reserved: 12G

caching:
  enabled: true
  type: redis # or 'memory'
  ttl: 3600

performance:
  enable_paging: true
  enable_tiling: false
  prefill_batch_size: 16

Usage

CLI Usage

# List available models
ollama list

# Pull and run a model
ollama pull llama2
ollama run llama2

# Direct inference with prompts
ollama run llama2 "What are the benefits of local AI?"

# Streaming output
ollama run mistral --stream "Explain quantum computing"

# Use with template
ollama run llama2 --template "Your prompt: {text}"

# Statistics and benchmarks
ollama stats

REST API

# Health check
curl http://localhost:8000/health

# List models
curl http://localhost:8000/api/models

# Create completion (streaming)
curl -X POST http://localhost:8000/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama2",
    "prompt": "Why is local AI important?",
    "stream": true,
    "context": []
  }'

# Chat completions (OpenAI-compatible)
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama2",
    "messages": [
      {"role": "system", "content": "You are an expert engineer"},
      {"role": "user", "content": "Explain RAG"}
    ],
    "temperature": 0.7
  }'

# Embeddings endpoint
curl -X POST http://localhost:8000/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "embedding-model",
    "input": "Generate embedding for this text"
  }'

Python Client

from ollama import Client

client = Client(base_url="http://localhost:8000")

# Simple completion
response = client.generate(
    model="llama2",
    prompt="Explain machine learning",
    stream=False
)
print(response.text)

# Chat interface
response = client.chat(
    model="mistral",
    messages=[
        {"role": "system", "content": "You are an AI expert"},
        {"role": "user", "content": "What is RAG?"}
    ],
    temperature=0.7
)
print(response.message.content)

# Embeddings
embeddings = client.embeddings(
    model="embedding-model",
    input="Generate vector representation"
)
print(embeddings.data[0].embedding)

# Streaming
for chunk in client.generate_stream(
    model="llama2",
    prompt="Tell a story about local AI"
):
    print(chunk.response, end="", flush=True)

Model Management

Downloading Models

# From Ollama registry
ollama pull llama2
ollama pull mistral

# Specific versions/sizes
ollama pull llama2:7b-chat-q4_0
ollama pull llama2:13b-chat-fp16

# From Hugging Face
python scripts/download_model.py \
  --source huggingface \
  --model meta-llama/Llama-2-7b-chat \
  --quantization q4_K_M

# Custom models
python scripts/import_model.py \
  --path /path/to/gguf/model.gguf \
  --name custom-model

Model Versioning

# List versions
ollama list --versions

# Pin specific version
ollama pull llama2:sha256:abc123def456

# Delete old versions
ollama rm llama2:old-version

# Export for backup
ollama export llama2 > llama2-backup.tar.gz
ollama import llama2-backup.tar.gz

Fine-tuning

# Prepare dataset
python scripts/prepare_finetuning_data.py \
  --input training_data.jsonl \
  --output prepared_data

# Fine-tune model
python scripts/finetune.py \
  --model llama2 \
  --data prepared_data \
  --output-dir ./fine-tuned-models \
  --epochs 3 \
  --learning-rate 1e-4

# Merge and quantize
python scripts/merge_lora.py \
  --base llama2 \
  --lora ./fine-tuned-models/lora \
  --output custom-llama2

# Benchmark
python scripts/benchmark.py --model custom-llama2

API Reference

Endpoints

Endpoint	Method	Purpose
`/health`	GET	Health check
`/api/models`	GET	List available models
`/api/generate`	POST	Text generation (streaming)
`/api/embedding`	POST	Generate embeddings
`/v1/chat/completions`	POST	OpenAI-compatible chat
`/v1/completions`	POST	OpenAI-compatible completion
`/v1/embeddings`	POST	OpenAI-compatible embeddings
`/admin/stats`	GET	System metrics
`/admin/reload`	POST	Reload configuration

Authentication

# Set API key
export OLLAMA_API_KEY="your-secret-key"

# Include in requests
curl -H "Authorization: Bearer $OLLAMA_API_KEY" \
  http://localhost:8000/api/models

Monitoring & Observability

Prometheus Metrics

Access dashboard at http://localhost:9090

Key metrics:

ollama_request_duration_seconds: Inference latency
ollama_tokens_generated_total: Cumulative token count
ollama_model_memory_bytes: Per-model memory usage
ollama_gpu_utilization_percent: GPU usage
ollama_queue_depth: Pending requests

# Query example
curl 'http://localhost:9090/api/v1/query?query=rate(ollama_tokens_generated_total[5m])'

Grafana Dashboards

Pre-built dashboards for:

System resources (CPU, RAM, GPU, Disk)
Model performance (latency, throughput, tokens/sec)
Request patterns (volume, errors, queue depth)
Cost analysis (compute time, energy consumption)

Distributed Tracing (Jaeger)

Access at http://localhost:16686

Traces capture:

Complete request flow from API → model inference
Component latencies (cache lookups, model execution)
Error spans with context
Resource utilization per span

Logging

# View logs with filtering
docker-compose logs -f ollama-api --tail=100
docker-compose logs ollama-worker-1 | grep "ERROR"

# Structured logging export
curl http://localhost:3100/loki/api/v1/query_range \
  --data-urlencode 'query={job="ollama"}'

Performance Tuning

Optimization Checklist

GPU: Ensure CUDA/ROCm properly initialized

python -c "import torch; print(torch.cuda.is_available())"

Quantization: Use q4 for speed, q5/fp16 for quality

# Benchmark
python scripts/benchmark_quantization.py

Batch Size: Profile optimal throughput

# config/models.yaml
llama2:
  batch_size: 8 # Adjust based on VRAM

Context Caching: Enable for chat workflows
```
caching:
  enabled: true
  type: redis
```

Model Pruning: Remove unused weights

python scripts/prune_model.py --model llama2 --ratio 0.1

Benchmarking

# Comprehensive benchmark
python scripts/benchmark.py \
  --models llama2 mistral \
  --batch-sizes 1 2 4 8 \
  --prompt-lengths 100 500 1000

# Memory profiling
python -m memory_profiler scripts/inference.py

# Latency percentiles
python scripts/latency_percentiles.py --duration 3600

Security

Best Practices

# Enable authentication
export OLLAMA_API_KEY_AUTH=true
export OLLAMA_API_KEYS="key1:hash1,key2:hash2"

# TLS/HTTPS setup
openssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -days 365
export OLLAMA_TLS_CERT=/path/to/cert.pem
export OLLAMA_TLS_KEY=/path/to/key.pem

# Rate limiting per API key
python scripts/setup_rate_limits.py \
  --key user-key \
  --requests-per-minute 100

# Audit logging
export OLLAMA_AUDIT_LOG=/var/log/ollama/audit.log

Model Validation

# Verify model integrity
ollama verify llama2

# Scan for vulnerabilities
python scripts/scan_model.py --model llama2

# Validate outputs
python scripts/validate_model_outputs.py \
  --model llama2 \
  --test-cases validation_suite.jsonl

Troubleshooting

Common Issues

GPU Not Detected

# Check CUDA installation
nvidia-smi

# Verify PyTorch support
python -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))"

# Update Docker GPU support
docker run --rm --gpus all nvidia/cuda:12.1.1-runtime-ubuntu22.04 nvidia-smi

Out of Memory

# Check current usage
docker stats

# Reduce model quantization
ollama pull llama2:7b-chat-q4_0  # Lower quantization

# Limit batch size in config
# Set batch_size: 1 or 2

Slow Inference

# Profile bottleneck
python scripts/profile_inference.py --model llama2

# Check model is quantized
ollama list  # Look for q4/q5 suffix

# Verify GPU in use
nvidia-smi dmon -s puc

Connection Issues

# Check service is running
docker-compose ps

# Verify port availability
netstat -tulpn | grep 8000

# Check logs for errors
docker-compose logs ollama-api

Development

Setting Up Development Environment

# Clone repository
git clone https://github.com/kushin77/ollama.git
cd ollama

# Create virtual environment
python -m venv venv
source venv/bin/activate

# Install development dependencies
pip install -r requirements/dev.txt
pip install -e .  # Install in editable mode

# Run tests
pytest tests/ -v --cov=ollama

# Format code
black ollama/ tests/
isort ollama/ tests/
ruff check ollama/ tests/

# Type checking
mypy ollama/ --strict

# Run linter
pylint ollama/

Project Structure

ollama/
├── .copilot-instructions      # Elite development instructions
├── .github/
│   └── workflows/             # CI/CD pipelines
├── ollama/
│   ├── api/                   # FastAPI server and routes
│   ├── inference/             # Model execution engine
│   ├── models/                # Model management
│   ├── cache/                 # Caching layer
│   ├── embeddings/            # Embedding generation
│   ├── rag/                   # RAG infrastructure
│   ├── monitoring/            # Observability
│   ├── security/              # Authentication, validation
│   └── utils/                 # Shared utilities
├── scripts/
│   ├── bootstrap.sh           # Setup script
│   ├── download_model.py      # Model downloading
│   ├── benchmark.py           # Performance testing
│   └── ...                    # Utility scripts
├── config/
│   ├── development.yaml       # Dev configuration
│   ├── production.yaml        # Production configuration
│   └── models.yaml            # Model definitions
├── docker/
│   ├── Dockerfile             # Main image
│   ├── Dockerfile.worker      # Worker image
│   └── docker-compose.yml     # Local development
├── tests/
│   ├── unit/                  # Unit tests
│   ├── integration/           # Integration tests
│   └── e2e/                   # End-to-end tests
├── docs/
│   ├── architecture.md        # System design
│   ├── api.md                 # API documentation
│   └── deployment.md          # Deployment guide
├── requirements/
│   ├── core.txt               # Production dependencies
│   ├── dev.txt                # Development dependencies
│   └── test.txt               # Testing dependencies
└── README.md                  # This file

Running Tests

# All tests
pytest

# Specific test file
pytest tests/unit/test_inference.py -v

# With coverage
pytest --cov=ollama --cov-report=html

# Only failed tests from last run
pytest --lf

# With output
pytest -s -vv tests/integration/

Contributing

Fork the repository
Create feature branch: git checkout -b feature/your-feature
Commit atomically: git commit -S -m "feat: add new feature"
Push to branch: git push origin feature/your-feature
Open pull request with clear description

See CONTRIBUTING.md for detailed guidelines.

Performance Benchmarks

Latency (p99, seconds)

Model	Quantization	Batch=1	Batch=8	Tokens/sec
Llama2 7B	q4_K_M	0.85	2.4	180
Llama2 13B	q5_K_M	1.2	3.8	120
Mistral 7B	q4_K_M	0.72	2.1	200
Neural Chat	q4_K_M	0.65	1.9	220

Benchmarks on NVIDIA RTX 4090, Ubuntu 22.04, CUDA 12.1

Roadmap

Q1 2026

Multi-GPU distributed inference
Optimized attention mechanisms (FlashAttention-3)
Enhanced RAG with re-ranking

Q2 2026

Fine-tuning infrastructure (LoRA, QLoRA)
Model marketplace integration
Kubernetes deployment support

Q3 2026

Multimodal model support (vision + text)
Advanced caching strategies (prefix caching)
Cost optimization tools

Support & Community

📚 Documentation: docs/
🐛 Issues: GitHub Issues
💬 Discussions: GitHub Discussions
🤝 Contributing: CONTRIBUTING.md

License

MIT License - See LICENSE for details

Citation

@software{ollama2026,
  author = {Kushin, A.},
  title = {Ollama: Elite Local AI Development Platform},
  url = {https://github.com/kushin77/ollama},
  year = {2026},
  note = {Version 1.0.0}
}

Last Updated: January 12, 2026 Version: 1.0.0 Maintainer: @kushin77

Name		Name	Last commit message	Last commit date
Latest commit History 350 Commits
.githooks		.githooks
.github		.github
.husky		.husky
.venv-dev		.venv-dev
.venv_dev		.venv_dev
.vscode		.vscode
api/examples/secrets_integration		api/examples/secrets_integration
archive		archive
cmd/github-issues		cmd/github-issues
config		config
docker		docker
docs		docs
internal/secrets		internal/secrets
k6		k6
k8s		k8s
kubernetes		kubernetes
ollama		ollama
pmo		pmo
scripts		scripts
server		server
terraform		terraform
tests		tests
.cloudbuild.yaml		.cloudbuild.yaml
.copilot-instructions		.copilot-instructions
.editorconfig		.editorconfig
.env.example		.env.example
.env.phase8.example		.env.phase8.example
.gitignore		.gitignore
.gitmessage		.gitmessage
.pre-commit-config.yaml		.pre-commit-config.yaml
ACTIVATION_AND_ROLLOUT.md		ACTIVATION_AND_ROLLOUT.md
AUTONOMOUS_AGENT_EXECUTION_AUTHORIZATION.md		AUTONOMOUS_AGENT_EXECUTION_AUTHORIZATION.md
AUTONOMOUS_AGENT_STARTUP_GUIDE.md		AUTONOMOUS_AGENT_STARTUP_GUIDE.md
AUTONOMOUS_EXECUTION_FINAL_REPORT.md		AUTONOMOUS_EXECUTION_FINAL_REPORT.md
AUTONOMOUS_EXECUTION_STATUS_2026-04-18.md		AUTONOMOUS_EXECUTION_STATUS_2026-04-18.md
AUTONOMOUS_ISSUE_FRAMEWORK.md		AUTONOMOUS_ISSUE_FRAMEWORK.md
AUTONOMOUS_TRIAGE_AGENT_EXECUTION_REPORT.md		AUTONOMOUS_TRIAGE_AGENT_EXECUTION_REPORT.md
AUTONOMOUS_TRIAGE_COMPLETION_FINAL.md		AUTONOMOUS_TRIAGE_COMPLETION_FINAL.md
AUTONOMOUS_TRIAGE_EXECUTION_COMPLETE.md		AUTONOMOUS_TRIAGE_EXECUTION_COMPLETE.md
AUTONOMOUS_TRIAGE_EXECUTION_PLAN.md		AUTONOMOUS_TRIAGE_EXECUTION_PLAN.md
BRANCH_GOVERNANCE.md		BRANCH_GOVERNANCE.md
BRANCH_GOVERNANCE_SETUP.md		BRANCH_GOVERNANCE_SETUP.md
CHECK_ISSUES_README.md		CHECK_ISSUES_README.md
DEPLOYMENT_COMPLETE_SUMMARY.md		DEPLOYMENT_COMPLETE_SUMMARY.md
DEPLOYMENT_READINESS_CHECKLIST.md		DEPLOYMENT_READINESS_CHECKLIST.md
EXECUTION_COMPLETE.md		EXECUTION_COMPLETE.md
EXECUTION_COMPLETE_STATUS.md		EXECUTION_COMPLETE_STATUS.md
EXECUTION_READY.md		EXECUTION_READY.md
EXECUTION_SUMMARY_2026-01-30.md		EXECUTION_SUMMARY_2026-01-30.md
FINAL_APPROVAL_AND_DEPLOYMENT_AUTHORIZATION.md		FINAL_APPROVAL_AND_DEPLOYMENT_AUTHORIZATION.md
FINAL_AUTONOMOUS_COMPLETION_REPORT.md		FINAL_AUTONOMOUS_COMPLETION_REPORT.md
FINAL_DEPLOYMENT_SUMMARY.md		FINAL_DEPLOYMENT_SUMMARY.md
FINAL_PMO_MIGRATION_STATUS.md		FINAL_PMO_MIGRATION_STATUS.md
FINAL_PROJECT_COMPLETION_SUMMARY.md		FINAL_PROJECT_COMPLETION_SUMMARY.md
FINAL_SIGN_OFF_AND_APPROVAL.md		FINAL_SIGN_OFF_AND_APPROVAL.md
FINAL_STATUS.md		FINAL_STATUS.md
FINAL_STATUS_SUMMARY_2026-01-27.md		FINAL_STATUS_SUMMARY_2026-01-27.md
FINAL_TRIAGE_COMPLETION_2026-04-18.md		FINAL_TRIAGE_COMPLETION_2026-04-18.md
FORMAL_ISSUE_APPROVAL_CERTIFICATION.md		FORMAL_ISSUE_APPROVAL_CERTIFICATION.md
FRAMEWORK_DEPLOYMENT_COMPLETE.md		FRAMEWORK_DEPLOYMENT_COMPLETE.md
GITHUB_API_QUICKSTART.md		GITHUB_API_QUICKSTART.md
GITHUB_API_README.md		GITHUB_API_README.md
GITHUB_API_SECRETS_POLICY.md		GITHUB_API_SECRETS_POLICY.md
GITHUB_ISSUES_ANALYSIS_SUMMARY.md		GITHUB_ISSUES_ANALYSIS_SUMMARY.md
GITHUB_ISSUES_CLOSURE_AUDIT_TRAIL.md		GITHUB_ISSUES_CLOSURE_AUDIT_TRAIL.md
GITHUB_ISSUES_CLOSURE_COMPLETE.md		GITHUB_ISSUES_CLOSURE_COMPLETE.md
GITHUB_ISSUES_IMPLEMENTATION_STATUS.md		GITHUB_ISSUES_IMPLEMENTATION_STATUS.md
GITHUB_ISSUES_INTEGRATION.md		GITHUB_ISSUES_INTEGRATION.md
GITHUB_ISSUES_ROADMAP.md		GITHUB_ISSUES_ROADMAP.md
GOVERNANCE_IMPLEMENTATION_COMPLETE.txt		GOVERNANCE_IMPLEMENTATION_COMPLETE.txt
GOVERNANCE_SUMMARY.md		GOVERNANCE_SUMMARY.md
IMPLEMENTATION_COMPLETE.md		IMPLEMENTATION_COMPLETE.md
IMPLEMENTATION_SUMMARY.md		IMPLEMENTATION_SUMMARY.md
ISSUES_CHECK_HOWTO.md		ISSUES_CHECK_HOWTO.md
ISSUES_CLOSURE_SUMMARY.md		ISSUES_CLOSURE_SUMMARY.md
ISSUES_TRIAGE_EXECUTION.md		ISSUES_TRIAGE_EXECUTION.md
ISSUE_151_IMPLEMENTATION_GUIDE.md		ISSUE_151_IMPLEMENTATION_GUIDE.md
ISSUE_152_IMPLEMENTATION_GUIDE.md		ISSUE_152_IMPLEMENTATION_GUIDE.md
ISSUE_163_IMPLEMENTATION_GUIDE.md		ISSUE_163_IMPLEMENTATION_GUIDE.md
ISSUE_164_IMPLEMENTATION_GUIDE.md		ISSUE_164_IMPLEMENTATION_GUIDE.md
ISSUE_165_IMPLEMENTATION_GUIDE.md		ISSUE_165_IMPLEMENTATION_GUIDE.md
ISSUE_166_IMPLEMENTATION_GUIDE.md		ISSUE_166_IMPLEMENTATION_GUIDE.md
ISSUE_167_IMPLEMENTATION_GUIDE.md		ISSUE_167_IMPLEMENTATION_GUIDE.md
ISSUE_42_ANALYSIS.md		ISSUE_42_ANALYSIS.md
ISSUE_42_COMPLETION_VERIFICATION.md		ISSUE_42_COMPLETION_VERIFICATION.md
ISSUE_42_DESIGN.md		ISSUE_42_DESIGN.md
ISSUE_42_FINAL_STATUS_REPORT.md		ISSUE_42_FINAL_STATUS_REPORT.md
ISSUE_42_IMPLEMENTATION_GUIDE.md		ISSUE_42_IMPLEMENTATION_GUIDE.md
ISSUE_42_IMPLEMENTATION_STATUS.md		ISSUE_42_IMPLEMENTATION_STATUS.md
ISSUE_42_PHASE_3_SETUP.md		ISSUE_42_PHASE_3_SETUP.md
ISSUE_43_ZERO_TRUST_SECURITY_GUIDE.md		ISSUE_43_ZERO_TRUST_SECURITY_GUIDE.md
ISSUE_44_OBSERVABILITY_GUIDE.md		ISSUE_44_OBSERVABILITY_GUIDE.md
ISSUE_45_CANARY_DEPLOYMENT_GUIDE.md		ISSUE_45_CANARY_DEPLOYMENT_GUIDE.md
ISSUE_46_COST_MANAGEMENT_GUIDE.md		ISSUE_46_COST_MANAGEMENT_GUIDE.md
ISSUE_47_DEVELOPER_PLATFORM_GUIDE.md		ISSUE_47_DEVELOPER_PLATFORM_GUIDE.md
ISSUE_48_LOAD_TESTING_GUIDE.md		ISSUE_48_LOAD_TESTING_GUIDE.md
ISSUE_49_SCALING_ROADMAP_GUIDE.md		ISSUE_49_SCALING_ROADMAP_GUIDE.md
ISSUE_50_TESTING_GUIDE.md		ISSUE_50_TESTING_GUIDE.md
ISSUE_58_COMPLETION_SUMMARY.md		ISSUE_58_COMPLETION_SUMMARY.md
ISSUE_PROTOBUF_CVE-2026-0994.md		ISSUE_PROTOBUF_CVE-2026-0994.md

Folders and files

Latest commit

History

Repository files navigation

Ollama: Elite Local AI Development Platform

Vision

Features

Backend (FastAPI)

Frontend (Next.js)

Documentation

Quick Links

All Documentation

Development & Contributing

Quality Assurance

Table of Contents

Quick Start

Web Interface

API Access

Local Development Setup

Backend Setup

Frontend Setup

Docker Quick Start

Architecture

High-Level System Design

Local Deployment

Public Endpoint via GCP Load Balancer

Full Architecture

Installation

Method 1: Docker Compose (Recommended for Production)

Method 2: Local Development Installation

Method 3: From Source (Advanced)

Configuration

Public Endpoint Configuration

Local Development Configuration

Model Configuration (config/models.yaml)

Usage

CLI Usage

REST API

Python Client

Model Management

Downloading Models

Model Versioning

Fine-tuning

API Reference

Endpoints

Authentication

Monitoring & Observability

Prometheus Metrics

Grafana Dashboards

Distributed Tracing (Jaeger)

Logging

Performance Tuning

Optimization Checklist

Benchmarking

Security

Best Practices

Model Validation

Troubleshooting

Common Issues

Development

Setting Up Development Environment

Project Structure

Running Tests

Contributing

Performance Benchmarks

Latency (p99, seconds)

Roadmap

Q1 2026

Q2 2026

Q3 2026

Support & Community

License

Citation

Stats

About

Resources

Contributing

Uh oh!

Stars

Watchers

Model Configuration (`config/models.yaml`)

Packages