Skip to content

PureBlissAK/ollama

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

350 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Ollama: Elite Local AI Development Platform

Full-stack AI infrastructure for building, deploying, and monitoring large language models with production web interface

Status Python TypeScript License Maintained

Vision

Ollama is a sophisticated full-stack AI platform designed for engineers who demand production-grade reliability, security, and performance. Run state-of-the-art language models entirely on your local infrastructure with a beautiful web interface—all AI workloads run locally on Docker, with optional GCP Load Balancer for public access.

Architecture:

  • Backend: Python/FastAPI + PostgreSQL + Redis + Ollama
  • Frontend: Next.js 14 + React 18 + TypeScript + Firebase OAuth
  • Deployment: Docker containers + GCP Load Balancer for https://elevatediq.ai/ollama

Target Audience: Elite engineers, research teams, enterprises requiring air-gapped AI systems, and developers building custom AI applications.

Production Status ✅:

  • Backend: Verified with 50-user load test (7,162 requests, 100% success, 75ms P95 latency)
  • Frontend: Production-ready Next.js with OAuth, real-time chat, streaming responses
  • Live Platform: https://elevatediq.ai/ollama
  • Infrastructure: GCP Landing Zone

Features

Backend (FastAPI)

  • 🚀 High-Performance API: FastAPI with async I/O
  • 🧠 Multi-Model Support: Ollama, OpenAI-compatible APIs
  • 💾 PostgreSQL + Redis: Conversation persistence and caching
  • 🔐 Firebase Authentication: OAuth with Google Sign-In
  • 📊 Prometheus Metrics: Production-grade observability
  • 🔒 Security: Rate limiting, API keys, CORS, TLS 1.3+

Frontend (Next.js)

  • 💬 Real-time Chat: Stream responses from LLMs with conversation history
  • 🔐 OAuth Integration: Secure Google Sign-In via Firebase
  • 🎨 Modern UI: Tailwind CSS with custom dark theme
  • 📱 Responsive Design: Mobile-first, works on all devices
  • Optimized Performance: Code splitting, lazy loading, <200KB bundle
  • 🧪 Type Safety: Full TypeScript coverage with strict mode

Documentation

📚 Complete Documentation Portal: docs/shared/README.md

Quick Links

Getting Started:

Development:

Operations:

API Reference:

Compliance & Security:

All Documentation

Browse the complete Shared Documentation Navigation for the canonical guide map, or the Indexed Documentation Hub for legacy compatibility snapshots.

Development & Contributing

New to Ollama development? Start here:

Quality Assurance

This project uses automated quality checks:

  • Type Checking: mypy ollama/ --strict (GitHub Actions)
  • Code Formatting: Black + Ruff (Pre-commit hooks + GitHub Actions)
  • Testing: 90%+ coverage with pytest (GitHub Actions)
  • Security: pip-audit, Bandit, CodeQL (GitHub Actions)
  • Linting: Ruff with strict rules (Pre-commit hooks + GitHub Actions)

Local Checks (before committing):

# Run all quality checks locally
pre-commit run --all-files

# Or run individually:
mypy ollama/ --strict
ruff check ollama/ --fix
black ollama/ tests/ --check
pytest tests/ --cov=ollama
pip-audit

Table of Contents


Quick Start

Web Interface

  1. Visit the live platform: https://elevatediq.ai/ollama
  2. Sign in with Google (Firebase OAuth)
  3. Start chatting with LLMs instantly
🌐 Web Interface Features:
✅ Real-time chat with streaming responses
✅ Multiple AI models (llama3.2, mistral, codellama, etc.)
✅ Conversation history and persistence
✅ Markdown rendering with syntax highlighting
✅ Responsive design (mobile, tablet, desktop)
✅ Dark mode optimized for long sessions

API Access

# Use the public API endpoint
curl -H "X-API-Key: your-api-key" \
  https://elevatediq.ai/ollama/health

# Python client with public endpoint
from ollama import Client

client = Client(
    base_url="https://elevatediq.ai/ollama",
    api_key="your-api-key"
)

response = client.generate(
    model="llama2",
    prompt="What is local AI?"
)

Local Development Setup

Backend Setup

# Clone repository
git clone https://github.com/kushin77/ollama.git
cd ollama

# Install backend dependencies
pip install -r requirements/base.txt

# Start backend services
docker-compose up -d

# Run development server
uvicorn ollama.main:app --reload --host 0.0.0.0 --port 8000

Frontend Setup

# Navigate to frontend directory
cd frontend

# Install dependencies
npm ci

# Configure environment
cp .env.example .env.local
# Edit .env.local with your Firebase credentials

# Start development server
npm run dev
# Open http://localhost:3000

Full Documentation:

# Clone and initialize
git clone https://github.com/kushin77/ollama.git
cd ollama
./scripts/bootstrap.sh --production

# Start the stack (development uses real IP, NOT localhost)
export REAL_IP=$(hostname -I | awk '{print $1}')
sed -i "s|PUBLIC_API_URL=.*|PUBLIC_API_URL=http://$REAL_IP:8000|" .env.dev
docker-compose -f docker/docker-compose.local.yml up -d

# Verify health via real IP
curl -s http://$REAL_IP:8000/health | jq .

Docker Quick Start

# Production deployment (through GCP Load Balancer)
curl -H "X-API-Key: your-api-key" \
  https://elevatediq.ai/ollama/api/v1/health

# Local development deployment
export REAL_IP=$(hostname -I | awk '{print $1}')
docker run -d \
  --name ollama \
  --gpus all \
  -p $REAL_IP:8000:8000 \
  -v ollama-models:/root/.ollama \
  -e PUBLIC_API_URL="http://$REAL_IP:8000" \
  kushin77/ollama:latest

# Pull a model and test
docker exec ollama ollama pull llama2
docker exec ollama ollama run llama2 "Why is local AI important?"

Architecture

High-Level System Design

Local Deployment

Application → API Server (localhost:8000) → Inference Engine

Public Endpoint via GCP Load Balancer

Client → HTTPS (elevatediq.ai) → GCP LB → API Server (8000) → Inference Engine
                                   ↓
                              TLS Termination
                              Rate Limiting
                              Security Headers

Full Architecture

┌─────────────────────────────────────────────────────────┐ │ Application Layer │ │ (FastAPI, Gradio UI, CLI Tools, Custom Integrations) │ └──────────────────┬──────────────────────────────────────┘ │ ┌──────────────────▼──────────────────────────────────────┐ │ Ollama API Gateway │ │ (Request validation, rate limiting, caching, routing) │ └──────────────────┬──────────────────────────────────────┘ │ ┌──────────────────▼──────────────────────────────────────┐ │ Inference Engine Layer │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ LLM Worker │ │ LLM Worker │ │ LLM Worker │ │ │ │ (GPU 0) │ │ (GPU 1) │ │ (GPU N) │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │ │ │ ┌──────────────────────────────────────────────────┐ │ │ │ Model Cache & Context Manager │ │ │ │ (Weights, Embeddings, KV Cache) │ │ │ └──────────────────────────────────────────────────┘ │ └──────────────────┬──────────────────────────────────────┘ │ ┌──────────────────▼──────────────────────────────────────┐ │ Storage & State Layer │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ PostgreSQL │ │ Redis Cache │ │ Vector DB │ │ │ │ (Metadata) │ │ (Sessions) │ │ (Embeddings) │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ └──────────────────┬──────────────────────────────────────┘ │ ┌──────────────────▼──────────────────────────────────────┐ │ Monitoring & Observability Layer │ │ (Prometheus, Grafana, Loki, Jaeger) │ └─────────────────────────────────────────────────────────┘


### Component Breakdown

| Component | Purpose | Technology |
|-----------|---------|-----------|
| **API Gateway** | Request routing, auth, rate limiting | FastAPI, gRPC |
| **Inference Workers** | Model execution with GPU acceleration | PyTorch, vLLM, TensorRT |
| **Model Registry** | Version control and management | Custom + Hugging Face |
| **Cache Layer** | Response and KV-cache optimization | Redis, in-memory |
| **Vector Database** | Semantic search and RAG support | Qdrant, Milvus |
| **Telemetry** | Metrics, traces, logs | Prometheus, Jaeger, Loki |
| **State Store** | Persistent metadata and conversation history | PostgreSQL |

---

## Features

### Core Capabilities
- ✅ **Multi-Model Support**: Run multiple models simultaneously with resource isolation
- ✅ **GPU Acceleration**: Automatic CUDA/Metal/ROCm detection and optimization
- ✅ **Distributed Inference**: Scale across multiple GPUs and machines
- ✅ **Model Quantization**: 4-bit, 8-bit, mixed-precision inference
- ✅ **Context Caching**: Efficient KV-cache management and reuse
- ✅ **RAG Integration**: Built-in vector database for semantic retrieval
- ✅ **Streaming Responses**: Server-sent events for real-time output
- ✅ **Batch Processing**: Efficient inference for multiple requests

### Advanced Features
- 🔒 **Air-Gapped Security**: No phone-home, full data isolation
- 📊 **Comprehensive Observability**: Prometheus metrics, distributed tracing
- 🔄 **Auto-Scaling**: Dynamic resource allocation based on load
- 🎯 **Fine-Tuning Support**: Local model adaptation with training infrastructure
- 🔐 **Multi-Tenant Isolation**: Namespace-based resource segregation
- 📦 **Model Versioning**: Content-addressed model storage with rollback
- 🚀 **Performance Profiling**: Built-in benchmarking and optimization tools

---

## Prerequisites

### Hardware Requirements

**Minimum** (for experimentation):
- GPU: 6GB VRAM (RTX 2060 or equivalent)
- CPU: 4-core modern processor
- RAM: 16GB system memory
- Storage: 100GB NVMe SSD

**Recommended** (production):
- GPU: 24GB+ VRAM (A100, RTX 4090, or enterprise GPU)
- CPU: 16+ cores, high single-thread performance
- RAM: 64GB+ system memory
- Storage: 500GB+ NVMe SSD (fast I/O critical)

### Software Requirements

```bash
# Linux (Ubuntu 22.04 LTS or RHEL 9+)
- CUDA 12.1+ OR ROCm 5.6+ (for GPU support)
- Docker 24.0+
- Docker Compose 2.20+
- Python 3.11+
- Git 2.40+

# Optional but recommended
- NVIDIA Container Toolkit (for GPU in Docker)
- Prometheus 2.40+
- Grafana 9.0+
- PostgreSQL 15+

Installation

Method 1: Docker Compose (Recommended for Production)

git clone https://github.com/kushin77/ollama.git
cd ollama

# Copy environment template
cp .env.example .env

# Configure for your environment
nano .env  # Set GPU, RAM, model paths

# Start production stack
docker-compose -f docker/docker-compose.prod.yml up -d

# Verify services
docker-compose -f docker/docker-compose.prod.yml ps
curl http://localhost:8000/health

Method 2: Local Development Installation

# Prerequisites
python -m venv venv
source venv/bin/activate  # or `venv\Scripts\activate` on Windows

# Install dependencies
pip install -r requirements/core.txt
pip install -r requirements/dev.txt  # For development

# Initialize database
python scripts/init_db.py

# Download base models
ollama pull llama2 mistral neural-chat

# Start development server
python -m ollama.server --config config/development.yaml

Method 3: From Source (Advanced)

git clone https://github.com/kushin77/ollama.git
cd ollama

# Build Docker images
docker build -t ollama:latest -f Dockerfile.prod .
docker build -t ollama-worker:latest -f Dockerfile.worker .

# Run with custom configuration
docker-compose -f docker-compose.custom.yml up

Configuration

Public Endpoint Configuration

For elevatediq.ai/ollama deployments via GCP Load Balancer:

# config/production.yaml
server:
  public_url: "https://elevatediq.ai/ollama"
  domain: "elevatediq.ai"

security:
  api_key_auth_enabled: true
  cors_origins:
    - "https://elevatediq.ai"
    - "https://*.elevatediq.ai"
  tls_enabled: false # TLS handled by GCP LB
# .env
OLLAMA_PUBLIC_URL=https://elevatediq.ai/ollama
OLLAMA_DOMAIN=elevatediq.ai
API_KEY_AUTH_ENABLED=true
CORS_ORIGINS=["https://elevatediq.ai","https://*.elevatediq.ai"]

See docs/gcp-load-balancer.md for complete GCP configuration.

Local Development Configuration

# .env.example
OLLAMA_HOST=0.0.0.0:8000
OLLAMA_MODELS_PATH=/models
OLLAMA_CACHE_SIZE=50G
OLLAMA_GPU_MEMORY=24000  # MB

# Database
DATABASE_URL=postgresql://ollama:password@localhost:5432/ollama
REDIS_URL=redis://localhost:6379/0

# Monitoring
PROMETHEUS_ENABLED=true
JAEGER_ENABLED=true
LOG_LEVEL=INFO

# Security
API_KEY_AUTH_ENABLED=true
CORS_ORIGINS=["http://localhost:3000"]

Model Configuration (config/models.yaml)

models:
  llama2:
    source: huggingface # or 'local', 'ollama-registry'
    model_id: meta-llama/Llama-2-7b-chat
    quantization: q4_K_M # q4_K_M, q5_K_M, fp16, bf16
    context_length: 4096
    gpu_memory_reserved: 10G
    batch_size: 8
    max_concurrent: 2

  mistral:
    source: huggingface
    model_id: mistralai/Mistral-7B-Instruct-v0.1
    quantization: q5_K_M
    context_length: 32768
    gpu_memory_reserved: 12G

caching:
  enabled: true
  type: redis # or 'memory'
  ttl: 3600

performance:
  enable_paging: true
  enable_tiling: false
  prefill_batch_size: 16

Usage

CLI Usage

# List available models
ollama list

# Pull and run a model
ollama pull llama2
ollama run llama2

# Direct inference with prompts
ollama run llama2 "What are the benefits of local AI?"

# Streaming output
ollama run mistral --stream "Explain quantum computing"

# Use with template
ollama run llama2 --template "Your prompt: {text}"

# Statistics and benchmarks
ollama stats

REST API

# Health check
curl http://localhost:8000/health

# List models
curl http://localhost:8000/api/models

# Create completion (streaming)
curl -X POST http://localhost:8000/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama2",
    "prompt": "Why is local AI important?",
    "stream": true,
    "context": []
  }'

# Chat completions (OpenAI-compatible)
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama2",
    "messages": [
      {"role": "system", "content": "You are an expert engineer"},
      {"role": "user", "content": "Explain RAG"}
    ],
    "temperature": 0.7
  }'

# Embeddings endpoint
curl -X POST http://localhost:8000/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "embedding-model",
    "input": "Generate embedding for this text"
  }'

Python Client

from ollama import Client

client = Client(base_url="http://localhost:8000")

# Simple completion
response = client.generate(
    model="llama2",
    prompt="Explain machine learning",
    stream=False
)
print(response.text)

# Chat interface
response = client.chat(
    model="mistral",
    messages=[
        {"role": "system", "content": "You are an AI expert"},
        {"role": "user", "content": "What is RAG?"}
    ],
    temperature=0.7
)
print(response.message.content)

# Embeddings
embeddings = client.embeddings(
    model="embedding-model",
    input="Generate vector representation"
)
print(embeddings.data[0].embedding)

# Streaming
for chunk in client.generate_stream(
    model="llama2",
    prompt="Tell a story about local AI"
):
    print(chunk.response, end="", flush=True)

Model Management

Downloading Models

# From Ollama registry
ollama pull llama2
ollama pull mistral

# Specific versions/sizes
ollama pull llama2:7b-chat-q4_0
ollama pull llama2:13b-chat-fp16

# From Hugging Face
python scripts/download_model.py \
  --source huggingface \
  --model meta-llama/Llama-2-7b-chat \
  --quantization q4_K_M

# Custom models
python scripts/import_model.py \
  --path /path/to/gguf/model.gguf \
  --name custom-model

Model Versioning

# List versions
ollama list --versions

# Pin specific version
ollama pull llama2:sha256:abc123def456

# Delete old versions
ollama rm llama2:old-version

# Export for backup
ollama export llama2 > llama2-backup.tar.gz
ollama import llama2-backup.tar.gz

Fine-tuning

# Prepare dataset
python scripts/prepare_finetuning_data.py \
  --input training_data.jsonl \
  --output prepared_data

# Fine-tune model
python scripts/finetune.py \
  --model llama2 \
  --data prepared_data \
  --output-dir ./fine-tuned-models \
  --epochs 3 \
  --learning-rate 1e-4

# Merge and quantize
python scripts/merge_lora.py \
  --base llama2 \
  --lora ./fine-tuned-models/lora \
  --output custom-llama2

# Benchmark
python scripts/benchmark.py --model custom-llama2

API Reference

Endpoints

Endpoint Method Purpose
/health GET Health check
/api/models GET List available models
/api/generate POST Text generation (streaming)
/api/embedding POST Generate embeddings
/v1/chat/completions POST OpenAI-compatible chat
/v1/completions POST OpenAI-compatible completion
/v1/embeddings POST OpenAI-compatible embeddings
/admin/stats GET System metrics
/admin/reload POST Reload configuration

Authentication

# Set API key
export OLLAMA_API_KEY="your-secret-key"

# Include in requests
curl -H "Authorization: Bearer $OLLAMA_API_KEY" \
  http://localhost:8000/api/models

Monitoring & Observability

Prometheus Metrics

Access dashboard at http://localhost:9090

Key metrics:

  • ollama_request_duration_seconds: Inference latency
  • ollama_tokens_generated_total: Cumulative token count
  • ollama_model_memory_bytes: Per-model memory usage
  • ollama_gpu_utilization_percent: GPU usage
  • ollama_queue_depth: Pending requests
# Query example
curl 'http://localhost:9090/api/v1/query?query=rate(ollama_tokens_generated_total[5m])'

Grafana Dashboards

Pre-built dashboards for:

  • System resources (CPU, RAM, GPU, Disk)
  • Model performance (latency, throughput, tokens/sec)
  • Request patterns (volume, errors, queue depth)
  • Cost analysis (compute time, energy consumption)

Distributed Tracing (Jaeger)

Access at http://localhost:16686

Traces capture:

  • Complete request flow from API → model inference
  • Component latencies (cache lookups, model execution)
  • Error spans with context
  • Resource utilization per span

Logging

# View logs with filtering
docker-compose logs -f ollama-api --tail=100
docker-compose logs ollama-worker-1 | grep "ERROR"

# Structured logging export
curl http://localhost:3100/loki/api/v1/query_range \
  --data-urlencode 'query={job="ollama"}'

Performance Tuning

Optimization Checklist

  • GPU: Ensure CUDA/ROCm properly initialized

    python -c "import torch; print(torch.cuda.is_available())"
  • Quantization: Use q4 for speed, q5/fp16 for quality

    # Benchmark
    python scripts/benchmark_quantization.py
  • Batch Size: Profile optimal throughput

    # config/models.yaml
    llama2:
      batch_size: 8 # Adjust based on VRAM
  • Context Caching: Enable for chat workflows

    caching:
      enabled: true
      type: redis
  • Model Pruning: Remove unused weights

    python scripts/prune_model.py --model llama2 --ratio 0.1

Benchmarking

# Comprehensive benchmark
python scripts/benchmark.py \
  --models llama2 mistral \
  --batch-sizes 1 2 4 8 \
  --prompt-lengths 100 500 1000

# Memory profiling
python -m memory_profiler scripts/inference.py

# Latency percentiles
python scripts/latency_percentiles.py --duration 3600

Security

Best Practices

# Enable authentication
export OLLAMA_API_KEY_AUTH=true
export OLLAMA_API_KEYS="key1:hash1,key2:hash2"

# TLS/HTTPS setup
openssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -days 365
export OLLAMA_TLS_CERT=/path/to/cert.pem
export OLLAMA_TLS_KEY=/path/to/key.pem

# Rate limiting per API key
python scripts/setup_rate_limits.py \
  --key user-key \
  --requests-per-minute 100

# Audit logging
export OLLAMA_AUDIT_LOG=/var/log/ollama/audit.log

Model Validation

# Verify model integrity
ollama verify llama2

# Scan for vulnerabilities
python scripts/scan_model.py --model llama2

# Validate outputs
python scripts/validate_model_outputs.py \
  --model llama2 \
  --test-cases validation_suite.jsonl

Troubleshooting

Common Issues

GPU Not Detected

# Check CUDA installation
nvidia-smi

# Verify PyTorch support
python -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))"

# Update Docker GPU support
docker run --rm --gpus all nvidia/cuda:12.1.1-runtime-ubuntu22.04 nvidia-smi

Out of Memory

# Check current usage
docker stats

# Reduce model quantization
ollama pull llama2:7b-chat-q4_0  # Lower quantization

# Limit batch size in config
# Set batch_size: 1 or 2

Slow Inference

# Profile bottleneck
python scripts/profile_inference.py --model llama2

# Check model is quantized
ollama list  # Look for q4/q5 suffix

# Verify GPU in use
nvidia-smi dmon -s puc

Connection Issues

# Check service is running
docker-compose ps

# Verify port availability
netstat -tulpn | grep 8000

# Check logs for errors
docker-compose logs ollama-api

Development

Setting Up Development Environment

# Clone repository
git clone https://github.com/kushin77/ollama.git
cd ollama

# Create virtual environment
python -m venv venv
source venv/bin/activate

# Install development dependencies
pip install -r requirements/dev.txt
pip install -e .  # Install in editable mode

# Run tests
pytest tests/ -v --cov=ollama

# Format code
black ollama/ tests/
isort ollama/ tests/
ruff check ollama/ tests/

# Type checking
mypy ollama/ --strict

# Run linter
pylint ollama/

Project Structure

ollama/
├── .copilot-instructions      # Elite development instructions
├── .github/
│   └── workflows/             # CI/CD pipelines
├── ollama/
│   ├── api/                   # FastAPI server and routes
│   ├── inference/             # Model execution engine
│   ├── models/                # Model management
│   ├── cache/                 # Caching layer
│   ├── embeddings/            # Embedding generation
│   ├── rag/                   # RAG infrastructure
│   ├── monitoring/            # Observability
│   ├── security/              # Authentication, validation
│   └── utils/                 # Shared utilities
├── scripts/
│   ├── bootstrap.sh           # Setup script
│   ├── download_model.py      # Model downloading
│   ├── benchmark.py           # Performance testing
│   └── ...                    # Utility scripts
├── config/
│   ├── development.yaml       # Dev configuration
│   ├── production.yaml        # Production configuration
│   └── models.yaml            # Model definitions
├── docker/
│   ├── Dockerfile             # Main image
│   ├── Dockerfile.worker      # Worker image
│   └── docker-compose.yml     # Local development
├── tests/
│   ├── unit/                  # Unit tests
│   ├── integration/           # Integration tests
│   └── e2e/                   # End-to-end tests
├── docs/
│   ├── architecture.md        # System design
│   ├── api.md                 # API documentation
│   └── deployment.md          # Deployment guide
├── requirements/
│   ├── core.txt               # Production dependencies
│   ├── dev.txt                # Development dependencies
│   └── test.txt               # Testing dependencies
└── README.md                  # This file

Running Tests

# All tests
pytest

# Specific test file
pytest tests/unit/test_inference.py -v

# With coverage
pytest --cov=ollama --cov-report=html

# Only failed tests from last run
pytest --lf

# With output
pytest -s -vv tests/integration/

Contributing

  1. Fork the repository
  2. Create feature branch: git checkout -b feature/your-feature
  3. Commit atomically: git commit -S -m "feat: add new feature"
  4. Push to branch: git push origin feature/your-feature
  5. Open pull request with clear description

See CONTRIBUTING.md for detailed guidelines.


Performance Benchmarks

Latency (p99, seconds)

Model Quantization Batch=1 Batch=8 Tokens/sec
Llama2 7B q4_K_M 0.85 2.4 180
Llama2 13B q5_K_M 1.2 3.8 120
Mistral 7B q4_K_M 0.72 2.1 200
Neural Chat q4_K_M 0.65 1.9 220

Benchmarks on NVIDIA RTX 4090, Ubuntu 22.04, CUDA 12.1


Roadmap

Q1 2026

  • Multi-GPU distributed inference
  • Optimized attention mechanisms (FlashAttention-3)
  • Enhanced RAG with re-ranking

Q2 2026

  • Fine-tuning infrastructure (LoRA, QLoRA)
  • Model marketplace integration
  • Kubernetes deployment support

Q3 2026

  • Multimodal model support (vision + text)
  • Advanced caching strategies (prefix caching)
  • Cost optimization tools

Support & Community


License

MIT License - See LICENSE for details


Citation

@software{ollama2026,
  author = {Kushin, A.},
  title = {Ollama: Elite Local AI Development Platform},
  url = {https://github.com/kushin77/ollama},
  year = {2026},
  note = {Version 1.0.0}
}

Last Updated: January 12, 2026 Version: 1.0.0 Maintainer: @kushin77


Stats

GitHub Stars GitHub Forks Last Commit

About

Get up and running with Kimi-K2.5, GLM-5, MiniMax, DeepSeek, gpt-oss, Qwen, Gemma and other models.

Resources

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages

  • Python 64.4%
  • Shell 17.3%
  • HCL 9.1%
  • Go 4.3%
  • JavaScript 1.7%
  • TypeScript 1.2%
  • Other 2.0%