Local AI

Documentation and concepts for running LLMs locally on CPU hardware, plus a ready-to-use Docker/Podman deployment.

Start Here

I want to run the local CPU setup: see deployments/README.md
I want to understand the stack layers: see software-stack.md
I want to understand why first-token latency is high on CPU: see prefill.md

If you are unsure where to begin, start with the deployment guide and use the default Docker Compose path.

Choose This Repo If

you want to run a local coding-agent stack on a CPU-only machine
you have about 32 GB RAM and around 20 GB free disk
you accept slow first-token latency in exchange for a fully local setup
you want a working deployment first, with architecture docs available afterward

Quick Start

Get Gemma 4 26B running locally in three commands (requires Docker or Podman, ~18 GB disk, recommended 32 GB RAM):

cd deployments
bash scripts/setup.sh           # one-time: pulls model (~18 GB), builds images
bash omp/scripts/start-omp.sh   # launch the interactive agentic harness

See deployments/README.md for the default setup path, alternative Podman methods, config files, and Windows scripts.

Performance Expectations

long prefill on CPU
~20 GB RAM usage at runtime by Ollama (model + KV cache)
~1-2 tokens/sec on CPU (Gemma 4 26B MoE)

Deep Dives

System Prompt — role definition, constraints, output format; the highest-leverage input
Tools — function schemas the model reads to decide when and how to call external actions
Delimiter Tokens — chat template tokens (<|system|>, <|user|>, etc.) that separate prompt roles
Modelfile — per-model sampling params and context window config (Ollama-specific)
Prefill — the input-processing phase before the first output token; why it's slow on CPU

Layers

The local AI hosting stack is composed of several layers, each responsible for a specific part of the system:

Web UI / API Gateway Layer: Provides a user interface and API endpoints for interacting with the AI models.
Inference Engine Layer: Loads models, runs GPU/CPU computations, and streams tokens.
Hardware Layer: The underlying hardware that executes the computations, such as NVIDIA/AMD GPUs, Apple Silicon, or CPUs.

More info: Software Stack

Agentic Harness

Handles user input, interface to tools and runs the loop of agentic reasoning. Can be implemented in various languages (Python, Node.js, Go, Rust, etc.) and can run on the same machine or a different one.

Examples:

VS Code + Continue — IDE extension with chat, autocomplete, and tool use
OpenCode — terminal-native agentic coding assistant
Claude Code — Anthropic's terminal agent
oh-my-pi (omp) — the harness used in the CPU deployment

How the Harness Shapes the Prompt

The model never sees raw user messages directly. The harness assembles the entire context window before every inference call. The structure sent to the inference engine looks like this:

┌─────────────────────────────────────────────────────────────┐
│  System Prompt                                              │
│  ├── Persona / Role definition                              │
│  ├── Behavioral constraints & tone                          │
│  ├── Output format instructions                             │
│  ├── Injected skills (per-task modules)                     │
│  └── Available context summary (e.g. repo, user profile)   │
├─────────────────────────────────────────────────────────────┤
│  Tool / Function Definitions  (JSON schema)                 │
│  ├── Tool name                                              │
│  ├── Tool description  ← model reads this to decide usage   │
│  └── Parameter schema with per-parameter descriptions       │
├─────────────────────────────────────────────────────────────┤
│  Conversation History  (prior turns, token-budgeted)        │
├─────────────────────────────────────────────────────────────┤
│  Retrieved Context  (RAG chunks, file contents, memories)   │
├─────────────────────────────────────────────────────────────┤
│  User Message                                               │
└─────────────────────────────────────────────────────────────┘

Each section competes for the same fixed context window. The harness decides what to include, how much space to allocate, and how to order it.

Prompt Layers Explained

1. System Prompt

The single highest-leverage input. The model has no memory between calls — everything it "knows about its job" must be restated here on every turn.

Anatomy of a well-structured system prompt:

## Role
You are a senior backend engineer assistant embedded in a code editor.
Your job is to help the user write, review, and refactor Go and Python code.

## Constraints
- Only suggest changes to the files you are explicitly shown.
- Do not generate code outside the language the user is working in.
- When unsure, ask a clarifying question before writing code.

## Output Format
- Respond in plain prose unless the user asks for code.
- Wrap all code blocks with language-tagged fences.
- Keep answers concise — one paragraph max unless more depth is requested.

## Context
Working repo: {{repo_name}}. Primary language: {{primary_lang}}.
User's skill level: {{skill_level}}.

The harness fills in {{...}} template variables at runtime, making the system prompt dynamic per session or per user.

2. Tool Descriptions

The model decides whether to call a tool and how entirely from reading the tool's name and description. This is not documentation for humans — it is model-readable instruction.

Bad tool description:

{
  "name": "search",
  "description": "Search for things."
}

Good tool description:

{
  "name": "search_codebase",
  "description": "Search the current repository for files, functions, or symbols matching a query. Use this when the user asks about existing code, asks you to find something, or before writing code that might duplicate something that already exists. Do NOT use this for web searches.",
  "parameters": {
    "query": {
      "type": "string",
      "description": "Natural language description of what to find. Include relevant symbol names, file types, or concepts."
    }
  }
}

Rules for tool descriptions:

Describe when to use it, not just what it does.
Describe when NOT to use it if there's a common confusion risk.
Write parameter descriptions as if coaching a junior developer.
Keep names verb-first and specific: search_codebase, create_file, run_tests — not tool1, utility, action.

3. Skills (Injected Prompt Modules)

A skill is a reusable block of instruction injected into the system prompt conditionally based on the task context. Instead of one giant static system prompt, the harness selects which skills are relevant and composes them at call time.

Example skills:

skill_code_review — instructs the model to check for security, readability, test coverage
skill_sql_expert — adds SQL-specific rules, dialect awareness, query plan reasoning
skill_git_workflow — explains branch conventions, commit message format, PR etiquette
skill_explain_to_junior — adjusts tone and depth when user signals they are learning

The harness selects skills based on:

The active file type / language
A detected intent from the user's message
An explicit user-set persona or mode
The tools currently enabled

4. Delimiter Tokens

Each model family has its own template format:

ChatML (Qwen, Mistral Small, many fine-tunes): <|im_start|>system ... <|im_end|>
Llama 3: <|begin_of_text|><|start_header_id|>system<|end_header_id|> ...
Gemma: <start_of_turn>user\n...

Ollama and vLLM apply the correct chat template automatically from the model's tokenizer_config.json. When calling the raw /v1/chat/completions API, the inference engine handles template application — but if you call /v1/completions (raw text), you must apply the template yourself.

Per-Use-Case Prompt Design

Different use cases need fundamentally different prompt strategies. Do not use a single generic system prompt across all tasks — it either bloats the context or leaves the model without the constraints it needs.

Use Case Matrix

Use Case	System Prompt Focus	Key Tools	Context Budget Strategy
Code assistant	Language rules, repo conventions, diff format	`search_codebase`, `read_file`, `create_file`, `run_tests`	Maximize file context; minimize history
Research / Q&A	Citation discipline, uncertainty acknowledgment	`web_search`, `retrieve_document`	Maximize retrieved chunks; keep history short
Data analysis	Step-by-step reasoning, output as tables/charts	`run_python`, `query_database`	Maximize result context; include schema
DevOps / Infra	Safety-first: describe before executing, dry-run default	`run_command`, `read_file`, `write_file`	Include current system state; no stale info
Writing assistant	Tone matching, length discipline, no hallucination	minimal	Maximize document context
Multi-agent orchestrator	Delegation rules, result synthesis, loop termination	`spawn_agent`, `call_agent`, `collect_results`	Keep coordinator context small; agents handle detail

Optimizing for Token Budget

The context window is finite. Every token spent on boilerplate is a token not available for code, documents, or history. Strategies:

Front-load critical instructions. Most models show recency bias in long contexts — place the most important constraints at the top of the system prompt, not the bottom.
Compress history. Summarize old turns into a rolling summary instead of keeping every raw message. The harness manages this, not the model.
RAG threshold. Only inject retrieved chunks when retrieval confidence exceeds a threshold. Noisy chunks hurt more than no chunks.
Tool thinning. Only include tools that are relevant to the current task type. A writing assistant doesn't need run_tests. More tools = more tokens consumed in every call + higher chance of spurious tool calls.
Lazy skill injection. Skip skill modules that don't apply. A Go file open → inject skill_go + skill_code_review. A markdown file → inject skill_writing. Never inject all skills at once.

Building Your Setup

A practical approach to building per-use-case configurations:

configs/
├── base_system_prompt.md       # shared role + constraints
├── skills/
│   ├── code_review.md
│   ├── sql_expert.md
│   ├── explain_to_junior.md
│   └── devops_safety.md
├── tools/
│   ├── coding_tools.json       # search_codebase, read_file, run_tests
│   ├── research_tools.json     # web_search, retrieve_document
│   └── infra_tools.json        # run_command, read_file, write_file
└── use_cases/
    ├── code_assistant.yaml     # base + [code_review] + coding_tools
    ├── data_analysis.yaml      # base + [sql_expert] + research_tools
    └── devops.yaml             # base + [devops_safety] + infra_tools

Each use case config declares:

Which skills to inject
Which tool set to load
Any runtime template variables (repo name, language, user level)
Context window budget allocations (history turns, RAG chunk count, file content limits)

The harness reads the active use case at startup (or on mode switch), composes the system prompt, loads the tool schemas, and applies the template before every inference call.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
deployments		deployments
AGENTS.md		AGENTS.md
LICENSE		LICENSE
README.md		README.md
prefill.md		prefill.md
software-stack.md		software-stack.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Local AI

Start Here

Choose This Repo If

Quick Start

Performance Expectations

Deep Dives

Layers

Agentic Harness

How the Harness Shapes the Prompt

Prompt Layers Explained

1. System Prompt

2. Tool Descriptions

3. Skills (Injected Prompt Modules)

4. Delimiter Tokens

Per-Use-Case Prompt Design

Use Case Matrix

Optimizing for Token Budget

Building Your Setup

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Local AI

Start Here

Choose This Repo If

Quick Start

Performance Expectations

Deep Dives

Layers

Agentic Harness

How the Harness Shapes the Prompt

Prompt Layers Explained

1. System Prompt

2. Tool Descriptions

3. Skills (Injected Prompt Modules)

4. Delimiter Tokens

Per-Use-Case Prompt Design

Use Case Matrix

Optimizing for Token Budget

Building Your Setup

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages