Documentation and concepts for running LLMs locally on CPU hardware, plus a ready-to-use Docker/Podman deployment.
- I want to run the local CPU setup: see deployments/README.md
- I want to understand the stack layers: see software-stack.md
- I want to understand why first-token latency is high on CPU: see prefill.md
If you are unsure where to begin, start with the deployment guide and use the default Docker Compose path.
- you want to run a local coding-agent stack on a CPU-only machine
- you have about 32 GB RAM and around 20 GB free disk
- you accept slow first-token latency in exchange for a fully local setup
- you want a working deployment first, with architecture docs available afterward
Get Gemma 4 26B running locally in three commands (requires Docker or Podman, ~18 GB disk, recommended 32 GB RAM):
cd deployments
bash scripts/setup.sh # one-time: pulls model (~18 GB), builds images
bash omp/scripts/start-omp.sh # launch the interactive agentic harnessSee deployments/README.md for the default setup path, alternative Podman methods, config files, and Windows scripts.
- long prefill on CPU
- ~20 GB RAM usage at runtime by Ollama (model + KV cache)
- ~1-2 tokens/sec on CPU (Gemma 4 26B MoE)
- System Prompt — role definition, constraints, output format; the highest-leverage input
- Tools — function schemas the model reads to decide when and how to call external actions
- Delimiter Tokens — chat template tokens (
<|system|>,<|user|>, etc.) that separate prompt roles - Modelfile — per-model sampling params and context window config (Ollama-specific)
- Prefill — the input-processing phase before the first output token; why it's slow on CPU
The local AI hosting stack is composed of several layers, each responsible for a specific part of the system:
- Web UI / API Gateway Layer: Provides a user interface and API endpoints for interacting with the AI models.
- Inference Engine Layer: Loads models, runs GPU/CPU computations, and streams tokens.
- Hardware Layer: The underlying hardware that executes the computations, such as NVIDIA/AMD GPUs, Apple Silicon, or CPUs.
More info: Software Stack
Handles user input, interface to tools and runs the loop of agentic reasoning. Can be implemented in various languages (Python, Node.js, Go, Rust, etc.) and can run on the same machine or a different one.
Examples:
- VS Code + Continue — IDE extension with chat, autocomplete, and tool use
- OpenCode — terminal-native agentic coding assistant
- Claude Code — Anthropic's terminal agent
- oh-my-pi (
omp) — the harness used in the CPU deployment
The model never sees raw user messages directly. The harness assembles the entire context window before every inference call. The structure sent to the inference engine looks like this:
┌─────────────────────────────────────────────────────────────┐
│ System Prompt │
│ ├── Persona / Role definition │
│ ├── Behavioral constraints & tone │
│ ├── Output format instructions │
│ ├── Injected skills (per-task modules) │
│ └── Available context summary (e.g. repo, user profile) │
├─────────────────────────────────────────────────────────────┤
│ Tool / Function Definitions (JSON schema) │
│ ├── Tool name │
│ ├── Tool description ← model reads this to decide usage │
│ └── Parameter schema with per-parameter descriptions │
├─────────────────────────────────────────────────────────────┤
│ Conversation History (prior turns, token-budgeted) │
├─────────────────────────────────────────────────────────────┤
│ Retrieved Context (RAG chunks, file contents, memories) │
├─────────────────────────────────────────────────────────────┤
│ User Message │
└─────────────────────────────────────────────────────────────┘
Each section competes for the same fixed context window. The harness decides what to include, how much space to allocate, and how to order it.
The single highest-leverage input. The model has no memory between calls — everything it "knows about its job" must be restated here on every turn.
Anatomy of a well-structured system prompt:
## Role
You are a senior backend engineer assistant embedded in a code editor.
Your job is to help the user write, review, and refactor Go and Python code.
## Constraints
- Only suggest changes to the files you are explicitly shown.
- Do not generate code outside the language the user is working in.
- When unsure, ask a clarifying question before writing code.
## Output Format
- Respond in plain prose unless the user asks for code.
- Wrap all code blocks with language-tagged fences.
- Keep answers concise — one paragraph max unless more depth is requested.
## Context
Working repo: {{repo_name}}. Primary language: {{primary_lang}}.
User's skill level: {{skill_level}}.
The harness fills in {{...}} template variables at runtime, making the system prompt dynamic per session or per user.
The model decides whether to call a tool and how entirely from reading the tool's name and description. This is not documentation for humans — it is model-readable instruction.
Bad tool description:
{
"name": "search",
"description": "Search for things."
}Good tool description:
{
"name": "search_codebase",
"description": "Search the current repository for files, functions, or symbols matching a query. Use this when the user asks about existing code, asks you to find something, or before writing code that might duplicate something that already exists. Do NOT use this for web searches.",
"parameters": {
"query": {
"type": "string",
"description": "Natural language description of what to find. Include relevant symbol names, file types, or concepts."
}
}
}Rules for tool descriptions:
- Describe when to use it, not just what it does.
- Describe when NOT to use it if there's a common confusion risk.
- Write parameter descriptions as if coaching a junior developer.
- Keep names verb-first and specific:
search_codebase,create_file,run_tests— nottool1,utility,action.
A skill is a reusable block of instruction injected into the system prompt conditionally based on the task context. Instead of one giant static system prompt, the harness selects which skills are relevant and composes them at call time.
Example skills:
skill_code_review— instructs the model to check for security, readability, test coverageskill_sql_expert— adds SQL-specific rules, dialect awareness, query plan reasoningskill_git_workflow— explains branch conventions, commit message format, PR etiquetteskill_explain_to_junior— adjusts tone and depth when user signals they are learning
The harness selects skills based on:
- The active file type / language
- A detected intent from the user's message
- An explicit user-set persona or mode
- The tools currently enabled
Models are trained on data where special tokens mark role boundaries (<|system|>, <|user|>, <|assistant|>, <|tool_call|>, etc.). The harness must use the correct chat template for the loaded model — getting this wrong silently degrades output quality even if the text looks correct to a human.
Each model family has its own template format:
- ChatML (Qwen, Mistral Small, many fine-tunes):
<|im_start|>system ... <|im_end|> - Llama 3:
<|begin_of_text|><|start_header_id|>system<|end_header_id|> ... - Gemma:
<start_of_turn>user\n...
Ollama and vLLM apply the correct chat template automatically from the model's tokenizer_config.json. When calling the raw /v1/chat/completions API, the inference engine handles template application — but if you call /v1/completions (raw text), you must apply the template yourself.
Different use cases need fundamentally different prompt strategies. Do not use a single generic system prompt across all tasks — it either bloats the context or leaves the model without the constraints it needs.
| Use Case | System Prompt Focus | Key Tools | Context Budget Strategy |
|---|---|---|---|
| Code assistant | Language rules, repo conventions, diff format | search_codebase, read_file, create_file, run_tests |
Maximize file context; minimize history |
| Research / Q&A | Citation discipline, uncertainty acknowledgment | web_search, retrieve_document |
Maximize retrieved chunks; keep history short |
| Data analysis | Step-by-step reasoning, output as tables/charts | run_python, query_database |
Maximize result context; include schema |
| DevOps / Infra | Safety-first: describe before executing, dry-run default | run_command, read_file, write_file |
Include current system state; no stale info |
| Writing assistant | Tone matching, length discipline, no hallucination | minimal | Maximize document context |
| Multi-agent orchestrator | Delegation rules, result synthesis, loop termination | spawn_agent, call_agent, collect_results |
Keep coordinator context small; agents handle detail |
The context window is finite. Every token spent on boilerplate is a token not available for code, documents, or history. Strategies:
- Front-load critical instructions. Most models show recency bias in long contexts — place the most important constraints at the top of the system prompt, not the bottom.
- Compress history. Summarize old turns into a rolling summary instead of keeping every raw message. The harness manages this, not the model.
- RAG threshold. Only inject retrieved chunks when retrieval confidence exceeds a threshold. Noisy chunks hurt more than no chunks.
- Tool thinning. Only include tools that are relevant to the current task type. A writing assistant doesn't need
run_tests. More tools = more tokens consumed in every call + higher chance of spurious tool calls. - Lazy skill injection. Skip skill modules that don't apply. A Go file open → inject
skill_go+skill_code_review. A markdown file → injectskill_writing. Never inject all skills at once.
A practical approach to building per-use-case configurations:
configs/
├── base_system_prompt.md # shared role + constraints
├── skills/
│ ├── code_review.md
│ ├── sql_expert.md
│ ├── explain_to_junior.md
│ └── devops_safety.md
├── tools/
│ ├── coding_tools.json # search_codebase, read_file, run_tests
│ ├── research_tools.json # web_search, retrieve_document
│ └── infra_tools.json # run_command, read_file, write_file
└── use_cases/
├── code_assistant.yaml # base + [code_review] + coding_tools
├── data_analysis.yaml # base + [sql_expert] + research_tools
└── devops.yaml # base + [devops_safety] + infra_tools
Each use case config declares:
- Which skills to inject
- Which tool set to load
- Any runtime template variables (repo name, language, user level)
- Context window budget allocations (history turns, RAG chunk count, file content limits)
The harness reads the active use case at startup (or on mode switch), composes the system prompt, loads the tool schemas, and applies the template before every inference call.