bubbuild · PsiACE · May 18, 2026 · May 17, 2026 · May 18, 2026
diff --git a/website/src/content/docs/docs/tutorials/index.mdx b/website/src/content/docs/docs/tutorials/index.mdx
@@ -12,6 +12,7 @@ Tutorials are hands-on lessons. Use this section when you want to learn a workfl
 1. [Observe Bub with tapes and Jaeger](/docs/tutorials/observability/) — inspect Bub's own tape first, then export Logfire/OpenTelemetry traces to Jaeger.
 2. [Connect MCP Servers with bub-mcp](/docs/tutorials/mcp/) — install the MCP plugin, wire up a time server, and call it from a Bub turn.
 3. [Persist tapes in SQLAlchemy with SQLite](/docs/tutorials/tapestore-sqlalchemy/) — replace the file-based tape store with a local SQLite database.
+4. [Run Bub with a local llama.cpp model](/docs/tutorials/local-llama-cpp/) — expose a GGUF Gemma model as a local OpenAI-compatible endpoint.
 
 ## Next steps
 

diff --git a/website/src/content/docs/docs/tutorials/local-llama-cpp.mdx b/website/src/content/docs/docs/tutorials/local-llama-cpp.mdx
@@ -0,0 +1,124 @@
+---
+title: Run Bub with a local llama.cpp model
+description: Start a local llama.cpp server and configure Bub to use it as an OpenAI-compatible model provider.
+sidebar:
+  order: 3
+---
+
+This tutorial shows how to run Bub against a local `llama.cpp` server. By the end, Bub will send model calls to a GGUF Gemma model running on your machine instead of a hosted API.
+
+Use this path when you want a local model for development, private experiments, offline demos, or latency-sensitive tasks near your application. This tutorial does not cover model benchmarking, fine-tuning, production hardening, or choosing the best model for every workload.
+
+The example uses [`ggml-org/gemma-4-E2B-it-GGUF`](https://huggingface.co/ggml-org/gemma-4-E2B-it-GGUF), a GGUF build of Google's Gemma 4 E2B instruction-tuned model. Google's [Gemma 4 overview](https://deepmind.google/models/gemma/gemma-4/) describes E2B and E4B as efficient models for mobile and edge devices, and the [Gemma 4 model card](https://ai.google.dev/gemma/docs/core/model_card_4) documents capabilities, limits, and responsible-use considerations.
+
+## Before you begin
+
+You need:
+
+- Bub installed and runnable with `uv run bub --help`.
+- Docker installed.
+- A GGUF model file under `~/.cache/llama.cpp/`.
+- Enough system memory for the quantization you choose. The Q8 Gemma 4 E2B GGUF file is about 5 GB on disk; runtime memory also depends on context size, batching, and GPU offload.
+
+This tutorial uses these file names:
+
+```text
+~/.cache/llama.cpp/ggml-org_gemma-4-E2B-it-GGUF_gemma-4-E2B-it-Q8_0.gguf
+~/.cache/llama.cpp/ggml-org_gemma-4-E2B-it-GGUF_mmproj-gemma-4-E2B-it-Q8_0.gguf
+```
+
+If your files use different names, update the `-m` and `--mmproj` paths in the Docker command.
+
+## 1. Start the local server
+
+Set an API key for the local server:
+
+```bash
+export LLAMA_API_KEY="${LLAMA_API_KEY:-test}"
+```
+
+Start `llama-server`:
+
+```bash
+sudo docker run --rm -it \
+  --security-opt label=disable \
+  -p 127.0.0.1:8080:8080 \
+  -v "$HOME/.cache/llama.cpp:/root/.cache/llama.cpp:ro" \
+  ghcr.io/ggml-org/llama.cpp:full \
+  --server \
+  --host 0.0.0.0 \
+  --port 8080 \
+  --api-key "$LLAMA_API_KEY" \
+  -m /root/.cache/llama.cpp/ggml-org_gemma-4-E2B-it-GGUF_gemma-4-E2B-it-Q8_0.gguf \
+  --mmproj /root/.cache/llama.cpp/ggml-org_gemma-4-E2B-it-GGUF_mmproj-gemma-4-E2B-it-Q8_0.gguf
+```
+
+The Docker port is bound to `127.0.0.1` so the server is available only from the local machine. Change the port binding only if you intentionally want another machine to reach it.
+
+On SELinux systems, `--security-opt label=disable` avoids bind-mount permission failures when the container reads model files from `~/.cache/llama.cpp`. If you only need text input, remove the `--mmproj` line.
+
+If Docker prints `no ROCm-capable device is detected`, the container can still fall back to CPU inference. That is enough to test the integration, but responses will be slower.
+
+## 2. Test the OpenAI-compatible API
+
+In another terminal, send a small chat request:
+
+```bash
+curl http://localhost:8080/v1/chat/completions \
+  -H "Authorization: Bearer $LLAMA_API_KEY" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "gemma-4-E2B-it",
+    "messages": [
+      {"role": "user", "content": "hello"}
+    ]
+  }'
+```
+
+A working server returns a `chat.completion` JSON object with an assistant message.
+
+## 3. Configure Bub
+
+Point Bub at the local server:
+
+```bash
+export BUB_API_BASE="http://localhost:8080/v1"
+export BUB_API_KEY="$LLAMA_API_KEY"
+export BUB_MODEL="openai:gemma-4-E2B-it"
+```
+
+Run one Bub turn:
+
+```bash
+uv run bub run "Reply with one short sentence: hello from a local model."
+```
+
+Bub now uses the local OpenAI-compatible endpoint for model calls. The turn pipeline, channels, tools, and tapes are unchanged.
+
+```bash
+~/bubbuild/bub$ uv run bub run "Reply with one short sentence: hello from a local model."
+2026-05-19 01:32:40.601 | INFO     | bub.builtin.agent:_run_tools_with_auto_handoff:271 - loop.step step=1 tape=becda04eb9f7369c__0b871d5e50e7c192 model=openai:gemma-4-E2B-it
+2026-05-19 01:32:46.747 | INFO     | bub.builtin.store:fork:122 - Merged 7 entries into tape "becda04eb9f7369c__0b871d5e50e7c192"
+[cli:local]
+hello from a local model.
+```
+
+## 4. Check the model documentation before changing workloads
+
+When you switch the model or quantization, check the upstream model documentation first:
+
+- The Hugging Face GGUF card lists supported local runtimes and the available quantized files.
+- The Gemma 4 model card documents input modalities, context windows, intended use, license, and risks.
+- Local execution does not remove the need for evaluation. A local model can still produce incorrect, biased, or unsafe output.
+
+Use small local models for workloads where their latency, privacy, cost, or offline behavior matters more than maximum model quality. For higher-stakes or product-facing workflows, evaluate the model on representative tasks before routing real users to it.
+
+## Clean up
+
+Stop the Docker container with `Ctrl-C`.
+
+Unset the Bub overrides when you want to return to your previous provider:
+
+```bash
+unset BUB_API_BASE BUB_API_KEY BUB_MODEL LLAMA_API_KEY
+```
diff --git a/website/src/content/docs/zh-cn/docs/tutorials/index.mdx b/website/src/content/docs/zh-cn/docs/tutorials/index.mdx
@@ -12,6 +12,7 @@ sidebar:
 1. [使用 tape 与 Jaeger 观察 Bub](/zh-cn/docs/tutorials/observability/) — 先检查 Bub 自身的 tape，再把 Logfire/OpenTelemetry trace 导出到 Jaeger。
 2. [使用 bub-mcp 连接 MCP 服务器](/zh-cn/docs/tutorials/mcp/) — 安装 MCP 插件，接入时间服务器，并在 Bub turn 中调用。
 3. [用 SQLAlchemy 与 SQLite 持久化 tape](/zh-cn/docs/tutorials/tapestore-sqlalchemy/) — 把基于文件的 tape store 换成本地 SQLite 数据库。
+4. [使用本地 llama.cpp 模型运行 Bub](/zh-cn/docs/tutorials/local-llama-cpp/) — 把 GGUF Gemma 模型暴露成本地 OpenAI-compatible endpoint。
 
 ## 下一步
 

diff --git a/website/src/content/docs/zh-cn/docs/tutorials/local-llama-cpp.mdx b/website/src/content/docs/zh-cn/docs/tutorials/local-llama-cpp.mdx
@@ -0,0 +1,124 @@
+---
+title: 使用本地 llama.cpp 模型运行 Bub
+description: 启动本地 llama.cpp server，并把 Bub 配置为使用这个 OpenAI-compatible 模型后端。
+sidebar:
+  order: 3
+---
+
+本教程演示如何让 Bub 使用本地 `llama.cpp` server。完成后，Bub 的模型调用会发往你机器上运行的 GGUF Gemma 模型，而不是托管 API。
+
+当你需要在开发环境、私有实验、离线 demo，或贴近应用的低延迟任务中使用本地模型时，可以使用这条路径。本教程不覆盖模型 benchmark、fine-tuning、生产加固，也不讨论如何为所有工作负载选择最佳模型。
+
+示例使用 [`ggml-org/gemma-4-E2B-it-GGUF`](https://huggingface.co/ggml-org/gemma-4-E2B-it-GGUF)，这是 Google Gemma 4 E2B instruction-tuned 模型的 GGUF 版本。Google 的 [Gemma 4 overview](https://deepmind.google/models/gemma/gemma-4/) 把 E2B 和 E4B 描述为适合移动和边缘设备的高效模型；[Gemma 4 model card](https://ai.google.dev/gemma/docs/core/model_card_4) 记录了能力、限制和负责任使用相关注意事项。
+
+## 开始前
+
+你需要：
+
+- Bub 已安装，且 `uv run bub --help` 可以运行。
+- 已安装 Docker。
+- `~/.cache/llama.cpp/` 下有 GGUF 模型文件。
+- 系统内存足够运行所选量化版本。Gemma 4 E2B 的 Q8 GGUF 文件约 5 GB；实际运行内存还会受上下文长度、batching、GPU offload 等设置影响。
+
+本教程使用下面两个文件名：
+
+```text
+~/.cache/llama.cpp/ggml-org_gemma-4-E2B-it-GGUF_gemma-4-E2B-it-Q8_0.gguf
+~/.cache/llama.cpp/ggml-org_gemma-4-E2B-it-GGUF_mmproj-gemma-4-E2B-it-Q8_0.gguf
+```
+
+如果你的文件名不同，需要相应修改 Docker 命令里的 `-m` 和 `--mmproj` 路径。
+
+## 1. 启动本地 server
+
+先为本地 server 设置一个 API key：
+
+```bash
+export LLAMA_API_KEY="${LLAMA_API_KEY:-test}"
+```
+
+启动 `llama-server`：
+
+```bash
+sudo docker run --rm -it \
+  --security-opt label=disable \
+  -p 127.0.0.1:8080:8080 \
+  -v "$HOME/.cache/llama.cpp:/root/.cache/llama.cpp:ro" \
+  ghcr.m.daocloud.io/ggml-org/llama.cpp:full \
+  --server \
+  --host 0.0.0.0 \
+  --port 8080 \
+  --api-key "$LLAMA_API_KEY" \
+  -m /root/.cache/llama.cpp/ggml-org_gemma-4-E2B-it-GGUF_gemma-4-E2B-it-Q8_0.gguf \
+  --mmproj /root/.cache/llama.cpp/ggml-org_gemma-4-E2B-it-GGUF_mmproj-gemma-4-E2B-it-Q8_0.gguf
+```
+
+Docker 端口绑定到 `127.0.0.1`，因此这个 server 只在本机可访问。只有在明确需要其他机器访问时，才调整端口绑定。
+
+在 SELinux 系统上，`--security-opt label=disable` 可以避免容器读取 `~/.cache/llama.cpp` 下模型文件时遇到 bind mount 权限问题。如果只需要文本输入，可以删除 `--mmproj` 那一行。
+
+如果 Docker 输出 `no ROCm-capable device is detected`，容器仍然可能回退到 CPU inference。这足够验证接入路径，但响应速度会更慢。
+
+## 2. 测试 OpenAI-compatible API
+
+在另一个终端发送一个小 chat 请求：
+
+```bash
+curl http://localhost:8080/v1/chat/completions \
+  -H "Authorization: Bearer $LLAMA_API_KEY" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "gemma-4-E2B-it",
+    "messages": [
+      {"role": "user", "content": "hello"}
+    ]
+  }'
+```
+
+正常工作的 server 会返回一个 `chat.completion` JSON 对象，其中包含 assistant message。
+
+## 3. 配置 Bub
+
+把 Bub 指向本地 server：
+
+```bash
+export BUB_API_BASE="http://localhost:8080/v1"
+export BUB_API_KEY="$LLAMA_API_KEY"
+export BUB_MODEL="openai:gemma-4-E2B-it"
+```
+
+运行一个 Bub turn：
+
+```bash
+uv run bub run "Reply with one short sentence: hello from a local model."
+```
+
+现在 Bub 会通过本地 OpenAI-compatible endpoint 进行模型调用。turn pipeline、channels、tools 和 tapes 都不需要改变。
+
+```bash
+~/bubbuild/bub$ uv run bub run "Reply with one short sentence: hello from a local model."
+2026-05-19 01:32:40.601 | INFO     | bub.builtin.agent:_run_tools_with_auto_handoff:271 - loop.step step=1 tape=becda04eb9f7369c__0b871d5e50e7c192 model=openai:gemma-4-E2B-it
+2026-05-19 01:32:46.747 | INFO     | bub.builtin.store:fork:122 - Merged 7 entries into tape "becda04eb9f7369c__0b871d5e50e7c192"
+[cli:local]
+hello from a local model.
+```
+
+## 4. 更换工作负载前检查模型文档
+
+当你切换模型或量化版本时，先检查上游模型文档：
+
+- Hugging Face GGUF card 会列出支持的本地运行方式和可用量化文件。
+- Gemma 4 model card 会说明输入模态、上下文窗口、预期用途、许可和风险。
+- 本地运行不等于免评测。本地模型仍然可能生成错误、有偏见或不安全的输出。
+
+适合优先尝试小型本地模型的，是那些更看重延迟、隐私、成本或离线能力，而不是最高模型质量的工作负载。对于高风险或面向产品的工作流，应先在有代表性的任务上评估模型，再接入真实用户路径。
+
+## 清理
+
+用 `Ctrl-C` 停止 Docker 容器。
+
+如果要恢复到之前的 provider，清理 Bub 覆盖配置：
+
+```bash
+unset BUB_API_BASE BUB_API_KEY BUB_MODEL LLAMA_API_KEY
+```