Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions website/src/content/docs/docs/tutorials/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ Tutorials are hands-on lessons. Use this section when you want to learn a workfl
1. [Observe Bub with tapes and Jaeger](/docs/tutorials/observability/) — inspect Bub's own tape first, then export Logfire/OpenTelemetry traces to Jaeger.
2. [Connect MCP Servers with bub-mcp](/docs/tutorials/mcp/) — install the MCP plugin, wire up a time server, and call it from a Bub turn.
3. [Persist tapes in SQLAlchemy with SQLite](/docs/tutorials/tapestore-sqlalchemy/) — replace the file-based tape store with a local SQLite database.
4. [Run Bub with a local llama.cpp model](/docs/tutorials/local-llama-cpp/) — expose a GGUF Gemma model as a local OpenAI-compatible endpoint.

## Next steps

Expand Down
124 changes: 124 additions & 0 deletions website/src/content/docs/docs/tutorials/local-llama-cpp.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
---
title: Run Bub with a local llama.cpp model
description: Start a local llama.cpp server and configure Bub to use it as an OpenAI-compatible model provider.
sidebar:
order: 3
---

This tutorial shows how to run Bub against a local `llama.cpp` server. By the end, Bub will send model calls to a GGUF Gemma model running on your machine instead of a hosted API.

Use this path when you want a local model for development, private experiments, offline demos, or latency-sensitive tasks near your application. This tutorial does not cover model benchmarking, fine-tuning, production hardening, or choosing the best model for every workload.

The example uses [`ggml-org/gemma-4-E2B-it-GGUF`](https://huggingface.co/ggml-org/gemma-4-E2B-it-GGUF), a GGUF build of Google's Gemma 4 E2B instruction-tuned model. Google's [Gemma 4 overview](https://deepmind.google/models/gemma/gemma-4/) describes E2B and E4B as efficient models for mobile and edge devices, and the [Gemma 4 model card](https://ai.google.dev/gemma/docs/core/model_card_4) documents capabilities, limits, and responsible-use considerations.

## Before you begin

You need:

- Bub installed and runnable with `uv run bub --help`.
- Docker installed.
- A GGUF model file under `~/.cache/llama.cpp/`.
- Enough system memory for the quantization you choose. The Q8 Gemma 4 E2B GGUF file is about 5 GB on disk; runtime memory also depends on context size, batching, and GPU offload.

This tutorial uses these file names:

```text
~/.cache/llama.cpp/ggml-org_gemma-4-E2B-it-GGUF_gemma-4-E2B-it-Q8_0.gguf
~/.cache/llama.cpp/ggml-org_gemma-4-E2B-it-GGUF_mmproj-gemma-4-E2B-it-Q8_0.gguf
```

If your files use different names, update the `-m` and `--mmproj` paths in the Docker command.

## 1. Start the local server

Set an API key for the local server:

```bash
export LLAMA_API_KEY="${LLAMA_API_KEY:-test}"
```

Start `llama-server`:

```bash
sudo docker run --rm -it \
--security-opt label=disable \
-p 127.0.0.1:8080:8080 \
-v "$HOME/.cache/llama.cpp:/root/.cache/llama.cpp:ro" \
ghcr.io/ggml-org/llama.cpp:full \
--server \
--host 0.0.0.0 \
--port 8080 \
--api-key "$LLAMA_API_KEY" \
-m /root/.cache/llama.cpp/ggml-org_gemma-4-E2B-it-GGUF_gemma-4-E2B-it-Q8_0.gguf \
--mmproj /root/.cache/llama.cpp/ggml-org_gemma-4-E2B-it-GGUF_mmproj-gemma-4-E2B-it-Q8_0.gguf
```

The Docker port is bound to `127.0.0.1` so the server is available only from the local machine. Change the port binding only if you intentionally want another machine to reach it.

On SELinux systems, `--security-opt label=disable` avoids bind-mount permission failures when the container reads model files from `~/.cache/llama.cpp`. If you only need text input, remove the `--mmproj` line.

If Docker prints `no ROCm-capable device is detected`, the container can still fall back to CPU inference. That is enough to test the integration, but responses will be slower.

## 2. Test the OpenAI-compatible API

In another terminal, send a small chat request:

```bash
curl http://localhost:8080/v1/chat/completions \
-H "Authorization: Bearer $LLAMA_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "gemma-4-E2B-it",
"messages": [
{"role": "user", "content": "hello"}
]
}'
```

A working server returns a `chat.completion` JSON object with an assistant message.

## 3. Configure Bub

Point Bub at the local server:

```bash
export BUB_API_BASE="http://localhost:8080/v1"
export BUB_API_KEY="$LLAMA_API_KEY"
export BUB_MODEL="openai:gemma-4-E2B-it"
```

Run one Bub turn:

```bash
uv run bub run "Reply with one short sentence: hello from a local model."
```

Bub now uses the local OpenAI-compatible endpoint for model calls. The turn pipeline, channels, tools, and tapes are unchanged.

```bash
~/bubbuild/bub$ uv run bub run "Reply with one short sentence: hello from a local model."
2026-05-19 01:32:40.601 | INFO | bub.builtin.agent:_run_tools_with_auto_handoff:271 - loop.step step=1 tape=becda04eb9f7369c__0b871d5e50e7c192 model=openai:gemma-4-E2B-it
2026-05-19 01:32:46.747 | INFO | bub.builtin.store:fork:122 - Merged 7 entries into tape "becda04eb9f7369c__0b871d5e50e7c192"
[cli:local]
hello from a local model.
```

## 4. Check the model documentation before changing workloads

When you switch the model or quantization, check the upstream model documentation first:

- The Hugging Face GGUF card lists supported local runtimes and the available quantized files.
- The Gemma 4 model card documents input modalities, context windows, intended use, license, and risks.
- Local execution does not remove the need for evaluation. A local model can still produce incorrect, biased, or unsafe output.

Use small local models for workloads where their latency, privacy, cost, or offline behavior matters more than maximum model quality. For higher-stakes or product-facing workflows, evaluate the model on representative tasks before routing real users to it.

## Clean up

Stop the Docker container with `Ctrl-C`.

Unset the Bub overrides when you want to return to your previous provider:

```bash
unset BUB_API_BASE BUB_API_KEY BUB_MODEL LLAMA_API_KEY
```
1 change: 1 addition & 0 deletions website/src/content/docs/zh-cn/docs/tutorials/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ sidebar:
1. [使用 tape 与 Jaeger 观察 Bub](/zh-cn/docs/tutorials/observability/) — 先检查 Bub 自身的 tape,再把 Logfire/OpenTelemetry trace 导出到 Jaeger。
2. [使用 bub-mcp 连接 MCP 服务器](/zh-cn/docs/tutorials/mcp/) — 安装 MCP 插件,接入时间服务器,并在 Bub turn 中调用。
3. [用 SQLAlchemy 与 SQLite 持久化 tape](/zh-cn/docs/tutorials/tapestore-sqlalchemy/) — 把基于文件的 tape store 换成本地 SQLite 数据库。
4. [使用本地 llama.cpp 模型运行 Bub](/zh-cn/docs/tutorials/local-llama-cpp/) — 把 GGUF Gemma 模型暴露成本地 OpenAI-compatible endpoint。

## 下一步

Expand Down
124 changes: 124 additions & 0 deletions website/src/content/docs/zh-cn/docs/tutorials/local-llama-cpp.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
---
title: 使用本地 llama.cpp 模型运行 Bub
description: 启动本地 llama.cpp server,并把 Bub 配置为使用这个 OpenAI-compatible 模型后端。
sidebar:
order: 3
---

本教程演示如何让 Bub 使用本地 `llama.cpp` server。完成后,Bub 的模型调用会发往你机器上运行的 GGUF Gemma 模型,而不是托管 API。

当你需要在开发环境、私有实验、离线 demo,或贴近应用的低延迟任务中使用本地模型时,可以使用这条路径。本教程不覆盖模型 benchmark、fine-tuning、生产加固,也不讨论如何为所有工作负载选择最佳模型。

示例使用 [`ggml-org/gemma-4-E2B-it-GGUF`](https://huggingface.co/ggml-org/gemma-4-E2B-it-GGUF),这是 Google Gemma 4 E2B instruction-tuned 模型的 GGUF 版本。Google 的 [Gemma 4 overview](https://deepmind.google/models/gemma/gemma-4/) 把 E2B 和 E4B 描述为适合移动和边缘设备的高效模型;[Gemma 4 model card](https://ai.google.dev/gemma/docs/core/model_card_4) 记录了能力、限制和负责任使用相关注意事项。

## 开始前

你需要:

- Bub 已安装,且 `uv run bub --help` 可以运行。
- 已安装 Docker。
- `~/.cache/llama.cpp/` 下有 GGUF 模型文件。
- 系统内存足够运行所选量化版本。Gemma 4 E2B 的 Q8 GGUF 文件约 5 GB;实际运行内存还会受上下文长度、batching、GPU offload 等设置影响。

本教程使用下面两个文件名:

```text
~/.cache/llama.cpp/ggml-org_gemma-4-E2B-it-GGUF_gemma-4-E2B-it-Q8_0.gguf
~/.cache/llama.cpp/ggml-org_gemma-4-E2B-it-GGUF_mmproj-gemma-4-E2B-it-Q8_0.gguf
```

如果你的文件名不同,需要相应修改 Docker 命令里的 `-m` 和 `--mmproj` 路径。

## 1. 启动本地 server

先为本地 server 设置一个 API key:

```bash
export LLAMA_API_KEY="${LLAMA_API_KEY:-test}"
```

启动 `llama-server`:

```bash
sudo docker run --rm -it \
--security-opt label=disable \
-p 127.0.0.1:8080:8080 \
-v "$HOME/.cache/llama.cpp:/root/.cache/llama.cpp:ro" \
ghcr.m.daocloud.io/ggml-org/llama.cpp:full \
--server \
--host 0.0.0.0 \
--port 8080 \
--api-key "$LLAMA_API_KEY" \
-m /root/.cache/llama.cpp/ggml-org_gemma-4-E2B-it-GGUF_gemma-4-E2B-it-Q8_0.gguf \
--mmproj /root/.cache/llama.cpp/ggml-org_gemma-4-E2B-it-GGUF_mmproj-gemma-4-E2B-it-Q8_0.gguf
```

Docker 端口绑定到 `127.0.0.1`,因此这个 server 只在本机可访问。只有在明确需要其他机器访问时,才调整端口绑定。

在 SELinux 系统上,`--security-opt label=disable` 可以避免容器读取 `~/.cache/llama.cpp` 下模型文件时遇到 bind mount 权限问题。如果只需要文本输入,可以删除 `--mmproj` 那一行。

如果 Docker 输出 `no ROCm-capable device is detected`,容器仍然可能回退到 CPU inference。这足够验证接入路径,但响应速度会更慢。

## 2. 测试 OpenAI-compatible API

在另一个终端发送一个小 chat 请求:

```bash
curl http://localhost:8080/v1/chat/completions \
-H "Authorization: Bearer $LLAMA_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "gemma-4-E2B-it",
"messages": [
{"role": "user", "content": "hello"}
]
}'
```

正常工作的 server 会返回一个 `chat.completion` JSON 对象,其中包含 assistant message。

## 3. 配置 Bub

把 Bub 指向本地 server:

```bash
export BUB_API_BASE="http://localhost:8080/v1"
export BUB_API_KEY="$LLAMA_API_KEY"
export BUB_MODEL="openai:gemma-4-E2B-it"
```

运行一个 Bub turn:

```bash
uv run bub run "Reply with one short sentence: hello from a local model."
```

现在 Bub 会通过本地 OpenAI-compatible endpoint 进行模型调用。turn pipeline、channels、tools 和 tapes 都不需要改变。

```bash
~/bubbuild/bub$ uv run bub run "Reply with one short sentence: hello from a local model."
2026-05-19 01:32:40.601 | INFO | bub.builtin.agent:_run_tools_with_auto_handoff:271 - loop.step step=1 tape=becda04eb9f7369c__0b871d5e50e7c192 model=openai:gemma-4-E2B-it
2026-05-19 01:32:46.747 | INFO | bub.builtin.store:fork:122 - Merged 7 entries into tape "becda04eb9f7369c__0b871d5e50e7c192"
[cli:local]
hello from a local model.
```

## 4. 更换工作负载前检查模型文档

当你切换模型或量化版本时,先检查上游模型文档:

- Hugging Face GGUF card 会列出支持的本地运行方式和可用量化文件。
- Gemma 4 model card 会说明输入模态、上下文窗口、预期用途、许可和风险。
- 本地运行不等于免评测。本地模型仍然可能生成错误、有偏见或不安全的输出。

适合优先尝试小型本地模型的,是那些更看重延迟、隐私、成本或离线能力,而不是最高模型质量的工作负载。对于高风险或面向产品的工作流,应先在有代表性的任务上评估模型,再接入真实用户路径。

## 清理

用 `Ctrl-C` 停止 Docker 容器。

如果要恢复到之前的 provider,清理 Bub 覆盖配置:

```bash
unset BUB_API_BASE BUB_API_KEY BUB_MODEL LLAMA_API_KEY
```
Loading