diff --git a/website/src/content/docs/docs/tutorials/index.mdx b/website/src/content/docs/docs/tutorials/index.mdx index 61f5164f..e7877fdd 100644 --- a/website/src/content/docs/docs/tutorials/index.mdx +++ b/website/src/content/docs/docs/tutorials/index.mdx @@ -12,6 +12,7 @@ Tutorials are hands-on lessons. Use this section when you want to learn a workfl 1. [Observe Bub with tapes and Jaeger](/docs/tutorials/observability/) — inspect Bub's own tape first, then export Logfire/OpenTelemetry traces to Jaeger. 2. [Connect MCP Servers with bub-mcp](/docs/tutorials/mcp/) — install the MCP plugin, wire up a time server, and call it from a Bub turn. 3. [Persist tapes in SQLAlchemy with SQLite](/docs/tutorials/tapestore-sqlalchemy/) — replace the file-based tape store with a local SQLite database. +4. [Run Bub with a local llama.cpp model](/docs/tutorials/local-llama-cpp/) — expose a GGUF Gemma model as a local OpenAI-compatible endpoint. ## Next steps diff --git a/website/src/content/docs/docs/tutorials/local-llama-cpp.mdx b/website/src/content/docs/docs/tutorials/local-llama-cpp.mdx new file mode 100644 index 00000000..ec99caab --- /dev/null +++ b/website/src/content/docs/docs/tutorials/local-llama-cpp.mdx @@ -0,0 +1,124 @@ +--- +title: Run Bub with a local llama.cpp model +description: Start a local llama.cpp server and configure Bub to use it as an OpenAI-compatible model provider. +sidebar: + order: 3 +--- + +This tutorial shows how to run Bub against a local `llama.cpp` server. By the end, Bub will send model calls to a GGUF Gemma model running on your machine instead of a hosted API. + +Use this path when you want a local model for development, private experiments, offline demos, or latency-sensitive tasks near your application. This tutorial does not cover model benchmarking, fine-tuning, production hardening, or choosing the best model for every workload. + +The example uses [`ggml-org/gemma-4-E2B-it-GGUF`](https://huggingface.co/ggml-org/gemma-4-E2B-it-GGUF), a GGUF build of Google's Gemma 4 E2B instruction-tuned model. Google's [Gemma 4 overview](https://deepmind.google/models/gemma/gemma-4/) describes E2B and E4B as efficient models for mobile and edge devices, and the [Gemma 4 model card](https://ai.google.dev/gemma/docs/core/model_card_4) documents capabilities, limits, and responsible-use considerations. + +## Before you begin + +You need: + +- Bub installed and runnable with `uv run bub --help`. +- Docker installed. +- A GGUF model file under `~/.cache/llama.cpp/`. +- Enough system memory for the quantization you choose. The Q8 Gemma 4 E2B GGUF file is about 5 GB on disk; runtime memory also depends on context size, batching, and GPU offload. + +This tutorial uses these file names: + +```text +~/.cache/llama.cpp/ggml-org_gemma-4-E2B-it-GGUF_gemma-4-E2B-it-Q8_0.gguf +~/.cache/llama.cpp/ggml-org_gemma-4-E2B-it-GGUF_mmproj-gemma-4-E2B-it-Q8_0.gguf +``` + +If your files use different names, update the `-m` and `--mmproj` paths in the Docker command. + +## 1. Start the local server + +Set an API key for the local server: + +```bash +export LLAMA_API_KEY="${LLAMA_API_KEY:-test}" +``` + +Start `llama-server`: + +```bash +sudo docker run --rm -it \ + --security-opt label=disable \ + -p 127.0.0.1:8080:8080 \ + -v "$HOME/.cache/llama.cpp:/root/.cache/llama.cpp:ro" \ + ghcr.io/ggml-org/llama.cpp:full \ + --server \ + --host 0.0.0.0 \ + --port 8080 \ + --api-key "$LLAMA_API_KEY" \ + -m /root/.cache/llama.cpp/ggml-org_gemma-4-E2B-it-GGUF_gemma-4-E2B-it-Q8_0.gguf \ + --mmproj /root/.cache/llama.cpp/ggml-org_gemma-4-E2B-it-GGUF_mmproj-gemma-4-E2B-it-Q8_0.gguf +``` + +The Docker port is bound to `127.0.0.1` so the server is available only from the local machine. Change the port binding only if you intentionally want another machine to reach it. + +On SELinux systems, `--security-opt label=disable` avoids bind-mount permission failures when the container reads model files from `~/.cache/llama.cpp`. If you only need text input, remove the `--mmproj` line. + +If Docker prints `no ROCm-capable device is detected`, the container can still fall back to CPU inference. That is enough to test the integration, but responses will be slower. + +## 2. Test the OpenAI-compatible API + +In another terminal, send a small chat request: + +```bash +curl http://localhost:8080/v1/chat/completions \ + -H "Authorization: Bearer $LLAMA_API_KEY" \ + -H "Content-Type: application/json" \ + -d '{ + "model": "gemma-4-E2B-it", + "messages": [ + {"role": "user", "content": "hello"} + ] + }' +``` + +A working server returns a `chat.completion` JSON object with an assistant message. + +## 3. Configure Bub + +Point Bub at the local server: + +```bash +export BUB_API_BASE="http://localhost:8080/v1" +export BUB_API_KEY="$LLAMA_API_KEY" +export BUB_MODEL="openai:gemma-4-E2B-it" +``` + +Run one Bub turn: + +```bash +uv run bub run "Reply with one short sentence: hello from a local model." +``` + +Bub now uses the local OpenAI-compatible endpoint for model calls. The turn pipeline, channels, tools, and tapes are unchanged. + +```bash +~/bubbuild/bub$ uv run bub run "Reply with one short sentence: hello from a local model." +2026-05-19 01:32:40.601 | INFO | bub.builtin.agent:_run_tools_with_auto_handoff:271 - loop.step step=1 tape=becda04eb9f7369c__0b871d5e50e7c192 model=openai:gemma-4-E2B-it +2026-05-19 01:32:46.747 | INFO | bub.builtin.store:fork:122 - Merged 7 entries into tape "becda04eb9f7369c__0b871d5e50e7c192" +[cli:local] +hello from a local model. +``` + +## 4. Check the model documentation before changing workloads + +When you switch the model or quantization, check the upstream model documentation first: + +- The Hugging Face GGUF card lists supported local runtimes and the available quantized files. +- The Gemma 4 model card documents input modalities, context windows, intended use, license, and risks. +- Local execution does not remove the need for evaluation. A local model can still produce incorrect, biased, or unsafe output. + +Use small local models for workloads where their latency, privacy, cost, or offline behavior matters more than maximum model quality. For higher-stakes or product-facing workflows, evaluate the model on representative tasks before routing real users to it. + +## Clean up + +Stop the Docker container with `Ctrl-C`. + +Unset the Bub overrides when you want to return to your previous provider: + +```bash +unset BUB_API_BASE BUB_API_KEY BUB_MODEL LLAMA_API_KEY +``` diff --git a/website/src/content/docs/zh-cn/docs/tutorials/index.mdx b/website/src/content/docs/zh-cn/docs/tutorials/index.mdx index 8c511051..020e1c68 100644 --- a/website/src/content/docs/zh-cn/docs/tutorials/index.mdx +++ b/website/src/content/docs/zh-cn/docs/tutorials/index.mdx @@ -12,6 +12,7 @@ sidebar: 1. [使用 tape 与 Jaeger 观察 Bub](/zh-cn/docs/tutorials/observability/) — 先检查 Bub 自身的 tape,再把 Logfire/OpenTelemetry trace 导出到 Jaeger。 2. [使用 bub-mcp 连接 MCP 服务器](/zh-cn/docs/tutorials/mcp/) — 安装 MCP 插件,接入时间服务器,并在 Bub turn 中调用。 3. [用 SQLAlchemy 与 SQLite 持久化 tape](/zh-cn/docs/tutorials/tapestore-sqlalchemy/) — 把基于文件的 tape store 换成本地 SQLite 数据库。 +4. [使用本地 llama.cpp 模型运行 Bub](/zh-cn/docs/tutorials/local-llama-cpp/) — 把 GGUF Gemma 模型暴露成本地 OpenAI-compatible endpoint。 ## 下一步 diff --git a/website/src/content/docs/zh-cn/docs/tutorials/local-llama-cpp.mdx b/website/src/content/docs/zh-cn/docs/tutorials/local-llama-cpp.mdx new file mode 100644 index 00000000..0388823b --- /dev/null +++ b/website/src/content/docs/zh-cn/docs/tutorials/local-llama-cpp.mdx @@ -0,0 +1,124 @@ +--- +title: 使用本地 llama.cpp 模型运行 Bub +description: 启动本地 llama.cpp server,并把 Bub 配置为使用这个 OpenAI-compatible 模型后端。 +sidebar: + order: 3 +--- + +本教程演示如何让 Bub 使用本地 `llama.cpp` server。完成后,Bub 的模型调用会发往你机器上运行的 GGUF Gemma 模型,而不是托管 API。 + +当你需要在开发环境、私有实验、离线 demo,或贴近应用的低延迟任务中使用本地模型时,可以使用这条路径。本教程不覆盖模型 benchmark、fine-tuning、生产加固,也不讨论如何为所有工作负载选择最佳模型。 + +示例使用 [`ggml-org/gemma-4-E2B-it-GGUF`](https://huggingface.co/ggml-org/gemma-4-E2B-it-GGUF),这是 Google Gemma 4 E2B instruction-tuned 模型的 GGUF 版本。Google 的 [Gemma 4 overview](https://deepmind.google/models/gemma/gemma-4/) 把 E2B 和 E4B 描述为适合移动和边缘设备的高效模型;[Gemma 4 model card](https://ai.google.dev/gemma/docs/core/model_card_4) 记录了能力、限制和负责任使用相关注意事项。 + +## 开始前 + +你需要: + +- Bub 已安装,且 `uv run bub --help` 可以运行。 +- 已安装 Docker。 +- `~/.cache/llama.cpp/` 下有 GGUF 模型文件。 +- 系统内存足够运行所选量化版本。Gemma 4 E2B 的 Q8 GGUF 文件约 5 GB;实际运行内存还会受上下文长度、batching、GPU offload 等设置影响。 + +本教程使用下面两个文件名: + +```text +~/.cache/llama.cpp/ggml-org_gemma-4-E2B-it-GGUF_gemma-4-E2B-it-Q8_0.gguf +~/.cache/llama.cpp/ggml-org_gemma-4-E2B-it-GGUF_mmproj-gemma-4-E2B-it-Q8_0.gguf +``` + +如果你的文件名不同,需要相应修改 Docker 命令里的 `-m` 和 `--mmproj` 路径。 + +## 1. 启动本地 server + +先为本地 server 设置一个 API key: + +```bash +export LLAMA_API_KEY="${LLAMA_API_KEY:-test}" +``` + +启动 `llama-server`: + +```bash +sudo docker run --rm -it \ + --security-opt label=disable \ + -p 127.0.0.1:8080:8080 \ + -v "$HOME/.cache/llama.cpp:/root/.cache/llama.cpp:ro" \ + ghcr.m.daocloud.io/ggml-org/llama.cpp:full \ + --server \ + --host 0.0.0.0 \ + --port 8080 \ + --api-key "$LLAMA_API_KEY" \ + -m /root/.cache/llama.cpp/ggml-org_gemma-4-E2B-it-GGUF_gemma-4-E2B-it-Q8_0.gguf \ + --mmproj /root/.cache/llama.cpp/ggml-org_gemma-4-E2B-it-GGUF_mmproj-gemma-4-E2B-it-Q8_0.gguf +``` + +Docker 端口绑定到 `127.0.0.1`,因此这个 server 只在本机可访问。只有在明确需要其他机器访问时,才调整端口绑定。 + +在 SELinux 系统上,`--security-opt label=disable` 可以避免容器读取 `~/.cache/llama.cpp` 下模型文件时遇到 bind mount 权限问题。如果只需要文本输入,可以删除 `--mmproj` 那一行。 + +如果 Docker 输出 `no ROCm-capable device is detected`,容器仍然可能回退到 CPU inference。这足够验证接入路径,但响应速度会更慢。 + +## 2. 测试 OpenAI-compatible API + +在另一个终端发送一个小 chat 请求: + +```bash +curl http://localhost:8080/v1/chat/completions \ + -H "Authorization: Bearer $LLAMA_API_KEY" \ + -H "Content-Type: application/json" \ + -d '{ + "model": "gemma-4-E2B-it", + "messages": [ + {"role": "user", "content": "hello"} + ] + }' +``` + +正常工作的 server 会返回一个 `chat.completion` JSON 对象,其中包含 assistant message。 + +## 3. 配置 Bub + +把 Bub 指向本地 server: + +```bash +export BUB_API_BASE="http://localhost:8080/v1" +export BUB_API_KEY="$LLAMA_API_KEY" +export BUB_MODEL="openai:gemma-4-E2B-it" +``` + +运行一个 Bub turn: + +```bash +uv run bub run "Reply with one short sentence: hello from a local model." +``` + +现在 Bub 会通过本地 OpenAI-compatible endpoint 进行模型调用。turn pipeline、channels、tools 和 tapes 都不需要改变。 + +```bash +~/bubbuild/bub$ uv run bub run "Reply with one short sentence: hello from a local model." +2026-05-19 01:32:40.601 | INFO | bub.builtin.agent:_run_tools_with_auto_handoff:271 - loop.step step=1 tape=becda04eb9f7369c__0b871d5e50e7c192 model=openai:gemma-4-E2B-it +2026-05-19 01:32:46.747 | INFO | bub.builtin.store:fork:122 - Merged 7 entries into tape "becda04eb9f7369c__0b871d5e50e7c192" +[cli:local] +hello from a local model. +``` + +## 4. 更换工作负载前检查模型文档 + +当你切换模型或量化版本时,先检查上游模型文档: + +- Hugging Face GGUF card 会列出支持的本地运行方式和可用量化文件。 +- Gemma 4 model card 会说明输入模态、上下文窗口、预期用途、许可和风险。 +- 本地运行不等于免评测。本地模型仍然可能生成错误、有偏见或不安全的输出。 + +适合优先尝试小型本地模型的,是那些更看重延迟、隐私、成本或离线能力,而不是最高模型质量的工作负载。对于高风险或面向产品的工作流,应先在有代表性的任务上评估模型,再接入真实用户路径。 + +## 清理 + +用 `Ctrl-C` 停止 Docker 容器。 + +如果要恢复到之前的 provider,清理 Bub 覆盖配置: + +```bash +unset BUB_API_BASE BUB_API_KEY BUB_MODEL LLAMA_API_KEY +```