Skip to content

futz12/ncnn_llm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

246 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ncnn_llm

ncnn_llm

LLM, VLM, OCR, translation, and embedding inference on top of ncnn.

License Build Backend Platform

中文文档 · Quick Start · Supported Models · Model Zoo


ncnn_llm provides a lightweight C++ runtime for running language models and embedding models with ncnn. It focuses on practical local inference for edge devices, desktop CPU, and Vulkan-capable GPUs.

The project started from nihui's experimental ncnn kvcache work and expands it into reusable examples, model loaders, tokenizers, vision preprocessing, OCR inference, and embedding APIs.

Highlights

  • Unified CLI runner for chat and vision-language models
  • KV-cache autoregressive decoding with CPU and optional Vulkan execution
  • Qwen / MiniCPM style LLM support
  • Qwen VL image input support
  • GLM-OCR image-to-text example
  • NLLB translation example
  • Text and multimodal embedding APIs
  • BPE and Unigram tokenizer support
  • xmake-based build with small standalone examples

Supported Models

Category Model Status Notes
LLM YoutuLLM Supported Chat / text generation
LLM MiniCPM4 Supported Chat / text generation
LLM Qwen3 Supported Chat / text generation
VLM Qwen3.5 Supported Image + text input
VLM Qwen2.5-VL Supported Image + text input
OCR GLM-OCR Supported OCR
Translation NLLB Supported Translation example
Embedding Jina-Embeddings-v5-Text-Nano Supported 768-dim text embeddings
Embedding Jina-CLIP-v2 Supported 1024-dim text + image embeddings

Quick Start

1. Requirements

  • xmake
  • ncnn built from master

2. Clone

git clone https://github.com/futz12/ncnn_llm.git
cd ncnn_llm

3. Build

xmake build

Build a single target:

xmake build llm_ncnn_run

4. Download Models

Download converted ncnn model directories from the mirror:

https://mirrors.sdu.edu.cn/ncnn_modelzoo/

Put the model directory under assets/, for example:

assets/
└── qwen3_0.6b/
    ├── model.json
    ├── *.ncnn.param
    ├── *.ncnn.bin
    └── tokenizer files

CLI Chat

llm_ncnn_run is the main interactive example for text and vision-language models.

xmake run llm_ncnn_run --model ./assets/qwen3_0.6b

With explicit runtime options:

xmake run llm_ncnn_run --model ./assets/qwen3_0.6b --threads 4
xmake run llm_ncnn_run --model ./assets/qwen3_0.6b --vulkan --vulkan-device 0

Vision-language input:

xmake run llm_ncnn_run --model ./assets/qwen2.5_vl_3b --image ./assets/test.jpg

CLI Options

Option Description
--model Model directory
--threads CPU thread count
--vulkan Enable Vulkan compute
--vulkan-device Vulkan device index
--image Image path for VL models
--builtin-tools Enable built-in demo tools

Example session:

llm_ncnn_run (cli). Type 'exit' or 'quit' to end the conversation.
User: Hello
Assistant: Hello! How can I help you today?

OCR

GLM-OCR uses a dedicated image prefill path and the shared text decode runtime.

xmake build ocr_main
xmake run ocr_main --model ./assets/glm_ocr --image ./test_ocr.png --prompt "Read the text in the image."

Example output:

Generating text:
Hello World 123

Embeddings

ncnn_embedding provides a common API for text embeddings and CLIP-style text-image embeddings.

Text Embedding

xmake build embedding_main
xmake run embedding_main --model ./assets/jina-embeddings-v5-text-nano

CLIP Multimodal Embedding

xmake build clip_main
xmake run clip_main --model ./assets/jina_clip_v2 --image ./assets/ganyu.jpg

C++ API

#include "ncnn_embedding.h"

ncnn_embedding embed("./assets/jina_clip_v2", false, 4);

std::vector<float> text_vec = embed.encode_text("Hello world");

if (embed.supports_image()) {
    std::vector<float> image_vec = embed.encode_image_file("./image.jpg");
    float score = cosine_similarity(text_vec, image_vec);
}

Other Examples

Target Purpose
llm_ncnn_run Unified chat / VL CLI
ocr_main GLM-OCR inference
embedding_main Text embedding inference
clip_main CLIP text-image embedding inference
nllb_main NLLB translation example
unigram_main Unigram tokenizer example
benchllm LLM benchmark
test_llm Unit tests

Build and run tests:

xmake build test_llm
xmake run test_llm

Run benchmark:

xmake build benchllm
xmake run benchllm [loop_count] [num_threads] [powersave] [gpu_device] [cooling_down] [seqlen]

Model Zoo

Converted ncnn model weights are available from:

https://mirrors.sdu.edu.cn/ncnn_modelzoo/

Each downloaded model directory should contain model.json, ncnn param/bin files, and tokenizer files. Put the directory under assets/ or pass its path with --model.

Configuration

Each model directory is described by model.json. The exact fields depend on the model family, but a typical text model contains:

{
  "model_type": "llm",
  "params": {
    "embed_param": "embed.ncnn.param",
    "embed_bin": "embed.ncnn.bin",
    "decoder_param": "decoder.ncnn.param",
    "decoder_bin": "decoder.ncnn.bin",
    "lm_head_param": "lm_head.ncnn.param",
    "lm_head_bin": "lm_head.ncnn.bin"
  },
  "tokenizer": {
    "type": "bbpe",
    "vocab_file": "vocab.txt",
    "merges_file": "merges.txt"
  },
  "setting": {
    "attn_cnt": 32,
    "hidden_size": 1024,
    "rope": {
      "type": "RoPE",
      "rope_head_dim": 64,
      "rope_theta": 1000000.0
    }
  }
}

Embedding and OCR models use their own model_type and parameter sections. See the model files under assets/ for concrete examples.

Project Layout

ncnn_llm/
├── assets/                 # Local model directories and demo assets
├── benchmark/              # Benchmark entry points
├── examples/               # CLI and feature examples
│   ├── llm_ncnn_run/       # Unified chat / VL runner
│   ├── ocr_main.cpp        # OCR example
│   ├── embedding_main.cpp  # Text embedding example
│   ├── clip_main.cpp       # CLIP example
│   └── nllb_main.cpp       # Translation example
├── export/                 # Export scripts
├── src/                    # Core runtime
│   ├── ncnn_llm_gpt.*      # LLM / VL runtime
│   ├── ncnn_llm_ocr.*      # OCR image prefill + shared decode
│   ├── ncnn_embedding.*    # Embedding runtime
│   ├── ncnn_text_runtime.* # Shared text decode helpers
│   └── utils/              # Tokenizer, image, RoPE, prompt helpers
├── tests/                  # Unit tests
└── xmake.lua               # Build configuration

Roadmap

  • Keep decoder and KV-cache runtime shared across model families
  • Expand supported model architectures and tokenizers
  • Improve Vulkan and CPU performance
  • Add INT8 quantization support
  • Document model export pipelines in more detail

Older export scripts may become outdated as the runtime evolves. Prefer the latest model examples and model.json files as references.

Community

Issues, fixes, converted models, and test results are welcome.

  • QQ group: 767178345

License

Apache License 2.0. See LICENSE.

About

A repo for llm on ncnn

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors