llama.cpp使用

前面的一篇文章介绍了Ollama运行DeepSeek-R1，实际上Ollama的后端使用的是llama.cpp。

llama.cpp使用纯C/C++实现，具备非常高效的性能，支持很多种硬件架构和指令集，并且提供了很多预编译好的工具，包括量化工具，Hugging Face上很多的模型都是使用llama.cpp进行量化的。

The main goal of llama.cpp is to enable LLM inference with minimal setup and state-of-the-art performance on a wide range of hardware - locally and in the cloud.

Plain C/C++ implementation without any dependencies

Apple silicon is a first-class citizen - optimized via ARM NEON, Accelerate and Metal frameworks

AVX, AVX2, AVX512 and AMX support for x86 architectures

1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory use

Custom CUDA kernels for running LLMs on NVIDIA GPUs (support for AMD GPUs via HIP and Moore Threads MTT GPUs via MUSA)

Vulkan and SYCL backend support

CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity

The llama.cpp project is the main playground for developing new features for the ggml library.

安装llama.cpp也非常简单，从llama.cpp Releases页面下载对应硬件的二进制文件，注意依赖的CUDA库文件较大所以单独打包了，将所有的文件解压到同一个目录就可以使用。

以DeepSeek-R1为例，运行 llama-server -m ./DeepSeek-R1-Distill-Llama-8B-Q5_K_M.gguf --port 11435 --no-mmap -c 16384 -np 4 -t 10 -ngl 99 命令在后端启动模型，此时可以在浏览器打开 http://localhost:11435 ，这是一个简单的聊天界面，也可以安装使用Open WebUI这样功能更强大的应用。

llama.cpp还提供了 llama-cli 工具，可以直接在命令行中和模型进行对话。

在Emacs中，可以通过gptel来对接llama.cpp，添加以下代码注册DeepSeek-R1模型。

(gptel-make-openai "llama-cpp"
  :stream nil
  :protocol "http"
  :host "localhost:11435"
  :models '(DeepSeek-R1))