llama.cpp使用

前面的一篇文章介绍了Ollama运行DeepSeek-R1，实际上Ollama的后端使用的是llama.cpp。

The main goal of llama.cpp is to enable LLM inference with minimal setup and state-of-the-art performance on a wide range of hardware - locally and in the cloud.
Plain C/C++ implementation without any dependencies
Apple silicon is a first-class citizen - optimized via ARM NEON, Accelerate and Metal frameworks
AVX, AVX2, AVX512 and AMX support for x86 architectures
1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory use
Custom CUDA kernels for running LLMs on NVIDIA GPUs (support for AMD GPUs via HIP and Moore Threads MTT GPUs via MUSA)
Vulkan and SYCL backend support
CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity
The llama.cpp project is the main playground for developing new features for the ggml library.

安装llama.cpp也非常简单，从llama.cpp Releases页面下载对应硬件的二进制文件，注意依赖的CUDA库文件较大所以单独打包了，将所有的文件解压到同一个目录就可以使用。

以DeepSeek-R1为例，运行 llama-server -m ./DeepSeek-R1-Distill-Llama-8B-Q5_K_M.gguf --port 11435 --no-mmap -c 16384 -np 4 -t 10 -ngl 99 命令在后端启动模型，此时可以在浏览器打开 http://localhost:11435 ，这是一个简单的聊天界面，也可以安装使用Open WebUI这样功能更强大的应用。

llama.cpp还提供了 llama-cli 工具，可以直接在命令行中和模型进行对话。

在Emacs中，可以通过gptel来对接llama.cpp，添加以下代码注册DeepSeek-R1模型。

(gptel-make-openai "llama-cpp"
  :stream nil
  :protocol "http"
  :host "localhost:11435"
  :models '(DeepSeek-R1))

Ollama运行DeepSeek-R1

8 February 2025·540 words·2 min

Emacs LLM

Ollama 是一个用于构建大型语言模型应用的工具，它提供了一个简洁易用的命令行界面和服务器，让你能够轻松下载、运行和管理各种开源LLM。

Emacs中使用大型语言模型

11 January 2025·229 words·1 min

Emacs LLM

Emacs有不少LLM的客户端，这里我们选择gptel，在 init.el 中添加以下代码：

Org导出HTML时生成稳定ID

26 January 2025·572 words·2 min

Emacs Org HTML

使用Org导出成HTML的时候，文档内部的标题和图片等元素都会生成格式如 orgxxxxxxx 的ID，这个ID从字面上并不能看出来对应那个元素，并且每次文档重新导出时，这个ID都会改变。

llama.cpp使用

Related

Ollama运行DeepSeek-R1

Emacs中使用大型语言模型

Org导出HTML时生成稳定ID