llama.cpp是一个高性能的CPU/GPU大语言模型推理框架，适用于消费级设备或边缘设备。开发者可以通过工具将各类开源大语言模型转换并量化成gguf格式的文件，然后通过llama.cpp实现本地推理。经过我的调研，相比较其它大模型落地方案，中小型研发企业使用llama.cpp可能是唯一的产品落地方案。关键词：“中小型研发企业”，“产品落地方案”。

中小型研发企业：相较动辄千万+的硬件投入，中小型研发企业只能支撑少量硬件投入，并且也缺少专业的研发人员。

产品落地方案：项目需要具备在垂直领域落地的能力，大多数情况下还需要私有化部署。

网上有不少介绍的文章，B站上甚至有一些收费课程。但是版本落后较多，基本已经没有参考价值。本文采用b3669版本，发布日期是2024年9月，参考代码：examples/main.cpp。由于作者(Georgi Gerganov)没有提供详细的接口文档，examples的代码质量也确实不高，因此学习曲线比较陡峭。本文旨在介绍如何使用llama.cpp进行推理和介绍重点函数，帮助开发人员入门，深入功能还有待研究。

一、推理流程

1. 过程描述

以常见的交互推理为例，程序大概可以分成5个子功能模块。

初始化：模型和系统提示词初始化。其实从程序处理过程上分析，并没有特别区分系统提示词与用户输入，实际项目开发中完全可以放在一起处理。后面会再解释它们在概念上的区别。

用户输入：等待用户输入文本信息。大语言模型其实就是对人类的文本信息进行分析和理解的过程，而产品落地的本质就是借助大模型的理解进一步完成一些指定任务。在这个过程中，互联网上又造了许多概念，什么agent，function等。其实本质上都是在研究如何将大模型与程序进一步结合并完成交互。至少目前，我的观点是：大模型仅具备语义分析，语义推理的能力。

分析预测：这个是大语言模型的核心能力之一，它需要分析上下文（系统提示词、用户输入、已推理的内容）再进一步完成下一个词语（token）的预测。

推理采样：这个是大语言模型的另一个核心能力，它需要从分析预测的结果中随机选择一个token，并将它作为输入反向发送给分析预测模块继续进行，直到输出结束（EOS）。

输出：这个模块严格说不属于大模型，但是它又是完成用户交互必须模块。从产品设计上，可以选择逐字输出（token-by-token）或者一次性输出（token-by-once）。

2. 概念介绍

角色（roles）：大语言模型通常会内置三种角色：系统（system），用户（user），助手（assistant）。这三种角色并非所有模型统一指定，但是基本目前所有开源的大模型都兼容这三种角色的交互，它有助于大模型更好的理解人类语境并完成任务。system表示系统提示词，就是我们常说的prompt。网上有不少课程将写系统提示词描述为提示词工程，还煞有介事的进行分类，其实大可不必。从我的使用经验看，一个好的系统提示词（prompt）应具备三个要点即可：语义明确，格式清晰，任务简单。语义明确即在系统提示词中尽量不要使用模棱两可的词语，用人话说就是“把问题说清楚”。格式清晰即可以使用markdown或者json指定一些重要概念。如果你需要让大模型按照某个固定流程进行分析，可以使用markdown的编号语法，如果你需要将大模型对推理结果进行结构化处理，可以使用json语法。任务简单即不要让大模型处理逻辑太复杂或者流程太多的任务。大模型的推理能力完全基于语义理解，它并不具备严格意义上的程序执行逻辑和数学运算逻辑。这就是为什么，当你问大模型：1.11和1.8谁大的时候，它会一本正经的告诉你，当整数部分一样大的时候，仅需要比较小数部分，因为11大于8，因此1.11大于1.8。那么如果我们现实中确实有一些计算任务或复杂的流程需要处理怎么办？我的解决方案是，与程序交互和动态切换上下文。除了系统角色以外，用户一般代表输入和助手一般代表输出。

token：这里不要理解为令牌，它的正确解释应该是一组向量的id。就是常见的描述大模型上下文长度的单位。一个token代表什么？互联网上有很多错误的解释，比较常见说法是：一个英文单词为1个token，一个中文通常是2-3个token。上面的流程介绍一节，我已经解释了“分析预测”与“采样推理”如何交互。“推理采样”生成1个token，反向输送给“分析预测”进行下一个token的预测，而输出模块可以选择token-by-token的方式向用户输出。实际上，对于中文而言，一个token通常表示一个分词。例如：“我爱中国”可能的分词结果是“我”，“爱”，“中国”也可能是“我”，“爱”，“中”，“国”。前者代表3个token，后者代表4个token。具体如何划分，取决于大模型的中文指令训练。除了常见的代表词语的token以外，还有一类特殊token（special token），例如上文提到的，大模型一个字一个字的进行推理生成，程序怎么知道何时结束？其实是有个eos-token，当读到这个token的时候，即表示本轮推理结束了。

3. 程序结构

llama.cpp的程序结构比较清晰，核心模块是llama和ggmll。ggml通过llama进行调用，开发通常不会直接使用。在llama中定义了常用的结构体和函数。common是对llama中函数功能的再次封装，有时候起到方便调用的目的。但是版本迭代上，common中的函数变化较快，最好的方法是看懂流程后直接调用llama.h中的函数。

4. 源码分析

下面我以examples/main/main.cpp作为基础做重点分析。

(1) 初始化

全局参数，这个结构体主要用来接收用户输入和后续用来初始化模型与推理上下文。

gpt_params params;

系统初始化函数：

llama_backend_init();
llama_numa_init(params.numa);

系统资源释放函数：

llama_backend_free();

创建模型和推理上下文：

llama_init_result llama_init = llama_init_from_gpt_params(params);

llama_model *model = llama_init.model;
llama_context *ctx = llama_init.context;

它声明在common.h中。如果你需要将模型和上下文分开创建可以使用llama.h中的另外两对函数：

llama_model_params model_params = llama_model_params_from_gpt_params(gpt_params_);
llama_model_ = llama_load_model_from_file(param.model.c_str(), model_params);

llama_context_params ctx_eval_params = llama_context_params_from_gpt_params(gpt_params_);
llama_context *ctx_eval = llama_new_context_with_model(llama_model_, ctx_eval_params);

创建ggml的线程池，这个过程可能和模型加速有关，代码中没有对它的详细解释：

struct ggml_threadpool * threadpool = ggml_threadpool_new(&tpp);

llama_attach_threadpool(ctx, threadpool, threadpool_batch);

除了完成一般的推理任务，llama.cpp还实现了上下文存储与读取。上下文切换的前提是不能换模型，且仅首次推理接收用户输入的prompt。利用这个特性，可以实现上下文的动态切换。

std::string path_session = params.path_prompt_cache;
std::vector session_tokens;

至此，有关系统初始化模块的过程已经完成。

(2) 用户输入

为了接收用户输入和推理输出，源码集中定义了几个变量：

std::vector embd_inp;

std::vector embd;

检查编码器，现代模型大多都没有明确定义的encodec

if (llama_model_has_encoder(model)) {
    int enc_input_size = embd_inp.size();
    llama_token * enc_input_buf = embd_inp.data();
    if (llama_encode(ctx, llama_batch_get_one(enc_input_buf, enc_input_size, 0, 0))) {
        LOG_TEE("%s : failed to evaln", __func__);
        return 1;
    }
    llama_token decoder_start_token_id = llama_model_decoder_start_token(model);
    if (decoder_start_token_id == -1) {
        decoder_start_token_id = llama_token_bos(model);
    }

    embd_inp.clear();
    embd_inp.push_back(decoder_start_token_id);
}

(3) 分析预测

分析预测部分的核心代码如下，我将处理关注力和session的逻辑删除，仅保留推理部分的逻辑。

// predict
if (!embd.empty()) {
    // Note: (n_ctx - 4) here is to match the logic for commandline prompt handling via
    // --prompt or --file which uses the same value.
    int max_embd_size = n_ctx - 4;

    // Ensure the input doesn't exceed the context size by truncating embd if necessary.
    if ((int) embd.size() > max_embd_size) {
        const int skipped_tokens = (int) embd.size() - max_embd_size;
        embd.resize(max_embd_size);

        console::set_display(console::error);
        printf(">", skipped_tokens, skipped_tokens != 1 ? "s" : "");
        console::set_display(console::reset);
        fflush(stdout);
    }

    for (int i = 0; i int) embd.size(); i += params.n_batch) {
        int n_eval = (int) embd.size() - i;
        if (n_eval > params.n_batch) {
            n_eval = params.n_batch;
        }

        LOG("eval: %sn", LOG_TOKENS_TOSTR_PRETTY(ctx, embd).c_str());

        if (llama_decode(ctx, llama_batch_get_one(&embd[i], n_eval, n_past, 0))) {
            LOG_TEE("%s : failed to evaln", __func__);
            return 1;
        }

        n_past += n_eval;

        LOG("n_past = %dn", n_past);
        // Display total tokens alongside total time
        if (params.n_print > 0 && n_past % params.n_print == 0) {
            LOG_TEE("n33[31mTokens consumed so far = %d / %d 33[0mn", n_past, n_ctx);
        }
    }
}

embd.clear();

逻辑的重点是：首先，如果推理的上下文长度超限，会丢弃超出部分。实际开发中可以考虑重构这个部分的逻辑。其次，每次推理都有一个处理数量限制（n_batch），这主要是为了当一次性输入的内容太多，系统不至于长时间无响应。最后，每次推理完成，embd都会被清理，推理完成后的信息会保存在ctx中。

(4) 推理采样

采样推理部分的源码分两个部分：

if ((int) embd_inp.size() is_interacting) {
    // optionally save the session on first sample (for faster prompt loading next time)
    if (!path_session.empty() && need_to_save_session && !params.prompt_cache_ro) {
        need_to_save_session = false;
        llama_state_save_file(ctx, path_session.c_str(), session_tokens.data(), session_tokens.size());

        LOG("saved session to %sn", path_session.c_str());
    }

    const llama_token id = llama_sampling_sample(ctx_sampling, ctx, ctx_guidance);

    llama_sampling_accept(ctx_sampling, ctx, id, /* apply_grammar= */ true);

    LOG("last: %sn", LOG_TOKENS_TOSTR_PRETTY(ctx, ctx_sampling->prev).c_str());

    embd.push_back(id);

    // echo this to console
    input_echo = true;

    // decrement remaining sampling budget
    --n_remain;

    LOG("n_remain: %dn", n_remain);
} else {
    // some user input remains from prompt or interaction, forward it to processing
    LOG("embd_inp.size(): %d, n_consumed: %dn", (int) embd_inp.size(), n_consumed);
    while ((int) embd_inp.size() > n_consumed) {
        embd.push_back(embd_inp[n_consumed]);

        // push the prompt in the sampling context in order to apply repetition penalties later
        // for the prompt, we don't apply grammar rules
        llama_sampling_accept(ctx_sampling, ctx, embd_inp[n_consumed], /* apply_grammar= */ false);

        ++n_consumed;
        if ((int) embd.size() >= params.n_batch) {
            break;
        }
    }
}

首先要关注第2部分，这一段的逻辑是将用户的输入载入上下文中，由于用户的输入不需要推理，因此只需要调用llama_sampling_accept函数。第1部分只有当用户输入都完成以后才会进入，每次采样一个token，写进embd。这个过程和分析预测交替进行，直到遇到eos。

if (llama_token_is_eog(model, llama_sampling_last(ctx_sampling))) {
    LOG("found an EOG tokenn");

    if (params.interactive) {
        if (params.enable_chat_template) {
            chat_add_and_format(model, chat_msgs, "assistant", assistant_ss.str());
        }
        is_interacting = true;
        printf("n");
    }
}

chat_add_and_format函数只负责将所有交互过程记录在char_msgs中，对整个推理过程没有影响。如果要实现用户输出，可以在这里处理。

二、关键函数

通过gpt_params初始化llama_model_params

struct llama_model_params     llama_model_params_from_gpt_params    (const gpt_params & params);

创建大模型指针

LLAMA_API struct llama_model * llama_load_model_from_file(
                             const char * path_model,
            struct llama_model_params     params);

创建ggml线程池和设置线程池

GGML_API struct ggml_threadpool*         ggml_threadpool_new          (struct ggml_threadpool_params  * params);
LLAMA_API void llama_attach_threadpool(
               struct   llama_context * ctx,
            ggml_threadpool_t   threadpool,
            ggml_threadpool_t   threadpool_batch);

通过gpt_params初始化llama_context_params

struct llama_context_params   llama_context_params_from_gpt_params  (const gpt_params & params);

LLAMA_API struct llama_context * llama_new_context_with_model(
                     struct llama_model * model,
            struct llama_context_params   params);

对输入进行分词并转换成token

std::vector llama_tokenize(
  const struct llama_context * ctx,
           const std::string & text,
                        bool   add_special,
                        bool   parse_special = false);

获取特殊token

LLAMA_API llama_token llama_token_bos(const struct llama_model * model); // beginning-of-sentence
LLAMA_API llama_token llama_token_eos(const struct llama_model * model); // end-of-sentence
LLAMA_API llama_token llama_token_cls(const struct llama_model * model); // classification
LLAMA_API llama_token llama_token_sep(const struct llama_model * model); // sentence separator
LLAMA_API llama_token llama_token_nl (const struct llama_model * model); // next-line
LLAMA_API llama_token llama_token_pad(const struct llama_model * model); // padding

批量处理token并进行预测

LLAMA_API struct llama_batch llama_batch_get_one(
                  llama_token * tokens,
                      int32_t   n_tokens,
                    llama_pos   pos_0,
                 llama_seq_id   seq_id);

LLAMA_API int32_t llama_decode(
            struct llama_context * ctx,
              struct llama_batch   batch);

执行采样和接收采样

llama_token llama_sampling_sample(
        struct llama_sampling_context * ctx_sampling,
        struct llama_context * ctx_main,
        struct llama_context * ctx_cfg,
        int idx = -1);

void llama_sampling_accept(
        struct llama_sampling_context * ctx_sampling,
        struct llama_context * ctx_main,
        llama_token id,
        bool apply_grammar);

将token转成自然语言

std::string llama_token_to_piece(
        const struct llama_context * ctx,
                       llama_token   token,
                       bool          special = true);

判断推理是否结束，注意，这个token可能和llama_token_eos获取的不一致。因此一定要通过这个函数判断

// Check if the token is supposed to end generation (end-of-generation, eg. EOS, EOT, etc.)
LLAMA_API bool llama_token_is_eog(const struct llama_model * model, llama_token token);

三、总结

本文旨在介绍llama.cpp的基础用法，由于Georgi Gerganov更新较快，且缺少文档。因此可能有些解释不够准确。如果大家对框架和本文敢兴趣可以给我留言深入讨论。

玄机博客

1.本站内容仅供参考，不作为任何法律依据。用户在使用本站内容时，应自行判断其真实性、准确性和完整性，并承担相应风险。

2.本站部分内容来源于互联网，仅用于交流学习研究知识，若侵犯了您的合法权益，请及时邮件或站内私信与本站联系，我们将尽快予以处理。

3.本文采用知识共享署名4.0国际许可协议 [BY-NC-SA] 进行授权

4.根据《计算机软件保护条例》第十七条规定“为了学习和研究软件内含的设计思想和原理，通过安装、显示、传输或者存储软件等方式使用软件的，可以不经软件著作权人许可，不向其支付报酬。”您需知晓本站所有内容资源均来源于网络，仅供用户交流学习与研究使用，版权归属原版权方所有，版权争议与本站无关，用户本人下载后不能用作商业或非法用途，需在24个小时之内从您的电脑中彻底删除上述内容，否则后果均由用户承担责任；如果您访问和下载此文件，表示您同意只将此文件用于参考、学习而非其他用途，否则一切后果请您自行承担，如果您喜欢该程序，请支持正版软件，购买注册，得到更好的正版服务。