快速入门指南#

这是尝试 TensorRT-LLM 的起点。具体来说，本快速入门指南使您能够快速完成设置并使用 TensorRT-LLM 发送 HTTP 请求。

LLM API#

LLM API 是一个 Python API，旨在促进直接在 Python 中使用 TensorRT-LLM 进行设置和推理。它可以通过简单地指定 HuggingFace 存储库名称或模型检查点来实现模型优化。 LLM API 通过单个 Python 对象管理检查点转换、引擎构建、引擎加载和模型推理，从而简化了流程。

这是一个简单的示例，展示了如何将 LLM API 与 TinyLlama 一起使用。

from tensorrt_llm import LLM, SamplingParams


def main():

    prompts = [
        "Hello, my name is",
        "The president of the United States is",
        "The capital of France is",
        "The future of AI is",
    ]
    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

    llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0")

    outputs = llm.generate(prompts, sampling_params)

    # Print the outputs.
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")


# The entry point of the program need to be protected for spawning processes.
if __name__ == '__main__':
    main()

您还可以直接在 LLM 构造函数中加载 TensorRT 模型优化器的 Hugging Face 上的量化检查点。要了解有关 LLM API 的更多信息，请查看API 介绍和 llm-api-examples/index。

使用 trtllm-serve 部署#

您可以使用 trtllm-serve 命令来启动与模型交互的 OpenAI 兼容服务器。要启动服务器，您可以运行如下示例命令

trtllm-serve "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

服务器启动后，您可以访问熟悉的 OpenAI 端点，例如 v1/chat/completions。您可以运行如下示例的推理

curl -X POST https://:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -H "Accept: application/json" \
    -d '{
        "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
        "messages":[{"role": "system", "content": "You are a helpful assistant."},
                    {"role": "user", "content": "Where is New York? Tell me in a single sentence."}],
        "max_tokens": 32,
        "temperature": 0
    }'

示例输出

{
  "id": "chatcmpl-ef648e7489c040679d87ed12db5d3214",
  "object": "chat.completion",
  "created": 1741966075,
  "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "New York is a city in the northeastern United States, located on the eastern coast of the state of New York.",
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 43,
    "total_tokens": 69,
    "completion_tokens": 26
  }
}

有关示例和命令语法，请参阅 trtllm-serve 部分。

模型定义 API#

先决条件#

此快速入门使用 Meta Llama 3.1 模型。该模型受特定许可协议的约束。要下载模型文件，请同意这些条款并使用 Hugging Face 进行身份验证。
完成安装步骤。
从Hugging Face Hub中提取 Llama 3.1 8B 模型的聊天调整变体的权重和分词器文件。
```
git clone https://hugging-face.cn/meta-llama/Meta-Llama-3.1-8B-Instruct
```

将模型编译为 TensorRT 引擎#

使用来自 GitHub 存储库的 examples/llama 目录中的Llama 模型定义。模型定义是一个最小的示例，展示了 TensorRT-LLM 中可用的一些优化。

# From the root of the cloned repository, start the TensorRT-LLM container
make -C docker release_run LOCAL_USER=1

# Log in to huggingface-cli
# You can get your token from huggingface.co/settings/token
huggingface-cli login --token *****

# Convert the model into TensorRT-LLM checkpoint format
cd examples/llama
pip install -r requirements.txt
pip install --upgrade transformers # Llama 3.1 requires transformer 4.43.0+ version.
python3 convert_checkpoint.py --model_dir Meta-Llama-3.1-8B-Instruct --output_dir llama-3.1-8b-ckpt

# Compile model
trtllm-build --checkpoint_dir llama-3.1-8b-ckpt \
    --gemm_plugin float16 \
    --output_dir ./llama-3.1-8b-engine

当您使用 TensorRT-LLM API 创建模型定义时，您可以从NVIDIA TensorRT 基元构建操作图，这些基元构成神经网络的层。这些操作映射到特定的内核； GPU 的预写程序。

在此示例中，我们包括了 gpt_attention 插件，该插件实现了类似 FlashAttention 的融合注意力内核，以及 gemm 插件，该插件使用 FP32 累积执行矩阵乘法。我们还将完整模型的所需精度称为 FP16，这与您从 Hugging Face 下载的权重的默认精度相匹配。有关插件和量化的更多信息，请参阅Llama 示例和数值精度部分。

运行模型#

现在您已经有了模型引擎，运行该引擎并执行推理。

python3 ../run.py --engine_dir ./llama-3.1-8b-engine  --max_output_len 100 --tokenizer_dir Meta-Llama-3.1-8B-Instruct --input_text "How do I count to nine in French?"

使用 Triton 推理服务器进行部署#

要创建 LLM 的生产就绪部署，请使用 TensorRT-LLM 的 Triton 推理服务器后端，以利用 TensorRT-LLM C++ 运行时进行快速推理执行，并包括飞行中批处理和分页 KV 缓存等优化。具有 TensorRT-LLM 后端的 Triton 推理服务器可通过 NVIDIA NGC 的预构建容器获得。

克隆 TensorRT-LLM 后端存储库

cd ..
git clone https://github.com/triton-inference-server/tensorrtllm_backend.git
cd tensorrtllm_backend

请参阅 TensorRT-LLM 后端存储库中的运行 llama 7b 的端到端工作流程，以使用 Triton 推理服务器部署模型。

下一步#

在本快速入门指南中，您

看到了 LLM API 的一个示例
了解了如何使用 trtllm-serve 部署模型
了解了模型定义 API

有关更多示例，请参阅

examples/，了解如何在最新的 LLM 上运行快速基准测试的展示。