TensorRT-LLM 检查点#

概述#

TensorRT-LLM 的早期版本（0.8 版本之前）开发时间表非常紧迫。对于这些版本，重点并未放在定义统一的工作流上。现在 TensorRT-LLM 已达到一定程度的功能丰富性，开发团队决定投入更多精力来统一 TensorRT-LLM 的 API 和工作流。本文档记录了围绕 TensorRT-LLM 检查点的工作流以及用于生成检查点、构建引擎和评估引擎的 CLI 工具集。

工作流包含三个步骤

将不同源框架的权重转换为 TensorRT-LLM 检查点。
使用统一的构建命令将 TensorRT-LLM 检查点构建为 TensorRT 引擎。
将引擎加载到 TensorRT-LLM 模型运行器并使用不同的评估任务进行评估。

NeMo -------------
                  |
HuggingFace ------
                  |   convert                             build                    load
Modelopt ---------  ----------> TensorRT-LLM Checkpoint --------> TensorRT Engine ------> TensorRT-LLM ModelRunner
                  |
JAX --------------
                  |
DeepSpeed --------

准备 TensorRT-LLM 检查点#

TensorRT-LLM 旨在支持不同的来源

来自 NVIDIA NeMo、Microsoft DeepSpeed 和 JAX 的训练模型
来自 NVIDIA Modelopt 的量化模型
来自 HuggingFace 的流行模型

TensorRT-LLM 定义了自己的检查点格式。检查点目录包括

一个 config json 文件，其中包含多个模型超参数。
一个或多个 rank 权重文件，每个文件包含一个张量（权重）字典。在多 GPU（多进程）场景中，不同的文件由不同的 rank 加载。

配置#

字段	类型	默认值
architecture	string	必填
dtype	string	必填
logits_dtype	string	‘float32’
vocab_size	int	必填
max_position_embeddings	int	null
hidden_size	int	必填
num_hidden_layers	int	必填
num_attention_heads	int	必填
num_key_value_heads	int	num_attention_heads
hidden_act	string	必填
intermediate_size	int	null
norm_epsilon	float	1e-5
position_embedding_type	string	‘learned_absolute’
mapping.world_size	int	1
mapping.tp_size	int	1
mapping.pp_size	int	1
quantization.quant_algo	str	null
quantization.kv_cache_quant_algo	str	null
quantization.group_size	int	64
quantization.has_zero_point	bool	False
quantization.pre_quant_scale	bool	False
quantization.exclude_modules	list	null

mapping.world_size 表示 mapping 是一个包含 world_size 子字段的字典。

{
    "architecture": "OPTForCausalLM",
    "mapping": {
        "world_size": 1
    }
}

支持的量化算法列表

W8A16
W4A16
W4A16_AWQ
W4A8_AWQ
W4A16_GPTQ
FP8
W8A8_SQ_PER_CHANNEL

支持的 KV cache 量化算法列表

FP8
INT8

配置字段是可扩展的，模型可以添加自己的特定配置字段。例如，OPT 模型有一个 do_layer_norm_before 字段。

以下是模型特定的配置列表

字段	类型	默认值
OPT
do_layer_norm_before	bool	False

Falcon
bias	bool	True
new_decoder_architecture	bool	False
parallel_attention	bool	False

Rank 权重#

与 PyTorch 类似，张量（权重）名称是一个包含层次信息的字符串，它唯一地映射到 TensorRT-LLM 模型的某个参数。

例如，OPT 模型的每个 transformer 层包含一个 Attention 层、一个 MLP 层和两个 LayerNorm 层。

Attention 权重#

Attention 层包含两个 Linear 层：qkv 和 dense；每个 Linear 层包含一个权重和一个 bias。总共有四个张量（权重），它们的名称是

transformer.layers.0.attention.qkv.weight
transformer.layers.0.attention.qkv.bias
transformer.layers.0.attention.dense.weight
transformer.layers.0.attention.dense.bias

其中 transformer.layers.0.attention 是前缀名称，表示权重/bias 在第 0 个 transformer 层的 Attention 模块中。

MLP 权重#

MLP 层也包含两个 Linear 层：fc 和 proj；每个 Linear 层包含一个权重和一个 bias。总共有四个张量（权重），它们的名称是

transformer.layers.0.mlp.fc.weight
transformer.layers.0.mlp.fc.bias
transformer.layers.0.mlp.proj.weight
transformer.layers.0.mlp.proj.bias

其中 transformer.layers.0.mlp 是前缀名称，表示权重/bias 在第 0 个 transformer 层的 MLP 模块中。

LayerNorm 权重#

两个 LayerNorm 层，即 input_layernorm 和 post_layernorm，每个包含一个权重和一个 bias。总共有四个张量（权重），它们的名称是

transformer.layers.0.input_layernorm.weight
transformer.layers.0.input_layernorm.bias
transformer.layers.0.post_layernorm.weight
transformer.layers.0.post_layernorm.bias

其中 transformer.layers.0.input_layernorm 和 transformer.layers.0.post_layernorm 是两个 layernorm 模块的前缀名称。

KV Cache 量化比例因子#

如果我们对模型进行量化，将会有不同的张量（取决于应用的量化方法）。例如，如果我们量化 KV cache，Attention 层将有这个额外的比例因子

transformer.layers.0.attention.kv_cache_scaling_factor

FP8 量化比例因子#

以下是 attention.qkv 线性层的 FP8 比例因子

transformer.layers.0.attention.qkv.activation_scaling_factor
transformer.layers.0.attention.qkv.weights_scaling_factor

AWQ 量化比例因子#

以下是 mlp.fc 线性层的 AWQ 比例因子

transformer.layers.0.mlp.fc.weights_scaling_factor
transformer.layers.0.mlp.fc.prequant_scaling_factor

注意

TensorRT-LLM 检查点中的线性权重始终遵循 (out_feature, in_feature) 形状，而 TensorRT-LLM 中由插件实现的某些量化线性层可能使用 (in_feature, out_feature) 形状。trtllm-build 命令会添加一个转置操作进行后处理。

示例#

让我们以 OPT 为例，使用张量并行度 2 部署模型

cd examples/opt
python3 convert_checkpoint.py --model_dir ./opt-125m \
                --dtype float16 \
                --tp_size 2 \
                --output_dir ./opt/125M/trt_ckpt/fp16/2-gpu/

以下是检查点目录

./opt/125M/trt_ckpt/fp16/1-gpu/
    config.json
    rank0.safetensors
    rank1.safetensors

以下是 config.json

{
    "architecture": "OPTForCausalLM",
    "dtype": "float16",
    "logits_dtype": "float32",
    "num_hidden_layers": 12,
    "num_attention_heads": 12,
    "hidden_size": 768,
    "vocab_size": 50272,
    "position_embedding_type": "learned_absolute",
    "max_position_embeddings": 2048,
    "hidden_act": "relu",
    "mapping": {
        "world_size": 2,
        "tp_size": 2
    },
    "use_parallel_embedding": false,
    "embedding_sharding_dim": 0,
    "do_layer_norm_before": true,
}

将检查点构建为 TensorRT 引擎#

TensorRT-LLM 提供了一个统一的构建命令：trtllm-build。在使用它之前，您可能需要将其添加到 PATH 中。

export PATH=/usr/local/bin:$PATH

trtllm-build --checkpoint_dir ./opt/125M/trt_ckpt/fp16/2-gpu/ \
                --gemm_plugin float16 \
                --max_batch_size 8 \
                --max_input_len 924 \
                --max_seq_len 1024 \
                --output_dir ./opt/125M/trt_engines/fp16/2-gpu/

进行评估#

mpirun -n 2 --allow-run-as-root \
    python3 ../summarize.py --engine_dir ./opt/125M/trt_engines/fp16/2-gpu/ \
                        --batch_size 1 \
                        --test_trt_llm \
                        --hf_model_dir opt-125m \
                        --data_type fp16 \
                        --check_accuracy \
                        --tensorrt_llm_rouge1_threshold=14