TensorRT-LLM 构建工作流程#

概述#

构建工作流程包含两个主要步骤。

从训练框架导出的现有模型检查点创建 TensorRT-LLM 模型。
将 TensorRT-LLM 模型构建为 TensorRT-LLM 引擎。

为了将 TensorRT-LLM 优化功能推广到所有模型，并在 TensorRT-LLM 用户的不同模型之间共享相同的工作流程，TensorRT-LLM 对模型的定义方式以及模型的导入方式有约定。

TensorRT-LLM 检查点约定记录在 TensorRT-LLM 检查点中，并且所有仅解码器模型都已迁移以采用该约定。特定于模型的 convert_checkpoint.py 脚本作为源代码在示例目录中提供，并且添加了 trtllm-build CLI 工具。但是，在核心 TensorRT-LLM 库之外提供转换检查点脚本作为示例存在一些缺点

TensorRT-LLM 的发展速度非常快，以至于模型的定义代码可能已经更改以获得更好的性能；这意味着 convert_checkpoint.py 已过时。
TensorRT-LLM 正在创建一组新的高级 API，这些 API 在一个类中处理模型转换、引擎构建和推理，以便于使用。因此，高级 API 需要调用权重转换代码，该代码应成为 TensorRT-LLM 核心库的一部分，而不是示例。并且不同模型的转换代码应具有相同的接口，这样高级 API 不需要为不同的模型添加许多临时代码。

为了缓解这些问题，模型特定的 convert_checkpoint.py 脚本正在被重构。大多数转换代码将移动到核心库中，位于模型定义旁边。请参考 tensorrt_llm/models/llama/ 作为示例。有一组新的 API 用于导入模型和转换权重。0.9 版本重构了 LLaMA 模型类以采用新的 API，其他模型的重构工作正在进行中。

转换 API#

LLaMA 模型的权重转换 API 如下所示。引入了 TopModelMixin 类，声明了 from_hugging_face() 接口，LLaMAForCausalLM 类继承了 TopModelMixin（不是直接父类，而是在其基类层次结构中），并实现了该接口。

class TopModelMixin
    @classmethod
    def from_hugging_face(cls,
                          hf_model_dir: str,
                          dtype: Optional[str] = 'float16',
                          mapping: Optional[Mapping] = None,
                          **kwargs):
        raise NotImplementedError("Subclass shall override this")

# TopModelMixin is in the part of base class hierarchy
class LLaMAForCausalLM (DecoderModelForCausalLM):
    @classmethod
    def from_hugging_face(cls,
             hf_model_dir,
             dtype='float16',
             mapping: Optional[Mapping] = None) -> LLaMAForCausalLM:
        # creating a TensorRT-LLM llama model object
        # converting HuggingFace checkpoint to TensorRT-LLM expected weights dict
        # Load the weights to llama model object

然后，在 GitHub 仓库的 examples/llama/ 目录中的 convert_checkpoint.py 脚本中，逻辑可以大大简化。即使 TensorRT-LLM LLaMA 类的模型定义代码由于某种原因发生了更改，from_hugging_face API 仍将保持不变，因此使用此接口的现有工作流程不会受到影响。

#other args omitted for simplicity here.
llama = LLaMAForCausalLM.from_hugging_face(model_dir, dtype, mapping=mapping)
llama.save_checkpoint(output_dir, save_config=(rank==0))

from_hugging_face API 不会故意将检查点保存到磁盘，而是返回一个内存中的对象。调用 save_checkpoint 来保存模型。这保持了灵活性，并使 convert->build 在一个进程中更快。通常，保存和加载大型模型的磁盘速度较慢，因此应避免。

由于 LLaMA 模型也以不同的格式发布，例如 Meta 检查点，因此 LLaMAForCausalLM 类具有一个 from_meta_ckpt 函数。由于它是 LLaMA 特定的，因此该函数未在 TopModelMixin 类中声明，因此，其他模型不使用它。

在 0.9 版本中，仅重构了 LLaMA。由于流行的 LLaMA（及其变体）模型由 Hugging Face 和 Meta 检查点格式发布，因此仅实现了这两个函数。

在未来的版本中，可能会添加 from_jax、from_nemo、from_keras 或其他用于不同训练检查点的工厂方法。例如，Gemma 2B 模型和 examples/gemma 目录中的 convert_checkpoint.py 文件除了 Hugging Face 之外还支持 JAX 和 Keras 格式。模型开发人员可以选择为他们贡献给 TensorRT-LLM 的模型实现这些工厂方法的任何子集。

对于某些 TensorRT-LLM 模型开发人员不支持的格式，您仍然可以自由地在核心库之外实现自己的权重转换；流程如下所示

config = read_config_from_the_custom_training_checkpoint(model_dir)
llama = LLaMAForCausalLM(config)

# option 1:
# Create a weights dict and then calls LLaMAForCausalLM.load
weights_dict = convert_weights_from_custom_training_checkpoint(model_dir)
llama.load(weights_dict)

# option 2:
# Internally assign the model parameters directly
convert_and_load_weights_into_trtllm_llama(llama, model_dir)
# Use the llama object as usual, to save the checkpoint or build engines

尽管进行这些自定义权重加载有一些限制和陷阱，但如果模型定义在 TensorRT-LLM 核心库中，并且权重加载/转换在核心库之外，则在发布新的 TensorRT-LLM 时可能需要更新转换代码。

量化 API#

TensorRT-LLM 依赖 NVIDIA Modelopt 工具包来支持某些量化，例如：FP8、W4A16_AWQ、W4A8_AWQ，同时它还具有一些自己的量化实现，用于平滑量化、INT8 KV 缓存和 INT4/INT8 权重仅量化。

在 TensorRT-LLM 0.8 版本中

对于 Modelopt 支持的量化算法，可以使用一个独立的脚本 example/quantization/quantize.py 导出 TensorRT-LLM 检查点，并且需要执行 trtllm-build 命令才能将检查点构建到引擎中。
对于非 Modelopt 量化算法，用户需要使用每个模型的 convert_checkpoint.py 脚本来导出 TensorRT-LLM 检查点。

使用 quantize() 接口来统一不同的量化流程。默认实现已添加到 PretrainedModel 类中。

class PretrainedModel:
    @classmethod
    def quantize(
        cls,
        hf_model_dir,
        output_dir,
        quant_config: QuantConfig,
        mapping: Optional[Mapping] = None): #some args are omitted here
        # Internally quantize the given hugging face models using Modelopt
        # and save the checkpoint to output_dir

默认实现仅处理 Modelopt 支持的量化。然后，LLaMA 类继承此 PretrainedModel 并将 Modelopt 量化分派到超类的默认实现。
如果 Modelopt 尚不支持新模型，则模型开发人员会在子类实现中引发错误。

class LLaMAForCausalLM:
    @classmethod
    def quantize(
        cls,
        hf_model_dir,
        output_dir,
        quant_config: QuantiConfig,
        mapping: Optional[Mapping] = None): #some args are omitted here
        use_modelopt_quantization = ... # determine if to use Modelopt or use native
        if use_modelopt_quantization:
            super().quantize(hf_model_dir,
                             output_dir,
                             quant_config)
        else:
            # handles TensorRT-LLM native model specific quantization
            # or raise exceptions if not supported

quantize API 旨在在内部获取多 GPU 资源以进行量化。例如，LLaMA 70B BF16 占用 140G 内存，如果进行 FP8 量化，则需要另外 70G。因此，我们需要至少 210G，需要 4 * A100(H100) 来量化 LLaMA 70B 模型。如果要在 MPI 程序中调用 quantize API，请谨慎并确保量化 API 仅由 rank 0 调用。

在 MPI 程序中使用 quantize API 如下所示，只有 rank 0 调用它。在非 MPI 程序中，不需要 if rank == 0 和 mpi_barrier()。

quant_config = QuantConfig()
quant_config.quant_algo = quant_mode.W4A16_AWQ
mapping = Mapping(world_size=tp_size, tp_size=tp_size)
if rank == 0:
    LLaMAForCausalLM.quantize(hf_model_dir,
                          checkpoint_dir,
                          quant_config=quant_config)
mpi_barrier() # wait for rank-o finishes the quantization
llama = LLaMAForCausalLM.from_checkpoint(checkpoint_dir, rank)
engine = build(llama, build_config)
engine.save(engine_dir)

保留 examples/quantization/quantize.py 以实现向后兼容性。

构建 API#

tensorrt_llm.build API 将 TensorRT-LLM 模型对象构建为 TensorRT-LLM 引擎。这个新的 API 取代了旧的流程：创建构建器，创建网络对象，将模型跟踪到网络，以及构建 TensorRT 引擎。此 API 的用法如下所示

llama = ... # create LLaMAForCausalLM object
build_config = BuildConfig(max_batch_size=1)
engine = tensorrt_llm.build(llama, build_config)
engine.save(engine_dir)

可以通过转换 API 或量化 API 部分中提到的任何方法创建 Llama 对象。

trtllm-build CLI 工具是对 tensorrt_llm.build API 的一个薄封装。CLI 工具的标志与 BuildConfig 类的字段非常接近。

如果模型要先保存到磁盘，然后再构建为引擎，TensorRT-LLM 提供了 from_checkpoint API 来反序列化 checkpoint。

## TensorRT-LLM code
class PretrainedModel:
    @classmethod
    def from_checkpoint(cls,
                    ckpt_dir: str,
                    rank: int = 0,
                    config: PretrainedConfig = None):
        # Internally load the model weights from a given checkpoint directory

调用 from_checkpoint API 将 checkpoint 反序列化为模型对象。可以调用 tensorrt_llm.build API 来构建引擎。

llama = LLaMAForCausalLM.from_checkpoint(checkpoint_dir)
engine = build(llama, build_config)
engine.save(engine_dir)

CLI 工具#

为了方便起见，上面提到的所有权重转换、量化和构建 API 都有相应的 CLI 工具。

特定于模型的 convert_checkpoint.py 脚本位于 examples/<model xxx>/ 文件夹中。
统一的量化脚本位于 examples/quantization/quantize.py 中，可以由所有**支持的**模型共享。
trtllm-build CLI 工具从 TensorRT-LLM checkpoint 构建所有模型。

请参考以下关于 CLI 工具的注意事项

这些脚本和工具应该用于编写脚本。不要导入这些工具中定义的 Python 函数/类。 TensorRT-LLM 不保证这些脚本的内容与以前的版本兼容。当不可避免时，这些工具的选项也可能会更改。
example 文件夹中的这些脚本可能会使用 TensorRT-LLM 内部/不稳定的 API，如果示例版本和 TensorRT-LLM 安装版本不匹配，则不能保证这些 API 能够工作。有些 GitHub 问题是由版本不匹配引起的。
- https://github.com/NVIDIA/TensorRT-LLM/issues/1293
- https://github.com/NVIDIA/TensorRT-LLM/issues/1252
- https://github.com/NVIDIA/TensorRT-LLM/issues/1079
您应该始终安装在 examples/<model xxx>/requirements.txt 中指定的相同 TensorRT-LLM 版本。
未来，鉴于不同模型的属性本质上可能不同，每个模型的转换脚本可能会或可能不会统一为一个由模型共享的单个脚本。但是，TensorRT-LLM 团队将努力确保同一功能的标志在不同的脚本之间保持一致。
TensorRT-LLM 团队鼓励使用新的低级转换/量化/构建 API 而不是这些脚本。转换 API 将逐步添加模型，这可能需要几个版本。