gemm.h¶

用于矩阵乘法的函数。

函数

void nvte_cublas_gemm(const NVTETensor A, const NVTETensor B, NVTETensor D, const NVTETensor bias, NVTETensor pre_gelu_out, bool transa, bool transb, bool grad, NVTETensor workspace, bool accumulate, bool use_split_accumulator, int math_sm_count, cudaStream_t stream)¶

计算两个矩阵的矩阵乘法，可能与其他操作融合。

计算

D = AB 如果 bias 和 pre_gelu_out 都是空张量
D = AB + bias 如果 pre_gelu_out 为空且 bias 不为空
D = GELU(AB + bias) 如果 bias 和 pre_gelu_out 都是非空张量

参数

A – [in] A 矩阵。
B – [in] B 矩阵。
D – [inout] 输出矩阵。
bias – [in] 偏置张量。
pre_gelu_out – [inout] GELU 激活之前的输出矩阵。
transa – [in] 是否转置 A 矩阵。
transb – [in] 是否转置 B 矩阵。
grad – [in] 此操作是否为梯度计算的一部分。
workspace – [out] 工作空间张量。
accumulate – [in] 是否将结果累积到 D 矩阵中。
use_split_accumulator – [in] 是否在 FP8 GEMM 中使用拆分累加器。
math_sm_count – [in] 要使用的 GPU SM 的数量（默认值 = 0：使用 cuBLAS 启发式算法）
stream – [in] 用于操作的 CUDA 流。

void nvte_cublas_atomic_gemm(const NVTETensor A, const NVTETensor B, NVTETensor D, const NVTETensor bias, NVTETensor pre_gelu_out, bool transa, bool transb, bool grad, NVTETensor workspace, bool accumulate, bool use_split_accumulator, int math_sm_count, int m_split, int n_split, bool gemm_producer, const NVTETensor counter, cudaStream_t stream)¶

计算两个矩阵的矩阵乘法，具有分块和原子计数器。

计算

D = AB 如果 bias 和 pre_gelu_out 都是空张量
D = AB + bias 如果 pre_gelu_out 为空且 bias 不为空
D = GELU(AB + bias) 如果 bias 和 pre_gelu_out 都是非空张量

警告

Cublas atomic gemm 使用的是 beta API，并且没有针对所有用例进行测试。

参数

A – [in] A 矩阵。
B – [in] B 矩阵。
D – [inout] 输出矩阵。
bias – [in] 偏置张量。
pre_gelu_out – [inout] GELU 激活之前的输出矩阵。
transa – [in] 是否转置 A 矩阵。
transb – [in] 是否转置 B 矩阵。
grad – [in] 此操作是否为梯度计算的一部分。
workspace – [out] 工作空间张量。
accumulate – [in] 是否将结果累积到 D 矩阵中。
use_split_accumulator – [in] 是否在 FP8 GEMM 中使用拆分累加器。
math_sm_count – [in] 要使用的 GPU SM 的数量（默认值 = 0：使用 cuBLAS 启发式算法）
m_split – [in] Atomic GEMM 沿 m 维度的块/拆分的数量。
n_split – [in] Atomic GEMM 沿 n 维度的块/拆分的数量。
gemm_producer – [in] Atomic GEMM 是生产者还是消费者。
counter – [inout] counter[chunk_i]=0 表示 chunk_i 已经生成。
stream – [in] 用于操作的 CUDA 流。