基础知识#

初始化#

首次调用 Warp 函数（如 wp.launch()）时，Warp 将初始化自身，并打印一些关于可用计算设备、驱动程序版本以及任何生成的内核代码位置的启动信息，例如：

Warp 1.2.0 initialized:
CUDA Toolkit 12.5, Driver 12.5
Devices:
    "cpu"      : "x86_64"
    "cuda:0"   : "NVIDIA GeForce RTX 3090" (24 GiB, sm_86, mempool enabled)
    "cuda:1"   : "NVIDIA GeForce RTX 3090" (24 GiB, sm_86, mempool enabled)
CUDA peer access:
    Supported fully (all-directional)
Kernel cache:
    /home/nvidia/.cache/warp/1.2.0

也可以使用 wp.init() 方法显式初始化 Warp

import warp as wp

wp.init()

内核#

在 Warp 中，计算内核被定义为 Python 函数，并使用 @wp.kernel 装饰器进行注解

import warp as wp

@wp.kernel
def simple_kernel(a: wp.array(dtype=wp.vec3),
                  b: wp.array(dtype=wp.vec3),
                  c: wp.array(dtype=float)):

    # get thread index
    tid = wp.tid()

    # load two vec3s
    x = a[tid]
    y = b[tid]

    # compute the dot product between vectors
    r = wp.dot(x, y)

    # write result back to memory
    c[tid] = r

从概念上讲，Warp 内核类似于 CUDA 内核。当内核在 GPU 上启动时，内核的主体将并行执行一定次数。

由于 Warp 内核被编译为原生 C++/CUDA 代码，因此所有函数输入参数都必须是静态类型的。这允许 Warp 生成以基本原生速度执行的快速代码。由于内核可能在 CPU 或 GPU 上运行，因此它们无法访问 Python 环境中的任意全局状态。相反，它们必须通过其输入参数（如数组）读取和写入数据。

Warp 内核与 CUDA 内核具有一一对应的关系。要启动具有 1024 个线程的内核，我们使用 wp.launch()

wp.launch(kernel=simple_kernel, # kernel to launch
          dim=1024,             # number of threads
          inputs=[a, b, c],     # parameters
          device="cuda")        # execution device

在内核内部，我们使用 wp.tid() 内置函数检索每个线程的线程索引

# get thread index
i = wp.tid()

可以在内核参考中找到可在 Warp 内核中调用的内置函数的完整列表。

可以使用 1D、2D、3D 或 4D 的线程网格启动内核。要启动 2D 线程网格来处理 1024x1024 图像，我们可以编写

wp.launch(kernel=compute_image, dim=(1024, 1024), inputs=[img], device="cuda")

通过在调用 wp.tid() 时使用多重赋值，可以在内核内部检索 2D 线程索引

@wp.kernel
def compute_image(pixel_data: wp.array2d(dtype=wp.vec3)):
    # get thread index
    i, j = wp.tid()

数组#

内存分配通过 wp.array 类型公开。数组是固定大小的多维容器，可以在主机 (CPU) 或设备 (GPU) 内存中存储任何 Warp 数据类型的同质元素。所有数组都有关联的数据类型，可以是标量数据类型（例如，wp.float、wp.int），也可以是复合数据类型（例如，wp.vec3、wp.matrix33）。数据类型列出了 Warp 的所有内置数据类型。

可以类似于 NumPy 和 PyTorch 分配数组

# allocate an uninitialized array of vec3s
v = wp.empty(shape=n, dtype=wp.vec3, device="cuda")

# allocate a zero-initialized array of quaternions
q = wp.zeros(shape=n, dtype=wp.quat, device="cuda")

# allocate and initialize an array from a NumPy array
# will be automatically transferred to the specified device
a = np.ones((10, 3), dtype=np.float32)
v = wp.from_numpy(a, dtype=wp.vec3, device="cuda")

无法在 Warp 内核中创建数组。

支持最多四个维度的数组。当键入内核参数时，别名 wp.array2d、wp.array3d、wp.array4d 很有用

@wp.kernel
def make_field(field: wp.array3d(dtype=float), center: wp.vec3, radius: float):
    i, j, k = wp.tid()

    p = wp.vec3(float(i), float(j), float(k))

    d = wp.length(p - center) - radius

    field[i, j, k] = d

默认情况下，从外部数据（例如：NumPy、列表、元组）初始化的 Warp 数组将在指定的设备的新内存中创建数据的副本。但是，如果输入是连续的并且在同一设备上，则可以使用数组构造函数的 copy=False 参数来使数组别名外部内存。有关与外部框架共享内存的更多详细信息，请参阅互操作性部分。

要将 GPU 数组数据读回 CPU 内存，我们可以使用 array.numpy()

# bring data from device back to host
view = device_array.numpy()

这将自动与 GPU 同步以确保所有未完成的工作都已完成，并将数组复制回 CPU 内存，然后将其传递给 NumPy。在 CPU 数组上调用 array.numpy() 将返回一个对 Warp 数据进行零复制 NumPy 视图。

有关更多详细信息，请参阅数组参考。

用户函数#

用户可以使用 @wp.func 装饰器编写自己的函数，例如

@wp.func
def square(x: float):
    return x*x

内核可以调用在同一模块中定义的用户函数，也可以调用在不同模块中定义的用户函数。如示例所示，用户函数的返回类型提示是可选的。

在 Warp 内核中可以做的任何事情也可以在用户函数中完成，但 wp.tid()除外。如果需要，可以通过用户函数的参数传入线程索引。

函数可以接受数组和结构作为输入

@wp.func
def lookup(foos: wp.array(dtype=wp.uint32), index: int):
    return foos[index]

函数也可以返回多个值

@wp.func
def multi_valued_func(a: wp.float32, b: wp.float32):
    return a + b, a - b, a * b, a / b

@wp.kernel
def test_multi_valued_kernel(test_data1: wp.array(dtype=wp.float32), test_data2: wp.array(dtype=wp.float32)):
    tid = wp.tid()
    d1, d2 = test_data1[tid], test_data2[tid]
    a, b, c, d = multi_valued_func(d1, d2)

也可以通过定义具有相同函数名称的多个函数签名来重载用户函数

@wp.func
def custom(x: int):
    return x + 1


@wp.func
def custom(x: float):
    return x + 1.0


@wp.func
def custom(x: wp.vec3):
    return x + wp.vec3(1.0, 0.0, 0.0)

有关在用户函数签名中使用 typing.Any 的详细信息，请参阅泛型函数。

有关如何定义自定义梯度函数、自定义重放函数和自定义原生函数的详细信息，请参阅可微性。

用户结构#

用户可以使用 @wp.struct 装饰器定义自己的结构，例如

@wp.struct
class MyStruct:

    pos: wp.vec3
    vel: wp.vec3
    active: int
    indices: wp.array(dtype=int)

与内核参数一样，结构的每个属性都必须在类定义时具有有效的类型提示。

结构可以用作 wp.arrays 的 dtype，并且可以直接作为参数传递给内核。有关结构的更多详细信息，请参阅结构参考。

Python 作用域与内核作用域 API#

某些 Warp API 只能从 Python 作用域（即，在 Warp 用户函数和内核之外）调用，而其他 API 只能从内核作用域调用。

Python 作用域 API 在运行时参考中记录，而内核作用域 API 在内核参考中记录。通常，内核作用域 API 也可以在 Python 作用域中使用。

并非所有 Python 语言都支持在内核作用域中使用。某些功能尚未实现，而其他功能从性能角度来看不太适合 GPU。

有关更多详细信息，请参阅局限性文档。

编译模型#

Warp 使用 Python->C++/CUDA 编译模型，该模型从 Python 函数定义生成内核代码。属于 Python 模块的所有内核都在运行时编译为动态库和 PTX。然后，在应用程序重新启动之间缓存结果，以实现快速启动时间。

请注意，编译是在该模块的首次内核启动时触发的。在模块中使用 @wp.kernel 注册的任何内核都将包含在共享库中。

默认情况下，在加载每个模块后，将打印出状态消息，指示基本信息

刚刚加载的模块的名称
模块哈希的前七个字符
正在为其加载模块的设备
加载模块所用的时间（以毫秒为单位）
模块是被编译 (compiled)、从缓存加载 (cached) 还是无法加载 (error)。

出于调试目的，可以将 wp.config.verbose = True 设置为 True，以便在每次模块加载开始时也获得打印输出。

这是一个例子，通过运行两次 python3 -m warp.examples.sim.example_cartpole 来演示内核缓存的功能。第一次，我们看到

Warp 1.2.0 initialized:
    CUDA Toolkit 12.5, Driver 12.5
    Devices:
      "cpu"      : "x86_64"
      "cuda:0"   : "NVIDIA GeForce RTX 3090" (24 GiB, sm_86, mempool enabled)
      "cuda:1"   : "NVIDIA GeForce RTX 3090" (24 GiB, sm_86, mempool enabled)
    CUDA peer access:
      Supported fully (all-directional)
    Kernel cache:
      /home/nvidia/.cache/warp/1.2.0
Module warp.sim.collide 296dfb5 load on device 'cuda:0' took 17982.83 ms (compiled)
Module warp.sim.articulation b2cf0c2 load on device 'cuda:0' took 5686.67 ms (compiled)
Module warp.sim.integrator_euler b87aa18 load on device 'cuda:0' took 7753.78 ms (compiled)
Module warp.sim.integrator 036f39a load on device 'cuda:0' took 456.53 ms (compiled)
step took 0.06 ms
render took 4.63 ms

第二次运行此示例时，我们看到模块加载消息现在显示 (cached)，并且加载时间大大缩短，因为跳过了代码编译

Warp 1.2.0 initialized:
    CUDA Toolkit 12.5, Driver 12.5
    Devices:
      "cpu"      : "x86_64"
      "cuda:0"   : "NVIDIA GeForce RTX 3090" (24 GiB, sm_86, mempool enabled)
      "cuda:1"   : "NVIDIA GeForce RTX 3090" (24 GiB, sm_86, mempool enabled)
    CUDA peer access:
      Supported fully (all-directional)
    Kernel cache:
      /home/nvidia/.cache/warp/1.2.0
Module warp.sim.collide 296dfb5 load on device 'cuda:0' took 9.07 ms (cached)
Module warp.sim.articulation b2cf0c2 load on device 'cuda:0' took 4.96 ms (cached)
Module warp.sim.integrator_euler b87aa18 load on device 'cuda:0' took 3.69 ms (cached)
Module warp.sim.integrator 036f39a load on device 'cuda:0' took 0.39 ms (cached)
step took 0.04 ms
render took 5.05 ms

有关更多信息，请参阅代码生成部分。

语言细节#

为了支持 GPU 计算和可微性，与 CPython 运行时存在一些差异。

内置类型#

Warp 支持许多类似于高级着色语言的内置数学类型，例如 vec2, vec3, vec4, mat22, mat33, mat44, quat, array。所有内置类型都具有值语义，因此诸如 a = b 之类的表达式会生成变量 b 的副本，而不是引用。

强类型#

与 Python 不同，在 Warp 中，所有变量都必须是类型化的。类型从源表达式和使用 Python 类型扩展的函数签名推断。所有内核参数都必须使用适当的类型进行注释，例如

@wp.kernel
def simple_kernel(a: wp.array(dtype=vec3),
                  b: wp.array(dtype=vec3),
                  c: float):

为方便起见，可以使用 typing.Any 代替出现在函数签名中的具体类型。有关更多信息，请参见泛型文档。上述内核的通用版本可能如下所示

from typing import Any

@wp.kernel
def generic_kernel(a: wp.array(dtype=Any),
                  b: wp.array(dtype=Any),
                  c: Any):

不支持元组初始化，而是应显式键入变量

# invalid
a = (1.0, 2.0, 3.0)

# valid
a = wp.vec3(1.0, 2.0, 3.0)

限制和不支持的功能#

有关 Warp 限制和不支持的功能的列表，请参见限制。