并发#

异步操作#

内核启动#

在 CUDA 设备上启动的内核相对于主机（CPU Python 线程）是异步的。启动内核会在 CUDA 设备上安排其执行，但 wp.launch() 函数可以在内核执行完成之前返回。这使我们可以在 CUDA 内核执行时运行一些 CPU 计算，这是一种将并行性引入程序中的简单方法。

wp.launch(kernel1, dim=n, inputs=[a], device="cuda:0")

# do some CPU work while the CUDA kernel is running
do_cpu_work()

在不同 CUDA 设备上启动的内核可以并发执行。这可用于在不同 GPU 上并行处理独立的子任务，同时使用 CPU 执行其他有用的工作。

# launch concurrent kernels on different devices
wp.launch(kernel1, dim=n, inputs=[a0], device="cuda:0")
wp.launch(kernel2, dim=n, inputs=[a1], device="cuda:1")

# do CPU work while kernels are running on both GPUs
do_cpu_work()

在 CPU 上启动内核当前是同步操作。换句话说，只有在内核在 CPU 上完成执行后，wp.launch() 才会返回。要并发运行 CUDA 内核和 CPU 内核，应首先启动 CUDA 内核。

# schedule a kernel on a CUDA device
wp.launch(kernel1, ..., device="cuda:0")

# run a kernel on the CPU while the CUDA kernel is running
wp.launch(kernel2, ..., device="cpu")

图形启动#

CUDA 图形启动的并发规则与 CUDA 内核启动类似，只是图形在 CPU 上不可用。

# capture work on cuda:0 in a graph
with wp.ScopedCapture(device="cuda:0") as capture0:
    do_gpu0_work()

# capture work on cuda:1 in a graph
with wp.ScopedCapture(device="cuda:1") as capture1:
    do_gpu1_work()

# launch captured graphs on the respective devices concurrently
wp.capture_launch(capture0.graph)
wp.capture_launch(capture1.graph)

# do some CPU work while the CUDA graphs are running
do_cpu_work()

数组创建#

创建 CUDA 数组相对于主机也是异步的。它涉及在设备上分配内存并对其进行初始化，这在后台使用内核启动或异步 CUDA memset 操作来完成。

a0 = wp.zeros(n, dtype=float, device="cuda:0")
b0 = wp.ones(n, dtype=float, device="cuda:0")

a1 = wp.empty(n, dtype=float, device="cuda:1")
b1 = wp.full(n, 42.0, dtype=float, device="cuda:1")

在此代码片段中，数组 a0 和 b0 在设备 cuda:0 上创建，数组 a1 和 b1 在设备 cuda:1 上创建。同一设备上的操作是顺序的，但每个设备独立于另一个设备执行它们，因此它们可以并发运行。

数组复制#

在设备之间复制数组也可能是异步的，但需要注意一些细节。

仅当主机数组被钉住时，从主机内存复制到 CUDA 设备以及从 CUDA 设备复制到主机内存才是异步的。钉住的内存允许 CUDA 驱动程序使用直接内存传输 (DMA)，这通常更快，并且可以在不涉及 CPU 的情况下完成。使用钉住内存有两个缺点：分配和释放通常较慢，并且系统对可以在系统上分配多少钉住内存存在特定于系统的限制。因此，Warp CPU 数组默认情况下未被钉住。您可以通过在创建 CPU 数组时传递 pinned=True 标志来请求钉住的分配。对于用于在主机和设备之间复制数据的数组来说，这是一个不错的选择，尤其是在需要异步传输时。

h = wp.zeros(n, dtype=float, device="cpu")
p = wp.zeros(n, dtype=float, device="cpu", pinned=True)
d = wp.zeros(n, dtype=float, device="cuda:0")

# host-to-device copy
wp.copy(d, h)  # synchronous
wp.copy(d, p)  # asynchronous

# device-to-host copy
wp.copy(h, d)  # synchronous
wp.copy(p, d)  # asynchronous

# wait for asynchronous operations to complete
wp.synchronize_device("cuda:0")

同一 CUDA 设备上的 CUDA 数组之间的复制始终相对于主机是异步的，因为它不涉及 CPU

a = wp.zeros(n, dtype=float, device="cuda:0")
b = wp.empty(n, dtype=float, device="cuda:0")

# asynchronous device-to-device copy
wp.copy(a, b)

# wait for transfer to complete
wp.synchronize_device("cuda:0")

不同 CUDA 设备上的 CUDA 数组之间的复制也相对于主机是异步的。对等传输需要格外小心，因为 CUDA 设备彼此之间也是异步的。从一个 GPU 复制数组到另一个 GPU 时，目标 GPU 用于执行复制，因此我们需要确保源 GPU 上的先前工作在传输之前完成。

a0 = wp.zeros(n, dtype=float, device="cuda:0")
a1 = wp.empty(n, dtype=float, device="cuda:1")

# wait for outstanding work on the source device to complete to ensure the source array is ready
wp.synchronize_device("cuda:0")

# asynchronous peer-to-peer copy
wp.copy(a1, a0)

# wait for the copy to complete on the destination device
wp.synchronize_device("cuda:1")

请注意，可以使用内存池访问或对等访问来加速对等传输，这可以在受支持的系统上启用 CUDA 设备之间的 DMA 传输。

流#

CUDA 流是在 GPU 上按顺序执行的一系列操作。来自不同流的操作可能会并发运行，并且可能会被设备调度程序交错。

Warp 在初始化期间会自动为每个 CUDA 设备创建一个流。这成为设备的当前流。在该设备上发出的所有内核启动和内存操作都放置在当前流上。

创建流#

流与特定的 CUDA 设备相关联。可以使用 wp.Stream 构造函数创建新流

s1 = wp.Stream("cuda:0")  # create a stream on a specific CUDA device
s2 = wp.Stream()          # create a stream on the default device

如果省略 device 参数，将使用默认设备，可以使用 wp.ScopedDevice 管理该设备。

为了与外部代码互操作，可以传递 CUDA 流句柄来包装外部流

s3 = wp.Stream("cuda:0", cuda_stream=stream_handle)

cuda_stream 参数必须是作为 Python 整数传递的本机流句柄（cudaStream_t 或 CUstream）。此机制在内部用于与 PyTorch 或 DLPack 等外部框架共享流。调用者负责确保在 wp.Stream 对象引用外部流时，外部流不会被销毁。

使用流#

使用 wp.ScopedStream 临时更改设备上的当前流，并在该流上安排一系列操作

stream = wp.Stream("cuda:0")

with wp.ScopedStream(stream):
    a = wp.zeros(n, dtype=float)
    b = wp.empty(n, dtype=float)
    wp.launch(kernel, dim=n, inputs=[a])
    wp.copy(b, a)

由于流与特定设备相关联，因此 wp.ScopedStream 涵盖了 wp.ScopedDevice 的功能。这就是为什么我们不需要在每次调用时都显式指定 device 参数。

流的一个重要好处是，它们可用于重叠同一设备上的计算和数据传输操作，这可以通过并行执行这些操作来提高程序的整体吞吐量。

with wp.ScopedDevice("cuda:0"):
    a = wp.zeros(n, dtype=float)
    b = wp.empty(n, dtype=float)
    c = wp.ones(n, dtype=float, device="cpu", pinned=True)

    compute_stream = wp.Stream()
    transfer_stream = wp.Stream()

    # asynchronous kernel launch on a stream
    with wp.ScopedStream(compute_stream)
        wp.launch(kernel, dim=a.size, inputs=[a])

    # asynchronous host-to-device copy on another stream
    with wp.ScopedStream(transfer_stream)
        wp.copy(b, c)

可以使用 wp.get_stream() 函数获取设备上的当前流

s1 = wp.get_stream("cuda:0")  # get the current stream on a specific device
s2 = wp.get_stream()          # get the current stream on the default device

可以使用 wp.set_stream() 函数设置设备上的当前流

wp.set_stream(stream, device="cuda:0")  # set the stream on a specific device
wp.set_stream(stream)                   # set the stream on the default device

一般来说，我们建议使用 wp.ScopedStream 而不是 wp.set_stream()。

同步#

可以使用 wp.synchronize_stream() 阻塞主机线程，直到给定流完成

wp.synchronize_stream(stream)

在使用多个流的程序中，与 wp.synchronize_device() 相比，这可以更细粒度地控制同步行为，后者同步设备上的所有流。例如，如果程序有多个计算和传输流，主机可能只想等待一个传输流完成，而不等待其他流。通过仅同步一个流，我们允许其他流继续与主机线程并发运行。

事件#

类似 wp.synchronize_device() 或 wp.synchronize_stream() 的函数会阻塞 CPU 线程，直到 CUDA 设备上的工作完成，但它们并不用于同步多个 CUDA 流。

CUDA 事件提供了一种用于流之间设备端同步的机制。这种同步不会阻塞主机线程，但它允许一个流等待另一个流上的工作完成。

与流一样，事件与特定设备相关联

e1 = wp.Event("cuda:0")  # create an event on a specific CUDA device
e2 = wp.Event()          # create an event on the default device

要等待流完成一些工作，我们首先在该流上记录事件。然后，我们使另一个流等待该事件

stream1 = wp.Stream("cuda:0")
stream2 = wp.Stream("cuda:0")
event = wp.Event("cuda:0")

stream1.record_event(event)
stream2.wait_event(event)

请注意，记录事件时，事件必须与记录流来自同一设备。等待事件时，等待流可以来自另一个设备。这允许使用事件来同步不同 GPU 上的流。

如果在没有事件参数的情况下调用 Stream.record_event() 方法，则将创建一个临时事件、记录并返回

event = stream1.record_event()
stream2.wait_event(event)

Stream.wait_stream() 方法将记录和等待事件的行为组合到一个调用中

stream2.wait_stream(stream1)

Warp 还提供了全局函数 wp.record_event()、wp.wait_event() 和 wp.wait_stream()，它们作用于默认设备的当前流

wp.record_event(event)  # record an event on the current stream
wp.wait_event(event)    # make the current stream wait for an event
wp.wait_stream(stream)  # make the current stream wait for another stream

这些变体方便在 wp.ScopedStream 和 wp.ScopedDevice 管理器中使用。

这是一个更完整的示例，包含一个生产者流，将数据复制到数组中，以及一个消费者流，在内核中使用该数组

with wp.ScopedDevice("cuda:0"):
    a = wp.empty(n, dtype=float)
    b = wp.ones(n, dtype=float, device="cpu", pinned=True)

    producer_stream = wp.Stream()
    consumer_stream = wp.Stream()

    with wp.ScopedStream(producer_stream)
        # asynchronous host-to-device copy
        wp.copy(a, b)

        # record an event to create a synchronization point for the consumer stream
        event = wp.record_event()

        # do some unrelated work in the producer stream
        do_other_producer_work()

    with wp.ScopedStream(consumer_stream)
        # do some unrelated work in the consumer stream
        do_other_consumer_work()

        # wait for the producer copy to complete
        wp.wait_event(event)

        # consume the array in a kernel
        wp.launch(kernel, dim=a.size, inputs=[a])

函数 wp.synchronize_event() 可用于阻塞主机线程，直到记录的事件完成。当主机希望等待流上的特定同步点，同时允许后续的流操作继续异步执行时，这很有用。

with wp.ScopedDevice("cpu"):
    # CPU buffers for readback
    a_host = wp.empty(N, dtype=float, pinned=True)
    b_host = wp.empty(N, dtype=float, pinned=True)

with wp.ScopedDevice("cuda:0"):
    stream = wp.get_stream()

    # initialize first GPU array
    a = wp.full(N, 17, dtype=float)
    # asynchronous readback
    wp.copy(a_host, a)
    # record event
    a_event = stream.record_event()

    # initialize second GPU array
    b = wp.full(N, 42, dtype=float)
    # asynchronous readback
    wp.copy(b_host, b)
    # record event
    b_event = stream.record_event()

    # wait for first array readback to complete
    wp.synchronize_event(a_event)
    # process first array on the CPU
    assert np.array_equal(a_host.numpy(), np.full(N, fill_value=17.0))

    # wait for second array readback to complete
    wp.synchronize_event(b_event)
    # process second array on the CPU
    assert np.array_equal(b_host.numpy(), np.full(N, fill_value=42.0))

查询流和事件状态#

Stream.is_complete 和 Event.is_complete 属性可用于查询流或事件的状态。与 wp.synchronize_stream() 和 wp.synchronize_event() 不同，这些查询不会阻塞主机线程。

这些属性对于在等待 GPU 操作完成时在 CPU 上运行操作很有用

@wp.kernel
def test_kernel(sum: wp.array(dtype=wp.uint64)):
    wp.atomic_add(sum, 0, wp.uint64(1))


sum = wp.zeros(1, dtype=wp.uint64)
wp.launch(test_kernel, dim=8 * 1024 * 1024, outputs=[sum])

# Have the CPU do some unrelated work while the GPU is computing
counter = 0
while not wp.get_stream().is_complete:
    print(f"counter: {counter}")
    counter += 1

Stream.is_complete 和 Event.is_complete 在图捕获期间无法访问。

CUDA 默认流#

Warp 避免使用同步 CUDA 默认流，这是一个特殊的流，它与同一设备上的所有其他流同步。此流目前仅在为方便起见而提供的读回操作期间使用，例如 array.numpy() 和 array.list()。

stream1 = wp.Stream("cuda:0")
stream2 = wp.Stream("cuda:0")

with wp.ScopedStream(stream1):
    a = wp.zeros(n, dtype=float)

with wp.ScopedStream(stream2):
    b = wp.ones(n, dtype=float)

print(a)
print(b)

在上面的代码片段中，有两个数组在不同的 CUDA 流上初始化。打印这些数组会触发读回，这是使用 array.numpy() 方法完成的。此读回发生在同步 CUDA 默认流上，这意味着不需要显式同步。这样做的原因是方便 - 打印数组对于调试很有用，因此无需担心同步。

这种方法的缺点是 CUDA 默认流（以及任何使用它的方法）在图捕获期间无法使用。常规 wp.copy() 函数应用于捕获图中的读回操作。

显式流参数#

几个 Warp 函数接受可选的 stream 参数。这允许直接指定流，而无需使用 wp.ScopedStream 管理器。两种方法都有优点和缺点，将在下面讨论。直接接受流参数的函数包括 wp.launch()、wp.capture_launch() 和 wp.copy()。

要在特定流上启动内核

wp.launch(kernel, dim=n, inputs=[...], stream=my_stream)

使用显式 stream 参数启动内核时，应省略 device 参数，因为设备是从流推断出来的。如果同时指定 stream 和 device，则 stream 参数优先。

要在特定流上启动图形

wp.capture_launch(graph, stream=my_stream)

对于内核和图形启动，直接指定流可能比使用 wp.ScopedStream 更快。虽然 wp.ScopedStream 对于在特定流上调度一系列操作很有用，但在设备上设置和恢复当前流时会产生一些开销。对于较大的工作负载，这种开销可以忽略不计，但对性能敏感的代码可能会受益于直接指定流，而不是使用 wp.ScopedStream，特别是对于单个内核或图形启动。

除了这些性能考虑因素外，在两个 CUDA 设备之间复制数组时，直接指定流也可能很有用。默认情况下，Warp 使用以下规则来确定将用于复制的流

如果目标数组位于 CUDA 设备上，则使用目标设备上的当前流。
否则，如果源数组位于 CUDA 设备上，则使用源设备上的当前流。

在对等复制的情况下，指定 stream 参数允许覆盖这些规则，并且可以在来自任何设备的流上执行复制。

stream0 = wp.get_stream("cuda:0")
stream1 = wp.get_stream("cuda:1")

a0 = wp.zeros(n, dtype=float, device="cuda:0")
a1 = wp.empty(n, dtype=float, device="cuda:1")

# wait for the destination array to be ready
stream0.wait_stream(stream1)

# use the source device stream to do the copy
wp.copy(a1, a0, stream=stream0)

请注意，我们使用事件同步使源流在复制之前等待目标流。这是由于 Warp 0.14.0 中引入的流排序内存池分配器。空数组 a1 的分配计划在流 stream1 上进行。为了避免先使用后分配错误，我们需要等到分配完成后才能在不同的流上使用该数组。

流优先级#

可以使用 priority 参数在创建新 Stream 时创建具有指定数值优先级的流。可以创建优先级为 -1 的高优先级流，而优先级为 0 的低优先级流。通过在不同优先级的流上安排工作，用户可以更精细地控制 GPU 如何安排待处理的工作。优先级仅是 GPU 如何处理工作的提示，并不保证待处理的工作将以特定顺序执行。流优先级目前不影响主机到设备或设备到主机的内存传输。

使用 -1 和 0 的有效值之外的优先级创建的流将具有钳位的优先级。可以使用 Stream.priority 属性查询任何流的优先级。如果 CUDA 设备不支持流优先级，则无论创建流时请求的优先级如何，所有流的优先级都将为 0。

有关流优先级的更多信息，请参阅 CUDA C++ 编程指南中的部分。

以下示例说明了流优先级的影响

import warp as wp

wp.config.verify_cuda = True

wp.init()

total_size = 256 * 1024 * 1024
each_size = 128 * 1024 * 1024

with wp.ScopedDevice("cuda:0"):
    array_lo = wp.zeros(total_size, dtype=wp.float32)
    array_hi = wp.zeros(total_size, dtype=wp.float32)

    stream_lo = wp.Stream(wp.get_device(), 0)  # Low priority
    stream_hi = wp.Stream(wp.get_device(), -1)  # High priority

    start_lo_event = wp.Event(enable_timing=True)
    start_hi_event = wp.Event(enable_timing=True)
    end_lo_event = wp.Event(enable_timing=True)
    end_hi_event = wp.Event(enable_timing=True)

    wp.synchronize_device(wp.get_device())

    stream_lo.record_event(start_lo_event)
    stream_hi.record_event(start_hi_event)

    for copy_offset in range(0, total_size, each_size):
        wp.copy(array_lo, array_lo, copy_offset, copy_offset, each_size, stream_lo)
        wp.copy(array_hi, array_hi, copy_offset, copy_offset, each_size, stream_hi)

    stream_lo.record_event(end_lo_event)
    stream_hi.record_event(end_hi_event)

    # get elapsed time between the two events
    elapsed_lo = wp.get_event_elapsed_time(start_lo_event, end_lo_event)
    elapsed_hi = wp.get_event_elapsed_time(start_hi_event, end_hi_event)

    print(f"elapsed_lo = {elapsed_lo:.6f}")
    print(f"elapsed_hi = {elapsed_hi:.6f}")

测试工作站上的示例输出如下所示

elapsed_lo = 5.118944
elapsed_hi = 2.647040

如果修改示例，使两个流具有相同的优先级，则输出变为

elapsed_lo = 5.112832
elapsed_hi = 5.114880

最后，如果我们反转流优先级，使 stream_lo 的优先级为 -1，而 stream_hi 的优先级为 0，我们得到

elapsed_lo = 2.621440
elapsed_hi = 5.105664

流使用指南#

流同步可能是一件棘手的事情，即使对于经验丰富的 CUDA 开发人员也是如此。考虑以下代码

a = wp.zeros(n, dtype=float, device="cuda:0")

s = wp.Stream("cuda:0")

wp.launch(kernel, dim=a.size, inputs=[a], stream=s)

此代码段存在流同步问题，乍一看很难检测到。代码很可能运行良好，但它引入了未定义的行为，这可能会导致偶尔才会出现的错误结果。问题在于内核在流 s 上启动，这与用于创建数组 a 的流不同。数组在设备 cuda:0 的当前流上分配和初始化，这意味着当流 s 开始执行使用数组的内核时，它可能尚未准备好。

解决方案是同步流，可以这样做

a = wp.zeros(n, dtype=float, device="cuda:0")

s = wp.Stream("cuda:0")

# wait for the current stream on cuda:0 to finish initializing the array
s.wait_stream(wp.get_stream("cuda:0"))

wp.launch(kernel, dim=a.size, inputs=[a], stream=s)

wp.ScopedStream 管理器旨在缓解此常见问题。它将新流与设备上的先前流同步。它的行为相当于插入 wait_stream() 调用，如上所示。使用 wp.ScopedStream，我们不需要显式地将新流与之前的流同步

a = wp.zeros(n, dtype=float, device="cuda:0")

s = wp.Stream("cuda:0")

with wp.ScopedStream(s):
    wp.launch(kernel, dim=a.size, inputs=[a])

这使得 wp.ScopedStream 成为在 Warp 中开始使用流的推荐方式。使用显式流参数可能性能略好，但它需要更多地关注流同步机制。如果您是流的新手，请考虑以下将流集成到 Warp 程序中的轨迹

级别 1：不要使用。您不需要使用流来使用 Warp。避免使用流是一种完全有效且令人尊敬的生活方式。许多有趣且复杂的算法可以在没有花哨的流处理的情况下开发出来。通常，最好专注于以一种简单而优雅的方式解决问题，而不受低级流管理的无常所累。
级别 2：使用 wp.ScopedStream。它可以帮助避免一些常见的难以发现的问题。会有一些开销，但如果 GPU 工作负载足够大，则可以忽略不计。考虑将流添加到您的程序中作为一种有针对性的优化，尤其是在内存传输（“喂养野兽”）等某些区域是已知瓶颈的情况下。流非常适合将内存传输与计算工作负载重叠。
级别 3：为内核启动、数组复制等使用显式流参数。这将是最有效的方案，可以使您接近光速。您将需要自己处理所有流同步，但结果在基准测试中可能是有益的。

同步指导#

同步的一般规则是尽可能少地使用它，但不能更少。

过度的同步会严重限制程序的性能。同步意味着流或线程正在等待其他东西完成。在等待时，它没有做任何有用的工作，这意味着任何未完成的工作都必须等到达到同步点才能开始。这限制了并行执行，这对于从硬件组件集合中榨取最大汁液通常很重要。

另一方面，如果操作执行顺序不正确，则同步不足会导致错误或不正确的结果。如果一个快速的程序无法保证正确的结果，那么它就毫无用处。

主机端同步#

主机端同步会阻塞主机线程 (Python)，直到 GPU 工作完成。当您正在等待某些 GPU 工作完成，以便您可以访问 CPU 上的结果时，这是必要的。

wp.synchronize() 是最重量级的同步函数，因为它同步了系统中的所有设备。如果性能很重要，则几乎永远不是应该调用的正确函数。但是，在调试与同步相关的问题时，有时它可能很有用。

wp.synchronize_device(device) 同步单个设备，这通常更好更快。这会同步指定设备上的所有流，包括由 Warp 创建的流以及由任何其他框架创建的流。

wp.synchronize_stream(stream) 同步单个流，这仍然更好。如果程序使用多个流，您可以等待一个特定的流完成，而不必等待其他流。如果您有一个从 GPU 复制数据到 CPU 的回读流，这将非常方便。您可以等待传输完成并在 CPU 上开始处理它，而其他流仍在 GPU 上并行运行，与主机代码并行。

wp.synchronize_event(event) 是最具体的主机同步函数。它会阻塞主机，直到先前在 CUDA 流上记录的事件完成。这可用于等待达到特定的流同步点，同时允许在该流上执行后续操作继续异步进行。

设备端同步#

设备端同步使用 CUDA 事件，使一个流等待在另一个流上记录的同步点 (wp.record_event(), wp.wait_event(), wp.wait_stream())。

这些函数不会阻塞主机线程，因此 CPU 可以保持忙碌并执行有用的工作，例如准备下一批数据来喂养野兽。事件可用于同步同一设备上的流，甚至不同的 CUDA 设备上的流，因此您可以编排非常复杂的多流和多设备工作负载，这些工作负载完全在可用的 GPU 上执行。这允许将主机端同步保持在最低限度，也许仅在读回最终结果时才进行同步。

同步和图形捕获#

CUDA 图捕获 CUDA 流上的一系列操作，这些操作可以以低开销多次重放。在捕获期间，不允许使用某些 CUDA 函数，包括主机端同步函数。也不允许使用同步 CUDA 默认流。CUDA 图中允许的唯一同步形式是基于事件的同步。

CUDA 图捕获必须在同一流上开始和结束，但在中间可以使用多个流。这允许 CUDA 图包含多个流，甚至多个 GPU。事件在多流图形捕获中起着至关重要的作用，因为除了其常规同步职责之外，它们还用于将新流分支和加入到主捕获流。

这是一个使用每个设备上的流捕获多 GPU 图的示例

stream0 = wp.Stream("cuda:0")
stream1 = wp.Stream("cuda:1")

# use stream0 as the main capture stream
with wp.ScopedCapture(stream=stream0) as capture:

    # fork stream1, which adds it to the set of streams being captured
    stream1.wait_stream(stream0)

    # launch a kernel on stream0
    wp.launch(kernel, ..., stream=stream0)

    # launch a kernel on stream1
    wp.launch(kernel, ..., stream=stream1)

    # join stream1
    stream0.wait_stream(stream1)

# launch the multi-GPU graph, which can execute the captured kernels concurrently
wp.capture_launch(capture.graph)