Skip to content

Latest commit

 

History

History
439 lines (360 loc) · 17.4 KB

File metadata and controls

439 lines (360 loc) · 17.4 KB

Other Extension Metadata

Last Modified Date

2023-03-31

IP Status

No known IP claims.

Contributors
  • Ewan Crawford, Codeplay Software Ltd.

  • Gordon Brown, Codeplay Software Ltd.

  • Kenneth Benzie, Codeplay Software Ltd.

  • Alastair Murray, Codeplay Software Ltd.

  • Jack Frankland, Codeplay Software Ltd.

  • Balaji Calidas, Qualcomm Technologies Inc.

  • Joshua Kelly, Qualcomm Technologies, Inc.

  • Kevin Petit, Arm Ltd.

  • Aharon Abramson, Intel.

  • Ben Ashbaugh, Intel.

  • Boaz Ouriel, Intel.

  • Chris Gearing, Intel.

  • Pekka Jääskeläinen, Tampere University and Intel

  • Jan Solanti, Tampere University

  • Nikhil Joshi, NVIDIA

  • James Price, Google

  • Brice Videau, Argonne National Laboratory

Description

cl_khr_command_buffer adds the ability to record and replay buffers of OpenCL commands.

Command-buffers enable a reduction in overhead when enqueuing the same workload multiple times. By separating the command-queue setup from dispatch, the ability to replay a set of previously created commands is introduced.

Device-side cl_sync_point_khr synchronization-points can be used within command-buffers to define command dependencies. This allows the commands of a command-buffer to execute out-of-order on a single compatible command-queue. The command-buffer itself has no inherent in-order/out-of-order property, this ordering is inferred from the command-queue used on command recording. Out-of-order enqueues without event dependencies of both regular commands, such as {clEnqueueFillBuffer}, and command-buffers are allowed to execute concurrently, and it is up to the user to express any dependencies using events.

The command-queues a command-buffer will be executed on can be set on replay via parameters to {clEnqueueCommandBufferKHR}, provided they are compatible with the command-queues used on command-buffer recording.

Background

On embedded devices where building a command stream accounts for a significant expenditure of resources and where workloads are often required to be pipelined, a solution that minimizes driver overhead can significantly improve the utilization of accelerators by removing a bottleneck in repeated command stream generation.

An additional motivator is lowering task execution latency, as devices can be kept occupied with work by repeated submissions, without having to wait on the host to construct commands again for a similar workload.

Rationale

The command-buffer abstraction over the generation of command streams is a proven approach which facilitates a significant reduction in driver overhead in existing real-world applications with repetitive pipelined workloads which are built on top of Vulkan, DirectX 12, and Metal.

A primary goal is for a command-buffer to avoid any interaction with application code after being enqueued until all recorded commands have completed. As such, any command which maps or migrates memory objects; reads or writes memory objects; or enqueues a native kernel, is not available for command-buffer recording. Finally commands recorded into a command buffer do not wait for or return event objects, these are instead replaced with device-side synchronization-point identifiers which enable out-of-order execution when enqueued on compatible command-queues.

Adding new entry-points for individual commands, rather than recording existing command-queue APIs with begin/end markers was a design decision made for the following reasons:

  • Individually specified entry points makes it clearer to the user what’s supported, as opposed to adding a large number of error conditions throughout the specification with all the restrictions.

  • Prevents code forking in existing entry points for the implementer, as otherwise separate paths in each entry point need to be maintained for both the recording and normal cases.

  • Allows the definition of a new device-side synchronization primitive rather than overloading {cl_event_TYPE}. As use of {cl_event_TYPE} in individual commands allows host interaction from callback and user-events, as well as introducing complexities when a command-buffer is enqueued multiple times regarding profiling and execution status.

  • New entry points facilitate returning handles to individual commands, allowing those commands to be modified between enqueues of the command buffer. Not all command handles are used in this extension, but providing them facilitates other extensions layered on top to take advantage of them to provide additional mutable functionality.

Simultaneous Use

The optional simultaneous use capability was added to the extension so that vendors can support pipelined workflows, where command-buffers are repeatedly enqueued without blocking in user code. However, simultaneous use may result in command-buffers being more expensive to enqueue than in a sequential model, so the capability is optional to enable optimizations on command-buffer recording.

Interactions With Other Extensions

The introduction of the command-buffer abstraction enables functionality beyond what the cl_khr_command_buffer extension currently provides, i.e. the recording of immutable commands to a single queue which can then be executed without commands synchronizing outside the command-buffer. Extra functionality expanding on this is provided as layered extensions on top of cl_khr_command_buffer. The layered extensions that currently exist are:

Having cl_khr_command_buffer as a minimal base specification means that the API defines mechanisms for functionality that is not enabled by this extension, these are described in the following sub-sections. cl_khr_command_buffer will retain its provisional extension status until other layered extensions are released, as these may reveal modifications needed to the base specification to support their intended use cases.

ND-range Kernel Command Properties

The {clCommandNDRangeKernelKHR} entry-point defines a properties parameter of new type {cl_ndrange_kernel_command_properties_khr_TYPE}. No properties are defined in cl_khr_command_buffer, but the parameter enables layered extensions like [cl_khr_command_buffer_mutable_dispatch] to define properties that inform the characteristics of the kernel command.

Command Handles

All command recording entry-points define a {cl_mutable_command_khr_TYPE} output parameter which provides a handle to the specific command being recorded. Use of these output handles is not enabled by the cl_khr_command_buffer extension, but the handles allow individual commands in a command-buffer to be referenced by the user.

Use of these handles is enabled in [cl_khr_command_buffer_mutable_dispatch] to give the capability for an application to use the handles to modify commands between enqueues of a command-buffer.

List of Queues

Only a single command-queue can be associated with a command-buffer in the cl_khr_command_buffer extension, but the API is designed so that the layered [cl_khr_command_buffer_multi_device] extension can relax this constraint to allow commands to be recorded across multiple queues in the same command-buffer, providing replay of heterogeneous task graphs.

Using multiple queue functionality will result in an error without [cl_khr_command_buffer_multi_device] to relax usage of the following API features:

  • When a command-buffer is created the API enables passing a list of queues that the command-buffer will record commands to. Only a single queue is permitted in cl_khr_command_buffer.

  • Individual command recording entry-points define a {cl_command_queue_TYPE} parameter for which of the queues set on command-buffer creation that command should be record to. This must be passed as NULL in cl_khr_command_buffer.

  • {clEnqueueCommandBufferKHR} takes a list of queues for command-buffer execution, correspond to those set on creation. Only a single queue is permitted in cl_khr_command_buffer.

New Commands

  • {clCreateCommandBufferKHR}

  • {clRetainCommandBufferKHR}

  • {clReleaseCommandBufferKHR}

  • {clFinalizeCommandBufferKHR}

  • {clEnqueueCommandBufferKHR}

  • {clCommandBarrierWithWaitListKHR}

  • {clCommandCopyBufferKHR}

  • {clCommandCopyBufferRectKHR}

  • {clCommandCopyBufferToImageKHR}

  • {clCommandCopyImageKHR}

  • {clCommandCopyImageToBufferKHR}

  • {clCommandFillBufferKHR}

  • {clCommandFillImageKHR}

  • {clCommandNDRangeKernelKHR}

  • {clGetCommandBufferInfoKHR}

  • The following SVM entry points are supported only with at least OpenCL 2.0, and starting from version 0.9.4 of this extension

    • {clCommandSVMMemcpyKHR}

    • {clCommandSVMMemFillKHR}

New Types

  • {cl_device_command_buffer_capabilities_khr_TYPE}

  • {cl_command_buffer_khr_TYPE}

  • {cl_sync_point_khr_TYPE}

  • {cl_command_buffer_info_khr_TYPE}

  • {cl_command_buffer_state_khr_TYPE}

  • {cl_command_buffer_properties_khr_TYPE}

  • {cl_command_buffer_flags_khr_TYPE}

  • {cl_ndrange_kernel_command_properties_khr_TYPE}

  • {cl_mutable_command_khr_TYPE}

New Enums

  • {cl_device_info_TYPE}

    • {CL_DEVICE_COMMAND_BUFFER_CAPABILITIES_KHR}

    • {CL_DEVICE_COMMAND_BUFFER_REQUIRED_QUEUE_PROPERTIES_KHR}

  • {cl_device_command_buffer_capabilities_khr_TYPE}

    • {CL_COMMAND_BUFFER_CAPABILITY_KERNEL_PRINTF_KHR}

    • {CL_COMMAND_BUFFER_CAPABILITY_DEVICE_SIDE_ENQUEUE_KHR}

    • {CL_COMMAND_BUFFER_CAPABILITY_SIMULTANEOUS_USE_KHR}

    • {CL_COMMAND_BUFFER_CAPABILITY_OUT_OF_ORDER_KHR}

  • {cl_command_buffer_properties_khr_TYPE}

    • {CL_COMMAND_BUFFER_FLAGS_KHR}

  • {cl_command_buffer_flags_khr_TYPE}

    • {CL_COMMAND_BUFFER_SIMULTANEOUS_USE_KHR}

  • {cl_command_buffer_info_khr_TYPE}

    • {CL_COMMAND_BUFFER_QUEUES_KHR}

    • {CL_COMMAND_BUFFER_NUM_QUEUES_KHR}

    • {CL_COMMAND_BUFFER_REFERENCE_COUNT_KHR}

    • {CL_COMMAND_BUFFER_STATE_KHR}

    • {CL_COMMAND_BUFFER_PROPERTIES_ARRAY_KHR}

    • {CL_COMMAND_BUFFER_CONTEXT_KHR}

  • {cl_command_buffer_state_khr_TYPE}

    • {CL_COMMAND_BUFFER_STATE_RECORDING_KHR}

    • {CL_COMMAND_BUFFER_STATE_EXECUTABLE_KHR}

    • {CL_COMMAND_BUFFER_STATE_PENDING_KHR}

  • {cl_command_type_TYPE}

    • {CL_COMMAND_COMMAND_BUFFER_KHR}

  • New Error Codes

    • {CL_INVALID_COMMAND_BUFFER_KHR}

    • {CL_INVALID_SYNC_POINT_WAIT_LIST_KHR}

    • {CL_INCOMPATIBLE_COMMAND_QUEUE_KHR}

Sample Code

  #define CL_CHECK(ERROR)                             \
    if (ERROR) {                                      \
      std::cerr << "OpenCL error: " << ERROR << "\n"; \
      return ERROR;                                   \
    }

  int main() {
    cl_platform_id platform;
    CL_CHECK(clGetPlatformIDs(1, &platform, nullptr));
    cl_device_id device;
    CL_CHECK(clGetDeviceIDs(platform, CL_DEVICE_TYPE_ALL, 1, &device, nullptr));

    cl_int error;
    cl_context context =
        clCreateContext(nullptr, 1, &device, nullptr, nullptr, &error);
    CL_CHECK(error);

    const char* code = R"OpenCLC(
  kernel void vector_addition(global int* tile1, global int* tile2,
                              global int* res) {
    size_t index = get_global_id(0);
    res[index] = tile1[index] + tile2[index];
  }
  )OpenCLC";
    const size_t length = std::strlen(code);

    cl_program program =
        clCreateProgramWithSource(context, 1, &code, &length, &error);
    CL_CHECK(error);

    CL_CHECK(clBuildProgram(program, 1, &device, nullptr, nullptr, nullptr));

    cl_kernel kernel = clCreateKernel(program, "vector_addition", &error);
    CL_CHECK(error);

    constexpr size_t frame_count = 60;
    constexpr size_t frame_elements = 1024;
    constexpr size_t frame_size = frame_elements * sizeof(cl_int);

    constexpr size_t tile_count = 16;
    constexpr size_t tile_elements = frame_elements / tile_count;
    constexpr size_t tile_size = tile_elements * sizeof(cl_int);

    cl_mem buffer_tile1 =
        clCreateBuffer(context, CL_MEM_READ_ONLY, tile_size, nullptr, &error);
    CL_CHECK(error);
    cl_mem buffer_tile2 =
        clCreateBuffer(context, CL_MEM_READ_ONLY, tile_size, nullptr, &error);
    CL_CHECK(error);
    cl_mem buffer_res =
        clCreateBuffer(context, CL_MEM_WRITE_ONLY, tile_size, nullptr, &error);
    CL_CHECK(error);

    CL_CHECK(clSetKernelArg(kernel, 0, sizeof(buffer_tile1), &buffer_tile1));
    CL_CHECK(clSetKernelArg(kernel, 1, sizeof(buffer_tile2), &buffer_tile2));
    CL_CHECK(clSetKernelArg(kernel, 2, sizeof(buffer_res), &buffer_res));

    cl_command_queue command_queue =
      clCreateCommandQueue(context, device,
                           CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, &error);
    CL_CHECK(error);

    cl_command_buffer_khr command_buffer =
        clCreateCommandBufferKHR(1, &command_queue, nullptr, &error);
    CL_CHECK(error);

    cl_mem buffer_src1 =
        clCreateBuffer(context, CL_MEM_READ_ONLY, frame_size, nullptr, &error);
    CL_CHECK(error);
    cl_mem buffer_src2 =
        clCreateBuffer(context, CL_MEM_READ_ONLY, frame_size, nullptr, &error);
    CL_CHECK(error);
    cl_mem buffer_dst =
        clCreateBuffer(context, CL_MEM_WRITE_ONLY, frame_size, nullptr, &error);
    CL_CHECK(error);

    cl_sync_point_khr tile_sync_point = 0;
    for (size_t tile_index = 0; tile_index < tile_count; tile_index++) {
      std::array<cl_sync_point_khr, 2> copy_sync_points;
      CL_CHECK(clCommandCopyBufferKHR(command_buffer,
          command_queue, buffer_src1, buffer_tile1, tile_index * tile_size, 0,
          tile_size, tile_sync_point ? 1 : 0,
          tile_sync_point ? &tile_sync_point : nullptr, &copy_sync_points[0]),
          nullptr);
      CL_CHECK(clCommandCopyBufferKHR(command_buffer,
          command_queue, buffer_src2, buffer_tile2, tile_index * tile_size, 0,
          tile_size, tile_sync_point ? 1 : 0,
          tile_sync_point ? &tile_sync_point : nullptr, &copy_sync_points[1]),
          nullptr);

      cl_sync_point_khr nd_sync_point;
      CL_CHECK(clCommandNDRangeKernelKHR(command_buffer,
          command_queue, nullptr, kernel, 1, nullptr, &tile_elements, nullptr,
          copy_sync_points.size(), copy_sync_points.data(), &nd_sync_point,
          nullptr));

      CL_CHECK(clCommandCopyBufferKHR(command_buffer,
          command_queue, buffer_res, buffer_dst, 0, tile_index * tile_size,
          tile_size, 1, &nd_sync_point, &tile_sync_point, nullptr));
    }

    CL_CHECK(clFinalizeCommandBufferKHR(command_buffer));

    std::random_device random_device;
    std::mt19937 random_engine{random_device()};
    std::uniform_int_distribution<cl_int> random_distribution{
        0, std::numeric_limits<cl_int>::max() / 2};
    auto random_generator = [&]() { return random_distribution(random_engine); };

    for (size_t frame_index = 0; frame_index < frame_count; frame_index++) {
      std::array<cl_event, 2> write_src_events;
      std::vector<cl_int> src1(frame_elements);
      std::generate(src1.begin(), src1.end(), random_generator);
      CL_CHECK(clEnqueueWriteBuffer(command_queue, buffer_src1, CL_FALSE, 0,
                                    frame_size, src1.data(), 0, nullptr,
                                    &write_src_events[0]));
      std::vector<cl_int> src2(frame_elements);
      std::generate(src2.begin(), src2.end(), random_generator);
      CL_CHECK(clEnqueueWriteBuffer(command_queue, buffer_src2, CL_FALSE, 0,
                                    frame_size, src2.data(), 0, nullptr,
                                    &write_src_events[1]));

      CL_CHECK(clEnqueueCommandBufferKHR(0, NULL, command_buffer, 2,
                                         write_src_events.data(), nullptr));

      CL_CHECK(clFinish(command_queue));

      CL_CHECK(clReleaseEvent(write_src_event[0]));
      CL_CHECK(clReleaseEvent(write_src_event[1]));
    }

    CL_CHECK(clReleaseCommandBufferKHR(command_buffer));
    CL_CHECK(clReleaseCommandQueue(command_queue));

    CL_CHECK(clReleaseMemObject(buffer_src1));
    CL_CHECK(clReleaseMemObject(buffer_src2));
    CL_CHECK(clReleaseMemObject(buffer_dst));

    CL_CHECK(clReleaseMemObject(buffer_tile1));
    CL_CHECK(clReleaseMemObject(buffer_tile2));
    CL_CHECK(clReleaseMemObject(buffer_res));

    CL_CHECK(clReleaseKernel(kernel));
    CL_CHECK(clReleaseProgram(program));
    CL_CHECK(clReleaseContext(context));

    return 0;
  }

Issues

  1. Introduce a clCloneCommandBufferKHR entry-point for cloning a command-buffer.

    UNRESOLVED

  2. Enable detached command-buffer execution, where command-buffers are executed on their own internal queue to prevent locking user created queues for the duration of their execution.

    UNRESOLVED

Version History

  • Revision 0.9.0, 2021-11-10

    • First assigned version (provisional).

  • 0.9.1, 2022-08-24

    • Specify an error if a command-buffer is finalized multiple times (provisional).

  • 0.9.2, 2023-03-31

    • Introduce context query {CL_COMMAND_BUFFER_CONTEXT_KHR} (provisional).

  • 0.9.3, 2023-04-04

    • Remove Invalid command-buffer state (provisional).

  • 0.9.4, 2023-05-11

    • Add clCommandSVMMemcpyKHR and clCommandSVMMemFillKHR command entries (provisional).