- Application
-
The combination of the program running on the host and OpenCL devices.
- Acquire semantics
-
One of the memory order semantics defined for synchronization operations. Acquire semantics apply to atomic operations that load from memory. Given two units of execution, A and B, acting on a shared atomic object M, if A uses an atomic load of M with acquire semantics to synchronize-with an atomic store to M by B that used release semantics, then A's atomic load will occur before any subsequent operations by A. Note that the memory orders release, sequentially consistent, and acquire_release all include release semantics and effectively pair with a load using acquire semantics.
- Acquire release semantics
-
A memory order semantics for synchronization operations (such as atomic operations) that has the properties of both acquire and release memory orders. It is used with read-modify-write operations.
- Atomic operations
-
Operations that at any point, and from any perspective, have either occurred completely, or not at all. Memory orders associated with atomic operations may constrain the visibility of loads and stores with respect to the atomic operations (see relaxed semantics, acquire semantics, release semantics or acquire release semantics).
- Blocking and Non-Blocking Enqueue API calls
-
A non-blocking enqueue API call places a command on a command-queue and returns immediately to the host. The blocking-mode enqueue API calls do not return to the host until the command has completed.
- Barrier
-
There are three types of barriers a command-queue barrier, a work-group barrier, and a sub-group barrier.
-
The OpenCL API provides a function to enqueue a command-queue barrier command. This barrier command ensures that all previously enqueued commands to a command-queue have finished execution before any following commands enqueued in the command-queue can begin execution.
-
The OpenCL kernel execution model provides built-in work-group barrier functionality. This barrier built-in function can be used by a kernel executing on a device to perform synchronization between work-items in a work-group executing the kernel. All the work-items of a work-group must execute the barrier construct before any are allowed to continue execution beyond the barrier.
-
The OpenCL kernel execution model provides built-in sub-group barrier functionality. This barrier built-in function can be used by a kernel executing on a device to perform synchronization between work-items in a sub-group executing the kernel. All the work-items of a sub-group must execute the barrier construct before any are allowed to continue execution beyond the barrier.
-
- Buffer Object
-
A memory object that stores a linear collection of bytes. Buffer objects are accessible using a pointer in a kernel executing on a device. Buffer objects can be manipulated by the host using OpenCL API calls. A buffer object encapsulates the following information:
-
Size in bytes.
-
Properties that describe usage information and which region to allocate from.
-
Buffer data.
-
- Built-in Kernel
-
A built-in kernel is a kernel that is provided by an OpenCL implementation. A built-in kernel is enqueued for execution like other kernels, but may execute on specialized hardware that is unavailable to non-built-in kernels. Applications can query the built-in kernels supported by a device.
- Child kernel
-
See Device-side enqueue.
- Command
-
The OpenCL operations that are submitted to a command-queue for execution. For example, OpenCL commands issue kernels for execution on a compute device, manipulate memory objects, etc.
- Command-queue
-
An object that holds commands that will be executed on a specific device. The command-queue is created on a specific device in a context. Commands to a command-queue are queued in-order but may be executed in-order or out-of-order. Refer to In-order Execution and Out-of-order Execution.
- Command-queue Barrier
-
See Barrier.
- Command synchronization
-
Constraints on the order that commands are launched for execution on a device defined in terms of the synchronization points that occur between commands in host command-queues and between commands in device-side command-queues. See synchronization points.
- Complete
-
The final state in the six state model for the execution of a command. The transition into this state occurs is signaled through event objects or callback functions associated with a command.
- Compute Device Memory
-
This refers to one or more memories attached to the compute device.
- Compute Unit
-
An OpenCL device has one or more compute units. A work-group executes on a single compute unit. A compute unit is composed of one or more processing elements and local memory. A compute unit may also include dedicated texture filter units that can be accessed by its processing elements.
- Concurrency
-
A property of a system in which a set of tasks in a system can remain active and make progress at the same time. To utilize concurrent execution when running a program, a programmer must identify the concurrency in their problem, expose it within the source code, and then exploit it using a notation that supports concurrency.
- Constant Memory
-
A region of global memory that remains constant during the execution of a kernel. The host allocates and initializes memory objects placed into constant memory.
- Context
-
The environment within which the kernels execute and the domain in which synchronization and memory management is defined. The context includes a set of devices, the memory accessible to those devices, the corresponding memory properties and one or more command-queues used to schedule execution of a kernel(s) or operations on memory objects.
- Control flow
-
The flow of instructions executed by a work-item. Multiple logically related work-items may or may not execute the same control flow. The control flow is said to be converged if all the work-items in the set execution the same stream of instructions. In a diverged control flow, the work-items in the set execute different instructions. At a later point, if a diverged control flow becomes converged, it is said to be a re-converged control flow.
- Converged control flow
-
See Control flow.
- Custom Device
-
A custom device is a specialized device that supports a subset of the OpenCL runtime APIs for directed tasks but is not OpenCL conformant. A custom device must implement all of the OpenCL runtime APIs, but may return implementation-defined error codes for unsupported functionality. Custom devices may support an online compiler. When an online compiler is not available, OpenCL programs may be created from binaries or for built-in kernels supported by the device. See also Device.
- Data Parallel Programming Model
-
Traditionally, this term refers to a programming model where concurrency is expressed as instructions from a single program applied to multiple elements within a set of data structures. The term has been generalized in OpenCL to refer to a model wherein a set of instructions from a single program are applied concurrently to each point within an abstract domain of indices.
- Data race
-
The execution of a program contains a data race if it contains two actions in different work-items or host threads where (1) one action modifies a memory location and the other action reads or modifies the same memory location, and (2) at least one of these actions is not atomic, or the corresponding memory scopes are not inclusive, and (3) the actions are global actions unordered by the global-happens-before relation or are local actions unordered by the local-happens-before relation.
- Deprecation
-
Existing features are marked as deprecated if their usage is not recommended as that feature is being de-emphasized, superseded and may be removed from a future version of the specification.
- Device
-
A device is a collection of compute units. A command-queue is used to queue commands to a device. Examples of commands include executing kernels, or reading and writing memory objects. OpenCL devices typically correspond to a GPU, a multi-core CPU, and other processors such as DSPs and the Cell/B.E. processor.
- Device-side enqueue
-
A mechanism whereby a kernel-instance is enqueued by a kernel-instance running on a device without direct involvement by the host program. This produces nested parallelism; i.e. additional levels of concurrency are nested inside a running kernel-instance. The kernel-instance executing on a device (the parent kernel) enqueues a kernel-instance (the child kernel) to a device-side command-queue. Child and parent kernels execute asynchronously though a parent kernel does not complete until all of its child-kernels have completed.
- Diverged control flow
-
See Control flow.
- Ended
-
The fifth state in the six state model for the execution of a command. The transition into this state occurs when execution of a command has ended. When a Kernel-enqueue command ends, all of the work-groups associated with that command have finished their execution.
- Event Object
-
An event object encapsulates the status of an operation such as a command. It can be used to synchronize operations in a context.
- Event Wait List
-
An event wait list is a list of event objects that can be used to control when a particular command begins execution.
- Fence
-
A memory ordering operation without an associated atomic object. A fence can use the acquire semantics, release semantics, or acquire release semantics.
- Framework
-
A software system that contains the set of components to support software development and execution. A framework typically includes libraries, APIs, runtime systems, compilers, etc.
- Generic address space
-
An address space that include the private, local, and global address spaces available to a device. The generic address space supports conversion of pointers to and from private, local and global address spaces, and hence lets a programmer write a single function that at compile time can take arguments from any of the three named address spaces.
- Global-happens-before
-
See Happens-before.
- Global ID
-
A global ID is used to uniquely identify a work-item and is derived from the number of global work-items specified when executing a kernel. The global ID is a N-dimensional value that starts at (0, 0, … 0). See also Local ID.
- Global Memory
-
A memory region accessible to all work-items executing in a context. It is accessible to the host using commands such as read, write and map. Global memory is included within the generic address space that includes the private and local address spaces.
- GL share group
-
A GL share group object manages shared OpenGL or OpenGL ES resources such as textures, buffers, framebuffers, and renderbuffers and is associated with one or more GL context objects. The GL share group is typically an opaque object and not directly accessible.
- Handle
-
An opaque type that references an object allocated by OpenCL. Any operation on an object occurs by reference to that object’s handle. Each object must have a unique handle value during the course of its lifetime. Handle values may be, but are not required to be, re-used by an implementation.
- Happens-before
-
An ordering relationship between operations that execute on multiple units of execution. If an operation A happens-before operation B then A must occur before B; in particular, any value written by A will be visible to B. We define two separate happens-before relations: global-happens-before and local-happens-before. These are defined in Memory Ordering Rules.
- Host
-
The host interacts with the context using the OpenCL API.
- Host-thread
-
The unit of execution that executes the statements in the host program.
- Host pointer
-
A pointer to memory that is in the virtual address space on the host.
- Illegal
-
Behavior of a system that is explicitly not allowed and will be reported as an error when encountered by OpenCL.
- Image Object
-
A memory object that stores a two- or three-dimensional structured array. Image data can only be accessed with read and write functions. The read functions use a sampler.
The image object encapsulates the following information:
-
Dimensions of the image.
-
Description of each element in the image.
-
Properties that describe usage information and which region to allocate from.
-
Image data.
The elements of an image are selected from a list of predefined image formats.
-
- Implementation-Defined
-
Behavior that is explicitly allowed to vary between conforming implementations of OpenCL. An OpenCL implementor is required to document the implementation-defined behavior.
- Independent Forward Progress
-
If an entity supports independent forward progress, then if it is otherwise not dependent on any actions due to be performed by any other entity (for example it does not wait on a lock held by, and thus that must be released by, any other entity), then its execution cannot be blocked by the execution of any other entity in the system (it will not be starved). Work-items in a sub-group, for example, typically do not support independent forward progress, so one work-item in a sub-group may be completely blocked (starved) if a different work-item in the same sub-group enters a spin loop.
- In-order Execution
-
A model of execution in OpenCL where the commands in a command-queue are executed in order of submission with each command running to completion before the next one begins. See Out-of-order Execution.
- Intermediate Language
-
A lower-level language that may be used to create programs. SPIR-V is a required intermediate language (IL) for OpenCL 2.1 and 2.2 devices. Other OpenCL devices may optionally support SPIR-V or other ILs.
- Kernel
-
A kernel is a function declared in a program and executed on an OpenCL device. A kernel is identified by the
__kernelorkernelqualifier applied to any function defined in a program. - Kernel-instance
-
The work carried out by an OpenCL program occurs through the execution of kernel-instances on devices. The kernel instance is the kernel object, the values associated with the arguments to the kernel, and the parameters that define the ND-range index space.
- Kernel Object
-
A kernel object encapsulates a specific kernel function declared in a program and the argument values to be used when executing this kernel function.
- Kernel Language
-
A language that is used to represent source code for kernel. Kernels may be directly created from OpenCL C kernel language source strings. Other kernel languages may be supported by compiling to SPIR-V, another supported Intermediate Language, or to a device-specific program binary format.
- Launch
-
The transition of a command from the submitted state to the ready state. See Ready.
- Local ID
-
A local ID specifies a unique work-item ID within a given work-group that is executing a kernel. The local ID is a N-dimensional value that starts at (0, 0, … 0). See also Global ID.
- Local Memory
-
A memory region associated with a work-group and accessible only by work-items in that work-group. Local memory is included within the generic address space that includes the private and global address spaces.
- Marker
-
A command queued in a command-queue that can be used to tag all commands queued before the marker in the command-queue. The marker command returns an event which can be used by the application to queue a wait on the marker event i.e. wait for all commands queued before the marker command to complete.
- Memory Consistency Model
-
Rules that define which values are observed when multiple units of execution load data from any shared memory plus the synchronization operations that constrain the order of memory operations and define synchronization relationships. The memory consistency model in OpenCL is based on the memory model from the ISO C11 programming language.
- Memory Objects
-
A memory object is a handle to a reference counted region of Global Memory. Also see Buffer Object and Image Object.
- Memory Regions (or Pools)
-
A distinct address space in OpenCL. Memory regions may overlap in physical memory though OpenCL will treat them as logically distinct. The memory regions are denoted as private, local, constant, and global.
- Memory Scopes
-
These memory scopes define a hierarchy of visibilities when analyzing the ordering constraints of memory operations. They are defined by the values of the memory_scope enumeration constant. Current values are memory_scope_work_item (memory constraints only apply to a single work-item and in practice apply only to image operations), memory_scope_sub_group (memory-ordering constraints only apply to work-items executing in a sub-group), memory_scope_work_group (memory-ordering constraints only apply to work-items executing in a work-group), memory_scope_device (memory-ordering constraints only apply to work-items executing on a single device) and memory_scope_all_svm_devices or equivalently memory_scope_all_devices (memory-ordering constraints only apply to work-items executing across multiple devices and when using shared virtual memory).
- Modification Order
-
All modifications to a particular atomic object M occur in some particular total order, called the modification order of M. If A and B are modifications of an atomic object M, and A happens-before B, then A shall precede B in the modification order of M. Note that the modification order of an atomic object M is independent of whether M is in local or global memory.
- Nested Parallelism
-
See device-side enqueue.
- Object
-
Objects are abstract representation of the resources that can be manipulated by the OpenCL API. Examples include program objects, kernel objects, and memory objects.
- Out-of-order Execution
-
A model of execution in which commands placed in the work queue may begin and complete execution in any order consistent with constraints imposed by event wait lists_and_command-queue barrier. See In-order Execution.
- Parent device
-
The OpenCL device which is partitioned to create sub-devices. Not all parent devices are root devices. A root device might be partitioned and the sub-devices partitioned again. In this case, the first set of sub-devices would be parent devices of the second set, but not the root devices. Also see Device, parent device and root device.
- Parent kernel
-
see Device-side enqueue.
- Pipe
-
The pipe memory object conceptually is an ordered sequence of data items. A pipe has two endpoints: a write endpoint into which data items are inserted, and a read endpoint from which data items are removed. At any one time, only one kernel instance may write into a pipe, and only one kernel instance may read from a pipe. To support the producer consumer design pattern, one kernel instance connects to the write endpoint (the producer) while another kernel instance connects to the reading endpoint (the consumer).
- Platform
-
The host plus a collection of devices managed by the OpenCL framework that allow an application to share resources and execute kernels on devices in the platform.
- Private Memory
-
A region of memory private to a work-item. Variables defined in one work-items private memory are not visible to another work-item.
- Processing Element
-
A virtual scalar processor. A work-item may execute on one or more processing elements.
- Program
-
An OpenCL program consists of a set of kernels. Programs may also contain auxiliary functions called by the kernel functions and constant data.
- Program Object
-
A program object encapsulates the following information:
-
A reference to an associated context.
-
A program source or binary.
-
The latest successfully built program executable, the list of devices for which the program executable is built, the build options used and a build log.
-
The number of kernel objects currently attached.
-
- Queued
-
The first state in the six state model for the execution of a command. The transition into this state occurs when the command is enqueued into a command-queue.
- Ready
-
The third state in the six state model for the execution of a command. The transition into this state occurs when pre-requisites constraining execution of a command have been met; i.e. the command has been launched. When a kernel-enqueue command is launched, work-groups associated with the command are placed in a devices work-pool from which they are scheduled for execution.
- Re-converged Control Flow
-
see Control flow.
- Reference Count
-
The life span of an OpenCL object is determined by its reference count, an internal count of the number of references to the object. When you create an object in OpenCL, its reference count is set to one. Subsequent calls to the appropriate retain API (such as {clRetainContext}, {clRetainCommandQueue}) increment the reference count. Calls to the appropriate release API (such as {clReleaseContext}, {clReleaseCommandQueue}) decrement the reference count. Implementations may also modify the reference count, e.g. to track attached objects or to ensure correct operation of in-progress or scheduled activities. The object becomes inaccessible to host code when the number of release operations performed matches the number of retain operations plus the allocation of the object. At this point the reference count may be zero but this is not guaranteed.
- Relaxed Consistency
-
A memory consistency model in which the contents of memory visible to different work-items or commands may be different except at a barrier or other explicit synchronization points.
- Relaxed Semantics
-
A memory order semantics for atomic operations that implies no order constraints. The operation is atomic but it has no impact on the order of memory operations.
- Release Semantics
-
One of the memory order semantics defined for synchronization operations. Release semantics apply to atomic operations that store to memory. Given two units of execution, A and B, acting on a shared atomic object M, if A uses an atomic store of M with release semantics to synchronize-with an atomic load to M by B that used acquire semantics, then A's atomic store will occur after any prior operations by A. Note that the memory orders acquire, sequentially consistent, and acquire_release all include acquire semantics and effectively pair with a store using release semantics.
- Remainder work-groups
-
When the work-groups associated with a kernel-instance are defined, the sizes of a work-group in each dimension may not evenly divide the size of the ND-range in the corresponding dimensions. The result is a collection of work-groups on the boundaries of the ND-range that are smaller than the base work-group size. These are known as remainder work-groups.
- Running
-
The fourth state in the six state model for the execution of a command. The transition into this state occurs when the execution of the command starts. When a Kernel-enqueue command starts, one or more work-groups associated with the command start to execute.
- Root device
-
A root device is an OpenCL device that has not been partitioned. Also see Device, Parent device and Root device.
- Resource
-
A class of objects defined by OpenCL. An instance of a resource is an object. The most common resources are the context, command-queue, program objects, kernel objects, and memory objects. Computational resources are hardware elements that participate in the action of advancing a program counter. Examples include the host, devices, compute units and processing elements.
- Retain, Release
-
The action of incrementing (retain) and decrementing (release) the reference count using an OpenCL object. This is a book keeping functionality to make sure the system doesn’t remove an object before all instances that use this object have finished. Refer to Reference Count.
- Sampler
-
An object that describes how to sample an image when the image is read in the kernel. The image read functions take a sampler as an argument. The sampler specifies the image addressing-mode i.e. how out-of-range image coordinates are handled, the filter mode, and whether the input image coordinate is a normalized or unnormalized value.
- Scope inclusion
-
Two actions A and B are defined to have an inclusive scope if they have the same scope P such that: (1) if P is memory_scope_sub_group, and A and B are executed by work-items within the same sub-group, or (2) if P is memory_scope_work_group, and A and B are executed by work-items within the same work-group, or (3) if P is memory_scope_device, and A and B are executed by work-items on the same device, or (4) if P is memory_scope_all_svm_devices or memory_scope_all_devices, if A and B are executed by host threads or by work-items on one or more devices that can share SVM memory with each other and the host process.
- Sequenced before
-
A relation between evaluations executed by a single unit of execution. Sequenced-before is an asymmetric, transitive, pair-wise relation that induces a partial order between evaluations. Given any two evaluations A and B, if A is sequenced-before B, then the execution of A shall precede the execution of B.
- Sequential consistency
-
Sequential consistency interleaves the steps executed by each unit of execution. Each access to a memory location sees the last assignment to that location in that interleaving.
- Sequentially consistent semantics
-
One of the memory order semantics defined for synchronization operations. When using sequentially-consistent synchronization operations, the loads and stores within one unit of execution appear to execute in program order (i.e., the sequenced-before order), and loads and stores from different units of execution appear to be simply interleaved.
- Shared Virtual Memory (SVM)
-
An address space exposed to both the host and the devices within a context. SVM causes addresses to be meaningful between the host and all of the devices within a context and therefore supports the use of pointer based data structures in OpenCL kernels. It logically extends a portion of the global memory into the host address space therefore giving work-items access to the host address space. There are three types of SVM in OpenCL:
- Coarse-Grained buffer SVM
-
Sharing occurs at the granularity of regions of OpenCL buffer memory objects.
- Fine-Grained buffer SVM
-
Sharing occurs at the granularity of individual loads/stores into bytes within OpenCL buffer memory objects.
- Fine-Grained system SVM
-
Sharing occurs at the granularity of individual loads/stores into bytes occurring anywhere within the host memory.
- SIMD
-
Single Instruction Multiple Data. A programming model where a kernel is executed concurrently on multiple processing elements each with its own data and a shared program counter. All processing elements execute a strictly identical set of instructions.
- Specialization constants
-
Specialization constants are special constant objects that do not have known constant values in an intermediate language (e.g. SPIR-V). Applications may provide updated values for the specialization constants before a program is built. Specialization constants that do not receive a value from an application shall use the default specialization constant value.
- SPMD
-
Single Program Multiple Data. A programming model where a kernel is executed concurrently on multiple processing elements each with its own data and its own program counter. Hence, while all computational resources run the same kernel they maintain their own instruction counter and due to branches in a kernel, the actual sequence of instructions can be quite different across the set of processing elements.
- Sub-device
-
An OpenCL device can be partitioned into multiple sub-devices. The new sub-devices alias specific collections of compute units within the parent device, according to a partition scheme. The sub-devices may be used in any situation that their parent device may be used. Partitioning a device does not destroy the parent device, which may continue to be used along side and intermingled with its child sub-devices. Also see Device, Parent device and Root device.
- Sub-group
-
Sub-groups are an implementation-dependent grouping of work-items within a work-group. The size and number of sub-groups is implementation-defined.
- Sub-group Barrier
-
See Barrier.
- Submitted
-
The second state in the six state model for the execution of a command. The transition into this state occurs when the command is flushed from the command-queue and submitted for execution on the device. Once submitted, a programmer can assume a command will execute once its prerequisites have been met.
- SVM Buffer
-
A memory allocation enabled to work with Shared Virtual Memory (SVM). Depending on how the SVM buffer is created, it can be a coarse-grained or fine-grained SVM buffer. Optionally it may be wrapped by a Buffer Object. See Shared Virtual Memory (SVM).
- Synchronization
-
Synchronization refers to mechanisms that constrain the order of execution and the visibility of memory operations between two or more units of execution.
- Synchronization operations
-
Operations that define memory order constraints in a program. They play a special role in controlling how memory operations in one unit of execution (such as work-items or, when using SVM a host thread) are made visible to another. Synchronization operations in OpenCL include atomic operations and fences.
- Synchronization point
-
A synchronization point between a pair of commands (A and B) assures that results of command A happens-before command B is launched (i.e. enters the ready state) .
- Synchronizes with
-
A relation between operations in two different units of execution that defines a memory order constraint in global memory (global-synchronizes-with) or local memory (local-synchronizes-with).
- Task Parallel Programming Model
-
A programming model in which computations are expressed in terms of multiple concurrent tasks executing in one or more command-queues. The concurrent tasks can be running different kernels.
- Thread-safe
-
An OpenCL API call is considered to be thread-safe if the internal state as managed by OpenCL remains consistent when called simultaneously by multiple host threads. OpenCL API calls that are thread-safe allow an application to call these functions in multiple host threads without having to implement mutual exclusion across these host threads.
- Undefined
-
The behavior of an OpenCL API call, built-in function used inside a kernel or execution of a kernel that is explicitly not defined by OpenCL. A conforming implementation is not required to specify what occurs when an undefined construct is encountered in OpenCL.
- Unit of execution
-
A generic term for a process, OS managed thread running on the host (a host-thread), kernel-instance, host program, work-item or any other executable agent that advances the work associated with a program.
- Valid Object
-
An OpenCL object is considered valid if it meets all of the following criteria:
-
The object was created by a successful call to an OpenCL API function.
-
The object has a strictly positive application-owned reference count.
-
The object has not had its backing memory changed outside of normal usage by the OpenCL implementation (e.g. corrupted by the application, a library it uses, the implementation itself, or any other agent that can access the object’s backing memory).
An object is only valid in the platform where it was created.
An OpenCL implementation must check for a
NULLobject to determine if an object is valid. The behavior for all other invalid objects is implementation-defined. -
- Work-group
-
A collection of related work-items that execute on a single compute unit. The work-items in the group execute the same kernel-instance and share local memory and work-group functions.
- Work-group Barrier
-
See Barrier.
- Work-group Function
-
A function that carries out collective operations across all the work-items in a work-group. Available collective operations are a barrier, reduction, broadcast, prefix sum, and evaluation of a predicate. A work-group function must occur within a converged control flow; i.e. all work-items in the work-group must encounter precisely the same work-group function.
- Work-group Synchronization
-
Constraints on the order of execution for work-items in a single work-group.
- Work-pool
-
A logical pool associated with a device that holds commands and work-groups from kernel-instances that are ready to execute. OpenCL does not constrain the order that commands and work-groups are scheduled for execution from the work-pool; i.e. a programmer must assume that they could be interleaved. There is one work-pool per device used by all command-queues associated with that device. The work-pool may be implemented in any manner as long as it assures that work-groups placed in the pool will eventually execute.
- Work-item
-
One of a collection of parallel executions of a kernel invoked on a device by a command. A work-item is executed by one or more processing elements as part of a work-group executing on a compute unit. A work-item is distinguished from other work-items by its global ID or the combination of its work-group ID and its local ID within a work-group.