|
| 1 | +:data-uri: |
| 2 | +:sectanchors: |
| 3 | +:icons: font |
| 4 | +:source-highlighter: coderay |
| 5 | + |
| 6 | +ifdef::backend-html5[] |
| 7 | +:cl_intel_subgroup_matrix_multiply_accumulate_tf32: cl_intel_{zwsp}subgroup_{zwsp}matrix_{zwsp}multiply_{zwsp}accumulate_tf32 |
| 8 | +:cl_intel_subgroup_matrix_multiply_accumulate: cl_intel_{zwsp}subgroup_{zwsp}matrix_{zwsp}multiply_{zwsp}accumulate |
| 9 | +:SPV_INTEL_subgroup_matrix_multiply_accumulate: SPV_INTEL_{zwsp}subgroup_{zwsp}matrix_{zwsp}multiply_{zwsp}accumulate |
| 10 | +:SPV_INTEL_tensor_float32_conversion: SPV_INTEL_{zwsp}tensor_{zwsp}float32_{zwsp}conversion |
| 11 | +endif::[] |
| 12 | +ifndef::backend-html5[] |
| 13 | +:cl_intel_subgroup_matrix_multiply_accumulate_tf32: cl_intel_​subgroup_​matrix_​multiply_​accumulate_tf32 |
| 14 | +:cl_intel_subgroup_matrix_multiply_accumulate: cl_intel_​subgroup_​matrix_​multiply_​accumulate |
| 15 | +:SPV_INTEL_subgroup_matrix_multiply_accumulate: SPV_INTEL_​subgroup_​matrix_​multiply_​accumulate |
| 16 | +:SPV_INTEL_tensor_float32_conversion: SPV_INTEL_​tensor_​float32_​conversion |
| 17 | +endif::[] |
| 18 | + |
| 19 | += {cl_intel_subgroup_matrix_multiply_accumulate_tf32} |
| 20 | + |
| 21 | +== Name Strings |
| 22 | + |
| 23 | +`{cl_intel_subgroup_matrix_multiply_accumulate_tf32}` |
| 24 | + |
| 25 | +== Contact |
| 26 | + |
| 27 | +Ben Ashbaugh, Intel (ben 'dot' ashbaugh 'at' intel 'dot' com) |
| 28 | + |
| 29 | +== Contributors |
| 30 | + |
| 31 | +// spell-checker: disable |
| 32 | +Ben Ashbaugh, Intel + |
| 33 | +Junjie Gu, Intel + |
| 34 | +Bartosz Koscielak, Intel + |
| 35 | +Yury Plyakhin, Intel + |
| 36 | +Dmitry Sidorov, Intel + |
| 37 | +Lukasz Towarek, Intel + |
| 38 | +// spell-checker: enable |
| 39 | + |
| 40 | +== Notice |
| 41 | + |
| 42 | +Copyright (c) 2025 Intel Corporation. All rights reserved. |
| 43 | + |
| 44 | +== Status |
| 45 | + |
| 46 | +Complete |
| 47 | + |
| 48 | +== Version |
| 49 | + |
| 50 | +Built On: 2025-10-22 + |
| 51 | +Revision: 1.0.0 |
| 52 | + |
| 53 | +== Dependencies |
| 54 | + |
| 55 | +This extension is written against the OpenCL 3.0 C Language specification, V3.0.19. |
| 56 | + |
| 57 | +This extension builds on and hence requires support for the `{cl_intel_subgroup_matrix_multiply_accumulate}` extension. |
| 58 | + |
| 59 | +== Overview |
| 60 | + |
| 61 | +This extension extends the `{cl_intel_subgroup_matrix_multiply_accumulate}` extension by adding functions that operate on matrices of "TensorFloat-32" data, also known as `tf32` data. |
| 62 | +The `tf32` format has similar dynamic range as the `fp32` or `float` format, and similar precision as the `fp16` or `half` format. |
| 63 | + |
| 64 | +== New API Functions |
| 65 | + |
| 66 | +None. |
| 67 | + |
| 68 | +== New API Enums |
| 69 | + |
| 70 | +None. |
| 71 | + |
| 72 | +== New OpenCL C Functions |
| 73 | + |
| 74 | +[source] |
| 75 | +---- |
| 76 | +// These functions are available to devices where the minimum subgroup |
| 77 | +// size is 16. For these devices, the subgroup size must be 16 (the |
| 78 | +// minimum supported subgroup size). Calling these functions on other |
| 79 | +// devices or from kernels with a different subgroup size is undefined |
| 80 | +// behavior: |
| 81 | +
|
| 82 | +float intel_sub_group_tf32_tf32_matrix_mad_k8(float a, float8 b, float acc); |
| 83 | +float2 intel_sub_group_tf32_tf32_matrix_mad_k8(float a, float8 b, float2 acc); |
| 84 | +float4 intel_sub_group_tf32_tf32_matrix_mad_k8(float2 a, float8 b, float4 acc); |
| 85 | +float8 intel_sub_group_tf32_tf32_matrix_mad_k8(float4 a, float8 b, float8 acc); |
| 86 | +
|
| 87 | +// Conversions: |
| 88 | +
|
| 89 | +float intel_convert_tfloat32_as_float(float source); |
| 90 | +float2 intel_convert_tfloat322_as_float2(float2 source); |
| 91 | +float3 intel_convert_tfloat323_as_float3(float3 source); |
| 92 | +float4 intel_convert_tfloat324_as_float4(float4 source); |
| 93 | +float8 intel_convert_tfloat328_as_float8(float8 source); |
| 94 | +float16 intel_convert_tfloat3216_as_float16(float16 source); |
| 95 | +---- |
| 96 | + |
| 97 | +== Modifications to the OpenCL C Specification |
| 98 | + |
| 99 | +=== Add a new Section 6.3.1.X - The `tf32` Format |
| 100 | + |
| 101 | +The `TensorFloat-32` or `tf32` format is a 32-bit floating-point format, similar to the single-precision `float` format. |
| 102 | +It has one sign bit, eight exponent bits, and 23 mantissa bits. |
| 103 | +Only 10 mantissa bits are used when performing operations on `tf32` data, similar to the half-precision 16-bit `half` format. |
| 104 | +This means that the `tf32` format has similar dynamic range as the `float` format, and similar precision as the `half` format. |
| 105 | + |
| 106 | +The `{cl_intel_subgroup_matrix_multiply_accumulate_tf32}` extension does not add `tf32` as a supported data type for OpenCL kernels, however the matrix multiplication functions added by the extension interpret the `float` operands as `tf32` data when performing the matrix multiplication operation. |
| 107 | + |
| 108 | +A 32-bit `float` can be converted (rounded) to a `tf32` value using the following suite of functions. For these functions, the only supported rounding mode is the default rounding mode, which is round-to-nearest-even ("rte"): |
| 109 | + |
| 110 | +[source] |
| 111 | +---- |
| 112 | +float intel_convert_tfloat32_as_float(float source); |
| 113 | +float2 intel_convert_tfloat322_as_float2(float2 source); |
| 114 | +float3 intel_convert_tfloat323_as_float3(float3 source); |
| 115 | +float4 intel_convert_tfloat324_as_float4(float4 source); |
| 116 | +float8 intel_convert_tfloat328_as_float8(float8 source); |
| 117 | +float16 intel_convert_tfloat3216_as_float16(float16 source); |
| 118 | +---- |
| 119 | + |
| 120 | +=== Add a new Section 6.13.X.Y - `tf32` Subgroup Matrix Multiply Accumulate Functions |
| 121 | + |
| 122 | +This section describes a family of built-in functions that multiply two `tf32` matrix sources `a` and `b` and then add a 32-bit `float` matrix accumulation value to produce a 32-bit `float` matrix result value. |
| 123 | +`a` is the first `tf32` matrix operand and has M rows and K columns. |
| 124 | +`b` is the second `tf32` matrix operand and has K rows and N columns. |
| 125 | +`acc` is the `float` matrix accumulation value and has M rows and N columns. |
| 126 | +The result `float` matrix also has M rows and N columns. |
| 127 | +All work items in the subgroup cooperate to perform this operation. |
| 128 | +These functions must be encountered by all work items in the subgroup executing the kernel. |
| 129 | + |
| 130 | +The full list of supported `tf32` functions is described in the overview, above. |
| 131 | +For this list of functions: |
| 132 | + |
| 133 | +* `M` may be equal to 1, 2, 4, or 8. |
| 134 | +* `N` must be equal to 16. |
| 135 | +In other words, the only supported subgroup size is 16. |
| 136 | +* The supported floating-point matrix types for `a` and `b` are 32-bit `float` data that is interpreted as `tf32` data when performing the matrix multiplication operation. |
| 137 | +For these `tf32` matrices, `K` must be equal to 8. |
| 138 | +The accumulation value `acc` and result value are 32-bit `float` values. |
| 139 | +* Because `N` must be equal to 16 and `K` must be equal to 8, each work-item contributes every other row of the matrix `a`. |
| 140 | +For `M` equal to one, only the first `K` work-items contribute to the matrix `a`, and contributions from the remaining work-items are ignored. |
| 141 | +For other values of `M`, the first `K` work-items contribute the even rows of the matrix `a`, and the remaining work-items contribute the odd rows of the matrix `a`. |
| 142 | + |
| 143 | +== Modifications to the OpenCL SPIR-V Environment Specification |
| 144 | + |
| 145 | +=== Add a new section 5.2.X - `{cl_intel_subgroup_matrix_multiply_accumulate_tf32}` |
| 146 | + |
| 147 | +If the OpenCL environment supports the extension `{cl_intel_subgroup_matrix_multiply_accumulate_tf32}`, then the environment must accept modules that declare use of the extension `{SPV_INTEL_subgroup_matrix_multiply_accumulate}` and that declare the SPIR-V capability *SubgroupMatrixMultiplyAccumulateINTEL*. |
| 148 | + |
| 149 | +For devices where the minimum subgroup size is 16, the following matrix dimensions and types are supported. |
| 150 | +For these devices, the subgroup size must be 16 (the minimum subgroup size). |
| 151 | +Behavior is undefined if these functions are called on other devices or from kernels with a different subgroup size: |
| 152 | + |
| 153 | +[cols="^1,^1,^1,^2,^2,^2,^2",width="100%"] |
| 154 | +[options="header"] |
| 155 | +|===== |
| 156 | +| M Dimension | N Dimension | K Dimension | Result Type | Matrix A Type | Matrix B Type | Matrix C Type |
| 157 | + |
| 158 | +// f32 = tf32 x tf32 + f32 |
| 159 | +7+<| *tf32 matrix sources, fp32 accumulator*: |
| 160 | +| 1, 2, 4, 8 | 16 | 8 | `M x float32_t` |
| 161 | +| `ceil(M/2) x float32_t` with *MatrixATF32INTEL* |
| 162 | +| `8 x float32_t` with *MatrixBTF32INTEL* |
| 163 | +| `M x float32_t` |
| 164 | + |
| 165 | +|===== |
| 166 | + |
| 167 | +Additionally, if the OpenCL environment supports the extension `{cl_intel_subgroup_matrix_multiply_accumulate_tf32}`, then the environment must accept modules that declare use of the extension `{SPV_INTEL_tensor_float32_conversion}` and that declare the SPIR-V capability *TensorFloat32RoundingINTEL*. |
| 168 | + |
| 169 | +== Issues |
| 170 | + |
| 171 | +None. |
| 172 | + |
| 173 | +// . Issue? |
| 174 | +// + |
| 175 | +// -- |
| 176 | +// *UNRESOLVED*: Description. |
| 177 | +// -- |
| 178 | + |
| 179 | +== Revision History |
| 180 | + |
| 181 | +[cols="5,15,15,70"] |
| 182 | +[grid="rows"] |
| 183 | +[options="header"] |
| 184 | +|======================================== |
| 185 | +|Rev|Date|Author|Changes |
| 186 | +|1.0.0|2025-10-23|Ben Ashbaugh|*Initial public revision* |
| 187 | +|======================================== |
| 188 | + |
| 189 | +//************************************************************************ |
| 190 | +//Other formatting suggestions: |
| 191 | +// |
| 192 | +//* Use *bold* text for host APIs, or [source] syntax highlighting. |
| 193 | +//* Use `mono` text for device APIs, or [source] syntax highlighting. |
| 194 | +//* Use `mono` text for extension names, types, or enum values. |
| 195 | +//* Use _italics_ for parameters. |
| 196 | +//************************************************************************ |
0 commit comments