Skip to content

Commit cf43f3c

Browse files
authored
initial version of cl_intel_subgroup_matrix_multiply_accumulate_tf32 (KhronosGroup#1487)
1 parent 6e02cd9 commit cf43f3c

1 file changed

Lines changed: 196 additions & 0 deletions

File tree

Lines changed: 196 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,196 @@
1+
:data-uri:
2+
:sectanchors:
3+
:icons: font
4+
:source-highlighter: coderay
5+
6+
ifdef::backend-html5[]
7+
:cl_intel_subgroup_matrix_multiply_accumulate_tf32: cl_intel_{zwsp}subgroup_{zwsp}matrix_{zwsp}multiply_{zwsp}accumulate_tf32
8+
:cl_intel_subgroup_matrix_multiply_accumulate: cl_intel_{zwsp}subgroup_{zwsp}matrix_{zwsp}multiply_{zwsp}accumulate
9+
:SPV_INTEL_subgroup_matrix_multiply_accumulate: SPV_INTEL_{zwsp}subgroup_{zwsp}matrix_{zwsp}multiply_{zwsp}accumulate
10+
:SPV_INTEL_tensor_float32_conversion: SPV_INTEL_{zwsp}tensor_{zwsp}float32_{zwsp}conversion
11+
endif::[]
12+
ifndef::backend-html5[]
13+
:cl_intel_subgroup_matrix_multiply_accumulate_tf32: cl_intel_​subgroup_​matrix_​multiply_​accumulate_tf32
14+
:cl_intel_subgroup_matrix_multiply_accumulate: cl_intel_​subgroup_​matrix_​multiply_​accumulate
15+
:SPV_INTEL_subgroup_matrix_multiply_accumulate: SPV_INTEL_​subgroup_​matrix_​multiply_​accumulate
16+
:SPV_INTEL_tensor_float32_conversion: SPV_INTEL_​tensor_​float32_​conversion
17+
endif::[]
18+
19+
= {cl_intel_subgroup_matrix_multiply_accumulate_tf32}
20+
21+
== Name Strings
22+
23+
`{cl_intel_subgroup_matrix_multiply_accumulate_tf32}`
24+
25+
== Contact
26+
27+
Ben Ashbaugh, Intel (ben 'dot' ashbaugh 'at' intel 'dot' com)
28+
29+
== Contributors
30+
31+
// spell-checker: disable
32+
Ben Ashbaugh, Intel +
33+
Junjie Gu, Intel +
34+
Bartosz Koscielak, Intel +
35+
Yury Plyakhin, Intel +
36+
Dmitry Sidorov, Intel +
37+
Lukasz Towarek, Intel +
38+
// spell-checker: enable
39+
40+
== Notice
41+
42+
Copyright (c) 2025 Intel Corporation. All rights reserved.
43+
44+
== Status
45+
46+
Complete
47+
48+
== Version
49+
50+
Built On: 2025-10-22 +
51+
Revision: 1.0.0
52+
53+
== Dependencies
54+
55+
This extension is written against the OpenCL 3.0 C Language specification, V3.0.19.
56+
57+
This extension builds on and hence requires support for the `{cl_intel_subgroup_matrix_multiply_accumulate}` extension.
58+
59+
== Overview
60+
61+
This extension extends the `{cl_intel_subgroup_matrix_multiply_accumulate}` extension by adding functions that operate on matrices of "TensorFloat-32" data, also known as `tf32` data.
62+
The `tf32` format has similar dynamic range as the `fp32` or `float` format, and similar precision as the `fp16` or `half` format.
63+
64+
== New API Functions
65+
66+
None.
67+
68+
== New API Enums
69+
70+
None.
71+
72+
== New OpenCL C Functions
73+
74+
[source]
75+
----
76+
// These functions are available to devices where the minimum subgroup
77+
// size is 16. For these devices, the subgroup size must be 16 (the
78+
// minimum supported subgroup size). Calling these functions on other
79+
// devices or from kernels with a different subgroup size is undefined
80+
// behavior:
81+
82+
float intel_sub_group_tf32_tf32_matrix_mad_k8(float a, float8 b, float acc);
83+
float2 intel_sub_group_tf32_tf32_matrix_mad_k8(float a, float8 b, float2 acc);
84+
float4 intel_sub_group_tf32_tf32_matrix_mad_k8(float2 a, float8 b, float4 acc);
85+
float8 intel_sub_group_tf32_tf32_matrix_mad_k8(float4 a, float8 b, float8 acc);
86+
87+
// Conversions:
88+
89+
float intel_convert_tfloat32_as_float(float source);
90+
float2 intel_convert_tfloat322_as_float2(float2 source);
91+
float3 intel_convert_tfloat323_as_float3(float3 source);
92+
float4 intel_convert_tfloat324_as_float4(float4 source);
93+
float8 intel_convert_tfloat328_as_float8(float8 source);
94+
float16 intel_convert_tfloat3216_as_float16(float16 source);
95+
----
96+
97+
== Modifications to the OpenCL C Specification
98+
99+
=== Add a new Section 6.3.1.X - The `tf32` Format
100+
101+
The `TensorFloat-32` or `tf32` format is a 32-bit floating-point format, similar to the single-precision `float` format.
102+
It has one sign bit, eight exponent bits, and 23 mantissa bits.
103+
Only 10 mantissa bits are used when performing operations on `tf32` data, similar to the half-precision 16-bit `half` format.
104+
This means that the `tf32` format has similar dynamic range as the `float` format, and similar precision as the `half` format.
105+
106+
The `{cl_intel_subgroup_matrix_multiply_accumulate_tf32}` extension does not add `tf32` as a supported data type for OpenCL kernels, however the matrix multiplication functions added by the extension interpret the `float` operands as `tf32` data when performing the matrix multiplication operation.
107+
108+
A 32-bit `float` can be converted (rounded) to a `tf32` value using the following suite of functions. For these functions, the only supported rounding mode is the default rounding mode, which is round-to-nearest-even ("rte"):
109+
110+
[source]
111+
----
112+
float intel_convert_tfloat32_as_float(float source);
113+
float2 intel_convert_tfloat322_as_float2(float2 source);
114+
float3 intel_convert_tfloat323_as_float3(float3 source);
115+
float4 intel_convert_tfloat324_as_float4(float4 source);
116+
float8 intel_convert_tfloat328_as_float8(float8 source);
117+
float16 intel_convert_tfloat3216_as_float16(float16 source);
118+
----
119+
120+
=== Add a new Section 6.13.X.Y - `tf32` Subgroup Matrix Multiply Accumulate Functions
121+
122+
This section describes a family of built-in functions that multiply two `tf32` matrix sources `a` and `b` and then add a 32-bit `float` matrix accumulation value to produce a 32-bit `float` matrix result value.
123+
`a` is the first `tf32` matrix operand and has M rows and K columns.
124+
`b` is the second `tf32` matrix operand and has K rows and N columns.
125+
`acc` is the `float` matrix accumulation value and has M rows and N columns.
126+
The result `float` matrix also has M rows and N columns.
127+
All work items in the subgroup cooperate to perform this operation.
128+
These functions must be encountered by all work items in the subgroup executing the kernel.
129+
130+
The full list of supported `tf32` functions is described in the overview, above.
131+
For this list of functions:
132+
133+
* `M` may be equal to 1, 2, 4, or 8.
134+
* `N` must be equal to 16.
135+
In other words, the only supported subgroup size is 16.
136+
* The supported floating-point matrix types for `a` and `b` are 32-bit `float` data that is interpreted as `tf32` data when performing the matrix multiplication operation.
137+
For these `tf32` matrices, `K` must be equal to 8.
138+
The accumulation value `acc` and result value are 32-bit `float` values.
139+
* Because `N` must be equal to 16 and `K` must be equal to 8, each work-item contributes every other row of the matrix `a`.
140+
For `M` equal to one, only the first `K` work-items contribute to the matrix `a`, and contributions from the remaining work-items are ignored.
141+
For other values of `M`, the first `K` work-items contribute the even rows of the matrix `a`, and the remaining work-items contribute the odd rows of the matrix `a`.
142+
143+
== Modifications to the OpenCL SPIR-V Environment Specification
144+
145+
=== Add a new section 5.2.X - `{cl_intel_subgroup_matrix_multiply_accumulate_tf32}`
146+
147+
If the OpenCL environment supports the extension `{cl_intel_subgroup_matrix_multiply_accumulate_tf32}`, then the environment must accept modules that declare use of the extension `{SPV_INTEL_subgroup_matrix_multiply_accumulate}` and that declare the SPIR-V capability *SubgroupMatrixMultiplyAccumulateINTEL*.
148+
149+
For devices where the minimum subgroup size is 16, the following matrix dimensions and types are supported.
150+
For these devices, the subgroup size must be 16 (the minimum subgroup size).
151+
Behavior is undefined if these functions are called on other devices or from kernels with a different subgroup size:
152+
153+
[cols="^1,^1,^1,^2,^2,^2,^2",width="100%"]
154+
[options="header"]
155+
|=====
156+
| M Dimension | N Dimension | K Dimension | Result Type | Matrix A Type | Matrix B Type | Matrix C Type
157+
158+
// f32 = tf32 x tf32 + f32
159+
7+<| *tf32 matrix sources, fp32 accumulator*:
160+
| 1, 2, 4, 8 | 16 | 8 | `M x float32_t`
161+
| `ceil(M/2) x float32_t` with *MatrixATF32INTEL*
162+
| `8 x float32_t` with *MatrixBTF32INTEL*
163+
| `M x float32_t`
164+
165+
|=====
166+
167+
Additionally, if the OpenCL environment supports the extension `{cl_intel_subgroup_matrix_multiply_accumulate_tf32}`, then the environment must accept modules that declare use of the extension `{SPV_INTEL_tensor_float32_conversion}` and that declare the SPIR-V capability *TensorFloat32RoundingINTEL*.
168+
169+
== Issues
170+
171+
None.
172+
173+
// . Issue?
174+
// +
175+
// --
176+
// *UNRESOLVED*: Description.
177+
// --
178+
179+
== Revision History
180+
181+
[cols="5,15,15,70"]
182+
[grid="rows"]
183+
[options="header"]
184+
|========================================
185+
|Rev|Date|Author|Changes
186+
|1.0.0|2025-10-23|Ben Ashbaugh|*Initial public revision*
187+
|========================================
188+
189+
//************************************************************************
190+
//Other formatting suggestions:
191+
//
192+
//* Use *bold* text for host APIs, or [source] syntax highlighting.
193+
//* Use `mono` text for device APIs, or [source] syntax highlighting.
194+
//* Use `mono` text for extension names, types, or enum values.
195+
//* Use _italics_ for parameters.
196+
//************************************************************************

0 commit comments

Comments
 (0)