The programming language that is used to write
compute kernels is called kernel language. OpenCL adopts
C/
C++-based languages to specify the kernel computations performed on the device with some restrictions and additions to facilitate efficient mapping to the heterogeneous hardware resources of accelerators. Traditionally OpenCL C was used to program the accelerators in OpenCL standard, later C++ for OpenCL kernel language was developed that inherited all functionality from OpenCL C but allowed to use C++ features in the kernel sources.
OpenCL C language OpenCL C is a
C99-based language dialect adapted to fit the device model in OpenCL. Memory buffers reside in specific levels of the
memory hierarchy, and
pointers are annotated with the region qualifiers , , , and , reflecting this. Instead of a device program having a function, OpenCL C functions are marked to signal that they are
entry points into the program to be called from the host program.
Function pointers,
bit fields and
variable-length arrays are omitted, and
recursion is forbidden. The
C standard library is replaced by a custom set of standard functions, geared toward math programming. OpenCL C is extended to facilitate use of
parallelism with vector types and operations, synchronization, and functions to work with work-items and work-groups. The code asks the OpenCL library for the first available graphics card, creates memory buffers for reading and writing (from the perspective of the graphics card),
JIT-compiles the FFT-kernel and then finally asynchronously runs the kernel. The result from the transform is not read in this example. This is illustrative example code, not intended for serious use, so error-handling is entirely omitted. • include • include • include "CL/opencl.h" • define NUM_ENTRIES 1024 int main() // (int argc, const char* argv[]) { // CONSTANTS // The source code of the kernel is represented as a string // located inside file: "fft1D_1024_kernel_src.cl". For the details see the next listing. const char *KernelSource = #include "fft1D_1024_kernel_src.cl" ; // Looking up the available GPUs const cl_uint num = 1; clGetDeviceIDs(NULL, CL_DEVICE_TYPE_GPU, 0, NULL, (cl_uint*)&num); cl_device_id devices[1]; clGetDeviceIDs(NULL, CL_DEVICE_TYPE_GPU, num, devices, NULL); // create a compute context with GPU device cl_context context = clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL); // create a command queue clGetDeviceIDs(NULL, CL_DEVICE_TYPE_DEFAULT, 1, devices, NULL); cl_command_queue queue = clCreateCommandQueue(context, devices[0], 0, NULL); // allocate the buffer memory objects cl_mem memobjs[] = { clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(float) * 2 * NUM_ENTRIES, NULL, NULL), clCreateBuffer(context, CL_MEM_READ_WRITE, sizeof(float) * 2 * NUM_ENTRIES, NULL, NULL) }; // create the compute program // const char* fft1D_1024_kernel_src[1] = { }; cl_program program = clCreateProgramWithSource(context, 1, (const char **)& KernelSource, NULL, NULL); // build the compute program executable clBuildProgram(program, 0, NULL, NULL, NULL, NULL); // create the compute kernel cl_kernel kernel = clCreateKernel(program, "fft1D_1024", NULL); // set the args values size_t local_work_size[1] = { 256 }; clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&memobjs[0]); clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&memobjs[1]); clSetKernelArg(kernel, 2, sizeof(float)*(local_work_size[0] + 1) * 16, NULL); clSetKernelArg(kernel, 3, sizeof(float)*(local_work_size[0] + 1) * 16, NULL); // create N-D range object with work-item dimensions and execute kernel size_t global_work_size[1] = { 256 }; global_work_size[0] = NUM_ENTRIES; local_work_size[0] = 64; //Nvidia: 192 or 256 clEnqueueNDRangeKernel(queue, kernel, 1, NULL, global_work_size, local_work_size, 0, NULL, NULL); } The actual calculation inside file "fft1D_1024_kernel_src.cl" (based on "Fitting FFT onto the G80 Architecture"): R"( // This kernel computes FFT of length 1024. The 1024 length FFT is decomposed into // calls to a radix 16 function, another radix 16 function and then a radix 4 function __kernel void fft1D_1024 (__global float2 *in, __global float2 *out, __local float *sMemx, __local float *sMemy) { int tid = get_local_id(0); int blockIdx = get_group_id(0) * 1024 + tid; float2 data[16]; // starting index of data to/from global memory in = in + blockIdx; out = out + blockIdx; globalLoads(data, in, 64); // coalesced global reads fftRadix16Pass(data); // in-place radix-16 pass twiddleFactorMul(data, tid, 1024, 0); // local shuffle using local memory localShuffle(data, sMemx, sMemy, tid, (((tid & 15) * 65) + (tid >> 4))); fftRadix16Pass(data); // in-place radix-16 pass twiddleFactorMul(data, tid, 64, 4); // twiddle factor multiplication localShuffle(data, sMemx, sMemy, tid, (((tid >> 4) * 64) + (tid & 15))); // four radix-4 function calls fftRadix4Pass(data); // radix-4 function number 1 fftRadix4Pass(data + 4); // radix-4 function number 2 fftRadix4Pass(data + 8); // radix-4 function number 3 fftRadix4Pass(data + 12); // radix-4 function number 4 // coalesced global writes globalStores(data, out, 64); } )" A full, open source implementation of an OpenCL FFT can be found on Apple's website.
OpenCL C++ language OpenCL C++ is a short-lived specification for a language that combines OpenCL C and C++14. It was intended to be built in an
online mode only, by passing the -cl-std=c++ parameter in clBuildProgram(). No extension for detecting support for this language is described. It is unknown whether any driver actually supported this language. the transition to the community driven C++ for OpenCL programming language that provides features from
C++17 in combination with the traditional OpenCL C features. This language allows to leverage a rich variety of language features from standard C++ while preserving backward compatibility to OpenCL C. This opens up a smooth transition path to C++ functionality for the OpenCL kernel code developers as they can continue using familiar programming flow and even tools as well as leverage existing extensions and libraries available for OpenCL C. The language semantics is described in the documentation published in the releases of OpenCL-Docs repository hosted by the Khronos Group but it is currently not ratified by the Khronos Group. The C++ for OpenCL language is not documented in a stand-alone document and it is based on the specification of C++ and OpenCL C. The open source
Clang compiler has supported C++ for OpenCL since release 9. C++ for OpenCL has been originally developed as a Clang compiler extension and appeared in the release 9. As it was tightly coupled with OpenCL C and did not contain any Clang specific functionality its documentation has been re-hosted to the OpenCL-Docs repository C++ for OpenCL 1.0 contains features from C++17 and it is backward compatible with OpenCL C 2.0. In December 2021, a new provisional C++ for OpenCL version 2021 has been released which is fully compatible with the OpenCL 3.0 standard. A work in progress draft of the latest C++ for OpenCL documentation can be found on the Khronos website.
Features C++ for OpenCL supports most of the features (syntactically and semantically) from OpenCL C except for nested parallelism and blocks. However, there are minor differences in some supported features mainly related to differences in semantics between C++ and C. For example, C++ is more strict with the implicit type conversions and it does not support the type qualifier.
Arm has announced support for this extension in December 2020. However, due to increasing complexity of the algorithms accelerated on OpenCL devices, it is expected that more applications will compile C++ for OpenCL kernels offline using stand alone compilers such as Clang into executable binary format or portable binary format e.g. SPIR-V. Such an executable can be loaded during the OpenCL applications execution using a dedicated OpenCL API. Binaries compiled from sources in C++ for OpenCL 1.0 can be executed on OpenCL 2.0 conformant devices. Depending on the language features used in such kernel sources it can also be executed on devices supporting earlier OpenCL versions or OpenCL 3.0. Aside from OpenCL drivers, kernels written in C++ for OpenCL can be compiled for execution on Vulkan devices using clspv runtime layer, just the same way as OpenCL C kernels.
Contributions C++ for OpenCL is an open language developed by the community of contributors listed in its documentation. New contributions to the language semantic definition or open source tooling support are accepted from anyone interested as soon as they are aligned with the main design philosophy and they are reviewed and approved by the experienced contributors. == History ==