OpenCL

OpenCL C/C++
Paradigm	Imperative (procedural), structured, object-oriented (C++ only)
Family	C
Stable release	OpenCL C 2.0 revision 29 / August 4, 2015; 8 years ago
Preview release	OpenCL C++ 1.0 provisional specification revision 8 / March 2, 2015; 9 years ago
Typing discipline	Static, weak, manifest, nominal
Implementation language	Implementation specific
Filename extensions	.cl
Website	{{#property:P856}}
Major implementations
	AMD, Apple, freeocl, Gallium Compute, IBM, Intel Beignet, Intel SDK, Nvidia, pocl
Influenced by
	C99, CUDA, C++14

OpenCL API
	OpenCL logo
Original author(s)	Apple Inc.
Developer(s)	Khronos Group
Initial release	August 28, 2009; 14 years ago
Stable release	2.1 revision 23 / November 11, 2015; 8 years ago
Development status	Active
Written in	C/C++
Operating system	Android (vendor dependent), FreeBSD, Linux, OS X, Windows
Platform	ARMv7, ARMv8, Cell, PowerPC, x86, x86-64
Type	Heterogeneous computing API
License	OpenCL specification license
Website	www.khronos.org/opencl; www.khronos.org/webcl
As of	25 December 2015

Open Computing Language (OpenCL) is a framework for writing programs that execute across heterogeneous platforms consisting of central processing units (CPUs), graphics processing units (GPUs), digital signal processors (DSPs), field-programmable gate arrays (FPGAs) and other processors or hardware accelerators. OpenCL specifies a language (based on C99) for programming these devices and application programming interfaces (APIs) to control the platform and execute programs on the compute devices. OpenCL provides a standard interface for parallel computing using task-based and data-based parallelism. OpenCL is an open standard maintained by the non-profit technology consortium Khronos Group. Conformant implementations are available from Altera, AMD, Apple, ARM Holdings, Creative Technology, IBM, Imagination Technologies, Intel, Nvidia, Qualcomm, Samsung, Vivante, Xilinx, and ZiiLABS.^[7]^[8]

Overview

OpenCL views a computing system as consisting of a number of compute devices, which might be central processing units (CPUs) or "accelerators" such as graphics processing units (GPUs), attached to a host processor (a CPU). It defines a C-like language for writing programs. Functions executed on an OpenCL device are called kernels.^[1]^:17 A single compute device typically consists of several compute units, which in turn comprise multiple processing elements (PEs). A single kernel execution can run on all or many of the PEs in parallel. How a compute device is subdivided into compute units and PEs is up to the vendor; a compute unit can be thought of as a "core", but the notion of core is hard to define across all the types of devices supported by OpenCL (or even within the category of "CPUs"),^[9]^:{{{3}}}^:49–50 and the number of compute units may not correspond to the number of cores claimed in vendors' marketing literature (which may actually be counting SIMD lanes).^[10]

In addition to its C-like programming language, OpenCL defines an application programming interface (API) that allows programs running on the host to launch kernels on the compute devices and manage device memory, which is (at least conceptually) separate from host memory. Programs in the OpenCL language are intended to be compiled at run-time, so that OpenCL-using applications are portable between implementations for various host devices.^[11] The OpenCL standard defines host APIs for C and C++; third-party APIs exist for other programming languages and platforms such as Python,^[12] Java and .NET.^[9]^:15 An implementation of the OpenCL standard consists of a library that implements the API for C and C++, and an OpenCL C compiler for the compute device(s) targeted.

Memory hierarchy

OpenCL defines a four-level memory hierarchy for the compute device:^[11]

global memory: shared by all processing elements, but has high access latency;
read-only memory: smaller, low latency, writable by the host CPU but not the compute devices;
local memory: shared by a group of processing elements;
per-element private memory (registers).

Not every device needs to implement each level of this hierarchy in hardware. Consistency between the various levels in the hierarchy is relaxed, and only enforced by explicit synchronization constructs, notably barriers.

Devices may or may not share memory with the host CPU.^[11] The host API provides handles on device memory buffers and functions to transfer data back and forth between host and devices.

OpenCL C language

The programming language used to write computation kernels is called OpenCL C and is based on C99,^[13] but adapted to fit the device model in OpenCL. Memory buffers reside in specific levels of the memory hierarchy, and pointers are annotated with the region qualifiers __global, __local, __constant, and __private, reflecting this. Instead of a device program having a main function, OpenCL C functions are marked __kernel to signal that they are entry points into the program to be called from the host program. Function pointers, bit fields and variable-length arrays are omitted, recursion is forbidden.^[14] The C standard library is replaced by a custom set of standard functions, geared toward math programming.

OpenCL C is extended to facilitate use of parallelism with vector types and operations, synchronization, and functions to work with work-items and work-groups.^[15] In particular, besides scalar types such as float and double, which behave similarly to the corresponding types in C, OpenCL provides fixed-length vector types such as float4 (4-vector of single-precision floats); such vector types are available in lengths two, three, four, eight and sixteen for various base types.^[13]^:§6.1.2 Vectorized operations on these types are intended to map onto SIMD instructions sets, e.g., SSE or VMX when running OpenCL programs on CPUs.^[11] Other specialized types include 2-d and 3-d image types.^[13]^:{{{3}}}^:10–11

Example: matrix-vector multiplication

Each invocation (work-item) of the kernel takes a row of the green matrix (A in the code), multiplies this row with the red vector (x) and places the result in an entry of the blue vector (y). The number of columns

n

is passed to the kernel as ncols; the number of rows is implicit in the number of work-items produced by the host program.

The following is a matrix-vector multiplication algorithm in OpenCL C.

// Multiplies A*x, leaving the result in y.
// A is a row-major matrix, meaning the (i,j) element is at A[i*ncols+j].
__kernel void matvec(__global const float *A, __global const float *x,
                     uint ncols, __global float *y)
{
    size_t i = get_global_id(0);              // Global id, used as the row index.
    __global float const *a = &A[i*ncols];    // Pointer to the i'th row.
    float sum = 0.f;                          // Accumulator for dot product.
    for (size_t j = 0; j < ncols; j++) {
        sum += a[j] * x[j];
    }
    y[i] = sum;
}

The kernel function matvec computes, in each invocation, the dot product of a single row of a matrix $A$ and a vector $x$ :

y_i = a_{i,:} \cdot x = \sum_j a_{i,j} x_j

.

To extend this into a full matrix-vector multiplication, the OpenCL runtime maps the kernel over the rows of the matrix. On the host side, the clEnqueueNDRangeKernel function does this; it takes as arguments the kernel to execute, its arguments, and a number of work-items, corresponding to the number of rows in the matrix $A$ .

Example: computing the FFT

This example will load a fast Fourier transform (FFT) implementation and execute it. The implementation is shown below.^[16]

  // create a compute context with GPU device
  context = clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL);

  // create a command queue
  clGetDeviceIDs( NULL, CL_DEVICE_TYPE_DEFAULT, 1, &device_id, NULL );
  queue = clCreateCommandQueue(context, device_id, 0, NULL);

  // allocate the buffer memory objects
  memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(float)*2*num_entries, srcA, NULL);
  memobjs[1] = clCreateBuffer(context, CL_MEM_READ_WRITE, sizeof(float)*2*num_entries, NULL, NULL);

  // create the compute program
  program = clCreateProgramWithSource(context, 1, &fft1D_1024_kernel_src, NULL, NULL);

  // build the compute program executable
  clBuildProgram(program, 0, NULL, NULL, NULL, NULL);

  // create the compute kernel
  kernel = clCreateKernel(program, "fft1D_1024", NULL);

  // set the args values
  clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&memobjs[0]);
  clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&memobjs[1]);
  clSetKernelArg(kernel, 2, sizeof(float)*(local_work_size[0]+1)*16, NULL);
  clSetKernelArg(kernel, 3, sizeof(float)*(local_work_size[0]+1)*16, NULL);

  // create N-D range object with work-item dimensions and execute kernel
  global_work_size[0] = num_entries;
  local_work_size[0] = 64; //Nvidia: 192 or 256
  clEnqueueNDRangeKernel(queue, kernel, 1, NULL, global_work_size, local_work_size, 0, NULL, NULL);

The actual calculation (based on Fitting FFT onto the G80 Architecture):^[17]

  // This kernel computes FFT of length 1024. The 1024 length FFT is decomposed into
  // calls to a radix 16 function, another radix 16 function and then a radix 4 function

  __kernel void fft1D_1024 (__global float2 *in, __global float2 *out,
                          __local float *sMemx, __local float *sMemy) {
    int tid = get_local_id(0);
    int blockIdx = get_group_id(0) * 1024 + tid;
    float2 data[16];

    // starting index of data to/from global memory
    in = in + blockIdx;  out = out + blockIdx;

    globalLoads(data, in, 64); // coalesced global reads
    fftRadix16Pass(data);      // in-place radix-16 pass
    twiddleFactorMul(data, tid, 1024, 0);

    // local shuffle using local memory
    localShuffle(data, sMemx, sMemy, tid, (((tid & 15) * 65) + (tid >> 4)));
    fftRadix16Pass(data);               // in-place radix-16 pass
    twiddleFactorMul(data, tid, 64, 4); // twiddle factor multiplication

    localShuffle(data, sMemx, sMemy, tid, (((tid >> 4) * 64) + (tid & 15)));

    // four radix-4 function calls
    fftRadix4Pass(data);      // radix-4 function number 1
    fftRadix4Pass(data + 4);  // radix-4 function number 2
    fftRadix4Pass(data + 8);  // radix-4 function number 3
    fftRadix4Pass(data + 12); // radix-4 function number 4

    // coalesced global writes
    globalStores(data, out, 64);
  }

A full, open source implementation of an OpenCL FFT can be found on Apple's website.^[18]

History

OpenCL was initially developed by Apple Inc., which holds trademark rights, and refined into an initial proposal in collaboration with technical teams at AMD, IBM, Qualcomm, Intel, and Nvidia. Apple submitted this initial proposal to the Khronos Group. On June 16, 2008, the Khronos Compute Working Group was formed^[19] with representatives from CPU, GPU, embedded-processor, and software companies. This group worked for five months to finish the technical details of the specification for OpenCL 1.0 by November 18, 2008.^[20] This technical specification was reviewed by the Khronos members and approved for public release on December 8, 2008.^[21]

OpenCL 1.0

OpenCL 1.0 released with Mac OS X Snow Leopard on August 28, 2009. According to an Apple press release:^[22]

Snow Leopard further extends support for modern hardware with Open Computing Language (OpenCL), which lets any application tap into the vast gigaflops of GPU computing power previously available only to graphics applications. OpenCL is based on the C programming language and has been proposed as an open standard.

AMD decided to support OpenCL instead of the now deprecated Close to Metal in its Stream framework.^[23]^[24] RapidMind announced their adoption of OpenCL underneath their development platform to support GPUs from multiple vendors with one interface.^[25] On December 9, 2008, Nvidia announced its intention to add full support for the OpenCL 1.0 specification to its GPU Computing Toolkit.^[26] On October 30, 2009, IBM released its first OpenCL implementation as a part of the XL compilers.^[27]

OpenCL 1.1

OpenCL 1.1 was ratified by the Khronos Group on June 14, 2010^[28] and adds significant functionality for enhanced parallel programming flexibility, functionality, and performance including:

New data types including 3-component vectors and additional image formats;
Handling commands from multiple host threads and processing buffers across multiple devices;
Operations on regions of a buffer including read, write and copy of 1D, 2D, or 3D rectangular regions;
Enhanced use of events to drive and control command execution;
Additional OpenCL built-in C functions such as integer clamp, shuffle, and asynchronous strided copies;
Improved OpenGL interoperability through efficient sharing of images and buffers by linking OpenCL and OpenGL events.

OpenCL 1.2

On November 15, 2011, the Khronos Group announced the OpenCL 1.2 specification,^[29] which added significant functionality over the previous versions in terms of performance and features for parallel programming. Most notable features include:

Device partitioning: the ability to partition a device into sub-devices so that work assignments can be allocated to individual compute units. This is useful for reserving areas of the device to reduce latency for time-critical tasks.
Separate compilation and linking of objects: the functionality to compile OpenCL into external libraries for inclusion into other programs.
Enhanced image support: 1.2 adds support for 1D images and 1D/2D image arrays. Furthermore, the OpenGL sharing extensions now allow for OpenGL 1D textures and 1D/2D texture arrays to be used to create OpenCL images.
Built-in kernels: custom devices that contain specific unique functionality are now integrated more closely into the OpenCL framework. Kernels can be called to use specialised or non-programmable aspects of underlying hardware. Examples include video encoding/decoding and digital signal processors.
DirectX functionality: DX9 media surface sharing allows for efficient sharing between OpenCL and DX9 or DXVA media surfaces. Equally, for DX11, seamless sharing between OpenCL and DX11 surfaces is enabled.
The ability to force IEEE 754 compliance for single precision floating point math: OpenCL by default allows the single precision versions of the division, reciprocal, and square root operation to be less accurate than the correctly rounded values that IEEE 754 requires.^[30] If the programmer passes the "-cl-fp32-correctly-rounded-divide-sqrt" command line argument to the compiler, these three operations will be computed to IEEE 754 requirements if the OpenCL implementation supports this, and will fail to compile if the OpenCL implementation does not support computing these operations to their correctly-rounded values as defined by the IEEE 754 specification.^[30] This ability is supplemented by the ability to query the OpenCL implementation to determine if it can perform these operations to IEEE 754 accuracy.^[30]

OpenCL 2.0

On November 18, 2013, the Khronos Group announced the ratification and public release of the finalized OpenCL 2.0 specification.^[31] Updates and additions to OpenCL 2.0 include:

Shared virtual memory
Nested parallelism
Generic address space
Images
C11 atomics
Pipes
Android installable client driver extension

OpenCL 2.1

The ratification and release of the OpenCL 2.1 provisional specification was announced on March 3, 2015 at the Game Developer Conference in San Francisco. It was released on November 16, 2015.^[32] It replaces the OpenCL C kernel language with OpenCL C++, a subset of C++14. Vulkan and OpenCL 2.1 share SPIR-V as an intermediate language allowing high-level language front-ends to share a common compilation target. Updates to the OpenCL API include:

Additional subgroup functionality
Copying of kernel objects and states
Low-latency device timer queries
Ingestion of SPIR-V code by runtime
Execution priority hints for queues
Zero-sized dispatches from host

AMD, ARM, Intel, HPC, and YetiWare have declared support for OpenCL 2.1.^[33]^[34]

Implementations

OpenCL consists of a set of headers and a shared object that is loaded at runtime. An installable client driver loader (ICD loader) must be installed on the platform for every class of vendor for which the runtime would need to support. That is, for example, in order to support Nvidia devices on a Linux platform, the Nvidia ICD would need to be installed such that the OpenCL runtime would be able to locate the ICD for the vendor and redirect the calls appropriately. The standard OpenCL header is used by the consumer application; calls to each function are then proxied by the OpenCL runtime to the appropriate driver using the ICD. Each vendor must implement each OpenCL call in their driver.^[35]

A number of open source implementations of the OpenCL ICD exist, including freeocl^[36] and ocl-icd.^[37] An implementation of OpenCL for a number of platforms is maintained as part of the Gallium Compute Project,^[38] which builds on the work of the Mesa project to support multiple platforms. An implementation by Intel for its Ivy Bridge hardware was released in 2013.^[39] This software, called "Beignet", is not based on Mesa/Gallium, which has attracted criticism from developers at AMD and Red Hat,^[40] as well as Michael Larabel of Phoronix.^[41] A CPU-only version building on Clang and LLVM, called pocl, is intended to be a portable OpenCL implementation.^[42]

Timeline of vendor implementations

December 10, 2008: AMD and Nvidia held the first public OpenCL demonstration, a 75-minute presentation at Siggraph Asia 2008. AMD showed a CPU-accelerated OpenCL demo explaining the scalability of OpenCL on one or more cores while Nvidia showed a GPU-accelerated demo.^[43]^[44]
March 16, 2009: at the 4th Multicore Expo, Imagination Technologies announced the PowerVR SGX543MP, the first GPU of this company to feature OpenCL support.^[45]
March 26, 2009: at GDC 2009, AMD and Havok demonstrated the first working implementation for OpenCL accelerating Havok Cloth on AMD Radeon HD 4000 series GPU.^[46]
April 20, 2009: Nvidia announced the release of its OpenCL driver and SDK to developers participating in its OpenCL Early Access Program.^[47]
August 5, 2009: AMD unveiled the first development tools for its OpenCL platform as part of its ATI Stream SDK v2.0 Beta Program.^[48]
August 28, 2009: Apple released Mac OS X Snow Leopard, which contains a full implementation of OpenCL.^[49]

OpenCL in Snow Leopard is supported on the Nvidia GeForce 320M, GeForce GT 330M, GeForce 9400M, GeForce 9600M GT, GeForce 8600M GT, GeForce GT 120, GeForce GT 130, GeForce GTX 285, GeForce 8800 GT, GeForce 8800 GS, Quadro FX 4800, Quadro FX5600, ATI Radeon HD 4670, ATI Radeon HD 4850, Radeon HD 4870, ATI Radeon HD 5670, ATI Radeon HD 5750, ATI Radeon HD 5770 and ATI Radeon HD 5870.^[50]

September 28, 2009: Nvidia released its own OpenCL drivers and SDK implementation.
October 13, 2009: AMD released the fourth beta of the ATI Stream SDK 2.0, which provides a complete OpenCL implementation on both R700/R800 GPUs and SSE3 capable CPUs. The SDK is available for both Linux and Windows.^[51]
November 26, 2009: Nvidia released drivers for OpenCL 1.0 (rev 48).

The Apple,^[52] Nvidia,^[53] RapidMind^[54] and Gallium3D^[55] implementations of OpenCL are all based on the LLVM Compiler technology and use the Clang Compiler as its frontend.

October 27, 2009: S3 released their first product supporting native OpenCL 1.0 – the Chrome 5400E embedded graphics processor.^[56]
December 10, 2009: VIA released their first product supporting OpenCL 1.0 – ChromotionHD 2.0 video processor included in VN1000 chipset.^[57]
December 21, 2009: AMD released the production version of the ATI Stream SDK 2.0,^[58] which provides OpenCL 1.0 support for R800 GPUs and beta support for R700 GPUs.
June 1, 2010: ZiiLABS released details of their first OpenCL implementation for the ZMS processor for handheld, embedded and digital home products.^[59]
June 30, 2010: IBM released a fully conformant version of OpenCL 1.0.^[60]
September 13, 2010: Intel released details of their first OpenCL implementation for the Sandy Bridge chip architecture. Sandy Bridge will integrate Intel's newest graphics chip technology directly onto the central processing unit.^[61]
November 15, 2010: Wolfram Research released Mathematica 8 with OpenCLLink package.
March 3, 2011: Khronos Group announces the formation of the WebCL working group to explore defining a JavaScript binding to OpenCL. This creates the potential to harness GPU and multi-core CPU parallel processing from a Web browser.^[62]^[63]
March 31, 2011: IBM released a fully conformant version of OpenCL 1.1.^[60]^[64]
April 25, 2011: IBM released OpenCL Common Runtime v0.1 for Linux on x86 Architecture.^[65]
May 4, 2011: Nokia Research releases an open source WebCL extension for the Firefox web browser, providing a JavaScript binding to OpenCL.^[66]
July 1, 2011: Samsung Electronics releases an open source prototype implementation of WebCL for WebKit, providing a JavaScript binding to OpenCL.^[67]
August 8, 2011: AMD released the OpenCL-driven AMD Accelerated Parallel Processing (APP) Software Development Kit (SDK) v2.5, replacing the ATI Stream SDK as technology and concept.^[68]
December 12, 2011: AMD released AMD APP SDK v2.6^[69] which contains a preview of OpenCL 1.2.
February 27, 2012: The Portland Group released the PGI OpenCL compiler for multi-core ARM CPUs.^[70]
April 17, 2012: Khronos released a WebCL working draft.^[71]
May 6, 2013: Altera released the Altera SDK for OpenCL, version 13.0.^[72] It is conformant to OpenCL 1.0.^[73]
November 18, 2013: Khronos announced that the specification for OpenCL 2.0 had been finalised.^[74]
March 19, 2014: Khronos releases the WebCL 1.0 specification^[75]^[76]
August 29, 2014: Intel releases HD Graphics 5300 driver that supports OpenCL 2.0.^[77]
September 25, 2014: AMD releases Catalyst 14.41 RC1, which includes an OpenCL 2.0 driver.^[78]
April 13, 2015: Nvidia releases WHQL driver v350.12, which includes OpenCL 1.2 support.^[79]
August 26, 2015: AMD released AMD APP SDK v3.0 ^[80] which contains full support of OpenCL 2.0 and sample coding.

Conformant products

The Khronos Group maintains an extended list of OpenCL-conformant products.^[4]

Synopsis of OpenCL conformant products^[4]
AMD APP SDK (supports OpenCL CPU and accelerated processing unit Devices)	X86 + SSE2 (or higher) compatible CPUs 64-bit & 32-bit;^[81] Linux 2.6 PC, Windows Vista/7 PC	AMD Fusion E-350, E-240, C-50, C-30 with HD 6310/HD 6250	AMD Radeon/Mobility HD 6800, HD 5x00 series GPU, iGPU HD 6310/HD 6250	ATI FirePro Vx800 series GPU
Intel SDK for OpenCL Applications 2013^[82] (supports Intel Core processors and Intel HD Graphics 4000/2500)	Intel CPUs with SSE 4.1, SSE 4.2 or AVX support.^[83]^[84] Microsoft Windows, Linux	Intel Core i7, i5, i3; 2nd Generation Intel Core i7/5/3, 3rd Generation Intel Core Processors with Intel HD Graphics 4000/2500	Intel Core 2 Solo, Duo Quad, Extreme	Intel Xeon 7x00,5x00,3x00 (Core based)
IBM Servers with OpenCL Development Kit for Linux on Power running on Power VSX^[85]^[86]	IBM Power 755 (PERCS), 750	IBM BladeCenter PS70x Express	IBM BladeCenter JS2x, JS43	IBM BladeCenter QS22
IBM OpenCL Common Runtime (OCR) ^[87]	X86 + SSE2 (or higher) compatible CPUs 64-bit & 32-bit;^[88] Linux 2.6 PC	AMD Fusion, Nvidia Ion and Intel Core i7, i5, i3; 2nd Generation Intel Core i7/5/3	AMD Radeon, Nvidia GeForce and Intel Core 2 Solo, Duo, Quad, Extreme	ATI FirePro, Nvidia Quadro and Intel Xeon 7x00,5x00,3x00 (Core based)
Nvidia OpenCL Driver and Tools^[89]	Nvidia Tesla C/D/S	Nvidia GeForce GTS/GT/GTX	Nvidia Ion	Nvidia Quadro FX/NVX/Plex

Extensions

Some vendors provide extended functionality over the standard OpenCL specification via the means of extensions. These are still specified by Khronos but provided by vendors within their SDKs. They often contain features that are to be implemented in the future – for example device fission functionality was originally an extension but is now provided as part of the 1.2 specification.

Extensions provided in the 1.2 specification include:

Writing to 3D image memory objects
Half-precision floating-point format
Sharing memory objects with OpenGL
Creating event objects from GL sync objects
Sharing memory objects with Direct3D 10
DX9 media Surface Sharing
Sharing Memory Objects with Direct3D 11

Device fission

Device fission – introduced fully into the OpenCL standard with version 1.2 – allows individual command queues to be used for specific areas of a device. For example, within the Intel SDK, a command queue can be created that maps directly to an individual core. AMD also provides functionality for device fission, also originally as an extension. Device fission can be used where the availability of compute is required reliably, such as in a latency sensitive environment. Fission effectively reserves areas of the device for computation.

Portability, performance and alternatives

A key feature of OpenCL is portability, via its abstracted memory and execution model, and the programmer is not able to directly use hardware-specific technologies such as inline Parallel Thread Execution (PTX) for NVidia GPUs unless they are willing to give up direct portability on other platforms. It is possible to run any OpenCL kernel on any conformant implementation.

However, performance of the kernel is not necessarily portable across platforms. Existing implementations have been shown to be competitive when kernel code is properly tuned, though, and auto-tuning has been suggested as a solution to the performance portability problem,^[90] yielding "acceptable levels of performance" in experimental linear algebra kernels.^[91] Portability of an entire application containing multiple kernels with differing behaviors was also studied, and shows that portability only required limited tradeoffs.^[92]

A study at Delft University that compared CUDA programs and their straightforward translation into OpenCL C found CUDA to outperform OpenCL by at most 30% on the Nvidia implementation. The researchers noted that their comparison could be made fairer by applying manual optimizations to the OpenCL programs, in which case there was "no reason for OpenCL to obtain worse performance than CUDA". The performance differences could mostly be attributed to differences in the programming model (especially the memory model) and to NVIDIA's compiler optimizations for CUDA compared to those for OpenCL.^[90] Another, similar study found CUDA to perform faster data transfers to and from a GPU's memory.^[93]

The fact that OpenCL allows workloads to be shared by CPU and GPU, executing the same programs, means that programmers can exploit both by dividing work among the devices.^[94] This leads to the problem of deciding how to partition the work, because the relative speeds of operations differ among the devices. Machine learning has been suggested to solve this problem: Grewe and O'Boyle describe a system of support vector machines trained on compile-time features of program that can decide the device partitioning problem statically, without actually running the programs to measure their performance.^[95]

References

↑ ^1.0 ^1.1 Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ ^4.0 ^4.1 ^4.2 Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ ^9.0 ^9.1 Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ ^11.0 ^11.1 ^11.2 ^11.3 Lua error in package.lua at line 80: module 'strict' not found.
↑ Template:Cite doi/10.1016.2Fj.parco.2011.09.001
↑ ^13.0 ^13.1 ^13.2 Lua error in package.lua at line 80: module 'strict' not found.
↑ AMD. Introduction to OpenCL Programming 201005, page 89-90
↑ AMD. Introduction to OpenCL Programming 201005, page 89-90
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.^{[dead link]}
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Khronos Drives Momentum of Parallel Computing Standard with Release of OpenCL 1.1 Specification
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ ^30.0 ^30.1 ^30.2 Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.^{[dead link]}
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.^{[dead link]}
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.^{[dead link]}
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ ^60.0 ^60.1 Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Intel OpenCL 2.0 Driver
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ http://us.download.nvidia.com/Windows/350.12/350.12-win8-win7-winvista-desktop-release-notes.pdf
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.^{[dead link]}
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.^{[dead link]}
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.^{[dead link]}
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ ^90.0 ^90.1 Lua error in package.lua at line 80: module 'strict' not found.
↑ Template:Cite doi/10.1016.2Fj.parco.2011.10.002
↑ Lua error in package.lua at line 80: module 'strict' not found.^{[dead link]}
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ A Survey of CPU-GPU Heterogeneous Computing Techniques, ACM Computing Surveys, 2015.
↑ Lua error in package.lua at line 80: module 'strict' not found.

External links

[specification-1] 1.0 ^1.1 Lua error in package.lua at line 80: module 'strict' not found.

[2] Lua error in package.lua at line 80: module 'strict' not found.

[3] Lua error in package.lua at line 80: module 'strict' not found.

[Products-4] 4.0 ^4.1 ^4.2 Lua error in package.lua at line 80: module 'strict' not found.

[5] Lua error in package.lua at line 80: module 'strict' not found.

[6] Lua error in package.lua at line 80: module 'strict' not found.

[7] Lua error in package.lua at line 80: module 'strict' not found.

[8] Lua error in package.lua at line 80: module 'strict' not found.

[Gaster-9] 9.0 ^9.1 Lua error in package.lua at line 80: module 'strict' not found.

[10] Lua error in package.lua at line 80: module 'strict' not found.

[CiSE-11] 11.0 ^11.1 ^11.2 ^11.3 Lua error in package.lua at line 80: module 'strict' not found.

[pyopencl-12] Template:Cite doi/10.1016.2Fj.parco.2011.09.001

[openclc-13] 13.0 ^13.1 ^13.2 Lua error in package.lua at line 80: module 'strict' not found.

[14] AMD. Introduction to OpenCL Programming 201005, page 89-90

[15] AMD. Introduction to OpenCL Programming 201005, page 89-90

[siggraph-16] Lua error in package.lua at line 80: module 'strict' not found.

[VolkovKazianFFTG80-17] Lua error in package.lua at line 80: module 'strict' not found.

[AppleOpenCLFFT-18] Lua error in package.lua at line 80: module 'strict' not found.

[19] Lua error in package.lua at line 80: module 'strict' not found.

[macWorld-20] Lua error in package.lua at line 80: module 'strict' not found.

[khronosGroup-21] Lua error in package.lua at line 80: module 'strict' not found.

[pressrelease-22] Lua error in package.lua at line 80: module 'strict' not found.

[AMDpressrelease-23] Lua error in package.lua at line 80: module 'strict' not found.

[eweekAMD-24] Lua error in package.lua at line 80: module 'strict' not found.

[RapidMindHPCWire-25] Lua error in package.lua at line 80: module 'strict' not found.^{[dead link]}

[Nvidia_Press_Release_2008-12-09-26] Lua error in package.lua at line 80: module 'strict' not found.

[openclIBM-27] Lua error in package.lua at line 80: module 'strict' not found.

[28] Khronos Drives Momentum of Parallel Computing Standard with Release of OpenCL 1.1 Specification

[29] Lua error in package.lua at line 80: module 'strict' not found.

[OpenCL1.2-30] 30.0 ^30.1 ^30.2 Lua error in package.lua at line 80: module 'strict' not found.

[31] Lua error in package.lua at line 80: module 'strict' not found.

[32] Lua error in package.lua at line 80: module 'strict' not found.

[33] Lua error in package.lua at line 80: module 'strict' not found.

[34] Lua error in package.lua at line 80: module 'strict' not found.

[35] Lua error in package.lua at line 80: module 'strict' not found.

[36] Lua error in package.lua at line 80: module 'strict' not found.

[37] Lua error in package.lua at line 80: module 'strict' not found.

[38] Lua error in package.lua at line 80: module 'strict' not found.

[39] Lua error in package.lua at line 80: module 'strict' not found.

[40] Lua error in package.lua at line 80: module 'strict' not found.

[41] Lua error in package.lua at line 80: module 'strict' not found.

[42] Lua error in package.lua at line 80: module 'strict' not found.

[43] Lua error in package.lua at line 80: module 'strict' not found.

[44] Lua error in package.lua at line 80: module 'strict' not found.

[45] Lua error in package.lua at line 80: module 'strict' not found.

[46] Lua error in package.lua at line 80: module 'strict' not found.^{[dead link]}

[47] Lua error in package.lua at line 80: module 'strict' not found.

[48] Lua error in package.lua at line 80: module 'strict' not found.

[49] Lua error in package.lua at line 80: module 'strict' not found.

[50] Lua error in package.lua at line 80: module 'strict' not found.

[51] Lua error in package.lua at line 80: module 'strict' not found.^{[dead link]}

[52] Lua error in package.lua at line 80: module 'strict' not found.

[53] Lua error in package.lua at line 80: module 'strict' not found.

[54] Lua error in package.lua at line 80: module 'strict' not found.

[55] Lua error in package.lua at line 80: module 'strict' not found.

[56] Lua error in package.lua at line 80: module 'strict' not found.^{[dead link]}

[57] Lua error in package.lua at line 80: module 'strict' not found.

[58] Lua error in package.lua at line 80: module 'strict' not found.

[59] Lua error in package.lua at line 80: module 'strict' not found.

[khronos.org-60] 60.0 ^60.1 Lua error in package.lua at line 80: module 'strict' not found.

[61] Lua error in package.lua at line 80: module 'strict' not found.

[62] Lua error in package.lua at line 80: module 'strict' not found.

[63] Lua error in package.lua at line 80: module 'strict' not found.

[64] Lua error in package.lua at line 80: module 'strict' not found.

[65] Lua error in package.lua at line 80: module 'strict' not found.

[66] Lua error in package.lua at line 80: module 'strict' not found.

[67] Lua error in package.lua at line 80: module 'strict' not found.

[68] Lua error in package.lua at line 80: module 'strict' not found.

[69] Lua error in package.lua at line 80: module 'strict' not found.

[70] Lua error in package.lua at line 80: module 'strict' not found.

[71] Lua error in package.lua at line 80: module 'strict' not found.

[72] Lua error in package.lua at line 80: module 'strict' not found.

[73] Lua error in package.lua at line 80: module 'strict' not found.

[74] Lua error in package.lua at line 80: module 'strict' not found.

[75] Lua error in package.lua at line 80: module 'strict' not found.

[76] Lua error in package.lua at line 80: module 'strict' not found.

[77] Intel OpenCL 2.0 Driver

[78] Lua error in package.lua at line 80: module 'strict' not found.

[79] ttp://us.download.nvidia.com/Windows/350.12/350.12-win8-win7-winvista-desktop-release-notes.pdf

[80] Lua error in package.lua at line 80: module 'strict' not found.

[81] Lua error in package.lua at line 80: module 'strict' not found.^{[dead link]}

[intelsdk-82] Lua error in package.lua at line 80: module 'strict' not found.

[83] Lua error in package.lua at line 80: module 'strict' not found.

[84] Lua error in package.lua at line 80: module 'strict' not found.^{[dead link]}

[85] Lua error in package.lua at line 80: module 'strict' not found.

[86] Lua error in package.lua at line 80: module 'strict' not found.

[87] Lua error in package.lua at line 80: module 'strict' not found.

[88] Lua error in package.lua at line 80: module 'strict' not found.^{[dead link]}

[89] Lua error in package.lua at line 80: module 'strict' not found.

[comprehensive-90] 90.0 ^90.1 Lua error in package.lua at line 80: module 'strict' not found.

[91] Template:Cite doi/10.1016.2Fj.parco.2011.10.002

[92] Lua error in package.lua at line 80: module 'strict' not found.^{[dead link]}

[93] Lua error in package.lua at line 80: module 'strict' not found.

[94] A Survey of CPU-GPU Heterogeneous Computing Techniques, ACM Computing Surveys, 2015.

[95] Lua error in package.lua at line 80: module 'strict' not found.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

[41]

[42]

[43]

[44]

[45]

[46]

[47]

[48]

[49]

[50]

[51]

[52]

[53]

[54]

[55]

[56]

[57]

[58]

[59]

[60]

[61]

[62]

[63]

[64]

[65]

[66]

[67]

[68]

[69]

[70]

[71]

[72]

[73]

[74]

[75]

[76]

[77]

[78]

[79]

[80]

[81]

[82]

[83]

[84]

[85]

[86]

[87]

[88]

[89]

[90]

[91]

[92]

[93]

[94]

[95]

v t e Parallel computing
General	Distributed computing Cloud computing High-performance computing
Levels	Bit Instruction Task Data Memory
Multithreading	Temporal Simultaneous Preemptive Cooperative
Theory	PRAM model Analysis of parallel algorithms Amdahl's law Gustafson's law Cost efficiency Karp–Flatt metric Slowdown Speedup
Elements	Process Thread Fiber Instruction window
Coordination	Multiprocessing Memory coherency Cache coherency Cache invalidation Barrier Synchronization Application checkpointing
Programming	Models Implicit parallelism Explicit parallelism Concurrency Non-blocking algorithm
Hardware	Flynn's taxonomy SISD SIMD MISD MIMD Pipelined processor Superscalar processor Vector processor Multiprocessor symmetric asymmetric Memory shared distributed distributed shared UMA NUMA COMA Massively parallel computer Computer cluster Grid computer
APIs	POSIX Threads OpenMP OpenCL OpenHMPP OpenACC MPI PVM UPC TBB Boost.Thread Global Arrays Ateji PX Charm++ Cilk Coarray Fortran CUDA Dryad C++ AMP PLINQ TPL
Problems	Embarrassingly parallel Software lockout Scalability Race condition Deadlock Livelock Starvation Deterministic algorithm Parallel slowdown
Category: parallel computing Media related to Lua error in package.lua at line 80: module 'strict' not found. at Wikimedia Commons

OpenCL

Contents

Overview

Memory hierarchy

OpenCL C language

Example: matrix-vector multiplication

Example: computing the FFT

History

OpenCL 1.0

OpenCL 1.1

OpenCL 1.2

OpenCL 2.0

OpenCL 2.1

Implementations

Timeline of vendor implementations

Conformant products

Extensions

Device fission

Portability, performance and alternatives

See also

References

External links

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools