Diving into OpenCL
27 Nov 2013Introduction
In a previous article I’d put together a walk through on how to get your development environment ready to write some OpenCL code. This article by itself isn’t of much use unless you can write some code already. Today’s post will be a walk through on writing your first OpenCL program. This example, much like a lot of the other entry-level OpenCL development tutorials will focus on performing addition between two lists of floating point numbers.
Lots to learn
Unfortunately, OpenCL is a topic that brings a very steep learning curve. In order to understand even the most simple of programs you need to read a fair bit of code and hopefully be aware of what it’s doing. Before we dive into any implementation, I’ll take you on a brief tour of terms, types and definitions that will help you to understanding the code as it’s presented.
A cl_platform_id
is obtained using clGetPlatformIDs. A platform in OpenCL refers to the host execution environment and any attached devices. Platforms are what allow OpenCL to share resources and execute programs.
A cl_device_id
is obtained using clGetDeviceIDs. It’s how your program will refer to the devices that your code will run on. A device is how OpenCL refers to “something” that will execute code (CPU, GPU, etc).
A cl_context
is obtained using clCreateContext. A context is established across OpenCL devices. It’s what OpenCL will use to manage command-queues, memory, program and kernel objects. It provides the ability to execute a kernel across many devices.
A cl_program
is created from actual (string) source-code at runtime using clCreateProgramWithSource. They’re created in conjunction with your context so that program creation is aware of where it’ll be expected to run. After the cl_program
reference is successfully established, the host program would typically call clBuildProgram to take the program from its source code (string) state into an executable (binary) state.
A cl_command_queue
is established using clCreateCommandQueue. A command queue is how work is scheduled to a device for execution.
A cl_kernel
is created using clCreateKernel. A kernel is a function contained within a compiled cl_program
object. It’s identified within the source code with a __kernel
qualifier. You set the argument list for a cl_kernel
object using clSetKernelArg. To glue it all together, you use clEnqueueNDRangeKernel or clEnqueueTask to enqueue a task on the command queue to execute a kernel.
A side note here is that you can use clEnqueueNativeKernel to execute native C/C++ code that isn’t compiled by OpenCL.
At least if you can identify some form of meaning when you come across these type names, you won’t be totally in the dark. Next up, we’ll create a host program and OpenCL routine - compile, build and run!
The Host
The host application is responsible for engaging with the OpenCL api to setup all of the objects described above. It’s also responsible for locating the OpenCL source code and making it available for compilation at run time. In this first snippet of code, we use the OpenCL api to establish the management platform and devices that are available to execute our code. Majority of the OpenCL api standardises itself around returning error codes from all of the functions.
At this point, we have “platform” which will (hopefully) contain a platform ID identifying our management platform and “device” should either refer to the GPU or CPU (failing to find a GPU). The next step is to create a context and your OpenCL program from source.
We’ve done just that here, but the program isn’t quite yet ready for execution. Before we can start using this, we need to build the program. The build process is very much a compilation & linking process that involves its own set of log message outputs, etc. You can make this part of your program as elaborate as you’d like. Here’s an example compilation process.
We’ve got a platform, device, context and program that’s been built. We now need to shift contexts from the host program to the actual OpenCL code that we’ll execute for the purposes of this example. We need to understand what the inputs, outputs, used resources, etc. of the OpenCL code is before we can continue to write the rest of our host.
The OpenCL Code
The main purpose of OpenCL code is really to operate arithmetically on arrays (or strings) of data. The example that I’m suggesting for the purposes of this article takes in two source arrays and produces another array which are the sum of each index. i.e.
As above, the source arrays are a
and b
. The result array (holding the sum of each source array at each index) is c
. Here’s the (rather simple) OpenCL code to achieve this.
That’s all there is to it. Things to note are, any function that is to be called in the OpenCL context is called a “kernel”. Kernel’s must be decorated with the __kernel
modifier. In this example, the parameters that are passed in are decorated with the __global
modifier. This tells OpenCL that these are objects allocated from the global memory pool. You can read up more about these modifiers here. The final thing to note is the use of get_global_id. It’s what gives us the particular index to process in our array here. The parameter that’s supplies allows you to work with 1, 2 or 3 dimensional arrays. Anything over this, the arrays need to be broken down to use a smaller dimension count.
Back to the Host
Back in context of the host, we’ll create the command queue and kernel objects. The command queue allows us to send commands to OpenCL like reading & writing to buffers or executing kernel code. The following code shows the creation of the command queue and kernel.
Notice that we mentioned the kernel by name here. A kernel object refers to the function! Now that we have a function to execute (or kernel) we now need to be able to pass data to the function. We also need to be able to read the result once processing has finished. Our first job is allocating buffers that OpenCL will be aware of to handle these arrays.
In the above snippet, we’ve defined the source arrays and we’ve also created buffers that will hold the information (for use in our OpenCL code). Now all we need to do is to feed the source arrays into the buffers and supply all of the buffers as arguments to our kernel.
Now we invoke OpenCL to do the work. In doing this, we need to supply the invocation with a global size and local size. Global size is used to specify the total number of work items being processed. In our case, this is ARRAY_SIZE
. Local size is used as the number of work items in each local group. Local size needs to be a divisor of global size. For simplicity, I’ve set these both to ARRAY_SIZE
.
After all of the work is completed, we really want to take a look at the result. We’ll send another request to the command queue to read that result array back into local storage. From there, we’ll be able to print the results to screen.
Fantastic. Everything’s on screen now, we can see the results. All we’ll do from here is clean up our mess and get out.
What a marathon! Hopefully you’ve learnt something from this post. It’s a lot to take in to get a “Hello, World” level application up and running.