5

Progressing unification of CPU and GPU hardware, as evidenced by AMD Kaveri with hUMA (heterogeneous Uniform Memory Access) and Intel 4th generation CPUs, should allow copy-free sharing of data between CPU and GPU. I would like to know, if the most recent OpenCL (or other GPGPU framework) implementations allow true copy-free sharing (no explicit or implicit data copying) of large data structure between code running on CPU and GPU.

Paul Jurczak
  • 5,766
  • 36
  • 58
  • AMD APU and Intel Integrated Graphics both allow OpenCL kernels to access main memory without any copies. – Dithermaster Apr 30 '14 at 18:32
  • @Dithermaster Do you know for sure that a specific OpenCL implementation actually behaves this way on specific hardware? – Paul Jurczak Apr 30 '14 at 21:27
  • 1
    Fairly certain. I'm going by vendor documentation. Both vendors have documents that describe which flags to use when allocating a buffer and then use clEnqueueMapBuffer that doesn't incur a copy. Since both devices share the same memory as the CPU, it makes sense that they are able to do this. The difference between this and SVM is that with SVM the buffer can contain pointers which are valid on the GPU side. – Dithermaster Apr 30 '14 at 23:05

1 Answers1

5

The ability to share data between host and device without any memory transfers has been available in OpenCL from version 1.0, via the CL_MEM_ALLOC_HOST_PTR flag. This flag allocates a buffer for the device, but ensures that it lies in memory that is also accessible by the host. The workflow for these 'zero-copy' transfers usually takes on this form:

// Allocate a device buffer using host-accessible memory
d_buffer = clCreateBuffer(context, CL_MEM_ALLOC_HOST_PTR, size, NULL, &err);

// Get a host-pointer for the buffer
h_buffer = clEnqueueMapBuffer(queue, d_buffer, CL_TRUE, CL_MAP_WRITE,
                              0, size, 0, NULL, &err);

// Write data into h_buffer from the host
... 

// Unmap the memory buffer
clEnqueueUnmapMemObject(queue, d_buffer, h_buffer, 0, NULL, NULL);

// Do stuff with the buffer on the device
clSetKernelArg(kernel, 0, sizeof(cl_mem), &d_buffer);
clEnqueueNDRangeKernel(queue, kernel, ...);

This will create a device buffer, write some data into it from the host, and then run a kernel using this buffer on the device. Because of the way that the buffer was allocated, this should not result in a memory transfer if the device and host have a unified memory system.


The above approach is limited to simple, flat data structures (1D arrays). If you are interested in working with something a little more complex such as linked-lists, trees or any other pointer-based data structures, you'll need to take advantage of the Shared Virtual Memory (SVM) feature in OpenCL 2.0. At the time of writing, AMD and Intel have both released some preview support for OpenCL 2.0 functionality, but I cannot vouch for their implementations of SVM.

The workflow for the SVM approach will be somewhat similar to the code listed above. In short, you will allocate a buffer using clSVMAlloc, which will return a pointer that is valid on both the host and device. You will use clEnqueueSVMMap and clEnqueueSVMUnmap to synchronise the data when you wish to access the buffer from the host, and clSetKernelArgSVMPointer to pass it to the device. The crucial difference between SVM and CL_MEM_ALLOC_HOST_PTR is that an SVM pointer can also be included inside another buffer passed to the device (e.g. inside a struct or pointed to by another pointer). This is what allows you to build complex pointer-based data structures that can be shared between the host and device.

jprice
  • 9,355
  • 1
  • 26
  • 30
  • Thank you for informative post. `The ability to share data between host and device without any memory transfers has been available in OpenCL from version 1.0` theoretically yes, but there was no hardware at the time able to support it without sending data over PCIe bus to GPU. Second part of your post is closer to answering my question, but as you wrote, we don't know details of unreleased products yet. My question is about existence of implementation with specific behavior (no memory transfer for shared data) rather then theoretical capabilities of OpenCL. – Paul Jurczak Apr 30 '14 at 21:25
  • 1
    @PaulJurczak Just to clarify, the first method I described (`CL_MEM_ALLOC_HOST_PTR`) is known to work with a whole bunch of devices that have been on the the market for a few years (AMD APUs, Intel IGPUs, all mobile GPUs). I myself used this with the first AMD fusion devices back in 2011. So if you only need to share flat data-structures, this is a tried and tested approach already supported by many implementations. – jprice May 01 '14 at 07:46
  • `is known to work with a whole bunch of devices`: that's what I was looking for, thanks. – Paul Jurczak May 01 '14 at 20:55