SIMD-8,SIMD-16 or SIMD-32 in opencl on gpgpu

Question

I read couple of questions on SO for this topic(SIMD Mode), but still slight clarification/confirmation of how things work is required.

Why use SIMD if we have GPGPU?

SIMD intrinsics - are they usable on gpus?

CPU SIMD vs GPU SIMD?

Are following points correct,if I compile the code in SIMD-8 mode ? 1) it means 8 instructions of different work items are getting executing in parallel.

2) Does it mean All work items are executing the same instruction only?

3) if each wrok item code contains vload16 load then float16 operations and then vstore16 operations only. SIMD-8 mode will still work. I mean to say is it true GPU is till executing the same instruction (either vload16/ float16 / vstore16) for all 8 work items?

How should I understand this concept?

Interesting question. I've never heard anyone doing SIMD optimization on GPU. — user3528438, Jul 31 '15 at 19:11
GPUs use (almost) the same SIMD as CPUs - just the programming model is different, exposing scalar threads on GPU and vector threads on CPU. — void_ptr, Jul 31 '15 at 20:23

score 0 · Accepted Answer · answered Aug 02 '15 at 19:21

0

In the past many OpenCL vendors required to use vector types to be able to use SIMD. Nowadays OpenCL vendors are packing work items into SIMD so there is no need to use vector types. Whether is preffered to use vector types can be checked by querying for: CL_DEVICE_PREFERRED_VECTOR_WIDTH_<CHAR, SHORT, INT, LONG, FLOAT, DOUBLE>.

On Intel if vector type is used the vectorizer first scalarize them and then re-vectorize to make use of the wide instruction set. This is probably going to be similar on the other platforms.

answered Aug 02 '15 at 19:21

doqtor

8,058
2
15
30

So it means if I am using instructions like vload16 or float16 in a kernel code. I am increasing the redundant work per item. Don't you think if I follow this, it will nullify the existence of vload16 or float16 type instructions. – Manish Kumar Aug 03 '15 at 02:57
The problem is that this is really a tuning factor. It may do that, on the other hand you may find that your kernel benefits from packing more ALU operations into each work-item. Look at it more like a loop unrolling optimisation. You may not need it to get SIMD mapping, but you can benefit from the extra information about independent ALU ops. At that point you just have to experiment to find the best combination of all these factors, or rely on the heuristics that the compiler uses. – Lee Aug 03 '15 at 19:22
Guys I am totally confused on the concept here. I think I don't even know the mapping of SIMD engine to ALU's. I am working on intel architecture. Could you please map that first ? Here is the link to its doc: https://software.intel.com/sites/default/files/managed/71/a2/Compute%20Architecture%20of%20Intel%20Processor%20Graphics%20Gen8.pdf – Manish Kumar Aug 04 '15 at 13:48

SIMD-8,SIMD-16 or SIMD-32 in opencl on gpgpu

1 Answers1

Linked