Single instruction, multiple data (SIMD) is the concept of having each instruction operate on a small chunk or vector of data elements. CPU vector instruction sets include: x86 SSE and AVX, ARM NEON, and PowerPC AltiVec. To efficiently use SIMD instructions, data needs to be in structure-of-arrays form and should occur in longer streams. Naively "SIMD optimized" code frequently surprises by running slower than the original.
Questions tagged [simd]
2077 questions
300
votes
12 answers
How to compile Tensorflow with SSE4.2 and AVX instructions?
This is the message received from running a script to check if Tensorflow is working:
I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:125]…
GabrielChu
- 5,418
- 8
- 23
- 35
141
votes
5 answers
Header files for x86 SIMD intrinsics
Which header files provide the intrinsics for the different x86 SIMD instruction set extensions (MMX, SSE, AVX, ...)? It seems impossible to find such a list online. Correct me if I'm wrong.
fredoverflow
- 237,063
- 85
- 359
- 638
79
votes
8 answers
Subtracting packed 8-bit integers in an 64-bit integer by 1 in parallel, SWAR without hardware SIMD
If I have a 64-bit integer that I'm interpreting as an array of packed 8-bit integers with 8 elements. I need to subtract the constant 1 from each packed integer while handling overflow without the result of one element affecting the result of…
cam-white
- 700
- 3
- 9
76
votes
3 answers
Why is vectorization, faster in general, than loops?
Why, at the lowest level of the hardware performing operations and the general underlying operations involved (i.e.: things general to all programming languages' actual implementations when running code), is vectorization typically so dramatically…
Ben Sandeen
- 1,063
- 1
- 13
- 15
61
votes
4 answers
Fastest way to do horizontal SSE vector sum (or other reduction)
Given a vector of three (or four) floats. What is the fastest way to sum them?
Is SSE (movaps, shuffle, add, movd) always faster than x87? Are the horizontal-add instructions in SSE3 worth it?
What's the cost to moving to the FPU, then faddp, faddp?…
FeepingCreature
- 3,365
- 2
- 21
- 24
60
votes
3 answers
Parallel for vs omp simd: when to use each?
OpenMP 4.0 introduces a new construct called "omp simd". What is the benefit of using this construct over the old "parallel for"? When would each be a better choice over the other?
EDIT:
Here is an interesting paper related to the SIMD directive.
zr.
- 6,870
- 8
- 45
- 80
52
votes
5 answers
SSE intrinsic functions reference
Does anyone know of a reference listing the operation of the SSE intrinsic functions for gcc, i.e. the functions in the <*mmintrin.h> header files?
Thanks.
NGaffney
- 1,504
- 1
- 17
- 16
51
votes
2 answers
How to choose AVX compare predicate variants
In the Advanced Vector Extensions (AVX) the compare instructions like _m256_cmp_ps, the last argument is a compare predicate.
The choices for the predicate overwhelm me.
They seem to be a tripple of type, ordering, signaling.
E.g. _CMP_LE_OS is…
Bram
- 5,692
- 1
- 40
- 68
48
votes
4 answers
Getting started with Intel x86 SSE SIMD instructions
I want to learn more about using the SSE.
What ways are there to learn, besides the obvious reading the Intel® 64 and IA-32 Architectures Software Developer's Manuals?
Mainly I'm interested to work with the GCC X86 Built-in Functions.
Liran Orevi
- 4,507
- 7
- 44
- 63
44
votes
4 answers
ARM Cortex-A8: Whats the difference between VFP and NEON
In ARM Cortex-A8 processor, I understand what NEON is, it is an SIMD co-processor.
But is VFP(Vector Floating Point) unit, which is also a co-processor, works as a SIMD processor? If so which one is better to use?
I read few links such as…
HaggarTheHorrible
- 6,335
- 16
- 63
- 79
42
votes
8 answers
How to determine if memory is aligned?
I am new to optimizing code with SSE/SSE2 instructions and until now I have not gotten very far. To my knowledge a common SSE-optimized function would look like this:
void sse_func(const float* const ptr, int len){
if( ptr is aligned )
{
…
user229898
- 2,237
- 3
- 17
- 9
39
votes
5 answers
AVX2 what is the most efficient way to pack left based on a mask?
If you have an input array, and an output array, but you only want to write those elements which pass a certain condition, what would be the most efficient way to do this in AVX2?
I've seen in SSE where it was done like…
Froglegs
- 915
- 1
- 7
- 19
37
votes
8 answers
Why is strcmp not SIMD optimized?
I've tried to compile this program on an x64 computer:
#include
int main(int argc, char* argv[])
{
return ::std::strcmp(argv[0],
"really really really really really really really really really"
"really really really really…
user1095108
- 12,675
- 6
- 43
- 96
37
votes
4 answers
Why vectorizing the loop does not have performance improvement
I am investigating the effect of vectorization on the performance of the program. In this regard, I have written following code:
#include
#include
#include
#define LEN 10000000
int main(){
struct timeval…
Pouya
- 1,643
- 2
- 18
- 25
35
votes
1 answer
Difference between MOVDQA and MOVAPS x86 instructions?
I'm looking Intel datasheet: Intel® 64 and IA-32 Architectures
Software Developer’s Manual and I can't find the difference between
MOVDQA: Move Aligned Double Quadword
MOVAPS: Move Aligned Packed Single-Precision
In Intel datasheet I can find…
GJ.
- 10,234
- 2
- 39
- 58