Single instruction, multiple data (SIMD) is the concept of having each instruction operate on a small chunk or vector of data elements. CPU vector instruction sets include: x86 SSE and AVX, ARM NEON, and PowerPC AltiVec. To efficiently use SIMD instructions, data needs to be in structure-of-arrays form and should occur in longer streams. Naively "SIMD optimized" code frequently surprises by running slower than the original.
Questions tagged [simd]
2077 questions
34
votes
2 answers
CPU SIMD vs GPU SIMD?
GPU uses the SIMD paradigm, that is, the same portion of code will be executed in parallel, and applied to various elements of a data set.
However, CPU also uses SIMD, and provide instruction-level parallelism. For example, as far as I know,…
![](../../users/profiles/2098831.webp)
carmellose
- 4,146
- 7
- 35
- 46
33
votes
4 answers
What's missing/sub-optimal in this memcpy implementation?
I've become interested in writing a memcpy() as an educational exercise. I won't write a whole treatise of what I did and didn't think about, but here's
some guy's implementation:
__forceinline // Since Size is usually known,
//…
![](../../users/profiles/1593077.webp)
einpoklum
- 86,754
- 39
- 223
- 453
31
votes
5 answers
Why ARM NEON not faster than plain C++?
Here is a C++ code:
#define ARR_SIZE_TEST ( 8 * 1024 * 1024 )
void cpp_tst_add( unsigned* x, unsigned* y )
{
for ( register int i = 0; i < ARR_SIZE_TEST; ++i )
{
x[ i ] = x[ i ] + y[ i ];
}
}
Here is a neon version:
void…
![](../../users/profiles/717063.webp)
Smalti
- 367
- 1
- 4
- 11
30
votes
1 answer
Crash with icc: can the compiler invent writes where none existed in the abstract machine?
Consider the following simple program:
#include
#include
#include
void replace(char *str, size_t len) {
for (size_t i = 0; i < len; i++) {
if (str[i] == '/') {
str[i] = '_';
}
…
![](../../users/profiles/149138.webp)
BeeOnRope
- 51,419
- 13
- 149
- 309
29
votes
3 answers
How to write portable simd code for complex multiplicative reduction
I want to write fast simd code to compute the multiplicative reduction of a complex array. In standard C this is:
#include
complex float f(complex float x[], int n ) {
complex float p = 1.0;
for (int i = 0; i < n; i++)
p *=…
![](../../users/profiles/2179021.webp)
eleanora
- 9,397
- 17
- 58
- 128
29
votes
2 answers
How to implement atoi using SIMD?
I'd like to try writing an atoi implementation using SIMD instructions, to be included in RapidJSON (a C++ JSON reader/writer library). It currently has some SSE2 and SSE4.2 optimizations in other places.
If it's a speed gain, multiple atoi results…
![](../../users/profiles/85140.webp)
the_drow
- 17,134
- 23
- 116
- 185
29
votes
1 answer
What are the best instruction sequences to generate vector constants on the fly?
"Best" means fewest instructions (or fewest uops, if any instructions decode to more than one uop). Machine-code size in bytes is a tie-breaker for equal insn count.
Constant-generation is by its very nature the start of a fresh dependency chain,…
![](../../users/profiles/224132.webp)
Peter Cordes
- 245,674
- 35
- 423
- 606
29
votes
3 answers
Intel AVX: 256-bits version of dot product for double precision floating point variables
The Intel Advanced Vector Extensions (AVX) offers no dot product in the 256-bit version (YMM register) for double precision floating point variables. The "Why?" question have been very briefly treated in another forum (here) and on Stack Overflow…
![](../../users/profiles/1375262.webp)
gleeen.gould
- 559
- 1
- 4
- 22
27
votes
5 answers
Good portable SIMD library
can anyone recommend portable SIMD library that provides a c/c++ API, works on Intel and AMD extensions and Visual Studio, GCC compatible. I'm looking to speed up things like scaling a 512x512 array of doubles. Vector dot products, matrix…
![](../../users/profiles/76804.webp)
Budric
- 3,291
- 8
- 32
- 38
27
votes
4 answers
print a __m128i variable
I'm trying to learn to code using intrinsics and below is a code which does addition
compiler used: icc
#include
#include
int main()
{
__m128i a = _mm_set_epi32(1,2,3,4);
__m128i b = _mm_set_epi32(1,2,3,4);
…
![](../../users/profiles/1539858.webp)
arunmoezhi
- 2,822
- 6
- 30
- 51
26
votes
2 answers
Implementation of __builtin_clz
What is the implementation of GCC's (4.6+) __builtin_clz? Does it correspond to some CPU instruction on Intel x86_64 (AVX)?
![](../../users/profiles/684534.webp)
Cartesius00
- 21,471
- 40
- 115
- 185
26
votes
5 answers
How to check if compiled code uses SSE and AVX instructions?
I wrote some code to do a bunch of math, and it needs to go fast, so I need it to use SSE and AVX instructions. I'm compiling it using g++ with the flags -O3 and -march=native, so I think it's using SSE and AVX instructions, but I'm not sure. Most…
![](../../users/profiles/4594262.webp)
BadProgrammer99
- 619
- 5
- 13
26
votes
2 answers
Haskell math performance on multiply-add operation
I'm writing a game in Haskell, and my current pass at the UI involves a lot of procedural generation of geometry. I am currently focused on identifying performance of one particular operation (C-ish pseudocode):
Vec4f multiplier, addend;
Vec4f…
![](../../users/profiles/375856.webp)
Steven Robertson
- 473
- 4
- 9
26
votes
5 answers
Get member of __m128 by index?
I've got some code, originally given to me by someone working with MSVC, and I'm trying to get it to work on Clang. Here's the function that I'm having trouble with:
float vectorGetByIndex( __m128 V, unsigned int i )
{
assert( i <= 3 );
…
![](../../users/profiles/49128.webp)
benwad
- 5,676
- 8
- 52
- 89
25
votes
1 answer
GCC fails to optimize aligned std::array like C array
Here's some code which GCC 6 and 7 fail to optimize when using std::array:
#include
static constexpr size_t my_elements = 8;
class Foo
{
public:
#ifdef C_ARRAY
typedef double Vec[my_elements] alignas(32);
#else
typedef…
![](../../users/profiles/4323.webp)
John Zwinck
- 207,363
- 31
- 261
- 371