Vector of 64-bit double faster to dot-product than a vector of 32-bit unsigned int?

Question

I have two designs of code iterating over vectors of size 500. One of the designs contains arrays of 64-bit doubles and the second design uses arrays containing 32-bit integers. I was expecting the 32-bit design to be quicker because more useful data can be packed in the cache.

Compiler MSVC, CPU Ivy Bridge, compiling 64-bit mode.

This is code 1, using the 32-bit ints (runs in 2600 CPU cycles):

#include <vector>
#include <iostream>

int main(){

    std::vector<unsigned int> x1;
    std::vector<unsigned int> x2;
    std::vector<unsigned int> x3;
    x1.resize(500);
    x2.resize(500);
    x3.resize(500);

    for(int i =0; i<500; i++){
        x1[i] = i;
        x2[i] = 2*i;
        x3[i] = 4*i;
    }


    int counter = 0;
    while(counter < 1000){
        unsigned long long start = 0;
        unsigned long long end = 0;

        double m = 0;
        double n = 0;

        start = __rdtsc();

        for(int i=0; i < 500; i++){
            unsigned int a = x1[i];
            unsigned int b = x2[i];
            unsigned int g = x3[i];
            m = m + (a * g);
            n = n + (b * g);
        }

        end = __rdtscp();

        std::cout << (end-start) << "\t\t"<<m << n << std::endl;
        counter++;
    }
}

producing this asm (-Os):

start = __rdtscp(&p);
 rdtscp  
 lea         r8,[rbp+6Fh]  
 mov         dword ptr [r8],ecx  
 shl         rdx,20h  
 or          rax,rdx  
 mov         r10,rax  
        unsigned int p;
        unsigned int q;
        unsigned long long start = 0;
        unsigned long long end = 0;

        double m = 0;
 mov         r8,rbx  
 mov         r9d,1F4h  
            unsigned int a = x1[i];
            unsigned int b = x2[i];
            unsigned int g = x3[i];
 mov         edx,dword ptr [r8+r15]  
            m = m + (a * g);
 mov         ecx,edx  
 imul        ecx,dword ptr [r8+r14]  
 xorps       xmm0,xmm0  
 cvtsi2sd    xmm0,rcx  
 addsd       xmm7,xmm0  
            n = n + (b * g);
 imul        edx,dword ptr [r8]  
 mov         eax,edx  
 xorps       xmm0,xmm0  
 cvtsi2sd    xmm0,rax  
 addsd       xmm8,xmm0  

        for(int i=0; i < 500; i++){
 add         r8,4  
 dec         r9  
 jne         main+0E5h (013F681261h)  
        }

        end = __rdtscp(&q);
 rdtscp  
        }

        end = __rdtscp(&q);
 lea         r8,[rbp+6Fh]  
 mov         dword ptr [r8],ecx  
 shl         rdx,20h  
 or          rdx,rax

This is code 2, using the 64-bit doubles (code runs in 2000 CPU cycles):

#include <vector>
#include <iostream>

int main(){

    std::vector<double> x1;
    std::vector<double> x2;
    std::vector<unsigned long long> x3;
    x1.resize(500);
    x2.resize(500);
    x3.resize(500);

    for(int i =0; i<500; i++){
        x1[i] = i;
        x2[i] = 2*i;
        x3[i] = 4*i;
    }

    int counter = 0;
    while(counter < 1000){
        unsigned int p;
        unsigned int q;
        unsigned long long start = 0;
        unsigned long long end = 0;

        double m = 0;
        double n = 0;

        start = __rdtscp(&p);

        for(int i=0; i < 500; i++){
            double a = x1[i];
            double b = x2[i];
            unsigned long long g = x3[i];
            m = m + (a * g);
            n = n + (b * g);
        }

        end = __rdtscp(&q);

        std::cout << (end-start) << "\t\t"<<m << n << std::endl;
        counter++;
    }
}

and here is the asm (-Os) produced:

start = __rdtscp(&p);
 rdtscp  
 lea         r8,[rbp+6Fh]  
 mov         dword ptr [r8],ecx  
 shl         rdx,20h  
 or          rax,rdx  
 mov         r9,rax  
        unsigned int p;
        unsigned int q;
        unsigned long long start = 0;
        unsigned long long end = 0;

        double m = 0;
 mov         rdx,rbx  
 mov         r8d,1F4h  
            double a = x1[i];
            double b = x2[i];
            unsigned long long g = x3[i];
 mov         rcx,qword ptr [rdx+r15]  
 xorps       xmm1,xmm1  
            m = m + (a * g);
 cvtsi2sd    xmm1,rcx  
 test        rcx,rcx  
 jns         main+120h (013F32129Ch)  
 addsd       xmm1,xmm9  
 movaps      xmm0,xmm1  
 mulsd       xmm0,mmword ptr [rdx+r14]  
 addsd       xmm6,xmm0  
            n = n + (b * g);
 mulsd       xmm1,mmword ptr [rdx]  
 addsd       xmm7,xmm1  

        for(int i=0; i < 500; i++){
 add         rdx,8  
 dec         r8  
 jne         main+10Ah (013F321286h)  
        }

        end = __rdtscp(&q);
 rdtscp  
        }

        end = __rdtscp(&q);
 lea         r8,[rbp+6Fh]  
 mov         dword ptr [r8],ecx  
 shl         rdx,20h  
 or          rdx,rax

Floating point operations can be performed in parallel with other CPU instructions, so that might account for the discrepancy. Although the disassembly shows both using xmm registers so now I'm confused too. — Mark Ransom, May 07 '14 at 02:23
A similar question from another forum with some comparasons. [floating point vs integer](http://stackoverflow.com/questions/2550281/floating-point-vs-integer-calculations-on-modern-hardware). From what I understand, extended precision math routines like APFLOAT perform their operations using floating point, but I don't know if SSE type instructions are used. — rcgldr, May 07 '14 at 02:28
I'm going to change the code just to include full compilable examples. — mezamorphic, May 07 '14 at 02:33
Unfortunately, the assembler code is quite unreadable due to the loop unrolling. Can you rerun the test with -Os optimization and post the resulting code? If the effect persists with -Os, we will be in a much better position to actually point out what's going on on the assembler level. — cmaster - reinstate monica, May 07 '14 at 02:35
@cmaster ok used -Os and there is still a difference. Have changed question to include complete-examples and the respective ASM after using -Os on each. — mezamorphic, May 07 '14 at 02:49
Ivy Bridge supports AVX, so it can packed 4 doubles in a 256-bit ymm register. If you use 32-bit int it can only pack 4 ints in 128-bit xmm register. There's no difference about the number of packed values. If possible, try this on a Haswell computer which supports AVX2 for integer operations — phuclv, May 07 '14 at 02:52
@LưuVĩnhPhúc I don't understand your answer- it doesn't explain why the 64-bit double is faster to iterate across if they're both packing 4x numbers in to the registers? — mezamorphic, May 07 '14 at 02:58
Ok sussed it (I think). In the int version the accumulating variables (m and n) are still declared as type double. If I change these to int the int version is faster. Sound correct? — mezamorphic, May 07 '14 at 03:25
I think that yes that is your case... conversions between fp and integral types are usually slowing down (sometimes a lot)... — Spektre, May 07 '14 at 06:32
What happens during the conversion? Is it something the CPU does which we cannot see in the ASM? — mezamorphic, May 07 '14 at 15:45

score 6 · Accepted Answer · edited May 08 '14 at 06:01

The difference is the conversion of integers to doubles in the first code (the vectors contain unsigned int, the product is in integer arithmetic, but the accumulation uses double, in assembler this adds the cvtsi2sd instruction to your code).

In the second code, you use doubles everywhere, so you don't have a conversion and the code runs faster.

This difference would have been much more pronounced on a CPU that has a stricter distinction between the fixed and floating point processing units (the POWER platform is an example for this). The X86 platform is very forgiving in that respect.

Vector of 64-bit double faster to dot-product than a vector of 32-bit unsigned int?

1 Answers1