Is it faster to run a vector dot product using int32_t instead of a double?

Question

I have read a few posts (e.g., C++ built-in types), saying that for modern intel XEON CPU, there is no difference between using int32_t and using a double.

However, I have noticed that when I do vector multiplication,

std::vector<T> a, b, c;
// run some initialization
for( std::size_t i = 0; i < 1000000; ++i){
    c[i] = a[i] * b[i];
}

if I set T as int32_t, this piece of code runs much faster than setting T to double.

I am running this on XEON E5620 + centOS

Can anyone clarify a bit here? Is using int32_t faster or not?

I'm no expert in assembly or the x86 architecture, but could the `int32_t` version be getting vectorized? — François Andrieux, Sep 24 '18 at 19:25
Well, a double has twice as many bits to process, so it's not surprising to me that it's slower. The bits are dependent, so it's not like that can be done in parallel. Probably, single (4 bits) and int32_t (also 4 bits) are largely on par on modern CPUs. — 500 - Internal Server Error, Sep 24 '18 at 19:25
Other things than the raw bit calculation play in as well, like how well the matrix fits in the CPU cache. — 500 - Internal Server Error, Sep 24 '18 at 19:27
What compiler flags are you using? That can make a huge difference. — NathanOliver, Sep 24 '18 at 19:29
Of course I meant to say 4 _bytes_ each, not bits, in my comment above. — 500 - Internal Server Error, Sep 24 '18 at 19:35
Please see vladr answer at https://stackoverflow.com/questions/2550281/floating-point-vs-integer-calculations-on-modern-hardware — kelalaka, Sep 24 '18 at 20:01

score 3 · Answer 1 · answered Sep 25 '18 at 07:47

3

You're running a million multiplications, using 2 million inputs and 1 million outputs. With 4 byte values, that's 12 MB. With 8 byte values, that's 24MB. The E5620 has 12 MB cache.

answered Sep 25 '18 at 07:47

MSalters

159,923
8
140
320

score 2 · Answer 2 · edited Sep 25 '18 at 06:32

This is the result from my cpu;

Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz gcc 7.3

pure gcc, no optimization

short add/sub: 1.586071 [0]
short mul/div: 5.601069 [1]
long add/sub: 1.659803 [0]    
long mul/div: 8.145207 [0] 
long long add/sub: 1.826622 [0]    
long long mul/div: 8.161891 [0]  
float add/sub: 2.685403 [0]    
float mul/div: 3.758135 [0]
double add/sub: 2.662717 [0]
double mul/div: 4.189572 [0]

with gcc -O3

short add/sub: 0.000001 [0]
short mul/div: 4.491903 [1]
long add/sub: 0.000000 [0]
long mul/div: 6.535028 [0]
long long add/sub: 0.000000 [0]
long long mul/div: 6.543064 [0]
float add/sub: 1.182737 [0]
float mul/div: 2.218142 [0]
double add/sub: 1.183991 [0]
double mul/div: 2.529001 [0]

The result really depends on your architecture and the optimization. I remember that, there was an IBM Sparc workstation 20 years ago in my University that has better floating performance than integers.

Please read this nice talk;

Is it faster to run a vector dot product using int32_t instead of a double?

2 Answers2