I am doing some performance critical work in C++, and we are currently using integer calculations for problems that are inherently floating point because "its faster". This causes a whole lot of annoying problems and adds a lot of annoying code.

Now, I remember reading about how floating point calculations were so slow approximately circa the 386 days, where I believe (IIRC) that there was an optional co-proccessor. But surely nowadays with exponentially more complex and powerful CPUs it makes no difference in "speed" if doing floating point or integer calculation? Especially since the actual calculation time is tiny compared to something like causing a pipeline stall or fetching something from main memory?

I know the correct answer is to benchmark on the target hardware, what would be a good way to test this? I wrote two tiny C++ programs and compared their run time with "time" on Linux, but the actual run time is too variable (doesn't help I am running on a virtual server). Short of spending my entire day running hundreds of benchmarks, making graphs etc. is there something I can do to get a reasonable test of the relative speed? Any ideas or thoughts? Am I completely wrong?

The programs I used as follows, they are not identical by any means:

#include <iostream>
#include <cmath>
#include <cstdlib>
#include <time.h>

int main( int argc, char** argv )
    int accum = 0;

    srand( time( NULL ) );

    for( unsigned int i = 0; i < 100000000; ++i )
        accum += rand( ) % 365;
    std::cout << accum << std::endl;

    return 0;

Program 2:

#include <iostream>
#include <cmath>
#include <cstdlib>
#include <time.h>

int main( int argc, char** argv )

    float accum = 0;
    srand( time( NULL ) );

    for( unsigned int i = 0; i < 100000000; ++i )
        accum += (float)( rand( ) % 365 );
    std::cout << accum << std::endl;

    return 0;

Thanks in advance!

Edit: The platform I care about is regular x86 or x86-64 running on desktop Linux and Windows machines.

Edit 2(pasted from a comment below): We have an extensive code base currently. Really I have come up against the generalization that we "must not use float since integer calculation is faster" - and I am looking for a way (if this is even true) to disprove this generalized assumption. I realize that it would be impossible to predict the exact outcome for us short of doing all the work and profiling it afterwards.

Anyway, thanks for all your excellent answers and help. Feel free to add anything else :).

  • 3,028
  • 3
  • 31
  • 47
  • 4,709
  • 6
  • 26
  • 21
  • 8
    What you have as your test now is trivial. There's also probably very little difference in the assembly, (`addl` replaced with `fadd`, for example). The only way to really get a good measurement is get a core part of your real program and profile different versions of that. Unfortunately that can be pretty hard without using tons of effort. Perhaps telling us the target hardware and your compiler would help people at least give you pre-existing experience, etc. About your integer use, I suspect you could make a sort of `fixed_point` template class that would ease such work tremendously. – GManNickG Mar 31 '10 at 03:22
  • 1
    There are still a lot of architectures out there that don't have dedicated floating point hardware - some tags explaining the systems you care about will help you get better answers. – Carl Norum Mar 31 '10 at 03:24
  • That is a good point. At the moment we have a large code base, and I am trying to make the argument that it would be essentially the same "speed" in any case. Hoping to find some evidence to support my point of view - to justify the work of switching over. Anyway - thanks for the template class idea - I will try that. – maxpenguin Mar 31 '10 at 03:27
  • @Carl Norum good point, I care about regular x86 or x86-64 desktop machines running Linux and Windows – maxpenguin Mar 31 '10 at 03:30
  • 3
    I believe the hardware in my HTC Hero (android) doesn't have FPU, but the hardware in the Google NexusOne (android) does. what is your target? desktop/server pcs? netbooks (possible arm+linux)? phones? – SteelBytes Mar 31 '10 at 03:33
  • 5
    If you want fast FP on x86, try to compile with optimization and SSE code generation. SSE (whatever version) can do at least float add, subtract, and multiply in a single cycle. Divide, mod, and higher functions will *always* be slow. Also note that `float` gets the speed boost, but usually `double` doesn't. – Mike D. Mar 31 '10 at 04:18
  • 1
    Fixed-point integer approximates FP by using multiple integer operations to keep the results from overflowing. That's almost always slower than just using the extremely capable FPUs found in modern desktop CPUs. e.g. MAD, the fixed-point mp3 decoder, is slower than libmpg123, and even though it's good quality for a fixed point decoder, libmpg123 still has less rounding error. http://www.wezm.net/technical/2008/04/mp3-decoder-libraries-compared/ for benchmarks on a PPC G5. – Peter Cordes Jul 13 '15 at 01:51
  • See this: http://nicolas.limare.net/pro/notes/2014/12/12_arit_speed/ – Kiran K. Jun 07 '16 at 13:02
  • Are you saying the machine is only 8 bits wide and 10000 changes back or 10000 wide and 8 cycles back clarifying this could tell an engineering cmos student if he should be in engineering with nand gates and how many.......please don't make fun of me. update: Sorry I didn't mean to post question as answer – Austin Perdue Sep 19 '19 at 00:58

11 Answers11


For example (lesser numbers are faster),

64-bit Intel Xeon X5550 @ 2.67GHz, gcc 4.1.2 -O3

short add/sub: 1.005460 [0]
short mul/div: 3.926543 [0]
long add/sub: 0.000000 [0]
long mul/div: 7.378581 [0]
long long add/sub: 0.000000 [0]
long long mul/div: 7.378593 [0]
float add/sub: 0.993583 [0]
float mul/div: 1.821565 [0]
double add/sub: 0.993884 [0]
double mul/div: 1.988664 [0]

32-bit Dual Core AMD Opteron(tm) Processor 265 @ 1.81GHz, gcc 3.4.6 -O3

short add/sub: 0.553863 [0]
short mul/div: 12.509163 [0]
long add/sub: 0.556912 [0]
long mul/div: 12.748019 [0]
long long add/sub: 5.298999 [0]
long long mul/div: 20.461186 [0]
float add/sub: 2.688253 [0]
float mul/div: 4.683886 [0]
double add/sub: 2.700834 [0]
double mul/div: 4.646755 [0]

As Dan pointed out, even once you normalize for clock frequency (which can be misleading in itself in pipelined designs), results will vary wildly based on CPU architecture (individual ALU/FPU performance, as well as actual number of ALUs/FPUs available per core in superscalar designs which influences how many independent operations can execute in parallel -- the latter factor is not exercised by the code below as all operations below are sequentially dependent.)

Poor man's FPU/ALU operation benchmark:

#include <stdio.h>
#ifdef _WIN32
#include <sys/timeb.h>
#include <sys/time.h>
#include <time.h>
#include <cstdlib>

mygettime(void) {
# ifdef _WIN32
  struct _timeb tb;
  return (double)tb.time + (0.001 * (double)tb.millitm);
# else
  struct timeval tv;
  if(gettimeofday(&tv, 0) < 0) {
  return (double)tv.tv_sec + (0.000001 * (double)tv.tv_usec);
# endif

template< typename Type >
void my_test(const char* name) {
  Type v  = 0;
  // Do not use constants or repeating values
  //  to avoid loop unroll optimizations.
  // All values >0 to avoid division by 0
  // Perform ten ops/iteration to reduce
  //  impact of ++i below on measurements
  Type v0 = (Type)(rand() % 256)/16 + 1;
  Type v1 = (Type)(rand() % 256)/16 + 1;
  Type v2 = (Type)(rand() % 256)/16 + 1;
  Type v3 = (Type)(rand() % 256)/16 + 1;
  Type v4 = (Type)(rand() % 256)/16 + 1;
  Type v5 = (Type)(rand() % 256)/16 + 1;
  Type v6 = (Type)(rand() % 256)/16 + 1;
  Type v7 = (Type)(rand() % 256)/16 + 1;
  Type v8 = (Type)(rand() % 256)/16 + 1;
  Type v9 = (Type)(rand() % 256)/16 + 1;

  double t1 = mygettime();
  for (size_t i = 0; i < 100000000; ++i) {
    v += v0;
    v -= v1;
    v += v2;
    v -= v3;
    v += v4;
    v -= v5;
    v += v6;
    v -= v7;
    v += v8;
    v -= v9;
  // Pretend we make use of v so compiler doesn't optimize out
  //  the loop completely
  printf("%s add/sub: %f [%d]\n", name, mygettime() - t1, (int)v&1);
  t1 = mygettime();
  for (size_t i = 0; i < 100000000; ++i) {
    v /= v0;
    v *= v1;
    v /= v2;
    v *= v3;
    v /= v4;
    v *= v5;
    v /= v6;
    v *= v7;
    v /= v8;
    v *= v9;
  // Pretend we make use of v so compiler doesn't optimize out
  //  the loop completely
  printf("%s mul/div: %f [%d]\n", name, mygettime() - t1, (int)v&1);

int main() {
  my_test< short >("short");
  my_test< long >("long");
  my_test< long long >("long long");
  my_test< float >("float");
  my_test< double >("double");

  return 0;
  • 4,046
  • 4
  • 22
  • 39
  • 61,376
  • 17
  • 123
  • 127
  • 8
    why did you mix mult and div? Shouldn't it be interesting if mult is maybe (or expectedly?) much faster then div? – Kyss Tao Mar 28 '12 at 15:06
  • 14
    Multiplication is much faster than division in both integer and floating point cases. Division performance depend also on the size of the numbers. I usually assume that division is ~15 times slower. – Sogartar Aug 08 '12 at 11:48
  • 5
    http://pastebin.com/Kx8WGUfg I took your benchmark and separated out each operation to its own loop and added `volatile` to make sure. On Win64, the FPU is unused and MSVC will not generate code for it, so it compiles using `mulss` and `divss` XMM instructions there, which are 25x faster than the FPU in Win32. Test machine is Core i5 M 520 @ 2.40GHz – James Dunne Jan 02 '13 at 18:28
  • 4
    @JamesDunne just be careful, for fp ops `v` will quickly reach either 0 or +/-inf very very quickly, which may or may not be (theoretically) treated as a special case/fastpatheed by certain fpu implementations. – vladr Jan 03 '13 at 18:18
  • In my experience on my CPU 32-bit integer multiplication is 1 cycle tops whereas division tends to be closer to 8 cycles, so it's a terrible idea to mix the two. Same for floats, huge difference in performance. Also some of your tests are clearly optimised out entirely (when you see 0 cycles...). – Michel Rouzic Sep 05 '13 at 13:29
  • Divison is indeed slower, but my results (on 22nm core I7) show integer division is 8 times slower than integer multiplication, that floating point division is only twice as slow as floating point multiplication, and floating point division is somehow twice as fast as integer division, I guess due to MMX. – VoidStar Apr 28 '15 at 07:35
  • 3
    This "benchmark" has no data parallelism for out-of-order execution, because every operation is done with the same accumulator (`v`). On recent Intel designs, divide isn't pipelined at all (`divss`/`divps` has 10-14 cycle latency, and the same reciprocal throughput). `mulss` however is 5 cycle latency, but can issue one every cycle. (Or two per cycle on Haswell, since port 0 and port 1 both have an multiplier for FMA). – Peter Cordes Jul 13 '15 at 01:07
  • @JamesDunne: x87 FP math isn't THAT slow. The compiler can't auto-vectorize when targeting it, and its stack-based operation takes extra instructions, but `fmul` has the same 5 cycle latency and 1 cycle recip throughput that `mulss` has. Timings for `fadd` match `addss`, too. See http://agner.org/optimize/ for instruction tables. – Peter Cordes Jul 13 '15 at 01:10
  • I disassembled this, and it seems compilers are much better at optimizing for float than int. When doing a fair comparison of add and addss in a loop in assembly language, add outperforms addss by 25 times! – thang Feb 08 '16 at 01:13
  • Just a note that @JamesDunne pastebin can be compiled with `g++ -fpermissive -O3 -o benchmark-pc benchmark-pc.c` where `benchmark-pc.c` is what I saved the pastebin as – MrMesees Jan 10 '17 at 14:07
  • 1
    Isn't getting integer adds/subs in 0.000000 time suspicious in the above table? @James Dunne's version gets more reasonable values. – Yogurt Nov 05 '20 at 07:02
  • @Yogurt yes and no. The differences in timing here are due to @James Dunne's use of `volatile`, which is forcing a memory read/write around each individual operation (vs. performing operation sequences using registers alone). This is not necessarily realistic. FWIW, the "more reasonable values" are not actually measuring ALU performance, but rather the overhead of the memory subsystem (~1s for each integer loop, ~2s for each FP loop). Compile the same code w/ and w/o `volatile`, and compare the numbers you get as well as the disassembly. – vladr Mar 09 '21 at 16:49

Alas, I can only give you an "it depends" answer...

From my experience, there are many, many variables to performance...especially between integer & floating point math. It varies strongly from processor to processor (even within the same family such as x86) because different processors have different "pipeline" lengths. Also, some operations are generally very simple (such as addition) and have an accelerated route through the processor, and others (such as division) take much, much longer.

The other big variable is where the data reside. If you only have a few values to add, then all of the data can reside in cache, where they can be quickly sent to the CPU. A very, very slow floating point operation that already has the data in cache will be many times faster than an integer operation where an integer needs to be copied from system memory.

I assume that you are asking this question because you are working on a performance critical application. If you are developing for the x86 architecture, and you need extra performance, you might want to look into using the SSE extensions. This can greatly speed up single-precision floating point arithmetic, as the same operation can be performed on multiple data at once, plus there is a separate* bank of registers for the SSE operations. (I noticed in your second example you used "float" instead of "double", making me think you are using single-precision math).

*Note: Using the old MMX instructions would actually slow down programs, because those old instructions actually used the same registers as the FPU does, making it impossible to use both the FPU and MMX at the same time.

  • 24,271
  • 30
  • 100
  • 197
  • 855
  • 5
  • 11
  • 8
    And on some processors FP math can be faster than integer math. The Alpha processor had a FP divide instruction but not an integer one, so integer division had to be done in software. – Gabe Mar 31 '10 at 04:49
  • Will SSEx also speed up double precision arithmetic? I'm sorry, I'm not too familiar with SSE – Johannes Schaub - litb May 06 '16 at 09:02
  • 1
    @JohannesSchaub-litb: SSE2 (baseline for x86-64) has packed `double`-precision FP. With only two 64-bit `double`s per register, the potential speedup is smaller than `float` for code that vectorizes well. Scalar `float` and `double` use XMM registers on x86-64, with legacy x87 only used for `long double`. (So @ Dan: no, MMX registers don't conflict with normal FPU registers, because normal FPU on x86-64 is the SSE unit. MMX would be pointless because if you can do integer SIMD, you want 16-byte `xmm0..15` instead of 8-byte `mm0..7`, and modern CPUs have worse MMX than SSE throughput.) – Peter Cordes May 28 '18 at 22:29
  • 1
    But MMX and SSE*/AVX2 integer instructions do compete for the same execution units, so using both at once is almost never useful. Just use the wider XMM / YMM versions to get more work done. Using SIMD integer and FP at the same time competes for the same registers, but x86-64 has 16 of them. But total throughput limits mean you can't get twice as much work done by using integer and FP execution units in parallel. – Peter Cordes May 28 '18 at 22:45

TIL This varies (a lot). Here are some results using gnu compiler (btw I also checked by compiling on machines, gnu g++ 5.4 from xenial is a hell of a lot faster than 4.6.3 from linaro on precise)

Intel i7 4700MQ xenial

short add: 0.822491
short sub: 0.832757
short mul: 1.007533
short div: 3.459642
long add: 0.824088
long sub: 0.867495
long mul: 1.017164
long div: 5.662498
long long add: 0.873705
long long sub: 0.873177
long long mul: 1.019648
long long div: 5.657374
float add: 1.137084
float sub: 1.140690
float mul: 1.410767
float div: 2.093982
double add: 1.139156
double sub: 1.146221
double mul: 1.405541
double div: 2.093173

Intel i3 2370M has similar results

short add: 1.369983
short sub: 1.235122
short mul: 1.345993
short div: 4.198790
long add: 1.224552
long sub: 1.223314
long mul: 1.346309
long div: 7.275912
long long add: 1.235526
long long sub: 1.223865
long long mul: 1.346409
long long div: 7.271491
float add: 1.507352
float sub: 1.506573
float mul: 2.006751
float div: 2.762262
double add: 1.507561
double sub: 1.506817
double mul: 1.843164
double div: 2.877484

Intel(R) Celeron(R) 2955U (Acer C720 Chromebook running xenial)

short add: 1.999639
short sub: 1.919501
short mul: 2.292759
short div: 7.801453
long add: 1.987842
long sub: 1.933746
long mul: 2.292715
long div: 12.797286
long long add: 1.920429
long long sub: 1.987339
long long mul: 2.292952
long long div: 12.795385
float add: 2.580141
float sub: 2.579344
float mul: 3.152459
float div: 4.716983
double add: 2.579279
double sub: 2.579290
double mul: 3.152649
double div: 4.691226

DigitalOcean 1GB Droplet Intel(R) Xeon(R) CPU E5-2630L v2 (running trusty)

short add: 1.094323
short sub: 1.095886
short mul: 1.356369
short div: 4.256722
long add: 1.111328
long sub: 1.079420
long mul: 1.356105
long div: 7.422517
long long add: 1.057854
long long sub: 1.099414
long long mul: 1.368913
long long div: 7.424180
float add: 1.516550
float sub: 1.544005
float mul: 1.879592
float div: 2.798318
double add: 1.534624
double sub: 1.533405
double mul: 1.866442
double div: 2.777649

AMD Opteron(tm) Processor 4122 (precise)

short add: 3.396932
short sub: 3.530665
short mul: 3.524118
short div: 15.226630
long add: 3.522978
long sub: 3.439746
long mul: 5.051004
long div: 15.125845
long long add: 4.008773
long long sub: 4.138124
long long mul: 5.090263
long long div: 14.769520
float add: 6.357209
float sub: 6.393084
float mul: 6.303037
float div: 17.541792
double add: 6.415921
double sub: 6.342832
double mul: 6.321899
double div: 15.362536

This uses code from http://pastebin.com/Kx8WGUfg as benchmark-pc.c

g++ -fpermissive -O3 -o benchmark-pc benchmark-pc.c

I've run multiple passes, but this seems to be the case that general numbers are the same.

One notable exception seems to be ALU mul vs FPU mul. Addition and subtraction seem trivially different.

Here is the above in chart form (click for full size, lower is faster and preferable):

Chart of above data

Update to accomodate @Peter Cordes


i7 4700MQ Linux Ubuntu Xenial 64-bit (all patches to 2018-03-13 applied)
    short add: 0.773049
    short sub: 0.789793
    short mul: 0.960152
    short div: 3.273668
      int add: 0.837695
      int sub: 0.804066
      int mul: 0.960840
      int div: 3.281113
     long add: 0.829946
     long sub: 0.829168
     long mul: 0.960717
     long div: 5.363420
long long add: 0.828654
long long sub: 0.805897
long long mul: 0.964164
long long div: 5.359342
    float add: 1.081649
    float sub: 1.080351
    float mul: 1.323401
    float div: 1.984582
   double add: 1.081079
   double sub: 1.082572
   double mul: 1.323857
   double div: 1.968488
AMD Opteron(tm) Processor 4122 (precise, DreamHost shared-hosting)
    short add: 1.235603
    short sub: 1.235017
    short mul: 1.280661
    short div: 5.535520
      int add: 1.233110
      int sub: 1.232561
      int mul: 1.280593
      int div: 5.350998
     long add: 1.281022
     long sub: 1.251045
     long mul: 1.834241
     long div: 5.350325
long long add: 1.279738
long long sub: 1.249189
long long mul: 1.841852
long long div: 5.351960
    float add: 2.307852
    float sub: 2.305122
    float mul: 2.298346
    float div: 4.833562
   double add: 2.305454
   double sub: 2.307195
   double mul: 2.302797
   double div: 5.485736
Intel Xeon E5-2630L v2 @ 2.4GHz (Trusty 64-bit, DigitalOcean VPS)
    short add: 1.040745
    short sub: 0.998255
    short mul: 1.240751
    short div: 3.900671
      int add: 1.054430
      int sub: 1.000328
      int mul: 1.250496
      int div: 3.904415
     long add: 0.995786
     long sub: 1.021743
     long mul: 1.335557
     long div: 7.693886
long long add: 1.139643
long long sub: 1.103039
long long mul: 1.409939
long long div: 7.652080
    float add: 1.572640
    float sub: 1.532714
    float mul: 1.864489
    float div: 2.825330
   double add: 1.535827
   double sub: 1.535055
   double mul: 1.881584
   double div: 2.777245
  • 1,120
  • 16
  • 20
  • gcc5 maybe auto-vectorizes something that gcc4.6 didn't? Is `benchmark-pc` measuring some combination of throughput and latency? On your Haswell (i7 4700MQ), integer multiply is 1 per clock throughput, 3 cycle latency, but integer add/sub is 4 per clock throughput, 1 cycle latency (http://agner.org/optimize/). So presumably there's a lot of loop overhead diluting those numbers for add and mul to come out so close (long add: 0.824088 vs. long mul: 1.017164). (gcc defaults to not unrolling loops, except for fully unrolling very low iteration counts). – Peter Cordes Mar 12 '18 at 22:51
  • And BTW, why does it not test `int`, only `short` and `long`? On Linux x86-64, `short` is 16 bits (and thus has partial-register slowdowns in some cases), while `long` and `long long` are both 64-bit types. (Maybe it's designed for Windows where x86-64 still uses 32-bit `long`? Or maybe it's designed for 32-bit mode.) On Linux, [the x32 ABI has 32-bit `long` in 64-bit mode](https://en.wikipedia.org/wiki/X32_ABI), so if you have the libraries installed, use `gcc -mx32` to compiler for ILP32. Or just use `-m32` and look at the `long` numbers. – Peter Cordes Mar 12 '18 at 22:55
  • And you should really check if your compiler auto-vectorized anything. e.g. using `addps` on xmm registers instead of `addss`, to do 4 FP adds in parallel in one instruction that's as fast as scalar `addss`. (Use `-march=native` to allow using whatever instruction sets your CPU supports, not just the SSE2 baseline for x86-64). – Peter Cordes Mar 12 '18 at 22:56
  • @cincodenada please leave the charts showing the full 15 up the side as it's then illustrative of performance. – MrMesees Mar 12 '18 at 23:16
  • @PeterCordes I will try to look tomorrow, thank you for your diligence. – MrMesees Mar 12 '18 at 23:18
  • @MrMesees I scaled all the charts to a similar height because the question is concerned with floating point vs integer, not different processors vs each other. The main thing of concern in different processors is the relative differences, not absolute performance. Your call though. – cincodenada Mar 12 '18 at 23:41
  • @cincodenada I still feel like I understood less looking at the relatively scaled graphics, but maybe there is a place for both if it's labelled? I <3 your contribution. – MrMesees Mar 13 '18 at 17:40
  • I looked at the benchmark source. It does `v *= foo;` or `v += foo` in an unrolled (by 4) loop where `foo` is a runtime variable (but loop invariant). So it's measuring latency not throughput, but much of the difference between add and mul is hidden by using `volatile v` so the compiler has to store/reload inside the loop; instead of seeing 3x the latency for this dependency chain, you only see `5+1` vs. `5+3`. Using `volatile Type sink = v;` inside the loop would force it to store every result separately, but allow it to keep `v` in a register. You can also use inline asm to escape... – Peter Cordes Mar 14 '18 at 04:10
  • fancy forking the gist and linking so I have half an idea what you're talking about (it's someone else code, just compiled, now modified with results, you're flying planes over my head) – MrMesees Mar 14 '18 at 22:58
  • Beware of the fact that all benchmark versions that you timed use `volatile`, meaning that a memory read and a memory write is forced before and after each individual operation. A a result, some of your measurements are actually benchmarking the memory subsystem, not the ALUs. – vladr Mar 09 '21 at 17:00
  • @vladr did you have suggested edits that would prevent inlining and other side-effects producing results that can't be trusted? It's a pastebin. I'm always interested to see other approaches. – MrMesees Mar 13 '21 at 21:48

There is likely to be a significant difference in real-world speed between fixed-point and floating-point math, but the theoretical best-case throughput of the ALU vs FPU is completely irrelevant. Instead, the number of integer and floating-point registers (real registers, not register names) on your architecture which are not otherwise used by your computation (e.g. for loop control), the number of elements of each type which fit in a cache line, optimizations possible considering the different semantics for integer vs. floating point math -- these effects will dominate. The data dependencies of your algorithm play a significant role here, so that no general comparison will predict the performance gap on your problem.

For example, integer addition is commutative, so if the compiler sees a loop like you used for a benchmark (assuming the random data was prepared in advance so it wouldn't obscure the results), it can unroll the loop and calculate partial sums with no dependencies, then add them when the loop terminates. But with floating point, the compiler has to do the operations in the same order you requested (you've got sequence points in there so the compiler has to guarantee the same result, which disallows reordering) so there's a strong dependency of each addition on the result of the previous one.

You're likely to fit more integer operands in cache at a time as well. So the fixed-point version might outperform the float version by an order of magnitude even on a machine where the FPU has theoretically higher throughput.

Ben Voigt
  • 260,885
  • 36
  • 380
  • 671
  • 4
    +1 for pointing out how naive benchmarks can yield 0-time loops because of unrolled constant integer operations. Moreover, the compiler can completely discard the loop (integer or FP) if the result is not actually used. – vladr Mar 31 '10 at 06:20
  • The conclusion to that is : one must call a function having the looping variable as argument. Since i think no compiler could be able to see that the function does nothing and that the call can be ignored. Since there's a call overhead, only the differences of time == ( float time - integer time ) will be significant. – GameAlchemist Nov 11 '13 at 06:21
  • 1
    @GameAlchemist: Many compilers do eliminate calls to empty functions, as a side effect of inlining. You have to make an effort to prevent that. – Ben Voigt Apr 03 '14 at 15:08
  • The OP sounded like he was talking about using integer for things where FP would be a more natural fit, so it would take more integer code to achieve the same result as the FP code. In this case, just use FP. For example, on hardware with an FPU (e.g. a desktop CPU), fixed-point integer MP3 decoders are slower (and slightly more rounding errors) than floating-point decoders. Fixed-point implementations of codecs mainly exist to run on stripped-down ARM CPUs with no FP hardware, only slow emulated FP. – Peter Cordes Jul 13 '15 at 01:42
  • one example for the first point: on x86-64 with AVX-512 there are only 16 GP registers but 32 zmm registers so scalar floating-point math *may* be faster – phuclv Jul 29 '19 at 06:48

Addition is much faster than rand, so your program is (especially) useless.

You need to identify performance hotspots and incrementally modify your program. It sounds like you have problems with your development environment that will need to be solved first. Is it impossible to run your program on your PC for a small problem set?

Generally, attempting FP jobs with integer arithmetic is a recipe for slow.

  • 126,977
  • 21
  • 238
  • 404
  • Yeah, as well as the conversion from a rand integer to a float in the floating point version. Any ideas on a better way to test this? – maxpenguin Mar 31 '10 at 03:32
  • 1
    If you're trying to profile speed, look at POSIX's `timespec_t` or something similar. Record the time at the start and end of the loop and take the difference. Then move the `rand` data generation out of the loop. Make sure that your algorithm gets all its data from arrays and puts all its data in arrays. That gets your actual algorithm by itself, and gets setup, malloc, result printing, everything but task switching and interrupts out of your profiling loop. – Mike D. Mar 31 '10 at 04:15
  • 3
    @maxpenguin: the question is what you are testing. Artem has assumed you are doing graphics, Carl considered whether you're on an embedded platform sans FP, I supposed you're coding science for a server. You can't generalize or "write" benchmarks. Benchmarks are sampled from the actual work your program does. One thing I can tell you is that it won't remain "essentially the same speed" if you touch the performance-critical element in your program, whatever that is. – Potatoswatter Mar 31 '10 at 04:39
  • good point and good answer. We have an extensive code base currently. Really I have come up against the generalization that we "must not use float since integer calculation is faster" - and I am looking for a way (if this is even true) to disprove this generalized assumption. I realize that it would be impossible to predict the exact outcome for us short of doing all the work and profiling it afterwards. Anyway, thanks for your help. – maxpenguin Mar 31 '10 at 05:32

Two points to consider -

Modern hardware can overlap instructions, execute them in parallel and reorder them to make best use of the hardware. And also, any significant floating point program is likely to have significant integer work too even if it's only calculating indices into arrays, loop counter etc. so even if you have a slow floating point instruction it may well be running on a separate bit of hardware overlapped with some of the integer work. My point being that even if the floating point instructions are slow that integer ones, your overall program may run faster because it can make use of more of the hardware.

As always, the only way to be sure is to profile your actual program.

Second point is that most CPUs these days have SIMD instructions for floating point that can operate on multiple floating point values all at the same time. For example you can load 4 floats into a single SSE register and the perform 4 multiplications on them all in parallel. If you can rewrite parts of your code to use SSE instructions then it seems likely it will be faster than an integer version. Visual c++ provides compiler intrinsic functions to do this, see http://msdn.microsoft.com/en-us/library/x5c07e2a(v=VS.80).aspx for some information.

  • 28,098
  • 16
  • 76
  • 120
  • One should note that on Win64, the FPU instructions are not generated by the MSVC compiler any more. Floating point is always using SIMD instructions there. This makes for a large speed discrepancy between Win32 and Win64 regarding flops. – James Dunne Jan 02 '13 at 18:29

The floating point version will be much slower, if there is no remainder operation. Since all the adds are sequential, the cpu will not be able to parallelise the summation. The latency will be critical. FPU add latency is typically 3 cycles, while integer add is 1 cycle. However, the divider for the remainder operator will probably the critical part, as it is not fully pipelined on modern cpu's. so, assuming the divide/remainder instruction will consume the bulk of the time, the difference due to add latency will be small.

Goran D
  • 51
  • 1
  • 2

Today, integer operations are usually a little bit faster than floating point operations. So if you can do a calculation with the same operations in integer and floating point, use integer. HOWEVER you are saying "This causes a whole lot of annoying problems and adds a lot of annoying code". That sounds like you need more operations because you use integer arithmetic instead of floating point. In that case, floating point will run faster because

  • as soon as you need more integer operations, you probably need a lot more, so the slight speed advantage is more than eaten up by the additional operations

  • the floating-point code is simpler, which means it is faster to write the code, which means that if it is speed critical, you can spend more time optimising the code.

  • 64,489
  • 38
  • 208
  • 350
  • 47,695
  • 5
  • 65
  • 91
  • There is a lot of wild speculation here, not accounting for any of the secondary effects present in hardware, which often dominate computation time. Not a bad starting point, but it needs to be checked on each particular application via profiling, and not taught as gospel. – Ben Voigt Apr 03 '14 at 15:07

Unless you're writing code that will be called millions of times per second (such as, e.g., drawing a line to the screen in a graphics application), integer vs. floating-point arithmetic is rarely the bottleneck.

The usual first step to the efficiency questions is to profile your code to see where the run-time is really spent. The linux command for this is gprof.


Though I suppose you can always implement the line drawing algorithm using integers and floating-point numbers, call it a large number of times and see if it makes a difference:


  • 61,376
  • 17
  • 123
  • 127
Artem Sokolov
  • 11,596
  • 4
  • 35
  • 65
  • 2
    Scientific applications use FP. The only advantage of FP is that precision is scale-invariant. It's like scientific notation. If you know the scale of the numbers already (eg, that the line length is a number of pixels), FP is obviated. But before you get to drawing the line, that's not true. – Potatoswatter Mar 31 '10 at 03:31

I ran a test that just added 1 to the number instead of rand(). Results (on an x86-64) were:

  • short: 4.260s
  • int: 4.020s
  • long long: 3.350s
  • float: 7.330s
  • double: 7.210s
  • 77,360
  • 20
  • 153
  • 184
  • 1
    Source, compile options, and timing method? I'm a bit surprised by the results. – GManNickG Mar 31 '10 at 04:52
  • Same loop as OP with "rand( ) % 365" replaced by "1". No optimization. User time from "time" command. – dan04 Mar 31 '10 at 05:31
  • 13
    "No optimization" is the key. You never profile with optimization turned off, always profile in "release" mode. – Dean Harding Mar 31 '10 at 05:39
  • 2
    In this case, though, the optimization off forces the op to occur, and is done deliberately -- the loop is there to dilate time to a reasonable scale of measurement. Using the constant 1 removes the cost of rand(). A sufficiently smart optimizing compiler would see 1 added 100,000,000 times with no way out of the loop and simply add 100000000 in a single op. That sort of gets around the whole purpose, doesn't it? – Stan Rogers Oct 08 '10 at 15:01
  • 7
    @Stan, make the variable volatile. Even a smart optimizing compiler should honour the multiple ops then. – vladr Jun 26 '11 at 05:07

Based of that oh-so-reliable "something I've heard", back in the old days, integer calculation were about 20 to 50 times faster that floating point, and these days it's less than twice as faster.

James Curran
  • 95,648
  • 35
  • 171
  • 253
  • 1
    Please consider looking at this again offering more than opinion (especially given that the opinion seems to fly in the face of facts gathered) – MrMesees Jan 14 '17 at 18:45
  • 1
    @MrMesees While this answer is not terribly useful, I would say it is consistent with the tests you made. And the historical trivia is probably fine too. – Jonatan Öström Nov 04 '17 at 22:53
  • As someone who worked with 286s back in the day, I can confirm; "YES... they were!" – David H Parry Oct 14 '19 at 17:31