-2

I wrote a rudimentary C program (PCB for Prime C Benchmark) that benchmarks system speed by timing the process of finding prime numbers for all natural numbers between 0 and a user-entered number [User Enters a 'Load Value', which is multiplied to 10^5]

On my Intel i5 5350U & LPDDR3 (MacBook Air 2017, Using Apple Clang 11) A workload of 1 (i.e primes upto 100,000) run 5 times, takes avg 25 seconds (plugged in, charging but goes to 50 seconds when @ 5% battery).

On my Exynos 9611 & LPDDR4x (Samsung M21, using the 'Coding C' App/Compiler ) The same exact workload and run 5 times, takes an avg 8 seconds !!

On Windows ( i5 3340M, Win7_SP2, VS2019 latest, Release version, x86), the program craps out absolutely ! When run 5 times for any 'Load Value', I get a time taken of 0.0000 !! What ?! There's absolutely something amiss here XD ....\

Linux (Ubuntu 20.04, GCC, same hardware as Win, i3) takes 21.5 seconds. It appears to me that Linux and MacOS (so Apple Clang & GCC) are probably doing it right...

The code:

#include <stdio.h>
#include <time.h>

long count = 0;

void bench(double x) {
    register unsigned long n, i, q;
    for (q = 0; q <= x; q++) {
        for (i = 2; i <= q / 2; ++i) {
            if (q % i == 0)
                count++;
        }
    }  
}

int main() {
    double x;
    int y;
    printf( "\nPCB v0.1\nOpen-source Tool for Benchmarking System Speed.\n\nRecommended Load Value 1 - 3\n");
    printf("\nEnter Load Value : ");
    scanf("%lf", &x);
    printf("\nEnter Frequency for Repetition : ");
    scanf("%d", &y);
    x = x * 100000;
    printf("\nPress Enter to Run ");
    getchar();
    getchar();
    printf("\n(...Running...)\n");
    int z;
    for (z = 1; z <= y; z++) {
        clock_t t; 
        t = clock(); 
        bench(x); 
        t = clock() - t; 
        double time_taken = ((double)t)/CLOCKS_PER_SEC; // in seconds 
        printf("\nTime Taken #%d = %.4f seconds\n", z, time_taken);
    }
    printf("\nPress Enter to Exit ");
    getchar();
    return count;
}
chqrlie
  • 98,886
  • 10
  • 89
  • 149
A P Jo
  • 438
  • 14
  • Haven't Yet Compiled in Linux, that should be interesting.... – A P Jo Jul 23 '20 at 09:32
  • 4
    Where do you calculate primes? Without any output or other side effect the function `bench` could completely be removed during optimization leading to very short execution times. – Gerhardh Jul 23 '20 at 09:36
  • 3
    Your double loop in `bench()` is equivalent to `; /* null expression */`. A smart compiler can use the replacement (independently, I believe, of any optimization flags) – pmg Jul 23 '20 at 09:38
  • Gerhardh Is its so ? Can this be deactivated ? I dont intend to print the prime numbers, what can I do ? – A P Jo Jul 23 '20 at 09:39
  • @pmg is there a way to mitigate this smartness through small changes in code ? – A P Jo Jul 23 '20 at 09:41
  • It appears that Apple Clang 11 is the only one doing what the code says ?! – A P Jo Jul 23 '20 at 09:41
  • Did you try to turn optimization off in all compilers? – Gerhardh Jul 23 '20 at 09:41
  • you for example count and return the prime number count and print that so the compiler can optimize the hell out of your code without removing the actual operations you want to bench. – Ackdari Jul 23 '20 at 09:42
  • No, all compilers do what the code says. The result cannot be distinguished as there is no side effect. All but 1 compiler recognize the function as a rather complicated way to write a NOP. – Gerhardh Jul 23 '20 at 09:42
  • Maybe you can fool the compiler with `volatile register unsigned long i;` – pmg Jul 23 '20 at 09:44
  • also you function `bench` does not _really_ find any primes it just checks for all numers `n < q` if it is divisable by all the numers`i` with `2 <= i < q/2` but doesn't do anything with this information. And so no actual prime detection. – Ackdari Jul 23 '20 at 09:44
  • @Gerhardh I dont want to do that since I want it to compilable right away anywhere – A P Jo Jul 23 '20 at 09:45
  • @Ackdari this counting seems like something I can do, but what if i dont want to print it ? And well, there isnt much i care about getting the primes aas much as i intend to have a simple speed measurement – A P Jo Jul 23 '20 at 09:46
  • Maybe I return the count rather than zero for int main ... – A P Jo Jul 23 '20 at 09:48
  • 1
    @JakeFry if you don't print anyresults from the actual calulations of `bench` and you want it to be compileable by any C compiler then it will always be in risk of be removed by the compiler dou to optimization – Ackdari Jul 23 '20 at 09:49
  • _"Maybe I return the count rather than zero for int main"_ this could also work – Ackdari Jul 23 '20 at 09:50
  • @Ackdari I did this and the Linux system reports 21 sec, while Mac reports 25, but the Samsung is reporting 8 seconds !! Could this have to do with genuine speed with the 8-core ddr4 hardware ? – A P Jo Jul 23 '20 at 10:01
  • @JakeFry The usual trick is to just print the final value of `count` after the benchmark finishes. If you print it, the compiler can't delete it. – user253751 Jul 23 '20 at 10:52
  • @user253751 doing the printing function does not change the inexplicable 8.47 second result of the Samsung Phone (I checked before and after) and Windows now takes 2 seconds optimised for speed and 6.8 sec unoptimised – A P Jo Jul 23 '20 at 11:14
  • @JakeFry: this benchmark only uses a single core. Type `unsigned long` is 64-bits in Linux and OS/X vs 32-bit in Windows which may explain the big difference. Try using `unsigned long long` to avoid artificial differences. – chqrlie Jul 23 '20 at 11:49
  • Why is 8 seconds inexplicable? Perhaps your Mac build has optimization turned off. – user253751 Jul 23 '20 at 12:13
  • 1
    Basically you found a missed-optimization bug in clang and GCC: they don't optimize away calculation of an unused result. The MSVC result is the compiler doing a better job and not doing useless work. (Which is surprising, usually MSVC misses more optimizations than GCC and clang). If you assign the return value to a `volatile int sink`, that would help. – Peter Cordes Jul 23 '20 at 13:43
  • @PeterCordes I've moved a lot from the code in the question, will be posting an answer with changed code... But yes apparently these rare things happen to me XD Imagine GCC not being better than MSVC Lol – A P Jo Jul 23 '20 at 13:46
  • 1
    @JakeFry: You can accept one of the answers by clicking on the gray checkmark below its score and upvote those that helped you. – chqrlie Jul 23 '20 at 15:21

3 Answers3

2

Your prime number enumeration code is flawed:

  • in the code initially posted, the for loop in the benchmark function had no side effect, so efficient compilers were able to optimise it and generate essentially no code. This explains the great disparity from one system to another.

  • in the last update, your algorithm does not compute the count of prime numbers, it merely performs a huge number of divisions and counts the number of times you get a zero remainder. This is much more costly than an actual prime number test which is itself much less efficient than performing a Sieve of Eratostenes.

For the purpose of measuring and comparing system performance, this method focuses exaggeratedly on the speed of the division opcode, and it shows a great variation between Linux, OS/X and Windows probably because of the size of type unsigned long which is 64-bit on Linux and OS/X vs 32-bit on Windows, making the modulo operation faster on Windows, even for the same set of numbers. Furthermore this type of benchmark uses a single core, so it does not measure total system performance by a long shot.

Relative performance of the different systems should be measured using a more diversified set of operations, stressing the CPU, memory, storage and communications systems.

Regarding the prime number enumeration, here is a modified version with a prime test:

#include <limits.h>
#include <stdio.h>
#include <time.h>

unsigned long long bench(double x) {
    if (x < 0 || x >= ULLONG_MAX) {
        printf("invalid benchmark range\n");
        return 0;
    }
    unsigned long long n = (unsigned long long)x;
    unsigned long long count = 0;
    if (n >= 2)
        count++;
    for (unsigned long long p = 3; p <= n; p += 2) {
        count++;
        for (unsigned long long i = 3; i * i <= p; i += 2) {
            if (p % i == 0) {
                count--;
                break;
            }
        }
    }
    return count;
}

int main() {
    double x;
    int y;
    clock_t total = 0;
    unsigned long long count;
    double time_taken;

    printf("\nPCB v0.1\nOpen-source Tool for Benchmarking System Speed.\n\nRecommended Load Value 1 - 3\n");
    printf("\nEnter load value: ");
    if (scanf("%lf", &x) != 1)
        return 1;
    printf("\nEnter repeat count: ");
    if (scanf("%d", &y) != 1)
        return 1;
    x = x * 100000;
    printf("\nPress Enter to Run ");
    getchar();
    getchar();
    printf("\n(...Running...)\n");
    for (int z = 0; z < y; z++) {
        clock_t t;
        t = clock();
        count = bench(x);
        t = clock() - t;
        total += t;
        time_taken = ((double)t) / CLOCKS_PER_SEC; // in seconds
        printf("\n%llu primes, time taken #%d = %.4f seconds\n", count, z, time_taken);
    }
    time_taken = ((double)total) / CLOCKS_PER_SEC; // in seconds
    printf("\nAverage time taken = %.4f seconds\n", time_taken / y);
    printf("\nPress Enter to Exit ");
    getchar();
    return 0;
}

Output:

PCB v0.1
Open-source Tool for Benchmarking System Speed.
Recommended Load Value 1 - 3

Enter load value: 1
Enter repeat count: 5
Press Enter to Run
(...Running...)

9592 primes, time taken #0 = 0.0126 seconds
9592 primes, time taken #1 = 0.0117 seconds
9592 primes, time taken #2 = 0.0133 seconds
9592 primes, time taken #3 = 0.0136 seconds
9592 primes, time taken #4 = 0.0137 seconds

Average time taken = 0.0130 seconds
Press Enter to Exit

This is almost 2000x faster than the initial code on my laptop.

Running a load of 100 gives this output:

PCB v0.1
Open-source Tool for Benchmarking System Speed.
Recommended Load Value 1 - 3

Enter load value: 100
Enter repeat count: 5
Press Enter to Run
(...Running...)

664579 primes, time taken #0 = 7.4249 seconds
664579 primes, time taken #1 = 7.3742 seconds
664579 primes, time taken #2 = 7.4119 seconds
664579 primes, time taken #3 = 7.3887 seconds
664579 primes, time taken #4 = 7.6725 seconds

Average time taken = 7.4544 seconds
Press Enter to Exit

Which is still much slower than a sieve:

$ chqrlie > time prime -c 1..10000000
664579

real    0m0.009s
user    0m0.006s
sys     0m0.001s

Here is a simplistic implementation using the Sieve approach that is not quite as fast as the optimised one used in my primes utility, but still achieves an average time of 0,0773 seconds for a load of 100, a 100x improvement over the prime test loop:

unsigned long long bench(double x) {
    /* simplistic Sieve of Eratostenes version */
    if (x < 0 || x >= SIZE_MAX) {
        printf("invalid benchmark range\n");
        return 0;
    }
    size_t count = 0;
    size_t n = (size_t)x + 1;   // array size
    if (n > 1) {
        unsigned char *a = calloc(n, 1);
        if (a == NULL) {
            printf("cannot allocate memory\n");
            return 0;
        }
        // 0 and 1 are considered composite
        a[0] = a[1] = 1;
        // flag all multiples of 2 as composite
        for (size_t i = 4; i < n; i += 2) {
            a[i] = 1;
        }
        for (size_t p = 3; p * p < n; p += 2) {
            // for all potential prime numbers
            if (a[p] == 0) {
                // if p is prime, flag all odd multiples of p as composite
                for (size_t i = p * p; i < n; i += 2 * p) {
                    a[i] = 1;
                }
            }
        }
        count = n;
        // count the number of composite numbers
        for (size_t i = 0; i < n; i++) {
            count -= a[i];
        }
        free(a);
    }
    return count;
}
chqrlie
  • 98,886
  • 10
  • 89
  • 149
  • I agree Mr. Chqrlie , my benchmark is flawed ans maybe there is a lot wrong with it... What can be done to make it less flawed ?? – A P Jo Jul 23 '20 at 11:00
  • I dont mind much how fast it runs, I'm not actually trying to find the number of Primes ! I'm more interested in a reasonably simple CLI benchmark for a system's speed ... And to that effect, I'm still very much clueless... – A P Jo Jul 23 '20 at 11:26
  • OK, but be aware that you are benchmarking a combination of factors: the hardware capabilities, the OS configuration of this hardware together with the load factor from other running applications, the compiler optimisation capabilities, which depend on the compiler itself, the target choices (32-bit vs: 64-bit, vectorisation options, etc.) and the compiler options as defined in the development environment used for this target. In the original benchmark, the limiting factor is the division operation. You need a more diversified test to measure system efficiency more generally and reliably. – chqrlie Jul 23 '20 at 11:46
  • The fact that the benchmark does more work than necessary doesn't make it a flawed benchmark, considering that the whole point of it is to do a bunch of unnecessary work. – user253751 Jul 23 '20 at 12:16
  • @user253751precisely. – A P Jo Jul 23 '20 at 13:04
  • @user253751: I agree. I will rephrase my answer. Yet the initial code was flawed as the performance measurement was too dependent on the compiler optimisation capabilities. The amended version is less prone to this but focuses disproportionally on the speed of the division opcode, and it shows a great variation between Linux, OS/X and Windows probably because of the size of type `unsigned long` which is 64-bit on Linux and OS/X vs 32-bit on Windows. Relative performance of the different systems should be measured using a more diversified set of operations. – chqrlie Jul 23 '20 at 13:22
  • Chqrlie and @user253751 I tried Chqrlie's code, and amended it slightly. It appears to be good. Important chnages : increasing x = x^7 as well as making count a double. Does making count a double make it better since floating point arithmetic (6,64,579 increments to count every loop) is now a part of this otherwise pre-dominantly integer benchmark ? link to my adaptation [ https://drive.google.com/file/d/1lN_iMSqxurVDfUFGqfo6YWuaggPiQMPy/view?usp=sharing ] – A P Jo Jul 23 '20 at 13:34
  • @JakeFry: this latest version does not use division/modulo anymore. It performs more memory accesses, which may be now the dominant factor, although 10MB may fit in L3 cache on some of the systems and not the others. – chqrlie Jul 23 '20 at 14:04
  • @chqrlie does that make it more optimal for measuring performance ? I have used not the Sieve code, I have used the one posted before, Do please go to the link and see it. I dint use the Sieve method precisely because it wasn't doing any arithmetic. Is memory access speed relevant ? – A P Jo Jul 23 '20 at 14:11
  • @JakeFry: OK, so you still benchmark the division opcode, but with a more consistent operand type across architectures. There is no magic recipe to measure performance, just like measuring the top speed of a car or its raw engine power does not give a complete picture of the car's performance in different circumstances. You might want to benchmark both methods (with a different load multiple) to compare if the figures vary consistenty across systems. – chqrlie Jul 23 '20 at 14:51
  • @chqrlie I ran a load of 1 ( 10^7 ) on an old Samsung S3 Neo I had lying around and the performance was tearful ... 190.8 seconds... i tried running 5 loops but after 30 mins i got nothing so gave up .. The system runs Android 4.4 and has no real lag or slowness even in YouTube and Spotify so i can now clearly understand that division is exceptions ! – A P Jo Jul 23 '20 at 15:08
0

I would suggest changing the bench function to

unsigned  long bench (double x) {
    register unsigned long n;
    register unsigned long i, q;
    unsigned long count = 0;
    for(q = 0;q <= x; q++){
        int prime = 0;
        for (i = 2; i <= q/2 ; ++i) {
            if (q % i == 0) {
                prime = 1;
        break;
            }
        }
        if (prime)
            count++;
    }
    return count;
}

and the main function to:

int main() {
    ...
    unsigned  long count = 0;
    for(z=1;z <=y; z++)
    {
        clock_t t; 
        t = clock(); 
        unsigned  long c = bench(x); 
        t = clock() - t; 
        count += c;
        double time_taken = ((double)t)/CLOCKS_PER_SEC; // in seconds 
        printf("\nTime Taken #%d = %.4f seconds\n", z, time_taken);
    }
    printf("\nPress Enter to Exit ");
    getchar();
    return count > 0 ? 0 : 1; // ensures the use of the return values of bench
}

Also for accurate measurements you should make sure that all compilers compile with the maximum optimization. This is important because this might give the compilers the opportunity of using SIMD instructions to speed-up the bench operation.

But be also aware that banchmarking a system/CPU by only one operation is not a good basis for a general comparisons between systems/CPUs. Your bench function for example only benchmarks how fast a CPU can divide large numbers of consecutive numbers.

Ackdari
  • 2,786
  • 1
  • 9
  • 23
  • what would be a better way to benchmark system speed ? Using other Arithmetic Operations ? Using Logical operations ? (dont we already use logical operators when running an immense for loop from 0 to 100,000 ? – A P Jo Jul 23 '20 at 10:18
  • 2
    @JakeFry First: I myself am **not** an expert in banchmarking. But the operation you should use heavenly depends on what you want to compare between a set of CPUs. If you want to compare the performance of division operations use something like your code. I only wanted to point out that one should keep in mind that CPUs can perform differently good or bad on different operations. For example CPU 1 could be heavily optimized for doing integer division and CPU 2 is optimized to do floating-point-arithmetic but is comparably slow with integer division. (1/2) – Ackdari Jul 23 '20 at 10:27
  • 1
    Your benchmark would suggest that CPU 1 is better than CPU 2. But in fact they just focus on different things. (2/2) – Ackdari Jul 23 '20 at 10:28
  • 1
    @JakeFry: yup, computer performance can't be reduced to a single number. This is why different benchmarks exist, obviously. https://en.wikipedia.org/wiki/Benchmark_(computing). Some try to be "representative" of common number-crunching workloads, but even SPECcpu breaks it down into SPECfp and SPECint. (https://en.wikipedia.org/wiki/SPECfp#Background) Even the performance of a single asm instruction has about 3 dimensions (different axes to measure on): latency, front-end throughput cost, and back-end execution-port cost, although for a large program that's not directly visible. – Peter Cordes Jul 23 '20 at 12:02
0

@chqrlie Was on the right trail...

He was right about :

  • The for loop in bench() being rendered useless by any compiler doing good optimisations (So basically 'Coding C' App, VS2019 but not Apple Clang & GCC)
  • This tests only integers calculations, so perhaps not very accurate in all cases (which is an acceptable limitation but more on that later)
  • The unsigned long was becoming 32-bit in VS and maybe in 'Coding C' as well, making them exponentially faster than in VS and 'Coding C'.
  • His code has the right bench() function, I have used this, as well as his smart addition of an average time output.

Further, what I might add :

Apart from UI improvements like doing away with outputs for every loop, and just sticking to an average as well as some capitalisation here and there, I have :

  • Made x=x10^5 into x=x10^7 to make Load Value less inflated (it takes avg 8 sec on my devices for Load Value 1 and you could still run decimal values).
  • There might be other small changes, idk, the code is posted below.

For anyone interested more , I'll be building PCB for some time with the source in this file, feel free to check it out or suggest changes/ critique code to apjo@tuta.io

A P Jo
  • 438
  • 14
  • Performance number are now much more believable : ~7 sec on the Samsung , ~ 9 sec on the Mac , ~ 9 seconds (optimised) and 10 sec (unoptimised) in Windows – A P Jo Jul 23 '20 at 14:06
  • 1
    *tests only integers calculations* - that's not the main objection! The main objection is that it tests only division, which is very slow and not usually a major factor in performance of most code. Although Intel did improve the integer divider in Broadwell, and again in Ice Lake (especially for 64-bit operand-size which is *very* slow on Intel CPUs before Ice Lake: [Trial-division code runs 2x faster as 32-bit on Windows than 64-bit on Linux](https://stackoverflow.com/a/52558274)). – Peter Cordes Jul 23 '20 at 14:19
  • Your benchmark might show something like a 4x speedup (normalized for clock speed) on Ice Lake, but the average IPC gain across most common workloads is supposed to be about 20% IIRC. (Obviously much bigger gains are possible in code that can take advantage of AVX-512 on Ice Lake but only AVX2 on Skylake, like some SIMD FP number crunching) – Peter Cordes Jul 23 '20 at 14:22
  • 2
    @PeterCordes I'm in 11th grade and self-learning C in the quarantine times from a book, please speak English XD – A P Jo Jul 23 '20 at 14:23
  • That is English, and the kind of stuff you need to understand if you want to write a benchmark that can actually tell anyone anything (i.e. that's better than nothing). :P IPC = instructions per clock, i.e. clock-for-clock performance improvement between different CPUs, i.e. normalized for clock speed differences. IIRC = If I Recall Correctly. SIMD = https://en.wikipedia.org/wiki/SIMD. FP = floating point. – Peter Cordes Jul 23 '20 at 14:27
  • 1
    If you want to write an integer division benchmark, that's fine, just make sure you realize that's what you're doing and describe it as such. And go read about why division is special compared to other kinds of operations that CPUs can do: [Why is division more expensive than multiplication?](https://stackoverflow.com/q/15745819) / [Floating point division vs floating point multiplication](https://stackoverflow.com/q/4125033) / [How sqrt() of GCC works after compiled? Which method of root is used? Newton-Raphson?](https://stackoverflow.com/q/54642663) – Peter Cordes Jul 23 '20 at 14:30
  • 1
    [Why does GCC use multiplication by a strange number for integer div?](//stackoverflow.com/q/41183935). [Why is \_\_int128\_t faster than long long on x86-64 GCC?](//stackoverflow.com/a/63034921) is another hacky benchmark dependent on division performance. Also [Idiomatic way of performance evaluation?](//stackoverflow.com/q/60291987) for general benchmarking. Also [Modern x86 cost model](https://stackoverflow.com/q/9957004) - CPUs run machine code, not C directly of course. You can get pretty low level with this e.g. see Agner Fog's https://agner.org/optimize/ optimizing C++ and asm guides – Peter Cordes Jul 23 '20 at 14:34
  • 1
    Especially [Trial-division code runs 2x faster as 32-bit on Windows than 64-bit on Linux](https://stackoverflow.com/a/52558274) (and the point about type size of `unsigned long` on Windows vs. Linux) is relevant to your benchmark in particular. – Peter Cordes Jul 23 '20 at 14:36
  • BTW, making `count` a double is not better. Just make everything pure integer; some minimal floating point won't be the bottleneck on any real CPUs because of out-of-order execution so it's just complicated for no reason. It's not "Better than nothing" - you need to have a reason for putting something into your benchmark. A poorly designed one is not worth the electricity it takes to run, let alone the time it takes to think about what the results might mean. – Peter Cordes Jul 23 '20 at 14:42
  • 1
    I'm not trying to crap on your beginner efforts, I'm trying to make it clear that **benchmarking is hard**, and *micro*-benchmarking a single artificial loop is much harder to make meaningful than e.g. running an existing full program on some data-set. It's 100% fine to try to write one, but you should expect to mostly learn why it's hard and what the complications are, not end up with something really useful that tells you anything really meaningful about different systems. Although measuring 32-bit integer division performance across different CPUs is something you could do. – Peter Cordes Jul 23 '20 at 14:42
  • @PeterCordes I ran a load of 1 ( 10^7 ) on an old Samsung S3 Neo I had lying around and the performance was tearful ... 190.8 seconds... i tried running 5 loops but after 30 mins i got nothing so gave up .. The system runs Android 4.4 and has no real lag or slowness even in YouTube and Spotify so i can now clearly understand that division is exceptional ! – A P Jo Jul 23 '20 at 15:08
  • 1
    @JakeFry: Keep learning on your own by experimenting and improving your code. Don't hesitate to post questions here when you have problems or surprising results. You might want to get a github account to keep track of your code as well as share it with others, it is more effective than Google docs. Programming is a demanding passion, but one that can last a lifetime... Enjoy the trip! – chqrlie Jul 23 '20 at 18:15