1

What is the fastest way in terms of CPU to get the first prime numbers up to a limit?

Ilya Gazman
  • 27,805
  • 19
  • 119
  • 190

2 Answers2

1

It takes 4.8 s on my machine to calculate the first 1,000,000,000 numbers. It's even faster than reading it from a file!

    public static void main(String[] args) {
        long startingTime = System.nanoTime();
        
        int limit = 1_000_000_000;
        BitSet bitSet = findPrimes(limit);

        System.out.println(2);
        System.out.println(3);
        System.out.println(5);
        System.out.println(7);
        limit = limit / 2 + 1;
        for (int i = 6; i < limit; i++) {
            if (!bitSet.get(i)) {
                int p = i * 2 - 1;
                //System.out.println(p);
            }
        }

        System.out.println("Done in " + (System.nanoTime() - startingTime) / 1_000_000 + " milliseconds");
    }

    public static BitSet findPrimes(int limit) {
        BitSet bitSet = new BitSet(limit / 2 + 1);
        int size = (int) Math.sqrt(limit);
        size += size % 2;
        for (int prime = 3; prime < size; prime += 2) {
            if (bitSet.get(prime / 2 + 1)) {
                continue;
            }
            for (int i = prime / 2 + 1; i < bitSet.size(); i += prime) {
                bitSet.set(i);
            }
        }

        return bitSet;
    }

The most efficient way of computing prime values is to use a Sieve of Eratosthenes, first discovered in the ancient greek millenniums ago.

The problem, however, is that it requires a lot of memory. My Java app is simply crashing when I just declare a boolean array of 1,000,000,000.

So I made two memory optimizations to make it work

  • Since all the primes after 2 are odd numbers, I half the array size by mapping the indexes to int p = i * 2 - 1;
  • Then instead of using an array of booleans, I used a BitSet that works with bit operation on an array of longs.

With those two optimizations using the Sieve of Eratosthenes allows you to compute the btSet in 4.8 seconds. As for printing it out, it's another story ;)

Ilya Gazman
  • 27,805
  • 19
  • 119
  • 190
  • I'm just curious, what does bitset do actually? I'm looking to implement this in python, because it's awesome, i just need to more info on what math operation bitset is doing. Thanks in advance – oppressionslayer Aug 26 '20 at 23:30
  • not much memory needed at all. use segmented sieve. each segment has the size of sqrt(N). – Will Ness Nov 04 '20 at 10:19
  • not 1B [prime] numbers, but primes up to 1B. that's under 51M primes. – Will Ness Nov 04 '20 at 11:46
  • @WillNess I would love to see you implementing it. I find it hard to believe it will have the same performance even if theoretically it should. – Ilya Gazman Nov 04 '20 at 14:19
  • theoretically it must be substantially faster because cache coherence, when the segment size fits into L1 (or L2 even, dunno). https://stackoverflow.com/a/10249801/849891 , https://stackoverflow.com/search?q=user%3A849891+offset+sieve , https://stackoverflow.com/search?tab=votes&q=user%3a849891%20sieve (btw, simple non-segmented sieve in C++: https://ideone.com/fapob) – Will Ness Nov 04 '20 at 14:37
  • @WillNess I believe it when I see the benchmarks. To even compare the two, one must restrict the problem size in a way both algorithms can execute without running out of memory. It leaves you with fairly fast executions. At those speeds, it's hard to predict all the moving parts. So I don't think that cache coherence is enough to sell your argument. – Ilya Gazman Nov 04 '20 at 14:44
  • it's not *my* argument, I'm giving you the accepted wisdom here. What I wrote I've read in all kinds of places all over. It also makes perfect sense to me. You don't want to believe it blindly, great!, you can (dis-)prove it to yourself if you feel like it. :) if anything, it will surely take O(sqrt(n)) *space*, which *is* its advantage over the non-segmented sieve. fair comparison is, compare sieving up to N, *same* N. *that's* fair. IOW that the segmented can reach far farther (w/out running out of memory) than the non-segmented one, is precisely its advantage (but I repeat myself) – Will Ness Nov 04 '20 at 15:20
  • but to compare the speeds, sure, sieving up to 1B ought to do it. it shouldn't change empirical orders of growth of course, but should provide a noticeable constant factor improvement. how much, I don't know, never bothered to do it myself. :) – Will Ness Nov 04 '20 at 15:21
  • err, "cache coherence" should have been "cache locality". goes to show you how big of an expert I am in these kind of things. :/ – Will Ness Nov 04 '20 at 15:33
  • [here's](https://stackoverflow.com/search?q=user%3A448810+segmented+sieve) posts on segmented sieve by an expert. – Will Ness Nov 04 '20 at 15:41
1

There are a number of ways to quickly calculate primes -- the 'fastest' ultimately will depend on many factors (combination of hardware, algorithms, data format, etc). This has already been answered here: Fastest way to list all primes below N . However, that is for Python, and the question becomes more interesting when we consider C (or other lower level languages)

You can see:

Asymptotically, Sieve of Atkin has a better computational complexity (O(N) or lower* -- see notes), but in practice the Sieve of Eratosthenes (O(N log log N)) is perfectly fine, and it has the advantage of being extremely easy to implement:

/* Calculate 'isprime[i]' to tell whether 'i' is prime, for all 0 <= i < N */
void primes(int N, bool* isprime) {
    int i, j;
    /* first, set odd numbers to prime, evens to not-prime */
    for (i = 0; i < N; ++i) isprime[i] = i % 2 == 1;
    
    /* special cases for i<3 */
    isprime[0] = isprime[1] = false;

    /* Compute square root via newton's algorithm 
     * (this can be replaced by 'i * i <= N' in the foor loop)
     */
    int sqrtN = (N >> 10) + 1024;
    for (i = 0; i < 32; ++i) sqrtN = (N / sqrtN + sqrtN) / 2;

    /* iterate through all odd numbers */
    isprime[2] = true;
    for (i = 3; i <= sqrtN /* or: i*i <= N */; i += 2) {
        /* check if this odd number is still prime (i.e. if it has not been marked off) */
        if (isprime[i]) {

            /* mark off multiples of the prime */
            j = 2 * i;
            isprime[j] = false;
            for (j += i; j < N; j += 2 * i) {
                isprime[j] = false;
            }
        }
    }
}

That code is pretty clear and consise, but takes around ~6s to compute primes < 1_000_000_000 (1 billion), and uses N * sizeof(bool) == N == 1000000000 bytes of memory. Not great

We can use bit-fiddling and uint64_t type (use #include <stdint.h>) to produce code that runs in the same complexity, but hopefully much faster and using less memory. We can right off the bat reduce the memory to N/8 (1 bit per boolean instead of 1 byte). Additionally, we can only count odd numbers and shave even more memory off -- a factor of 2, taking our final usage to N/16 bytes.

Here's the code:


/* Calculate that the (i//2)%64'th bit of 'isprime[(i//2)/64]' to tell whether 'i' is prime, for all 0 <= i < N,
 *   with 'i % 2 == 0'
 * 
 * Since the array 'isprime' can be seen as a long bistring:
 * 
 * isprime:
 * [0]       [1]          ...
 * +---------+---------+
 * |01234....|01234....|  ...
 * +---------+---------+
 * 
 * Where each block is a 64 bit integer. Therefore, we can map the odd natural numbers to this with the following formula:
 * i -> index=((i / 2) / 64), bit=((i / 2) % 64)
 * 
 * So, '1' is located at 'primes[0][0]' (where the second index represnts the bit),
 *     '3' is at 'primes[0][1]', 5 at 'primes[0][2]', and finally, 129 is at 'primes[1][0]'
 * 
 * And so forth.
 * 
 * Then, we use that value as a boolean to indicate whether the 'i' that maps to it is prime
 * 
 */
void primes_bs(int N, uint64_t* isprime) {
    uint64_t i, j, b/* block */, c/* bit */, v, ii;
    /* length of the array is roundup(N / 128) */
    int len = (N + 127) / 128;
    /* set all to isprime */
    for (i = 0; i < len; ++i) isprime[i] = ~0;
    
    /* set i==1 to not prime */
    isprime[0] &= ~1ULL;

    /* Compute square root via newton's algorithm */
    int sqrtN = (N >> 10) + 1024;
    for (i = 0; i < 32; ++i) sqrtN = (N / sqrtN + sqrtN) / 2;

    /* Iterate through every word/chunk and handle its bits */
    uint64_t chunk;
    for (b = 0; b <= (sqrtN + 127) / 128; ++b) {
        chunk = isprime[b];
        c = 0;
        while (chunk && c < 64) {
            if (chunk & 1ULL) {
                /* hot bit, so is prime */
                i = 128 * b + 2 * c + 1;
                /* iterate 3i, 5i, 7i, etc... 
                 * BUT, j is the index, so is basically 3i/2, 5i/2, 7i/2,
                 *   that way we don't need as many divisions per iteration,
                 *   and we can use 'j += i' (since the index only needs 'i'
                 *   added to it, whereas the value needs '2i')
                 */
                for (j = 3 * i / 2; j < N / 2; j += i) {
                    isprime[j / 64] &= ~(1ULL << (j % 64));
                }
            }

            /* chunk will not modify itself */
            c++;
            chunk = isprime[b] >> c;
        }
    }

    /* set extra bits to 0 -- unused */
    if (N % 128 != 0) isprime[len - 1] &= (1ULL << ((N / 2) % 64)) - 1;
}

This code calculates primality for all numbers less than 1000000000 (1_000_000_000) in ~2.7sec on my machine, which is a great improvement! And you use much less memory (1/16 th of what the other method uses.

However, if you're using it in other code, the interface is trickier -- you can use a function to help:

/* Calculate if 'x' is prime from a dense bitset */
bool calc_is_prime(uint64_t* isprime, uint64_t x) {
  /**/ if (x == 2) return true; /* must be included, since 'isprime' only encodes odd numbers */
  else if (x % 2 == 0) return false;
  return (isprime[(x / 2) / 64] >> ((x / 2) % 64))) & 1ULL;
}

You could go even further, and generalize this method of mapping indices for odd-only values (https://en.wikipedia.org/wiki/Wheel_factorization), but the resulting code may well be slower -- since the inner loops and operations are no longer powers of 2, you either have to have a table look up, or some very clever trickery. The only case when you would want to use wheel factorization is if you were very memory constrained, in which case wheel factoring for primorial(3) == 30, would allow you to store 30 numbers for every 7 bits (as opposed to 2 numbers for every 1 bit). Or, say primorial(5) == 2310, you could store 2310 numbers for every 341

However, large wheel factorizations will, again, change some of the instructions from bit shifts and power-of-two operations to table lookups, and for large wheels, this could actually end up being much slower

NOTES:

  • Technically, versions of the Sieve of Atkin exist which is O(N / (log log N)), but you no longer use the same dense array (if it was, the complexity MUST be at least O(N))
Cade Brown
  • 139
  • 2
  • 8