-1

I'm trying to write a function that can find all the prime numbers below some very large number by using the Sieve of Eratosthenes. I have written the function:

def primes(limit):
    #efficient method for finding large primes
    l=set()
    i=1
    while i<limit+1:
        l.add(i)
        i+=2
    s=int(math.sqrt(limit))
    #recur until sqrt is small
    if s<=1000:
        ps=smallprimes(s)
    else:
        ps=primes(s)

    for p in ps:
        l-=set(multiples(p,limit+1)[1:])
    return [2]+(list(l)[1:])

where smallprimes calculates primes below a limit just by checking the number of factors, and multiples calculates all multiples of a number below a limit.

With very large limits passed to primes, I create the large sets to "strike out" multiples of all the primes below the square root of limits.

Is there a more efficient way to "strike out" numbers from a sequence than by using sets? I am wondering because I really only need to subtract two arrays, I don't need prevention of duplicates, etc.

Luke Taylor
  • 6,166
  • 7
  • 38
  • 78
  • 1
    Side note: Python is not suited for efficient arithmetics. I'd suggest using either some external library or different language. – freakish Jan 12 '16 at 19:09
  • that don't look like the sieve of eratostenes to me, also take a look at this post, that may be what you are looking for [how-to-implement-an-efficient-infinite-generator-of-prime-numbers-in-python](http://stackoverflow.com/questions/2211990/how-to-implement-an-efficient-infinite-generator-of-prime-numbers-in-python) – Copperfield Jan 12 '16 at 19:12
  • It is sieve of erastothenes, i calculate the square root, then subtract the set of all multiples of all primes below the limit. I'll look at that link, though. – Luke Taylor Jan 12 '16 at 19:14
  • 1
    that you calcula the square root or not is not the problem I see with yours algorithm, is that you are using recursion, you don't need that to implement this sieve, anyway what you need is the erat3 algorithm in the link I give you, with a little modifications it give you all primes until N and use a minimum amount of memory, proporcional to the amount of primes until the square root of N – Copperfield Jan 12 '16 at 19:27
  • Ok… I was trying to use recursion to increase efficiency but I'll try erat3 – Luke Taylor Jan 12 '16 at 19:28
  • 1
    @LukeTaylor Yeah, numpy is good. The underlying arithmetic is written in C so you are good to go. – freakish Jan 12 '16 at 19:29

1 Answers1

1

Using a set will have problems for large data sets as the number of hash collisions will go up significantly, on top of which you incur unnecessary storage overhead.

An alternative solution is to use a numpy mask array. The index in the array is the number, the value indicates whether or not it is prime. You can further optimize by making the number be 2 * index + 1 so that you only store odd numbers.

This is just an example. Using sets for a large sieve will be very inefficient.

Mad Physicist
  • 76,709
  • 19
  • 122
  • 186
  • "Using a set will have problems for large data sets as the number of hash collisions will go up significantly" - not really. Sure, there's the birthday paradox, but the number of collisions that causes isn't really a problem. Large sets run out of memory before the number of elements causes collision problems. – user2357112 supports Monica Jan 12 '16 at 19:33
  • @user2357112 That actually depends on the underlying hash function. I'm not sure what Python uses but I guess that your statement might be quite correct. – freakish Jan 12 '16 at 19:34
  • Fair enough. That does not invalidate my point that sets are probably not the best way to do this. This answer is just an example of one alternative. I actually voted to close the question as opinion based. – Mad Physicist Jan 12 '16 at 19:35
  • @freakish: That's a problem with the hash function, not a problem with large set sizes. Hash collision-based DOS attacks can be carried out with fairly small hash tables, and if your hash function causes high numbers of collisions in large sets when there isn't an attacker involved, then you just have a *really* crappy hash function. – user2357112 supports Monica Jan 12 '16 at 19:40