Explanation needed for sum of prime below n numbers

Question

Today I solved a problem given in Project Euler, it's problem number 10 and it took 7 hrs for my python program to show the result. But in that forum itself a person named lassevk posted solution for this and it took only 4 sec. And its not possible for me to post this question in that forum because its not a discussion forum. So, think about this if you want to mark this question as non-constructive.

marked = [0] * 2000000
value = 3
s = 2
while value < 2000000:
    if marked[value] == 0:
        s += value
        i = value
        while i < 2000000:
            marked[i] = 1
            i += value
    value += 2
print s

If anyone understands this code please explain it as simple as possible.

This is my code which took 7 hrs to compute (I think I also used the same logic of Sieve of Eratosthenes technique which was mentioned in answers below):

import time
start = time.clock()

total = 0
limit = 2000000
KnownPrime = set([2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 
                  53, 59, 61, 67, 71])
KnownPrime.update(set([73, 79, 83, 89, 97, 101, 103, 107, 109, 113, 127, 
                       131, 137, 139, 149, 151, 157, 163, 167, 173]))
suspected = set(range(2, limit+1)) # list of suspected prime numbers
for p in KnownPrime:
    if p <= limit:
        total += p
        suspected.difference_update(set(range(p, limit+1, p)))

for i in suspected:
    k = i/2
    if k % 2 == 0: k += 1
    PrimeCheck = set(range(k, 2, -2))
    PrimeCheck.difference_update(KnownPrime)
    for j in PrimeCheck:
        if i % j == 0:
            break
    if i % j:
        total += i

print time.clock() - start
print total

So, can anyone tell me why did it took that much time.

Finally I did it here's my refactored code. Now it can show result with in 2 sec.

import math
import __builtin__

sum = __builtin__.sum

def is_prime(num):
    if num < 2: return False
    if num == 2: return True
    if num % 2 == 0: return False
    for i in range(3, int(math.sqrt(num)) + 1, 2):
        if num % i == 0: return False
    return True

def sum_prime(num):
    if num < 2: return 0
    sum_p = 2
    core_primes = []
    suspected = set(range(3, num + 1, 2))
    for i in range(3, int(math.sqrt(num)) + 1, 2):
        if is_prime(i): core_primes.append(i)
    for p in core_primes:
        sum_p += p
        suspected.difference_update(set(range(p, num + 1, p)))
    return sum(suspected) + sum_p

print sum_prime(2000000)

And here is the visualization for that.

The Project Euler question is [here](http://projecteuler.net/problem=10): it asks for the sum of all primes under 2 million. — rmunn, Jun 25 '13 at 04:13
I don't 100% undestand this, but that is a very, very smart solution. All I can tell you is that `value` is 3 because that is the lowest odd prime number, and `s = 2` because we can't forget that 2 is a prime number — TerryA, Jun 25 '13 at 04:13
You can replace the inner `while` loop with `marked[i::i] = [1]*((2000000-1)/i)` for a nice speedup — John La Rooy, Jun 25 '13 at 04:53
You asked why your solution took so much time: it's because you ended up writing an O(N^2) algorithm. (See [here](http://rob-bell.net/2009/06/a-beginners-guide-to-big-o-notation/) if you're not familiar with Big O notation). The `difference_update()` method of sets runs in O(N) time and you're calling it O(N) times, thus an O(N^2) total runtime. When N gets as large as 2 million, that's a LOT. — rmunn, Jun 25 '13 at 07:06
You'll encounter this kind of thing often in Project Euler problems: they're designed to make O(N^2) algorithms painfully slow to run. Given the large numbers you're dealing with, you want to make sure your solution runs in O(N) time at worst, or (as you just discovered) it can take 7 hours to complete, while someone else's solution takes just four seconds. — rmunn, Jun 25 '13 at 07:07
Your solution is *not* the Sieve of Eratosthenes. Your use of the modulo operator gives it away; that's trial division. As @rmunn points out, trial division is an O(n^2) algorithm, but the Sieve of Eratosthenes is only O(n log log n), which is nearly linear. — user448810, Jun 25 '13 at 12:42
in general you should use `for i in range(3, int(math.sqrt(n)) + 1, 2):` inside your `def is_prime(n):`. except for that, you got it! :) btw you could have used the sieve method to find the core primes list as well, instead of the trial division; but that has no impact on the overall complexity. Even your suboptimal trial division has no impact on the overall time complexity - it contributes `O((sqrt(n))^2)` (and the optimal, `O((sqrt(n))^1.5)`) to the overall `O(n log log n)` (this, if sets remove each element in `O(1)` time, as indeed can be expected of a dictionary-based data structure). — Will Ness, Jun 25 '13 at 19:10
indeed, the test http://ideone.com/4D9qru supports the above. Cheers! :) — Will Ness, Jun 25 '13 at 19:11
@rmunn my above mentioned test empirically supports that `a.difference_update(b)` runs in `O(len(b))`. So no, it does not run in `O(N)` where `N=len(a)`, which is needed to make the overall time quadratic. Sets perform nicely here, and the OP's final code runs in `O(n log(log n))`, on sets. Sets weren't the problem there. — Will Ness, Jun 25 '13 at 19:20
more than 10 edits turn your question (and all answrs) into community-wiki; up-votes give no reputation then. — Will Ness, Jun 26 '13 at 12:42
@WillNess ohh.. but.. it was edited 12 times till now any way I will follow you advice and thanks again :) — bkmagnetron, Jun 26 '13 at 17:49
you're welcome. anyway, *why* your original code was so slow (since nobody answered this): `k = i/2` is the culprit. Changing it to `k = int( math.sqrt(i) + 1 )` should bring down the run time [into the minutes range](http://ideone.com/EukfQP). And yes, as you probably know by now, the initial removal of 40 primes and their multiples was done by the method of the sieve of Er's. — Will Ness, Jun 26 '13 at 20:24
@WillNess oh.. yes.. so I really made it on my own (without knowing about sieve) except that sqrt(). :D — bkmagnetron, Jun 27 '13 at 15:55
no, in the original code you start with 40 sieve steps, but then you stop, and switch onto very suboptimal trial division. changing `n/2` to `sqrt(n)` makes it less terribly inefficient TD, but still suboptimal - for one thing, the testing is done in wrong direction; and you trial divide by odds, not by primes. Your last version is a sieve of Er's, with initial TD up to a sqrt. BTW can you make this initial step with the sieve of Er's, too? Can you do it without (almost) writing new code? :) — Will Ness, Jun 27 '13 at 20:26
@WillNess I think I can only do that by using is_prime() function — bkmagnetron, Jun 28 '13 at 12:32
http://ideone.com/c566K7 : change your `sum_prime(num)` to collect primes into a list, instead of summing them up. Then, use it as `list_primes( int(math.sqrt(num))+1 )` to calculate the "core" primes for itself. :) — Will Ness, Jun 28 '13 at 16:29
@WillNess :O `print sum(prime_list3x( 2000000 )) + 2` Can I know you experience in programming? Is the experience alone can give this much of knowledge about program or did I have to do any extra stuffs for being a programmer like you? How can I learn to analyse the code like you did? — bkmagnetron, Jun 28 '13 at 18:49
More than a few years. :) I also worked with Lisp; tried out Scheme, Prolog... The SICP book is a fine read, The Art of Prolog... But I'm really a mediocre programmer, that's all pretty elementary stuff; recursion is a basic technique. Elementary primes algorithms like sieve or TD is pretty basic stuff too... Happy trails! :) — Will Ness, Jun 29 '13 at 07:34

score 4 · Accepted Answer · edited May 23 '17 at 12:08

Question:

Find the sum of all the primes below two million.

It's a simple sieve. You should read about it but the general idea is to iterate over each number - if the value at index number is 0, it's prime and you mark off multiples of that number (since all those multiples must not be prime). Ignore if it's 1 (composite). I'll provide some comments to explain what this code in particular is doing,

marked = [0] * 2000000     # <- just set up the list
value = 3                  # <- starting at 3 then checking only odds
s = 2                      # <- BUT include 2 since its the only even prime
while value < 2000000:
    if marked[value] == 0: # <- if number at index value is 0 it's prime
        s += value         #    so add value to s (the sum)
        i = value          # <- now mark off future numbers that are multiples of
        while i < 2000000: #    value up until 2mil
            marked[i] = 1  # <- number @ index i is a multiple of value so mark
            i += value     # <- increment value each time (looking for multiples)
    value += 2             # <- only check every odd number
print s

Two optimizations for this code:

The initial value of i could be set to value*value == value**2
Could easily change this to use a list of length 1 million since we already know no evens are primes

EDIT:

While I hope my answer helps explain operations of sieves for future visitors, if you are looking for a very fast sieve implementation please refer to this question. Great performance analysis by unutbu and some excellent algorithms posted by Robert William Hanks!

Can you please check my code and tell me whether its like sieve or not. — bkmagnetron, Jun 25 '13 at 04:38
Thanks, I re-factored my code and now it can show result within in 2 sec which was efficient than lassevk's program :D. — bkmagnetron, Jun 25 '13 at 13:47

score 1 · Answer 2 · answered Jun 25 '13 at 04:13

The code is basically using the Sieve of Eratosthenes to find primes, which might be clearer once you take out the code that keeps track of the sum:

marked = [0] * 2000000
value = 3
while value < 2000000:
    if marked[value] == 0:
        i = value
        while i < 2000000:
            marked[i] = 1
            i += value
    value += 2

value ticks up by 2 (since you know that all even numbers above 2 aren't prime, you can just skip over them) and any value that hasn't already been marked by the time you reach it is prime, since you've marked all the multiples of the values below it.

score 1 · Answer 3 · answered Jun 25 '13 at 04:17

This code basically sums up all the primes < 2000000 using the Sieve of Eratosthenes concept:

marked is a huge array full of zeroes.

Each time the value is less than 2000000, check if the value in the array marked is flagged. Flagging can be considered as marking the array position as 1. For eg, if you want to flag a value, you mark the position of that value in your marked array as 1 (rest are all zeroes).

Next, you set i to that value(i is the value you use for your while loop). While i is less than 2000000, flag the marked array for that particular value. Then increment i by that value. This is done so that: If you flag all the multiples of 2, you don't need to reiterate through all of them in the next iteration. For eg. if you flag all the multiples of 2, next step you can start for 3 from 3^2 = 9, since all the smaller multiples would already be flagged by then.

Refer Sieve of Eratosthenes and this video for further details.

score 0 · Answer 4 · answered Jun 25 '13 at 04:15

0

This answer is using the Sieve of Erastothenes approach to mark non-primes (that's what the marked list is for) and then going through and only adding values that haven't been marked. See the Wikipedia article for more details.

answered Jun 25 '13 at 04:15

rmunn

29,094
9
62
91

hey check my code and tell me is it like Sieve of Erastothenes or not. – bkmagnetron Jun 25 '13 at 04:43

Explanation needed for sum of prime below n numbers

4 Answers4