5

I am creating a fast method of generating a list of primes in the range(0, limit+1). In the function I end up removing all integers in the list named removable from the list named primes. I am looking for a fast and pythonic way of removing the integers, knowing that both lists are always sorted.

I might be wrong, but I believe list.remove(n) iterates over the list comparing each element with n. meaning that the following code runs in O(n^2) time.

# removable and primes are both sorted lists of integers
for composite in removable:
    primes.remove(composite)

Based off my assumption (which could be wrong and please confirm whether or not this is correct) and the fact that both lists are always sorted, I would think that the following code runs faster, since it only loops over the list once for a O(n) time. However, it is not at all pythonic or clean.

i = 0
j = 0
while i < len(primes) and j < len(removable):
    if primes[i] == removable[j]:
        primes = primes[:i] + primes[i+1:]
        j += 1
    else:
        i += 1

Is there perhaps a built in function or simpler way of doing this? And what is the fastest way?

Side notes: I have not actually timed the functions or code above. Also, it doesn't matter if the list removable is changed/destroyed in the process.

For anyone interested the full functions is below:

import math

# returns a list of primes in range(0, limit+1)
def fastPrimeList(limit):
    if limit < 2:
        return list()
    sqrtLimit = int(math.ceil(math.sqrt(limit)))
    primes = [2] + range(3, limit+1, 2)
    index = 1
    while primes[index] <= sqrtLimit:
        removable = list()
        index2 = index
        while primes[index] * primes[index2] <= limit:
            composite = primes[index] * primes[index2]
            removable.append(composite)
            index2 += 1
        for composite in removable:
            primes.remove(composite)
        index += 1
    return primes
David C
  • 1,300
  • 1
  • 12
  • 27
  • A single call to `primes.remove` runs in `O(n)` time, so your clean 2nd solution also runs in `O(n^2)` time, not any faster than the 1st one. It's possible to do it in `O(n)` time, similarly to your 2nd solution, by simultaneously iterating over both lists (with loop variables `i` and `j`, incrementing only one of them at a time), but building a separate output list. – pts Sep 23 '14 at 22:07
  • Sorry I meant to change primes.remove() to primes = primes[:i] + primes[i+1:] – David C Sep 23 '14 at 22:11
  • Take a look at [Robert William Hank's solution](http://stackoverflow.com/a/3035188/190597). He uses a boolean list and sets the elements to False when it is determined that (roughly) the index of that element is not a prime. – unutbu Sep 23 '14 at 22:12
  • @sharkbyte: Replacing it with `primes = primes[:i] + primes[i+1:]` doesn't make it any faster, each of it also takes `O(n)` time, totaling `O(n^2)`. – pts Sep 23 '14 at 22:13
  • I didn't know splicing still takes O(n) time. I will have to look into that. – David C Sep 23 '14 at 22:17
  • 1
    @sharkbyte: Don't think of it as "splicing". That's an operation on linked lists, which takes constant time (if you have all the right pointers, directly or via double links); arrays (like Python's `list`) can't do that, so they can only simulate it by shifting everything out to create a gap an then copying in the "spliced" list. – abarnert Sep 23 '14 at 22:32

2 Answers2

7

This is quite fast and clean, it does O(n) set membership checks, and in amortized time it runs in O(n) (first line is O(n) amortized, second line is O(n * 1) amortized, because a membership check is O(1) amortized):

removable_set = set(removable)
primes = [p for p in primes if p not in removable_set]

Here is the modification of your 2nd solution. It does O(n) basic operations (worst case):

tmp = []
i = j = 0
while i < len(primes) and j < len(removable):
    if primes[i] < removable[j]:
        tmp.append(primes[i])
        i += 1
    elif primes[i] == removable[j]:
        i += 1
    else:
        j += 1
primes[:i] = tmp
del tmp

Please note that constants also matter. The Python interpreter is quite slow (i.e. with a large constant) to execute Python code. The 2nd solution has lots of Python code, and it can indeed be slower for small practical values of n than the solution with sets, because the set operations are implemented in C, thus they are fast (i.e. with a small constant).

If you have multiple working solutions, run them on typical input sizes, and measure the time. You may get surprised about their relative speed, often it is not what you would predict.

pts
  • 64,123
  • 15
  • 92
  • 159
  • Could you provide a little more explanation about the run time of changing a list to a set and set membership checks? – David C Sep 23 '14 at 22:15
  • 2
    @sharkbyte: Python sets are implemented using hashtables: operation are fast on average, but some unlucky operations get slow. Read the Wikipedia articles about hashtables to get a better idea about the time complexity. In the typical lucky case conversion is `O(n)`, and each membership check is `O(1)`. Worst case is slower. – pts Sep 23 '14 at 22:17
  • Thanks for clarifying that. I will take a look at the wiki for hashtables. – David C Sep 23 '14 at 22:21
  • @sharkbyte: I've extended and clarified my answer. – pts Sep 23 '14 at 22:27
  • @pts: You should include the average-case `O(N)` set conversion plus the amortized average-case `O(1)` per set membership check in your answer, because that's what explains why the whole thing is `O(N)` (it's `O(N + N*1)`, which is obviously `O(N)`). – abarnert Sep 23 '14 at 22:29
3

The most important thing here is to remove the quadratic behavior. You have this for two reasons.

First, calling remove searches the entire list for values to remove. Doing this takes linear time, and you're doing it once for each element in removable, so your total time is O(NM) (where N is the length of primes and M is the length of removable).

Second, removing elements from the middle of a list forces you to shift the whole rest of the list up one slot. So, each one takes linear time, and again you're doing it M times, so again it's O(NM).


How can you avoid these?

For the first, you either need to take advantage of the sorting, or just use something that allows you to do constant-time lookups instead of linear-time, like a set.

For the second, you either need to create a list of indices to delete and then do a second pass to move each element up the appropriate number of indices all at once, or just build a new list instead of trying to mutate the original in-place.

So, there are a variety of options here. Which one is best? It almost certainly doesn't matter; changing your O(NM) time to just O(N+M) will probably be more than enough of an optimization that you're happy with the results. But if you need to squeeze out more performance, then you'll have to implement all of them and test them on realistic data.

The only one of these that I think isn't obvious is how to "use the sorting". The idea is to use the same kind of staggered-zip iteration that you'd use in a merge sort, like this:

def sorted_subtract(seq1, seq2):
    i1, i2 = 0, 0
    while i1 < len(seq1):
        if seq1[i1] != seq2[i2]:
            i2 += 1
            if i2 == len(seq2):
                yield from seq1[i1:]
                return
        else:
            yield seq1[i1]
            i1 += 1
abarnert
  • 313,628
  • 35
  • 508
  • 596