22

I have to check presence of millions of elements (20-30 letters str) in the list containing 10-100k of those elements. Is there faster way of doing that in python than set() ?

import sys
#load ids
ids = set( x.strip() for x in open(idfile) )

for line in sys.stdin:
    id=line.strip()
    if id in ids:
        #print fastq
        print id
        #update ids
        ids.remove( id )
Leandro Papasidero
  • 3,624
  • 1
  • 15
  • 33
Leszek
  • 1,190
  • 1
  • 10
  • 20
  • 1
    What sort of times are you actually getting? – urschrei Aug 18 '11 at 15:51
  • 60 sec, an algorithm in c++ (using tr1/unordered_set) does the same in 18 sec... – Leszek Aug 18 '11 at 16:19
  • 2
    Do you have to check sequentially? It would probably be faster to create two sets, create an intersection set, then subtract the intersection set from the set which you're checking for membership. – urschrei Aug 18 '11 at 16:25
  • This is very vague. You need to give reproducable results that people can actually run. – Glenn Maynard Aug 18 '11 at 16:31
  • If set is too slow for you you might need to use a more optimized data structure based on the characterisitics of your data. What sort of data is it, exactly? – Assaf Lavie Aug 18 '11 at 16:31
  • These are identifiers ( 20-30 letters A-Z,0-9, and +@/-: ) – Leszek Aug 18 '11 at 16:37
  • 2
    I just saw your comment as to the speed -- Python is roughly 3 times slower than C++. This is actually pretty good for Python for many purposes. Have you profiled the Python code? What percent of that time is spent checking for set membership, and what percent is spent doing other things? – agf Aug 18 '11 at 17:02

4 Answers4

29

set is as fast as it gets.

However, if you rewrite your code to create the set once, and not change it, you can use the frozenset built-in type. It's exactly the same except immutable.

If you're still having speed problems, you need to speed your program up in other ways, such as by using PyPy instead of cPython.

agf
  • 148,965
  • 36
  • 267
  • 227
  • 2
    how is PyPy faster the cPython? – BrainStorm Aug 18 '11 at 16:03
  • 1
    http://speed.pypy.org/. Basically it implements a Just-in-time compiler, just like the JavaScript engine in your browser, that can drastically speed up many types of code. It is 2-100 times faster at most things. – agf Aug 18 '11 at 16:15
  • 2
    Have you profiled your code? is it the `__contains__` step that is taking most of the time? As I said in my answer, "`set` is as fast as it gets." Unless your problem is elsewhere, there is no way to speed it up in Python. – agf Aug 18 '11 at 16:18
  • 1
    Using fozenset() instead of set() increased speed by 15% in my code (unrelated project). – ChaimG Jul 18 '16 at 21:23
  • @ChaimG according to this https://stackoverflow.com/questions/36555214/set-vs-frozenset-performance answer, speed should be the exact same for set vs frozenset -- curious how you tested it and how it differs? – Danny Jul 31 '20 at 15:28
11

As I noted in my comment, what's probably slowing you down is that you're sequentially checking each line from sys.stdin for membership of your 'master' set. This is going to be really, really slow, and doesn't allow you to make use of the speed of set operations. As an example:

#!/usr/bin/env python

import random

# create two million-element sets of random numbers
a = set(random.sample(xrange(10000000),1000000))
b = set(random.sample(xrange(10000000),1000000))
# a intersection b
c = a & b
# a difference c
d = list(a - c) 
print "set d is all remaining elements in a not common to a intersection b"
print "length of d is %s" % len(d)

The above runs in ~6 wallclock seconds on my five year-old machine, and it's testing for membership in larger sets than you require (unless I've misunderstood you). Most of that time is actually taken up creating the sets, so you won't even have that overhead. The fact that the strings you refer to are long isn't relevant here; creating a set creates a hash table, as agf explained. I suspect (though again, it's not clear from your question) that if you can get all your input data into a set before you do any membership testing, it'll be a lot faster, as opposed to reading it in one item at a time, then checking for set membership

urschrei
  • 21,261
  • 12
  • 36
  • 75
0

You should try to split your data to make the search faster. The tree strcuture would allow you to find very quickly if the data is present or not.

For example, start with a simple map that links the first letter with all the keys starting with that letter, thus you don't have to search all the keys but only a smaller part of them.

This would look like :

ids = {}
for id in open(idfile):
    ids.setdefault(id[0], set()).add(id)

for line in sys.stdin:
    id=line.strip()
    if id in ids.get(id[0], set()):
       #print fastq
       print id
       #update ids
       ids[id[0]].remove( id )

Creation will be a bit slower but search should be much faster (I would expect 20 times faster, if the fisrt character of your keys is well distributed and not always the same).

This is a first step, you could do the same thing with the second character and so on, search would then just be walking the tree with each letter...

JC Plessis
  • 627
  • 3
  • 7
  • 6
    Set access is O(1), how would a tree make it faster? – agf Aug 18 '11 at 17:23
  • Well, you seem to be right. My mistake I really learned something big today, I thought a set was a mere list without twice the same value. Do you have any url where I could find more info about that ? I can't find anything about access speed on official documentation. – JC Plessis Aug 18 '11 at 17:53
  • Just take a look at https://secure.wikimedia.org/wikipedia/en/wiki/Hash_table, that's what sets and dictionaries are. – agf Aug 18 '11 at 18:02
  • 3
    @JC Plessis : have a look there for detailled python operations complexity : http://wiki.python.org/moin/TimeComplexity – Cédric Julien Aug 18 '11 at 21:11
-2

As mentioned by urschrei, you should "vectorize" the check. It is faster to check for the presence of a million elements once (as that is done in C) than to do the check for one element a million times.

Gecko
  • 1,133
  • 9
  • 14