2

I currently have a requirement to make a comparison of strings containing MAC addresses (eg. "11:22:33:AA:BB:CC" using Python 2.7. At present, I have a preconfigured set containing the MAC address and my script iterates through the set comparing each new MAC address to those in the list. This works great but as the set grows, the script massively slows down. With only 100 or so, you can notice a massive difference.

Does anybody have any advice on speeding up this process? Is storing them in a set the best way to compare or is it better to store them in a CSV / DB for example?

Sample of the code...

def Detect(p): 
    stamgmtstypes = (0,2,4)
    if p.haslayer(Dot11):
        if p.type == 0 and p.subtype in stamgmtstypes:
            if p.addr2 not in observedclients: 
                # This is the set with location_mutex: 
                detection = p.addr2 + "\t" + str(datetime.now())
                print type(p.addr2)
                print detection, last_location
                observedclients.append(p.addr2) 
thefragileomen
  • 1,439
  • 7
  • 22
  • 36
  • Apologies Avasal - I meant 'set' - original post amended. Sorry! – thefragileomen Oct 21 '11 at 10:56
  • you can convert list to set, this will eliminate the duplicates .. (delete original comment by mistake) – avasal Oct 21 '11 at 10:58
  • Is `observedclients` the one that is supposed be a set? Well, judging from what you show, it is a list, not a set. Sets doesn't have `.append` method. You add to sets with `.add`. – Avaris Oct 21 '11 at 11:15
  • observedclients is indeed the set (or so I believed). I have it declared as "observedclients = []" at the beginning of my code. This may be where my problem is then? I am now assuming that I should have declared my set as "observed = []" and then "observedclients = set(observed)"? – thefragileomen Oct 21 '11 at 11:20
  • 1
    Well, `[]` creates a list. You need `set()` to create one. And you add to it with `observedclients.add(p.addr2)`. It might improve performance. – Avaris Oct 21 '11 at 11:27
  • have you checked list/set.__contains__(element) method, – avasal Oct 21 '11 at 11:42

3 Answers3

1

First, you need to profile your code to understand where exactly the bottleneck is...

Also, as a generic recommendation, consider psyco, although there are a few times when psyco doesn't help

Once you find a bottleneck, cython may be useful, but you need to be sure that you declare all your variables in the cython source.

Community
  • 1
  • 1
Mike Pennington
  • 38,579
  • 16
  • 126
  • 167
  • Thanks Mike. I assumed that the bottleneck was simply the size of the list of MAC addresses. As said, it works great with only a handful of MAC addresses in but as the list grows in size, this is when the script slows down. I suppose a rule for me to bear in mind - never make assumptions! – thefragileomen Oct 21 '11 at 10:55
  • @thefragileomen, how long is your script? – Mike Pennington Oct 21 '11 at 10:59
  • Mike, it's only around 100 lines long with 5 functions. The function that I think is causing the problems is only a few lines long but it does have a few nested if functions. I've edited my main question with it contained within – thefragileomen Oct 21 '11 at 11:02
  • @thefragileomen, it looks like you're using `scapy`, right? `scapy` is pure python, and gets pretty slow as a result... some other packet-slinging libries (such as [pcs](http://pcs.sourceforge.net/)) offload work with `cython` for you... so this *may* help in your situation... I can't say for sure – Mike Pennington Oct 21 '11 at 11:07
0

Try using set. To declare set use set(), not [] (because the latter declares an empty list).

The lookup in the list is of O(n) complexity. It's what happens in you case when the list grows (the complexity grows with growing of n as O(n)).

The lookup in the set is of O(1) complexity on the average.

http://wiki.python.org/moin/TimeComplexity

Also, you will need to change some part of your code. There is no append method in set, so you will need to use something like observedclients.add(address).

ovgolovin
  • 12,091
  • 4
  • 38
  • 75
  • ovgolovin, apologies - error in my original post. I meant 'set' as opposed to 'list'. I'm already using a set and NOT a list but still experiencing the same problem. – thefragileomen Oct 21 '11 at 10:57
  • @thefragileomen Very strange, because if the size of the set grows, it shouldn't increase the difficulty of the algorithm (since the complexity is `O(1)` and it's independent of the size of the `set`). – ovgolovin Oct 21 '11 at 11:01
0

The post mentions "the script iterates through the set comparing each new MAC address to those in the list."

To take full advantage of sets, don't loop over them doing one-by-one comparisons. Instead use set operations like union(), intersection(), and difference():

s = set(list_of_strings_containing_mac_addresses)
t = set(preconfigured_set_of_mac_addresses)
print s - t, 'addresses in the list but not preconfigured'
Raymond Hettinger
  • 182,864
  • 54
  • 321
  • 419