59

Quick question to mainly satisfy my curiosity on the topic.

I am writing some large python programs with an SQlite database backend and will be dealing with a large number of records in the future, so I need to optimize as much as I can.

For a few functions, I am searching through keys in a dictionary. I have been using the "in" keyword for prototyping and was planning on going back and optimizing those searches later as I know the "in" keyword is generally O(n) (as this just translates to python iterating over an entire list and comparing each element). But, as a python dict is basically just a hash map, is the python interpreter smart enough to interpret:

if(key in dict.keys()):
    ...code...

to:

if(dict[key] != None):
    ...code...

It is basically the same operation but the top would be O(n) and the bottom would be O(1).

It's easy for me to use the bottom version in my code, but then I was just curious and thought I would ask.

tknickman
  • 3,440
  • 2
  • 31
  • 43
  • I say do what is easiest, and profile later. – jh314 Jul 09 '13 at 03:28
  • 2
    Actually, the code on bottom wouldn't work. You have to do something akin to `try: dict[key]; except KeyError: pass; else: #...code...`. – Travis DePrato Jul 09 '13 at 03:29
  • @TravisGD This is a good point, I forgot about that – tknickman Jul 09 '13 at 03:33
  • @jh314 This is true, but it's always in the back of my mind – tknickman Jul 09 '13 at 03:34
  • 4
    As a side note: Don't wrap `if` conditions in unnecessary parentheses. People who read and write a lot of Python expect parentheses to mean something—a tuple, a genexp, or overriding operator precedence—and they'll have to stop and read each line twice to make sure your parens don't actually mean anything. – abarnert Jul 09 '13 at 03:41
  • 1
    Another side note: Don't name a dictionary `dict`—that hides the type and constructor of the same name, which you may well want to use later on. – abarnert Jul 09 '13 at 03:42
  • Out of curiosity: If you're using a sqlite backend, why do you have a large dict in the first place? If it's being used for caching database results, I've found that it's sometimes very hard to beat just using the database (or a second in-memory database) for that, unless you can tolerate the cache being out of date. More importantly, often the database is fast enough in the first place. Of course your mileage may vary, but definitely profile before wasting a lot of effort building something that will make your code harder to test and maintain… – abarnert Jul 09 '13 at 18:15

4 Answers4

114

First, key in d.keys() is guaranteed to give you the same value as key in d for any dict d.

And the in operation on a dict, or the dict_keys object you get back from calling keys() on it (in 3.x), is not O(N), it's O(1).

There's no real "optimization" going on; it's just that using the hash is the obvious way to implement __contains__ on a hash table, just as it's the obvious way to implement __getitem__.


You may ask where this is guaranteed.

Well, it's not. Mapping Types defines dict as, basically, a hash table implementation of collections.abc.Mapping. There's nothing stopping someone from creating a hash table implementation of a Mapping, but still providing O(N) searches. But it would be extra work to make such a bad implementation, so why would they?

If you really need to prove it to yourself, you can test every implementation you care about (with a profiler, or by using some type with a custom __hash__ and __eq__ that logs calls, or…), or read the source.


In 2.x, you do not want to call keys, because that generates a list of the keys, instead of a KeysView. You could use iterkeys, but that may generate an iterator or something else that's not O(1). So, just use the dict itself as a sequence.

Even in 3.x, you don't want to call keys, because there's no need to. Iterating a dict, checking its __contains__, and in general treating it like a sequence is always equivalent to doing the same thing to its keys, so why bother? (And of course building the trivial KeyView, and accessing through it, are going to add a few nanoseconds to your running time and a few keystrokes to your program.)

(It's not quite clear that using sequence operations is equivalent for d.keys()/d.iterkeys() and d in 2.x. Other than performance issues, they are equivalent in every CPython, Jython, IronPython, and PyPy implementation, but it doesn't seem to be stated anywhere the way it is in 3.x. And it doesn't matter; just use key in d.)


While we're at it, note that this:

if(dict[key] != None):

… is not going to work. If the key is not in the dict, this will raise KeyError, not return None.

Also, you should never check None with == or !=; always use is.

You can do this with a try—or, more simply, do if dict.get(key, None) is not None. But again, there's no reason to do so. Also, that won't handle cases where None is a perfectly valid item. If that's the case, you need to do something like sentinel = object(); if dict.get(key, sentinel) is not sentinel:.


So, the right thing to write is:

if key in d:

More generally, this is not true:

I know the "in" keyword is generally O(n) (as this just translates to python iterating over an entire list and comparing each element

The in operator, like most other operators, is just a call to a __contains__ method (or the equivalent for a C/Java/.NET/RPython builtin). list implements it by iterating the list and comparing each element; dict implements it by hashing the value and looking up the hash; blist.blist implements it by walking a B+Tree; etc. So, it could be O(n), O(1), O(log n), or something completely different.

abarnert
  • 313,628
  • 35
  • 508
  • 596
  • That's what I was thinking, is this documented anywhere? I wasn't sure though just because I though dict.keys() may just be returning a list. Making the "in" O(n) – tknickman Jul 09 '13 at 03:31
  • 1
    @tknickman: In general, Python doesn't document performance characteristics of its functions. (Partly this is because it's always possible for you to do something ridiculous like define a `hash` function that depends on the number of elements.) So, [this](http://docs.python.org/3/library/stdtypes.html#mapping-types-dict) is all you get. But the fact that it documents that dicts are hash tables implies pretty strongly that `key in d`, `d[key]`, and `d.get(key)` are all going to be O(1). – abarnert Jul 09 '13 at 03:34
  • 2
    *Average* case is O(1), worst case O(n). – Steven Rumbalski Jul 09 '13 at 03:36
  • @AshwiniChaudhary: I believe he meant semantically, not algorithmically. – Steven Rumbalski Jul 09 '13 at 03:38
  • 2
    @AshwiniChaudhary: They are guaranteed to be semantically equivalent. In Python 3.x, they're also equivalent as far as performance. In Python 2.x, `keys` will obviously be slower. I've edited the answer to give more details. But the real point is, there is never any reason to use `key in d.keys()`, so you don't have to remember the details. – abarnert Jul 09 '13 at 03:40
  • @abarnert I interpreted that line in wrong way.(oops!). +1 great answer as always. :) – Ashwini Chaudhary Jul 09 '13 at 03:46
  • @AshwiniChaudhary: You may have interpreted that line the way originally written, instead of the way I intended to write it (and fixed it to), in which case you're hardly to blame… – abarnert Jul 09 '13 at 03:53
  • The first fallback solution I would try, if *key in dict* is not available, would be *dict.has_key()* instead of self written exception branches. – guidot Aug 24 '15 at 13:02
13

In Python 2 dict.keys() creates the whole list of keys first that's why it is an O(N) operation, while key in dict is an O(1) operation.

if(dict[key] != None) will raise KeyError if key is not found in the dict, so it is not equivalent to the first code.

Python 2 results:

>>> dic = dict.fromkeys(range(10**5))
>>> %timeit 10000 in dic
1000000 loops, best of 3: 170 ns per loop
>>> %timeit 10000 in dic.keys()
100 loops, best of 3: 4.98 ms per loop
>>> %timeit 10000 in dic.iterkeys()
1000 loops, best of 3: 402 us per loop
>>> %timeit 10000 in dic.viewkeys()
1000000 loops, best of 3: 457 ns per loop

In Python 3 dict.keys() returns a view object which is quite faster than Python 2's keys() but still slower simple normal key in dict:

Python 3 results:

>>> dic = dict.fromkeys(range(10**5))
>>> %timeit 10000 in dic
1000000 loops, best of 3: 295 ns per loop
>>> %timeit 10000 in dic.keys()
1000000 loops, best of 3: 475 ns per loop

Use just:

if key in dict:
   #code
Ashwini Chaudhary
  • 217,951
  • 48
  • 415
  • 461
  • This is 2.x-specific. (Also, note that in CPython 2.7.3 or PyPy 2.0b1, `iterkeys` may be much faster than `keys`—Python 2.x allows `iterkeys` to be something smarter that just `iter(d.keys())`, and they actually do take some advantage. But it's still nowhere near as fast as just using `d` directly. On my computer, it's 94ns vs. 338us vs. 2.03ms.) – abarnert Jul 09 '13 at 03:52
7

The proper way to do this would be

if key in dict:
    do stuff

the in operator is O(1) for dictionaries and sets in python.

Matt Bryant
  • 4,591
  • 4
  • 27
  • 42
1

The in operator for dict has average case time-complexity of O(1). For detailed information about time complexity of other dict() methods, visit this link.

prafi
  • 628
  • 7
  • 11
  • 3
    While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. – Elias MP Sep 13 '17 at 16:45