Big O of fragment of code with "in" operation on list

Question

What would be Big O of following fragment of code?

with open(file_name) as f:
    for word in f:
        w = word.rstrip()
        k = ''.join(sorted(w)).lower()
        if k in words:
            words[k].append(w)
        else:
            words[k] = [w]

There are several parts of the code that have influence on the time complexity. `sorted()` for example should have far more than `in`. — Klaus D., May 18 '16 at 01:33
@KlausD. Can I say with fair degree of certainty that this is not O(N)? — James Kalima, May 18 '16 at 01:35
The sorting should have an average complexity of `O(n log n)`. See https://en.wikipedia.org/wiki/Timsort for details. — Klaus D., May 18 '16 at 01:43
I think over all complexity is O(N) + O(n log n) so it is O(n log n) — Ovais Reza, May 18 '16 at 21:49

ShadowRanger · Answer 1 · 2016-05-18T01:41:23.047

x in alist is O(n), but this code isn't performing membership testing on a list; words looks to be a dict, and membership testing in dict's keys (or in a set) is O(1) (technically, worst case could be O(n), but it's average case O(1) and they put some effort into thwarting even intentional attempts to cause collisions).

This code could be simplified a bit using collections.defaultdict though, so creating lists is done implicitly when a non-existent key is looked up:

import collections

words = collections.defaultdict(list)
with open(file_name) as f:
    for word in f:
        w = word.rstrip()
        words[''.join(sorted(w)).lower()].append(w)

If you want uniqueness (though it would lose ordering), you just change to defaultdict(set) and change append to add. If you need uniqueness and ordering, collections.OrderedDict can (mostly) work as an ordered set:

import collections

words = collections.defaultdict(collections.OrderedDict)
with open(file_name) as f:
    for word in f:
        w = word.rstrip()
        # True is placeholder, any value will do if you're using in tests properly
        words[''.join(sorted(w)).lower()][w] = True

@JohnDoLittle: `words` is most definitely not a `list`. You're using a `str` to perform lookups in it, if it's a `list`, you'd be getting repeated `TypeError: list indices must be integers or slices, not str`; it's a `dict` of `list`s, or some esoteric (non-builtin) type. Since you never test membership in the values of the `dict` (which are `list`s), you're never paying any `O(n)` lookup costs. — ShadowRanger, May 18 '16 at 01:40

score 0 · Answer 2 · answered May 18 '16 at 01:43

0

The k in words would have linear complexity, that is, O(len(words)), if words were a list.

It looks like words is a dict, though, since words[k] apparently indexes it over a string, something that a list won't accept.

For a dict, access time can be seen as constant, O(1), both for searching (in) and updating. (This is amortized time.)

answered May 18 '16 at 01:43

9000

37,110
8
58
98

What would be over all complexity, thank in to account the whole fragment? – James Kalima May 18 '16 at 01:51
The whole piece reads a file so the time is going to be dominated by I/O. It is obviously linear in the number of characters read, provided that you don't have overly long words. The algorithmically slowest piece is the call to `sorted` which is O(n * log(n)) where n = `len(w)`. If all your words are much shorter than the text as a whole, it can just be considered a constant, and the whole piece is O(length of file). If you happen to have a text consisting of 2 or 3 super-long words, the performance will be dominated by `sorted` (log-linear), but it's a marginal case. – 9000 May 18 '16 at 13:55
Lets break it down, you complexity of reading N lines form a file is O(N) and complexity of comparing sorted(w) against n - 1 is O(n log n). Hence dominant operation in my opinion would be sorted(w), which is O(n log n). – Ovais Reza May 21 '16 at 22:13
This depends on what _n_ are we talking about. The whole thing is _O(k * s)_, where _k_ is the number of characters (we scan each exactly once), and _s_ is the complexity of analyzing each word (splitting to words is _O(1)_ per character). If _w_ = len(longest word) << _k_, we can assume _s_ ≤ _O(w log w)_, because no word is longer than _w_. Thus _s_ ≤ constant, or _s_ is O(1). If, OTOH, _w_ is comparable to _k_, we have only a few words in the text _for any value of k_, <=> words get longer as the length of text grows, but their number _c_ stays about the same, we have _O(k * log (k/c))_. – 9000 May 22 '16 at 00:56

Big O of fragment of code with "in" operation on list

2 Answers2