43

I was tinkering around with Python's set and frozenset collection types.

Initially, I assumed that frozenset would provide a better lookup performance than set, as its immutable and thus could exploit the structure of the stored items.

However, this does not seem to be the case, regarding the following experiment:

import random
import time
import sys

def main(n):
    numbers = []
    for _ in xrange(n):
        numbers.append(random.randint(0, sys.maxint))
    set_ = set(numbers)
    frozenset_ = frozenset(set_)

    start = time.time()
    for number in numbers:
        number in set_
    set_duration = time.time() - start

    start = time.time()
    for number in numbers:
        number in frozenset_
    frozenset_duration = time.time() - start

    print "set      : %.3f" % set_duration
    print "frozenset: %.3f" % frozenset_duration


if __name__ == "__main__":
    n = int(sys.argv[1])
    main(n)

I executed this code using both CPython and PyPy, which gave the following results:

> pypy set.py 100000000
set      : 6.156
frozenset: 6.166

> python set.py 100000000
set      : 16.824
frozenset: 17.248

It seems that frozenset is actually slower regarding the lookup performance, both in CPython and in PyPy. Does anybody have an idea why this is the case? I did not look into the implementations.

jonrsharpe
  • 99,167
  • 19
  • 183
  • 334
Sven Hager
  • 2,854
  • 4
  • 21
  • 32
  • 2
    "as its immutable and thus could exploit the structure of the stored items" - what exactly did you expect it to do? Any structure it has access to, `set` has too. – user2357112 supports Monica Apr 11 '16 at 17:27
  • 1
    Well, that's what I'm asking. I thought that maybe frozenset could use some kind of precomputed hash function, which in turn could yield better lookup performance. – Sven Hager Apr 11 '16 at 17:30
  • 2
    You need to calculate the hash of any item you look up, period. You can't precompute hashes here as you can test an arbitrary item against the set. I'm not sure how you picture this optimisation? Items *in* the set don't need to have their hash calculated; they have already been slotted into the hash table. – Martijn Pieters Apr 11 '16 at 17:39
  • 6
    "You need to calculate the hash of any item you look up, period" I am aware of this fact, but still a fixed set of elements could offer optimization opportunities (e.g., a perfect hash function that could be generated at the time the frozenset is generated and which could be used for lookup) – Sven Hager Apr 11 '16 at 17:41
  • Have you eliminated garbage collection delays and other system timings? Use the `timeit` module for proper timing experiments. Try with numbers **not** in either set too. `frozenset` and `set` share the same implementation, so the timing differences you see are entirely local to your test. – Martijn Pieters Apr 11 '16 at 17:42
  • @SvenHager: I am not aware of any shortcuts there. All the calculation applies to the item you are testing against the set, to locate the slot into which there might be an equal object. – Martijn Pieters Apr 11 '16 at 17:56
  • I'm a little late to the party, but wouldn't python store the frozenset on the function object for repeated calls like a tuple vs. list? – notbad.jpeg Dec 16 '16 at 20:44
  • 1
    I agree wth Sven that a frozenset could theoretically perform better at lookup time, by doing more computation at creation time. For example with a hash table implementation, a the hash function could be chosen so that there is minimal collision among hashes of elements of the set. – Bjarke Ebert May 24 '17 at 17:29

2 Answers2

85

The frozenset and set implementations are largely shared; a set is simply a frozenset with mutating methods added, with the exact same hashtable implementation. See the Objects/setobject.c source file; the top-level PyFrozenSet_Type definition shares functions with the PySet_Type definition.

There is no optimisation for a frozenset here, as there is no need to calculate the hashes for the items in the frozenset when you are testing for membership. The item that you use to test against the set still needs to have their hash calculated, in order to find the right slot in the set hashtable so you can do an equality test.

As such, your timing results are probably off due to other processes running on your system; you measured wall-clock time, and did not disable Python garbage collection nor did you repeatedly test the same thing.

Try to run your test using the timeit module, with one value from numbers and one not in the set:

import random
import sys
import timeit

numbers = [random.randrange(sys.maxsize) for _ in range(10000)]
set_ = set(numbers)
fset = frozenset(numbers)
present = random.choice(numbers)
notpresent = -1
test = 'present in s; notpresent in s'

settime = timeit.timeit(
    test,
    'from __main__ import set_ as s, present, notpresent')
fsettime = timeit.timeit(
    test,
    'from __main__ import fset as s, present, notpresent')

print('set      : {:.3f} seconds'.format(settime))
print('frozenset: {:.3f} seconds'.format(fsettime))

This repeats each test 1 million times and produces:

set      : 0.050 seconds
frozenset: 0.050 seconds
Martijn Pieters
  • 889,049
  • 245
  • 3,507
  • 2,997
  • hash collisions do not require more memory, they require extra computation. the extra memory part is from the hash table pre-allocating extra memory to be available for insertions of new items. Both behaviors could have been avoided with frozenset. It could also be done by passing the current hash through an unexpensive modifying function, no need to modify each and every other object's hash individually. It's not that unreasonable to think it might be more performant. – Mr.WorshipMe Sep 23 '19 at 14:37
  • @Mr.WorshipMe: I highly doubt it would be. The memory is requested in one go, the computation is cheap, and a 'fitted' table would still require a hash computation, modulo operations and equality test. – Martijn Pieters Sep 23 '19 at 17:09
  • 1
    @Mr.WorshipMe: this is all getting rather academic and off-topic for an SO answer comment thread, however. Can I suggest this is taken to Python-Ideas, instead, where the Python core devs are present? They are way more qualified to assess such ideas, anyway. See the [Python-ideas mailinglist](https://mail.python.org/pipermail/python-ideas/) or the [Python Discourse site, ideas category](https://discuss.python.org/c/ideas). – Martijn Pieters Sep 23 '19 at 17:11
14

The reason for the two different datatypes is not for performance, it is functional. Because frozensets are immutable they can be used as a key in dictionaries. Sets cannot be used for this purpose.

T. Durbin
  • 141
  • 1
  • 2