119

I've implemented a BloomFilter in python 3.3, and got different results every session. Drilling down this weird behavior got me to the internal hash() function - it returns different hash values for the same string every session.

Example:

>>> hash("235")
-310569535015251310

----- opening a new python console -----

>>> hash("235")
-1900164331622581997

Why is this happening? Why is this useful?

smci
  • 26,085
  • 16
  • 96
  • 138
redlus
  • 1,431
  • 2
  • 10
  • 16

3 Answers3

153

Python uses a random hash seed to prevent attackers from tar-pitting your application by sending you keys designed to collide. See the original vulnerability disclosure. By offsetting the hash with a random seed (set once at startup) attackers can no longer predict what keys will collide.

You can set a fixed seed or disable the feature by setting the PYTHONHASHSEED environment variable; the default is random but you can set it to a fixed positive integer value, with 0 disabling the feature altogether.

Python versions 2.7 and 3.2 have the feature disabled by default (use the -R switch or set PYTHONHASHSEED=random to enable it); it is enabled by default in Python 3.3 and up.

If you were relying on the order of keys in a Python set, then don't. Python uses a hash table to implement these types and their order depends on the insertion and deletion history as well as the random hash seed. Note that in Python 3.5 and older, this applies to dictionaries, too.

Also see the object.__hash__() special method documentation:

Note: By default, the __hash__() values of str, bytes and datetime objects are “salted” with an unpredictable random value. Although they remain constant within an individual Python process, they are not predictable between repeated invocations of Python.

This is intended to provide protection against a denial-of-service caused by carefully-chosen inputs that exploit the worst case performance of a dict insertion, O(n^2) complexity. See http://www.ocert.org/advisories/ocert-2011-003.html for details.

Changing hash values affects the iteration order of dicts, sets and other mappings. Python has never made guarantees about this ordering (and it typically varies between 32-bit and 64-bit builds).

See also PYTHONHASHSEED.

If you need a stable hash implementation, you probably want to look at the hashlib module; this implements cryptographic hash functions. The pybloom project uses this approach.

Since the offset consists of a prefix and a suffix (start value and final XORed value, respectively) you cannot just store the offset, unfortunately. On the plus side, this does mean that attackers cannot easily determine the offset with timing attacks either.

Martijn Pieters
  • 889,049
  • 245
  • 3,507
  • 2,997
  • 16
    I'd expect this to show up in the hash() docs and not only in __hash__(). +1 for a great answer. p.s. Isn't hashlib an overkill for non-cryptographic uses of hash functions? – redlus Dec 18 '14 at 17:13
  • 1
    pybloom uses the hashlib functions. But if you want something faster, you could check out [pyhash](https://github.com/flier/pyfasthash). – Håken Lid Dec 21 '14 at 03:39
  • 3
    Why does the documentation call it `disable` when setting it to 0? I don't see the effective difference to setting it to any old stable seed number, unless I'm missing something. What I mean is when I use `PYTHONHASHSEED=12345` I get the same hash for equal strings even across sessions - the same happens when I use `PYTHONHASHSEED=0` - the hash for equal strings will be the same across sessions (albeit different to 12345, but that's obvious, that's how seeds work). – blubberdiblub Apr 13 '17 at 16:57
  • @blubberdiblub: with `0` there is no seed at all and hashes for objects are equal to those generated in an older Python version without any hashseed support. – Martijn Pieters Apr 13 '17 at 17:59
  • 1
    @MartijnPieters what does it mean for the affected hashes to have "no seed at all"? What's the semantic or qualitative difference to having a seed of, say, 12345, apart from the fact that it creates two distinct sets of sessions between which the hash values are different and apart from PYTHONHASHSEED=0 being equal to older versions? Can you link me to a particular piece of source code? I guess my point is that if there is no such difference, I'd call it a seed of 0 and older versions of Python only supporting a seed of 0. The documentation as it stands right now is quite confusing to me. – blubberdiblub Apr 14 '17 at 15:13
  • @blubberdiblub: note: this is all getting a little too far off topic for comments. Things got a little more complicated with [PEP 456](https://www.python.org/dev/peps/pep-0456/), but if we assume a Python using FNV hashing, then setting `PYTHONHASHSEED=0` should produce the same hash values as a Python 2.6 for the same string input. The option exists because production systems had to be able to transition from versions without randomisation to one with, but keep compatible during the transition. – Martijn Pieters Apr 14 '17 at 15:35
  • I wonder why hash collision is supposed to be a security issue only for string keys. What about integer keys, or tuples of integers? For example, `hash(2**16) == hash(2**(10**8))`. – Alexey May 01 '20 at 11:38
  • @Alexey: it isn't, but numeric keys are not nearly as common as string keys and an attacker needs to send a *lot* of numerical data vs string data to get the same amount of delay in processing. Add to that that altering how hashing works for numeric types has a lot of additional complications (int, float, complex, Decimal and Fraction all produce the same hash for equal values), and so number hashes were left out of scope. See the [original discussion on the randomisation implementation](https://bugs.python.org/issue13703). – Martijn Pieters May 02 '20 at 14:27
10

Hash randomisation is turned on by default in Python 3. This is a security feature:

Hash randomization is intended to provide protection against a denial-of-service caused by carefully-chosen inputs that exploit the worst case performance of a dict construction

In previous versions from 2.6.8, you could switch it on at the command line with -R, or the PYTHONHASHSEED environment option.

You can switch it off by setting PYTHONHASHSEED to zero.

Peter Wood
  • 21,348
  • 4
  • 53
  • 90
-13

hash() is a Python built-in function and use it to calculate a hash value for object, not for string or num.

You can see the detail in this page: https://docs.python.org/3.3/library/functions.html#hash.

and hash() values comes from the object's __hash__ method. The doc says the followings:

By default, the hash() values of str, bytes and datetime objects are “salted” with an unpredictable random value. Although they remain constant within an individual Python process, they are not predictable between repeated invocations of Python.

That's why your have diffent hash value for the same string in different console.

What you implement is not a good way.

When you want to calculate a string hash value, just use hashlib

hash() is aim to get a object hash value, not a stirng.

Adam Wen
  • 1
  • 4
  • 6
    `hash()` is perfectly valid for string or numeric values. You are confusing this with the `__hash__` custom method, used **by `hash()`** to provide a custom implementation of the hash value. – Martijn Pieters Dec 17 '14 at 10:05