-1

After reading the source code of JDK, I am still surprised that the strings "AaAa", "AaBB" and "BBBB" have the same hashcode.

The source of JDK is as follows,

int h = hash;
if (h == 0 && value.length > 0) {
    char val[] = value;

    for (int i = 0; i < value.length; i++) {
        h = 31 * h + val[i];
    }
    hash = h;
}
return h;

Anyone could clarify this?

luk2302
  • 46,204
  • 19
  • 86
  • 119
Adam Lee
  • 21,598
  • 43
  • 138
  • 208
  • 2
    Why exactly does this surprise you? Hash codes are not unique, there will be different strings with the same hash code, and you happen to have found three. – Jesper Nov 05 '18 at 13:40
  • `"Aa"` and `"BB"` have the same hashcode. So sequences of `"Aa"` or `"BB"` of the same length will have the same hashcode. – khelwood Nov 05 '18 at 13:54
  • @Jesper It can be surprising to someone why has experience with cryptographic hash functions that strive not to have collisions. – zaph Nov 05 '18 at 22:07

5 Answers5

5

Because that's how the hash code is defined to be calculated for a String:

The hash code for a String object is computed as

s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]

So:

  • For AaAa: 65*31^3 + 97*31^2 + 65*31 + 97 = 2031744
  • For AaBB: 65*31^3 + 97*31^2 + 66*31 + 66 = 2031744
  • For BBBB: 66*31^3 + 66*31^2 + 66*31 + 66 = 2031744
Community
  • 1
  • 1
Andy Turner
  • 122,430
  • 10
  • 138
  • 216
  • looks it is very easy to clash? why choose this hash function? – Adam Lee Nov 05 '18 at 13:43
  • @AdamLee because it is simple to calculate, works well on average. And it's part of the public API, so it's not possible to change it. – Andy Turner Nov 05 '18 at 13:44
  • 2
    @AdamLee because making the finding of hash collisions hard was and is not one of the goals of this hash function. – luk2302 Nov 05 '18 at 13:44
  • @AdamLee Collisions are expected and not expected to be avoided or avoidable. Equal hashcodes do not suggest that objects are equal, but only they could be equal. See the Java [hashcode/equals contract](https://stackoverflow.com/questions/27581/what-issues-should-be-considered-when-overriding-equals-and-hashcode-in-java) to get a better understanding of its purpose. – msg45f Nov 05 '18 at 23:09
5

Because probability.

There are ~4 billion possible hash codes (Integer.MIN_VALUE -> Integer.MAX_VALUE) and basically infinite possible Strings. There are bound to be collisions. In fact, the birthday problem shows us that only ~77,000 strings are required for a high chance of an arbitrary collision - and that would be if the hash function had extremely high entropy, which it doesn't.

Perhaps you are thinking of a cryptographic hash function, where

a small change to a message should change the hash value so extensively that the new hash value appears uncorrelated with the old hash value

In which case, Object.hashCode is not designed for cryptographic purposes.

See also How secure is Java's hashCode()?

Michael
  • 34,340
  • 9
  • 58
  • 100
  • Perhaps worth pointing out that due to the [birthday paradox](https://en.wikipedia.org/wiki/Birthday_paradox#Probability_table), you are more likely than not to have a hash collision with just 77,400 instances. – Andy Turner Nov 05 '18 at 13:49
  • @AndyTurner How did you calculate 77k? – Michael Nov 05 '18 at 13:51
  • It's in the linked table, in the `p = 0.50` column, first row. How do I know it's 77.4k? Not sure - it's one of those factoids I just... "know". I may be misremembering :) – Andy Turner Nov 05 '18 at 13:54
  • @AndyTurner Well remembered. Pretty much spot on... (added a link if you wanna pat yourself on the back) – Michael Nov 05 '18 at 13:58
3

Their hash codes are

AaAa: ((65 * 31 + 97) * 31 + 65) * 31 + 97 = 2.031.744
AaBB: ((65 * 31 + 97) * 31 + 66) * 31 + 66 = 2.031.744
BBBB: ((66 * 31 + 66) * 31 + 66) * 31 + 66 = 2.031.744

That is just how the math is, nothing to be confused about.
Note the difference of exactly 31 between 97 and 66, that is what makes these hash codes line up so nicely.

luk2302
  • 46,204
  • 19
  • 86
  • 119
1

Here is the description from Java documentation of Object#hashCode method:

Whenever it is invoked on the same object more than once during an execution of a Java application, the hashCode method must consistently return the same integer, provided no information used in equals comparisons on the object is modified.This integer need not remain consistent from one execution of an application to another execution of the same application.

If two objects are equal according to the equals(Object) method, then calling the hashCode method on each of the two objects must produce the same integer result.

It is not required that if two objects are unequal according to the java.lang.Object#equals(java.lang.Object) method, then calling the hashCode method on each of the two objects must produce distinct integer results. However, the programmer should be aware that producing distinct integer results for unequal objects may improve the performance of hash tables.

So,the implementation of String class also maintain the above characteristics.So this is a normal phenomenon.

Community
  • 1
  • 1
Cherokee
  • 143
  • 7
0

There are several types of hash functions with different design and performance criteria.

  1. Hash functions used for indexing such as associative arrays and similar usages can have frequent collisions with no problem because a hash table code will then handle that in some namer such as putting them in lists or re-hashing. Here it is all about performance in time. The Java hash() seems to be of this type

  2. Another type of function, a cryptographic hash such as SHA*, strive to avoid collisions at the expense of hashing performance.

  3. Yet a third type of hash functions is a password verifier hash which is designed to be very slow (~100ms is common) and may require large amounts of memory and not-to-frequent collisions are not a concern. The point here is to make brute force attacks take so long as to be infeasible.

Once choses the type and characteristics of hashes based on usage.

zaph
  • 108,117
  • 19
  • 176
  • 215