42

I was going through Java's HashMap source code when I saw the following

//The default initial capacity - MUST be a power of two.
static final int DEFAULT_INITIAL_CAPACITY = 16;

My question is why does this requirement exists in the first place? I also see that the constructor which allows creating a HashMap with a custom capacity converts it into a power of two:

int capacity = 1;
while (capacity < initialCapacity)
  capacity <<= 1;

Why does the capacity always has to be a power of two?

Also, when automatic rehashing is performed, what exactly happens? Is the hash function altered too?

T.J. Crowder
  • 879,024
  • 165
  • 1,615
  • 1,639
Sushant
  • 1,014
  • 2
  • 12
  • 21

2 Answers2

46

The map has to work out which internal table index to use for any given key, mapping any int value (could be negative) to a value in the range [0, table.length). When table.length is a power of two, that can be done really cheaply - and is, in indexFor:

static int indexFor(int h, int length) {
    return h & (length-1);
}

With a different table length, you'd need to compute a remainder and make sure it's non-negative . This is definitely a micro-optimization, but probably a valid one :)

Also, when automatic rehashing is performed, what exactly happens? Is the hash function altered too?

It's not quite clear to me what you mean. The same hash codes are used (because they're just computed by calling hashCode on each key) but they'll be distributed differently within the table due to the table length changing. For example, when the table length is 16, hash codes of 5 and 21 both end up being stored in table entry 5. When the table length increases to 32, they will be in different entries.

Jon Skeet
  • 1,261,211
  • 792
  • 8,724
  • 8,929
  • Exactly what I was looking for, thank you. One more doubt, why is the Entry table transient, even when it keeps all the data? – Sushant Dec 02 '11 at 06:57
  • 1
    @Sushant: The data in the table is *explicitly* serialized within writeObject (so that all the empty entries aren't written out). Making the field transient stops the normal serialization code from *also* writing it out in the call to `defaultWriteObject`. – Jon Skeet Dec 02 '11 at 07:01
  • @JonSkeet how does h & (length-1) deal with negatives ? lets say length =16 and h = -7 – Geek Aug 22 '12 at 11:41
  • @Jon I am trying to connect your answer with the [accepted answer here](http://stackoverflow.com/questions/7405438/why-if-n-n-n-then-n-is-a-power-of-2?lq=1) – Geek Aug 22 '12 at 12:00
  • @Geek: It's not clear what you're concerned about. The negative number will still be mapped to an appropriate non-negative number. Did you try your example on paper? – Jon Skeet Aug 22 '12 at 12:22
  • @Jon I wanted to understand why h & (length-1) is cheap if length is a power of 2 and the number is negative or positive . Subtracting 1 from a power of 2 is equivalent to what actually ? – Geek Aug 22 '12 at 12:24
  • @Geek: Basically `length - 1` ends up as binary of something like 000011111 (or whatever - the right number of bits at the end). `h & (length - 1)` is *always* cheap - the important thing is that it's *useful* if `length` is a power of 2. – Jon Skeet Aug 22 '12 at 12:25
  • 3
    Its not important here but the hash of the key that is used by Hashmap is not `key.hashCode()`. The hash is a supplemental hash function applied on top of `key.hashCode()`. This is done to guard against poor hashCode implementations that might lead to more than desirable collisions. – Puneet Jan 08 '13 at 08:11
  • About h & length-1 -> This is just a faster way of doing h modulo length which is afforded by the bit characteristics of length (and thus length-1) desribed by @JonSkeet. – Puneet Jan 08 '13 at 08:15
5

The ideal situation is actually using prime number sizes for the backing array of an HashMap. That way your keys will be more naturally distributed across the array. However this works with mod division and that operation became slower and slower with every release of Java. In a sense, the power of 2 approach is the worst table size you can imagine because with poor hashcode implementations are more likely to produce key collosions in the array.

Therefor you'll find another very important method in Java's HashMap implementation, which is the hash(int), that compensates for poor hashcodes.

M Platvoet
  • 1,639
  • 1
  • 10
  • 14
  • yes that makes a lot of sense, but as as an additional favor can you talk more about how the hash(int) function goes about improving the original hashcode. I see its taking xor of a few bits, but I have not fully understood it. – Sushant Dec 02 '11 at 08:29
  • 2
    Basically, using the power of two approach makes the lower bits of a hashCode the important ones. With poor hashCode implementations this will not differ much (eg: 10110111 and 00000111). So with all the shifting of bits the higher ones get more importance. – M Platvoet Dec 02 '11 at 08:42
  • 2
    The statement that the "mod operation became slower and slower with every release of Java" is quite misleading. Rather, it is the bitmask operation which became faster at a greater pace, ultimately both of these starting to reflect the ground-level performance of the actual hardware. At that level, bitmask is certainly much more performant---enough so that the whole setup, including the additional hashcode scrambling steps, is still a lot faster. – Marko Topolnik Dec 02 '14 at 15:17