71

c++ unordered_map collision handling , resize and rehash

This is a previous question opened by me and I have seen that I am having a lot of confusion about how unordered_map is implemented. I am sure many other people shares that confusion with me. Based on the information I have know without reading the standard:

Every unordered_map implementation stores a linked list to external nodes in the array of buckets... No, that is not at all the most efficient way to implement a hash map for most common uses. Unfortunately, a small "oversight" in the specification of unordered_map all but requires this behavior. The required behavior is that iterators to elements must stay valid when inserting or deleting other elements

I was hoping that someone might explain the implementation and how it fits with the C++ standard definition ( in terms of performance requirements ) and if it is really not the most efficient way to implement an hash map data structure how it can be improved ?

ralzaul
  • 3,730
  • 2
  • 27
  • 45
  • 8
    The standard doesn't dictate the implementation but rather performance requirements. Thus, implementation of STL containers might differ from vendor to vendor. – 101010 Jun 29 '15 at 10:06
  • 5
    I don't understand why this is too broad or irrelevant nor why it did get two negative votes. It is a perfectly valid question from my point of view... – ralzaul Jun 29 '15 at 10:15
  • I think it's because your point of view isn't what's used as a yardstick for which sort of questions are on-topic on this site. No-one can "explain the implementation" as there are many potential implementations. –  Jun 29 '15 at 10:17
  • Than someone might come and tell that standard implies this, this and this and two,three major implementations are using a linked list, allocating memory and resizing it when necessary etc,. No one is asking for the actual code of the implementation here... – ralzaul Jun 29 '15 at 10:18
  • 3
    I think that the complexity and iterator requirements essentially require an implementation as hash table with chaining. – Kerrek SB Jun 29 '15 at 10:26
  • I can understand the 2nd, 4th and I will modify it. Thanks, but I mean how many questions in this site is asked after reading the standard? I basically quoted something I have found on the web I don't say that this guys knows the standard better than the committee or anything.. – ralzaul Jun 29 '15 at 10:26
  • 2
    @Alex I am not interested about any particular implementation as well. What I was hoping to get is something like this: ex: std:map is mostly implemented using a red&black tree with pruning - unordered_map is implemented mostly using .... and ... . That is it. – ralzaul Jun 29 '15 at 10:30
  • 2
    @Walter Many similar questions should be closed as opinion-based: however, in this case, there are restrictions on the implementation of the std unordered meows that are strong enough that a good answer can be generated. – Yakk - Adam Nevraumont May 25 '16 at 14:59

1 Answers1

83

The Standard effectively mandates that implementations of std::unordered_set and std::unordered_map - and their "multi" brethren - use open hashing aka separate chaining, which means an array of buckets, each of which holds the head of a linked list†. That requirement is subtle: it is a consequence of:

  • the default max_load_factor() being 1.0 (which means the table will resize whenever size() would otherwise exceed 1.0 times the bucket_count(), and
  • the guarantee that the table will not be rehashed unless grown beyond that load factor.

That would be impractical without chaining, as the collisions with the other main category of hash table implementation - closed hashing aka open addressing - become overwhelming as the load_factor()](https://en.cppreference.com/w/cpp/container/unordered_map/load_factor) approaches 1.

References:

23.2.5/15: The insert and emplace members shall not affect the validity of iterators if (N+n) < z * B, where N is the number of elements in the container prior to the insert operation, n is the number of elements inserted, B is the container’s bucket count, and z is the container’s maximum load factor.

amongst the Effects of the constructor at 23.5.4.2/1: max_load_factor() returns 1.0.

† To allow optimal iteration without passing over any empty buckets, GCC's implementation fills the buckets with iterators into a single singly-linked list holding all the values: the iterators point to the element immediately before that bucket's elements, so the next pointer there can be rewired if erasing the bucket's last value.

Regarding the text you quote:

No, that is not at all the most efficient way to implement a hash map for most common uses. Unfortunately, a small "oversight" in the specification of unordered_map all but requires this behavior. The required behavior is that iterators to elements must stay valid when inserting or deleting other elements

There is no "oversight"... what was done was very deliberate and done with full awareness. It's true that other compromises could have been struck, but the open hashing / chaining approach is a reasonable compromise for general use, that copes reasonably elegantly with collisions from mediocre hash functions, isn't too wasteful with small or large key/value types, and handles arbitrarily-many insert/erase pairs without gradually degrading performance the way many closed hashing implementations do.

As evidence of the awareness, from Matthew Austern's proposal here:

I'm not aware of any satisfactory implementation of open addressing in a generic framework. Open addressing presents a number of problems:

• It's necessary to distinguish between a vacant position and an occupied one.

• It's necessary either to restrict the hash table to types with a default constructor, and to construct every array element ahead of time, or else to maintain an array some of whose elements are objects and others of which are raw memory.

• Open addressing makes collision management difficult: if you're inserting an element whose hash code maps to an already-occupied location, you need a policy that tells you where to try next. This is a solved problem, but the best known solutions are complicated.

• Collision management is especially complicated when erasing elements is allowed. (See Knuth for a discussion.) A container class for the standard library ought to allow erasure.

• Collision management schemes for open addressing tend to assume a fixed size array that can hold up to N elements. A container class for the standard library ought to be able to grow as necessary when new elements are inserted, up to the limit of available memory.

Solving these problems could be an interesting research project, but, in the absence of implementation experience in the context of C++, it would be inappropriate to standardize an open-addressing container class.

Specifically for insert-only tables with data small enough to store directly in the buckets, a convenient sentinel value for unused buckets, and a good hash function, a closed hashing approach may be roughly an order of magnitude faster and use dramatically less memory, but that's not general purpose.

A full comparison and elaboration of hash table design options and their implications is off topic for S.O. as it's way too broad to address properly here.

Tony Delroy
  • 94,554
  • 11
  • 158
  • 229
  • "... open hashing / chaining ... handles arbitrarily-many insert/erase pairs without gradually degrading performance the way many closed hashing implementations do." Actually, it *does* gradually degrade performance. What it avoids is the *precipitous* performance loss typical of most closed hashing. Either way, the performance degrades from O(1) to O(N), but with closed hashing, that happens from roughly 90% full to 100% full, whereas with open hashing it's typically from (say) 100% full to 1000% full (i.e., you've inserted ten times as many items as there are slots in the table). – Jerry Coffin May 22 '17 at 16:27
  • You can also do open hashing with a balanced tree instead of a list for each bucket. In this case, the degradation is from O(1) to O(log N). In this case, even with 10x as many items as slots in the table, there's still only minimal performance degradation (with the proviso that it's generally somewhat slower, even when minimally utilized). – Jerry Coffin May 22 '17 at 16:35
  • @JerryCoffin: true - I've heard that's what (one of?) the key Java implementations does, but it would adversely affect iteration over the full container compared to the singly linked list approach GCC uses (outlined in the answer). – Tony Delroy May 23 '17 at 07:26
  • @JerryCoffin: my reply above addressed your second comment, but just noticed I never addressed your first. Regarding *"Actually, it does gradually degrade...from (say) 100% full to 1000% full"* - that doesn't happen due to the max load factor - the table resizes at 100%. But, you're not talking about the performance issue with open hashing that I was alluding to either, which is that when you erase an element you either have to mark the bucket as having been in use and keep searching past it (small lasting cost), or "compact" a collision chain into the bucket (larger up-front cost). – Tony Delroy Feb 20 '18 at 12:54
  • 2
    @MohitShah Nonsense. It's an extremely good answer, which reasons clearly and - crucially - cites the original proposal. If you had trouble reading it, that's your problem, not the answer's. – underscore_d Jan 02 '19 at 22:16