0

My understanding of hash tables is that they use hash functions to relate keys to locations in memory, with a total number of "buckets" pre-allocated in memory. The goal is for there to be enough buckets that I don't have to use chaining, slowing my ideal O(1) access time complexity to n/m x O(1) where n is the number of unique keys to store, and m is the number of buckets.

So if I have 1000 unique items to store, I'll want no less than 1000 buckets, and perhaps a lot more to minimize probability of having to use my chained linked list. If this weren't the case, we'd expect the average hash table to have many, many collisions. Now if we've got 1000 pre-allocated buckets, that means I have 1000 bytes of allocated memory, distributed around my memory. Thus every single unique key in my hash table results in a fragment of memory, fragmenting my RAM.

Does this mean that the use of hash tables is basically guaranteed to result in an amount of fragmentation proportional to the number of unique keys? Further, this seems to indicate that you can greatly minimize fragmentation using some statistics to pick the number of buckets, if you know how many unique keys there are going to be. Is this the case?

Peter Cordes
  • 245,674
  • 35
  • 423
  • 606
Tal
  • 109
  • 2

1 Answers1

0

1000 bytes of allocated memory, distributed around my memory

No, you have one array of 1000 entries (of some size which is almost certainly larger than 1 byte per entry).

If each entry is big enough to handle the non-collision case in-place, no extra dynamic allocation is required until you have a collision. (e.g. maybe you use a union and a 1-bit flag to indicate whether this entry is a stand-alone bucket or whether it's a pointer to a linked list.)

If not, then when you write an entry, space needs to be allocated for it and a pointer stored in the table array itself. (e.g. a key-value hash table with small keys but large values). An empty hash table can still be full of NULL pointers.

You might still want it to hold structs of pointer and hash value (for single-member buckets). Then you can reject definitely-not-present queries without another level of indirection if the full hash value doesn't match the query; e.g. for a 32 or 64-bit hash that's many more bits than the 10 bits for indexing a 1024-entry table.


To reduce overall fragmentation, you can use a slab allocator or other technique for carving nodes out of a contiguous block you get from a global allocator. Having the hash table maintain its own private free-list could help with spatial locality of the linked-list nodes, so they're at least not scattered across many different virtual pages (TLB misses) and hopefully not DRAM pages (even slower cache misses).

Peter Cordes
  • 245,674
  • 35
  • 423
  • 606