Is it possible to hash pointers in portable C++03 code?

Question

Is it possible to portably hash a pointer in C++03, which does not have std::hash defined?

It seems really weird for hashables containing pointers to be impossible in C++, but I can't think of any way of making them.

The closest way I can think of is doing reinterpret_cast<uintptr_t>(ptr), but uintptr_t is not required to be defined in C++03, and I'm not sure if the value could be legally manipulated even if it was defined... is this even possible?

I guess you're look for something more than just doing sizeof(ptr) and treating it as sequence of chars? — Guy Sirton, Jan 05 '13 at 01:27
@GuySirton: You mean sequences of `unsigned char`? I'm not sure, is reading a pointer as an integer legal? — user541686, Jan 05 '13 at 01:34
@Mehrdad: Any object can be read as a sequence of `unsigned char`. However, some bits might not have defined values. For example, if you read bytes from `struct { int x : 3; }`. — Dietrich Epp, Jan 05 '13 at 01:40
If you're only reading it to hash the values, what do you expect would be illegal about it? Certainly if you read it as an int and then try to use it in a pointer-like way, it's bad, but I don't see what you would be concerned about here. — StilesCrisis, Jan 05 '13 at 01:40
Dietrich makes an interesting point. Some systems can ignore some bits of a pointer (e.g. 128K Mac CPUs ignored the top byte and so the OS stored allocation flags there). If you want your hashing algorithm to treat the pointers as identical despite having non-identical bits inside, you would need to do something special to handle it. — StilesCrisis, Jan 05 '13 at 01:42
@StilesCrisis: 32-bit x86 uses all the bits, and in 64-bit, although only 48 bits are defined in the architecture, the remainind 16-bits are defined. AMD learned the lesson from the 68000-based Mac's that when they got a "true" 32-bit processor (68020, in the Mac II) things went very bad with that funny usage of the upper 8 bits of the registers [it actually didn't matter how much memory you had in a 68000 processor, it only has a 24-bit address bus, so 16MB is the most you could possibly address]. So in 64-bit, you have to have all zeros or all ones in the top bits, based on the value below. — Mats Petersson, Jan 05 '13 at 01:47
I know, but we're talking TRULY portable, right? Including, say, portable to a 68000-based architecture. Sure we know better now, but maybe you're writing code for a tiny little SOC that uses a 68000 CPU. — StilesCrisis, Jan 05 '13 at 02:12
@DietrichEpp: So that means I can't rely on two "equal" pointers returning the same value, right? (Because of unused bits, for example.) Does that imply it's impossible to do hashing without relying on the particular memory architecture of the implementation? — user541686, Jan 05 '13 at 02:33
Yes, on some architectures, some address bits are unused. Checking two addresses for equality requires masking off those bits. So you would need to know that much about your architecture to make a hash table. — StilesCrisis, Jan 05 '13 at 06:23

score 9 · Accepted Answer · edited May 23 '17 at 12:24

No, in general. In fact it's not even possible in general in C++11 without std::hash.

The reason why lies in the difference between values and value representations.

You may recall the very common example used to demonstrate the different between a value and its representation: the null pointer value. Many people mistakenly assume that the representation for this value is all bits zero. This is not guaranteed in any fashion. You are guaranteed behavior by its value only.

For another example, consider:

int i;
int* x = &i;
int* y = &i;

x == y;  // this is true; the two pointer values are equal

Underneath that, though, the value representation for x and y could be different!

Let's play compiler. We'll implement the value representation for pointers. Let's say we need (for hypothetical architecture reasons) the pointers to be at least two bytes, but only one is used for the value.

I'll just jump ahead and say it could be something like this:

struct __pointer_impl
{
    std::uint8_t byte1; // contains the address we're holding
    std::uint8_t byte2; // needed for architecture reasons, unused
    // (assume no padding; we are the compiler, after all)
};

Okay, this is our value representation, now lets implement the value semantics. First, equality:

bool operator==(const __pointer_impl& first, const __pointer_impl& second)
{
    return first.byte1 == second.byte1;
}

Because the pointer's value is really only contained in the first byte (even though its representation has two bytes), that's all we have to compare. The second byte is irrelevant, even if they differ.

We need the address-of operator implementation, of course:

__pointer_impl address_of(int& i)
{
    __pointer_impl result;

    result.byte1 = /* hypothetical architecture magic */;

    return result;
}

This particular implementation overload gets us a pointer value representation for a given int. Note that the second byte is left uninitialized! That's okay: it's not important for the value.

This is really all we need to drive the point home. Pretend the rest of the implementation is done. :)

So now consider our first example again, "compiler-ized":

int i;

/* int* x = &i; */
__pointer_impl x = __address_of(i);

/* int* y = &i; */
__pointer_impl y = __address_of(i);

x == y;  // this is true; the two pointer values are equal

For our tiny example on the hypothetical architecture, this sufficiently provides the guarantees required by the standard for pointer values. But note you are never guaranteed that x == y implies memcmp(&x, &y, sizeof(__pointer_impl)) == 0. There simply aren't requirements on the value representation to do so.

Now consider your question: how do we hash pointers? That is, we want to implement:

template <typename T>
struct myhash;

template <typename T>
struct myhash<T*> :
    std::unary_function<T*, std::size_t>
{
    std::size_t operator()(T* const ptr) const
    {
        return /* ??? */;
    }
};

The most important requirement is that if x == y, then myhash()(x) == myhash()(y). We also already know how to hash integers. What can we do?

The only thing we can do is try to is somehow convert the pointer to an integer. Well, C++11 gives us std::uintptr_t, so we can do this, right?

return myhash<std::uintptr_t>()(reinterpret_cast<std::uintptr_t>(ptr));

Perhaps surprisingly, this is not correct. To understand why, imagine again we're implementing it:

// okay because we assumed no padding:
typedef std::uint16_t __uintptr_t; // will be used for std::uintptr_t implementation

__uintptr_t __to_integer(const __pointer_impl& ptr)
{
    __uintptr_t result;
    std::memcpy(&result, &ptr, sizeof(__uintptr_t));

    return result;
}

__pointer_impl __from_integer(const __uintptr_t& ptrint)
{
    __pointer_impl result;
    std::memcpy(&result, &ptrint, sizeof(__pointer_impl));

    return result;
}

So when we reinterpret_cast a pointer to integer, we'll use __to_integer, and going back we'll use __from_integer. Note that the resulting integer will have a value depending upon the bits in the value representation of pointers. That is, two equal pointer values could end up with different integer representations...and this is allowed!

This is allowed because the result of reinterpret_cast is totally implementation-defined; you're only guaranteed the resulting of the opposite reinterpret_cast gives you back the same result.

So there's the first issue: on this implementation, our hash could end up different for equal pointer values.

This idea is out. Maybe we can reach into the representation itself and hash the bytes together. But this obviously ends up with the same issue, which is what the comments on your question are alluding to. Those pesky unused representation bits are always in the way, and there's no way to figure out where they are so we can ignore them.

We're stuck! It's just not possible. In general.

Remember, in practice we compile for certain implementations, and because the results of these operations are implementation-defined they are reliable if you take care to only use them properly. This is what Mats Petersson is saying: find out the guarantees of the implementation and you'll be fine.

In fact, most consumer platforms you use will handle the std::uintptr_t attempt just fine. If it's not available on your system, or if you want an alternative approach, just combine the hashes of the individual bytes in the pointer. All this requires to work is that the unused representation bits always take on the same value. In fact, this is the approach MSVC2012 uses!

Had our hypothetical pointer implementation simply always initialized byte2 to a constant, it would work there as well. But there just isn't any requirement for implementations to do so.

Hope this clarifies a few things.

Awesome, crystal clear answer! Not much else to say other than thanks a bunch! :) — user541686, Jan 05 '13 at 08:20
Your analysis is totally correct in terms of what the standard guarantees, but consider the joy, or lack thereof, experienced by the poor schmuck who has to implement `malloc` on top of the freestanding implementation that behaves as you describe (assume for purpose of this hypothetical that they don't get any other help from the compiler). To put it another way, the compiler team who makes pointer-to-integer conversions behave like this will IMNSHO shortly find an angry mob of operating system programmers outside their building. — zwol, Jan 05 '13 at 08:26
There is "the compiler is allowed to do it", and there is "a compiler does it". Because the C++ standard has a lot of "holes" in it that permit quite ridiculous behavior. Can you provide an example of a compiler that produces integers whose `uint_ptr` representations differ for a validly derived pointer that compares equal? I wouldn't be surprised if either none, or some, have that property, but knowing is important. — Yakk - Adam Nevraumont, May 30 '16 at 04:16
@Yakk: No clue. The question explicitly wants pure C++ solutions, which technically means you have to assume that if some compiler could do something that screws it up, one will. In practice, dunno; the comments on the question seem to indicate there are/were platforms like this, but those compilers probably don't get much use anymore. :) — GManNickG, May 30 '16 at 06:34

score 5 · Answer 2 · answered Jan 05 '13 at 01:53

The answer to your question really depends on "HOW portable" do you want it. Many architectures will have a uintptr_t, but if you want something that can compile on DSP's, Linux, Windows, AIX, old Cray machines, IBM 390 series machines, etc, etc, then you may have to have a config option where you define your own "uintptr_t" if it doesn't exist in that architecture.

Casting a pointer to an integer type should be fine. If you were to cast it back, you may be in trouble. Of course, if you have MANY pointers, and you allocate fairly large sections of memory on a 64-bit machine, using a 32-bit integer, there is a chance you get lots of collissions. Note that 64-bit windows still has a "long" as 32-bit.

Is it possible to hash pointers in portable C++03 code?

2 Answers2

Linked