C undefined behavior. Strict aliasing rule, or incorrect alignment?

Question

I can't explain the execution behavior of this program:

#include <string> 
#include <cstdlib> 
#include <stdio.h>

typedef char u8;
typedef unsigned short u16;

size_t f(u8 *keyc, size_t len)
{
    u16 *key2 = (u16 *) (keyc + 1);
    size_t hash = len;
    len = len / 2;

    for (size_t i = 0; i < len; ++i)
        hash += key2[i];
    return hash;
}

int main()
{
    srand(time(NULL));
    size_t len;
    scanf("%lu", &len);
    u8 x[len];
    for (size_t i = 0; i < len; i++)
        x[i] = rand();

    printf("out %lu\n", f(x, len));
}

So, when it is compiled with -O3 with gcc, and run with argument 25, it raises a segfault. Without optimizations it works fine. I've disassembled it: it is being vectorized, and the compiler assumes that the key2 array is aligned at 16 bytes, so it uses movdqa. Obviously it is UB, although I can't explain it. I know about the strict aliasing rule and it is not this case (I hope), because, as far as I know, the strict aliasing rule doesn't work with chars. Why does gcc assume that this pointer is aligned? Clang works fine too, even with optimizations.

EDIT

I changed unsigned char to char, and removed const, it still segfaults.

EDIT2

I know that this code is not good, but it should work ok, as far as I know about the strict aliasing rule. Where exactly is the violation?

`[unsigned] char*` has a specific exception with strict aliasing: you can read anything through it. It’s not a free strict aliasing bypass, and creating the unaligned `u16*` from it is invalid. — Ry-, Oct 17 '17 at 12:51
Why do you typecast using const? `(const u16 *) (keyc + 1);` — ryyker, Oct 17 '17 at 12:51
You use an `unsigned short*` in the program, but there are no `unsigned short`s anywhere. That sounds exactly like an alias violation. — Bo Persson, Oct 17 '17 at 12:54
Even without the aliasing `(const u16 *) (keyc + 1)` could easily lead to misaligned access. This is very bad code. — StoryTeller - Unslander Monica, Oct 17 '17 at 12:54
Please correct me but you code should not compile, why do i think that because your Array has length 'len' which is an runtime value( you set the value at runtime). As far as i learned it in C you should then call malloc or similar to allocate memory. Array types their sizes must be known at compile time, mustn't they? — ExOfDe, Oct 17 '17 at 12:55
@ExOfDe - Your aren't wrong, just not up to date. Lookup the changes in C99 to the langauge standard. — StoryTeller - Unslander Monica, Oct 17 '17 at 12:56
`x` is `u8 x[len];` and you're accessing its members (`char`) in the `f` function through a `const u16*` pointer. That's a clear strict aliasing violation. — PSkocik, Oct 17 '17 at 13:01
@ExOfDe how long have you been programming C? This has been the state of the standard for last 18 years. (though optional for the last *6 years) — Antti Haapala, Oct 17 '17 at 13:22
@AnttiHaapala it generates a warning, and warnings are errors that's why i was not aware of it besides I code mostly in ANSI C if C is needed. — ExOfDe, Oct 17 '17 at 13:30
Just to clearify your end goal, are you trying to calculate the hash of a C-string (null-terminated array of chars) or the hash of whatever object, given its binary representation in memory? — Bob__, Oct 17 '17 at 13:50
[This is the "ANSI" C](https://webstore.ansi.org/RecordDetail.aspx?sku=INCITS%2FISO%2FIEC+9899-2012) — Antti Haapala, Oct 17 '17 at 13:50
The behaviour of `size_t len; scanf("%lu", &len);` is platform-dependant because `size_t` doesn't generally have the same size as `long`, which is what the `l` format type modifier assumes. Use the `z` type modifier to refer to arguments of type `size_t`. — David Foerster, Oct 17 '17 at 15:40
BTW, `u8` and `u16` are highly misleading type names - they *look* a lot like fixed-width types, but aren't. — Toby Speight, Oct 19 '17 at 13:01

Antti Haapala · Accepted Answer · 2018-07-06T06:55:52.593

The code indeed breaks the strict aliasing rule. However, there is not only an aliasing violation, and the crash doesn't happen because of the aliasing violation. It happens because the unsigned short pointer is incorrectly aligned; even the pointer conversion itself is undefined if the result is not suitably aligned.

C11 (draft n1570) Appendix J.2:

1 The behavior is undefined in the following circumstances:

....

Conversion between two pointer types produces a result that is incorrectly aligned (6.3.2.3).

With 6.3.2.3p7 saying

[...] If the resulting pointer is not correctly aligned [68] for the referenced type, the behavior is undefined. [...]

unsigned short has alignment requirement of 2 on your implementation (x86-32 and x86-64), which you can test with

_Static_assert(_Alignof(unsigned short) == 2, "alignof(unsigned short) == 2");

However, you're forcing the u16 *key2 to point to an unaligned address:

u16 *key2 = (u16 *) (keyc + 1);  // we've already got undefined behaviour *here*!

There are countless programmers that insist that unaligned access is guaranteed to work in practice on x86-32 and x86-64 everywhere, and there wouldn't be any problems in practice - well, they're all wrong.

Basically what happens is that the compiler notices that

for (size_t i = 0; i < len; ++i)
     hash += key2[i];

can be executed more efficiently using the SIMD instructions if suitably aligned. The values are loaded into the SSE registers using MOVDQA, which requires that the argument is aligned to 16 bytes:

When the source or destination operand is a memory operand, the operand must be aligned on a 16-byte boundary or a general-protection exception (#GP) will be generated.

For cases where the pointer is not suitably aligned at start, the compiler will generate code that will sum the first 1-7 unsigned shorts one by one, until the pointer is aligned to 16 bytes.

Of course if you start with a pointer that points to an odd address, not even adding 7 times 2 will land one to an address that is aligned to 16 bytes. Of course the compiler will not even generate code that will detect this case, as "the behaviour is undefined, if conversion between two pointer types produces a result that is incorrectly aligned" - and ignores the situation completely with unpredictable results, which here means that the operand to MOVDQA will not be properly aligned, which will then crash the program.

It can be easily proven that this can happen even without violating any strict aliasing rules. Consider the following program that consists of 2 translation units (if both f and its caller are placed into one translation unit, my GCC is smart enough to notice that we're using a packed structure here, and doesn't generate code with MOVDQA):

translation unit 1:

#include <stdlib.h>
#include <stdint.h>

size_t f(uint16_t *keyc, size_t len)
{
    size_t hash = len;
    len = len / 2;

    for (size_t i = 0; i < len; ++i)
        hash += keyc[i];
    return hash;
}

translation unit 2

#include <string.h>
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <inttypes.h>

size_t f(uint16_t *keyc, size_t len);

struct mystruct {
    uint8_t padding;
    uint16_t contents[100];
} __attribute__ ((packed));

int main(void)
{
    struct mystruct s;
    size_t len;

    srand(time(NULL));
    scanf("%zu", &len);

    char *initializer = (char *)s.contents;
    for (size_t i = 0; i < len; i++)
       initializer[i] = rand();

    printf("out %zu\n", f(s.contents, len));
}

Now compile and link them together:

% gcc -O3 unit1.c unit2.c
% ./a.out
25
zsh: segmentation fault (core dumped)  ./a.out

Notice that there is no aliasing violation there. The only problem is the unaligned uint16_t *keyc.

With -fsanitize=undefined the following error is produced:

unit1.c:10:21: runtime error: load of misaligned address 0x7ffefc2d54f1 for type 'uint16_t', which requires 2 byte alignment
0x7ffefc2d54f1: note: pointer points here
 00 00 00  01 4e 02 c4 e9 dd b9 00  83 d9 1f 35 0e 46 0f 59  85 9b a4 d7 26 95 94 06  15 bb ca b3 c7
              ^

Typo: (unsigned short *)foo + 1 should read (unsigned short *)(foo + 1) — Joshua, Oct 17 '17 at 17:45
The alignment requirement of `unsigned short` is implementation-defined. You say "intrinsic alignment of 2", but that statement can only be made in the context of a particular implementation. The OP's compiler documentation must specify it; and also it can be inspected with `_Alignof(unsigned short)`. Perhaps you could add a `_Static_assert` to your program to confirm this — M.M, Oct 17 '17 at 20:46
I don't know about @Antti's environment, but I observe that for gcc 4.8.5 on Linux x86_64, the alignment requirement for `unsigned short` is indeed 2. — John Bollinger, Oct 17 '17 at 22:15
"_However, you're forcing the u16 *key2 to point to an unaligned address:_" perhaps; `keyc + 1` will be unaligned *iff* `keyc` is aligned! — curiousguy, Oct 21 '17 at 22:47
@Flamefire wow that's evil. Perhaps you should add an answer. The good thing about C is that it is easy to tell that some evil magic is happening. Not so with C++ and boost libraries — Antti Haapala, Aug 02 '19 at 08:11
Done: https://stackoverflow.com/a/57326681/1930508. It got longer than I thought but I hope it provides some more insights and examples where things can fail — Flamefire, Aug 02 '19 at 12:41
@Nubcake because it is a pointer to the `mystruct` member `contents` that was forced to be unaligned (by declaring `mystruct` with GCC `__attribute__((packed))`) in the other translation unit. The compiler did know how to generate code there and didn't produce any warnings, but by the time the program was linked together the information was lost. — Antti Haapala, Apr 09 '20 at 03:54
@AnttiHaapala So it is normally alligned if the packed specifier isn't used right? — Nubcake, Apr 09 '20 at 22:38
@Nubcake late answer to comment, but yes, the `packed` attribute asks the compiler to break alignment expectations deliberately. — Antti Haapala, Apr 04 '21 at 08:06

score 9 · Answer 2 · answered Oct 17 '17 at 13:01

It is legal to alias a pointer to an object to a pointer to a char, and then iterate all bytes from the original object.

When a pointer to char actually points to an object (has been obtained through previous operation), it is legal to convert is back to a pointer to the original type, and the standard requires that you get back the original value.

But converting an arbitrary pointer to a char to a pointer to object and dereferencing the obtained pointer violates the strict aliasing rule and invokes undefined behaviour.

So in your code, the following line is UB:

const u16 *key2 = (const u16 *) (keyc + 1); 
// keyc + 1 did not originally pointed to a u16: UB

@AnttiHaapala there can be more than one source of UB in a program — M.M, Oct 17 '17 at 20:42

score 5 · Answer 3 · answered Aug 02 '19 at 12:39

To provide some more info and common pitfalls to the excellent answer from @Antti Haapala:

TLDR: Access to unaligned data is undefined behavior (UB) in C/C++. Unaligned data is data at an address (aka pointer value) that is not evenly divisible by its alignment (which is usually its size). In (pseudo-)code: bool isAligned(T* ptr){ return (ptr % alignof(T)) == 0; }

This issue arises often when parsing file formats or data sent over network: You have a densely packed struct of different data types. Example would be a protocol like this: struct Packet{ uint16_t len; int32_t data[]; }; (Read as: A 16 bit length followed by len times a 32 bit int as a value). You could now do:

char* raw = receiveData();
int32_t sum = 0;
uint16_t len = *((uint16_t*)raw);
int32_t* data = (int32_t*)(raw2 + 2);
for(size_t i=0; i<len; ++i) sum += data[i];

This does not work! If you assume that raw is aligned (in your mind you could set raw = 0 which is aligned to any size as 0 % n == 0 for all n) then data cannot possibly be aligned (assuming alignment == type size): len is at address 0, so data is at address 2 and 2 % 4 != 0. But the cast tells the compiler "This data is properly aligned" ("... because otherwise it is UB and we never run into UB"). So during optimization the compiler will use SIMD/SSE instructions for faster calculation of the sum and those do crash when given unaligned data.
Sidenote: There are unaligned SSE instructions but they are slower and as the compiler assumes the alignment you promised they are not used here.

You can see this in the example from @Antti Haapala which I shortened and put at godbolt for you to play around with: https://godbolt.org/z/KOfi6V. Watch the "program returned: 255" aka "crashed".

This problem is also pretty common in deserialization routines which look like this:

char* raw = receiveData();
int32_t foo = readInt(raw); raw+=4;
bool foo = readBool(raw); raw+=1;
int16_t foo = readShort(raw); raw+=2;
...

The read* takes care of endianess and is often implemented like this:

int32_t readInt(char* ptr){
  int32_t result = *((int32_t*) ptr);
  #if BIG_ENDIAN
  result = byteswap(result);
  #endif
}

Note how this code dereferences a pointer which pointed to a smaller type which might have a different alignment and you run into the exact some problem.

This problem is so common that even Boost suffered from this through many versions. There is Boost.Endian which provides easy endian types. The C code from godbolt can be easily written likes this:

#include <cstdint>
#include <boost/endian/arithmetic.hpp>


__attribute__ ((noinline)) size_t f(boost::endian::little_uint16_t *keyc, size_t len)
{
    size_t hash = 0;
    for (size_t i = 0; i < len; ++i)
        hash += keyc[i];
    return hash;
}

struct mystruct {
    uint8_t padding;
    boost::endian::little_uint16_t contents[100];
};

int main(int argc, char** argv)
{
    mystruct s;
    size_t len = argc*25;

    for (size_t i = 0; i < len; i++)
       s.contents[i] = i * argc;

    return f(s.contents, len) != 300;
}

The type little_uint16_t is basically just some chars with an implicit conversion from/to uint16_t with a byteswap if the current machines endianess is BIG_ENDIAN. Under the hood the code used by Boost:endian was similar to this:

class little_uint16_t{
  char buffer[2];
  uint16_t value(){
    #if IS_x86
      uint16_t value = *reinterpret_cast<uint16_t*>(buffer);
    #else
    ...
    #endif
    #if BIG_ENDIAN
    swapbytes(value);
    #endif
    return value;
};

It used the knowledge that on x86 architectures unaligned access is possible. A load from an unaligned address was just a bit slower, but even on assembler level the same as the load from an aligned address.

However "possible" doesn't mean valid. If the compiler replaced the "standard" load by a SSE instruction then this fails as can be seen on godbolt. This went unnoticed for a long time because those SSE instructions are just used when processing large chunks of data with the same operation, e.g. adding an array of values which is what I did for this example. This was fixed in Boost 1.69 by using memcopy which can be translated to a "standard" load instruction in ASM which supports aligned and unaligned data on x86, so there is no slowdown compared to the cast version. But it cannot be translated into aligned SSE instructions without further checks.

Takeaway: Don't use shortcuts with casts. Be suspicious of every cast especially when casting from a smaller type and check that the alignment cannot be wrong or use the safe memcpy.

*This problem is also pretty common in deserialization routines...* And this therefore provides a nice example of the classic correctness/efficiency/readability tradeoffs. If you instead write your deserialization code to read one byte at a time using `getc`, then reassemble them into multibyte words "by hand" (see [here](http://c-faq.com/stdio/extconform.html) for some examples), you get code which (a) has no possibility of unaligned access and (b) automatically works regardless of host byte order (with no extra, explicit byteswapping), although it (c) probably isn't maximally efficient. — Steve Summit, Feb 23 '21 at 12:39
In that case how can we achieve the same result without copying the data? All the other solutions use copy of single byte/char/uint8_t at a time or `memcpy`... — Alexis, Apr 01 '21 at 13:19
Why do you say unaligned SSE instructions are slower? Looking at the Intel's intrinsics, `_mm_loadu_si128` and `_mm_load_si128` have the same latency and throughput for all their architectures. — Alexis, Apr 01 '21 at 13:23
There is no way without a copy to do this. Look at std::bit_cast which is basically a memcpy. The good news though is, that the compiler may eliminate the copy and use an unaligned load for example. I said "A load from an unaligned address **was** just a bit slower". So this might not be true anymore. Or it may. Measure to be sure. The reason for it being slower is, that it needs to load from 2 memory locations/do 2 load requests at the lowest (microcode) level instead of 1. This might be hidden by caches etc, but is not guaranteed and the first load might have some extra latency. Again: Might — Flamefire, Apr 03 '21 at 15:05
@Alexis: On Nehalem and newer, `movdqu` is the same speed as `movdqa` *for aligned loads* (or really for any loads that don't cross a cache line boundary). `movdqu` does have higher latency and worse throughput on a cache-line split, and much worse on a page split. ((instead of just faulting). Also, without AVX, only `_mm_load_si128` can fold into a memory source for an ALU instruction like `paddd xmm0, [rdi]`. With `loadu` the compiler would need `movdqu xmm1, [rdi]` / `paddd xmm0, xmm1`. (With AVX, memory operands don't require alignment by default, only for `vmovdqa`.) — Peter Cordes, Apr 23 '21 at 08:17
@Flamefire: In GNU C, you can use `typedef uint32_t unaligned_aliasing_u32 __attribute__((aligned(1), may_alias))` pointers as an alternative to `memcpy`. But yes, this question is actually a duplicate of [Why does unaligned access to mmap'ed memory sometimes segfault on AMD64?](https://stackoverflow.com/q/47510783), and Pascal Cuoq's blog [GCC always assumes aligned pointer accesses](https://trust-in-soft.com/blog/2020/04/06/gcc-always-assumes-aligned-pointers/), and https://pzemtsov.github.io/2016/11/06/bug-story-alignment-on-x86.html — Peter Cordes, Apr 23 '21 at 08:23
Sure, but that is not standard C/C++ and non-portable, so doesn't really help unless you don't need to be portable (which I'd advise against as the memcpy likely generates the same code) — Flamefire, Apr 23 '21 at 11:32
@PeterCordes I see but in that case I think the cause of the latency is the memory/cache accesses, not the instruction itself. And is much higher than few more instructions. — Alexis, Apr 23 '21 at 12:17
@Flamefire: Right, in pure ISO C/C++ you need memcpy. With optimization enabled, modern compilers almost always do a good job with it, and normally debug-mode performance doesn't matter. — Peter Cordes, Apr 23 '21 at 12:32

supercat · Answer 4 · 2018-10-09T17:04:55.963

Unless code does something to ensure that an array of character type is aligned, it should not particularly expect that it will be.

If alignment is taken care of, code takes its address once, converts it to a pointer of another type, and never accesses the storage via any means not derived from the latter pointer, then an implementation designed for low-level programming should have no particular difficulty treating the storage as an abstract buffer. Since such treatment would not be difficult and would be necessary for some kinds of low-level programming (e.g. implementing memory pools in contexts where malloc() may be unavailable), an implementation which doesn't support such constructs should not claim to be suitable for low-level programming.

Consequently, on implementations which are designed for low-level programming, constructs such as you describe would allow suitably-aligned arrays to be treated as untyped storage. Unfortunately, there is no easy way to recognize such implementations, since implementations which are designed primarily for low-level programming often fail to list all of the cases where the authors would think it obvious that such implementations behave in a fashion characteristic of the environment (and where they consequently do precisely that), while those whose design is are focused on other purposes may claim to be suitable for low-level programming even if they behave inappropriately for that purpose.

The authors of the Standard recognize that C is a useful language for non-portable programs, and specifically stated they did not wish to preclude its use as a "high-level assembler". They expected, however, that implementations intended for various purposes would support popular extensions to facilitate those purposes without regard for whether the Standard requires them to do so, and thus there was no need to have the Standard address such things. Because such intention was relegated to the Rationale rather than the Standard, however, some compiler writers regard the Standard as a full description of everything that programmers should ever expect from an implementation, and thus may not support low-level concepts like the use of static- or automatic-duration objects as effectively-untyped buffers.

Fun fact: the x86-64 System V ABI guarantees 16-byte alignment for VLAs, and local/global arrays of 16 bytes or larger. (It's super weird for that standard to say anything about the internals of a function, since it's not like another function can know that it was passed a pointer to a local array, rather than to one element of it.) So in this case, compiling with x86-64 GCC *does* ensure 16-byte alignment of `x[len]`, and thus misalignment of `(u16 *) (keyc + 1)`. — Peter Cordes, Apr 23 '21 at 08:29
@PeterCordes: I wonder what retronym should be used to distinguish the dialects of C which were useful because they would seek to fill in parts of the language with whatever would best fit the target platform and application field, from those which interpret the Standard's failure to mandate behavior for construct as an invitation to process them nonsensically? On some platforms, it may be advantageous not to align character arrays, but there has never been any reason for a non-obtuse implementation to adopt gcc/clang's willful blindness about cross-type address derivation. — supercat, Apr 23 '21 at 11:19
If you want an aliasing-safe unaligned u16, you can typedef it with `__attribute__((aligned(1), may_alias))`. GNU C gives you the tools. Arguably it would be nice if it saved you from yourself like it sometimes does with `_mm256_store_si256` to a local array that didn't use `alignas` (it chooses to align the destination for performance, which happens to also avoid segfaults from misaligned `vmovdqa`. https://godbolt.org/z/osW5zEefc) — Peter Cordes, Apr 23 '21 at 11:29
But is it better to create a situation where moving the definition of `f` to another file (where it won't inline into main without LTO) will break the program? Or when you change the buffer to be allocated some way that hides alignment info from GCC? So there are downsides to being forgiving. (If you mean that GCC should never auto-vectorize in a way that relies on a `u16*` having `alignof(u16)`, on targets where you can get away with unaligned scalars, one counter argument is that endorses or makes it easier to write code that's hard to port to alignment-required ISAs.) — Peter Cordes, Apr 23 '21 at 11:33
@PeterCordes: I don't think one should rely upon arrays being aligned unless one takes action to make them so. With regard to aliasing, the reason the Standard doesn't explicitly say that the "strict aliasing rule" doesn't apply in cases where e.g. a `T*` that points to a `T` is converted to `U*` and used to access storage without any intervening operations involving `T`, is that everyone in 1989 recognized no compiler whose author wasn't being deliberately obtuse would have any trouble recognizing such constructs. — supercat, Apr 23 '21 at 11:40

C undefined behavior. Strict aliasing rule, or incorrect alignment?

4 Answers4

Linked

Related