Why does unaligned access to mmap'ed memory sometimes segfault on AMD64?

Question

I have this piece of code which segfaults when run on Ubuntu 14.04 on an AMD64 compatible CPU:

#include <inttypes.h>
#include <stdlib.h>

#include <sys/mman.h>

int main()
{
  uint32_t sum = 0;
  uint8_t *buffer = mmap(NULL, 1<<18, PROT_READ,
                         MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
  uint16_t *p = (buffer + 1);
  int i;

  for (i=0;i<14;++i) {
    //printf("%d\n", i);
    sum += p[i];
  }

  return sum;
}

This only segfaults if the memory is allocated using mmap. If I use malloc, a buffer on the stack, or a global variable it does not segfault.

If I decrease the number of iterations of the loop to anything less than 14 it no longer segfaults. And if I print the array index from within the loop it also no longer segfaults.

Why does unaligned memory access segfault on a CPU that is able to access unaligned addresses, and why only under such specific circumstances?

Can't reproduce this on debian. Can you post the generated assembler code? (If you are using `gcc`, you can get the assembler output with the `-S` option, the assembler output will be written to a `.s` file.) — cmaster - reinstate monica, Nov 27 '17 at 12:22
Are you sure that your `mmap()` succeeds? Maybe it returns an error... — Ctx, Nov 27 '17 at 12:25
What happens if you call malloc and then add the line `volatile uint8_t dummy = buffer[0];` just after the malloc call? Same bug? What I'm fishing for is that the actual heap allocation may be delayed until the data is actually used. Since the contents of the buffer returned from malloc is guaranteed to hold unspecified values, the C compiler might think that it doesn't have to actually allocate anything. — Lundin, Nov 27 '17 at 12:25
@Lundin - You're hypothesising that the `malloc` intrinsic may be omitted entirely? — Oliver Charlesworth, Nov 27 '17 at 12:27
@Lundin but the OP claims, that it _works_ with `malloc()` and _not_ with `mmap()` — Ctx, Nov 27 '17 at 12:29
@Ctx Yes. I'm trying to narrow the problem down to either the system or the use of `mmap`. If the dummy read I suggested causes the same bug, then the problem is with the system. Otherwise with the use of `mmap`. — Lundin, Nov 27 '17 at 12:31
@Lundin Thanks for your suggestion. The `volatile` line you suggested doesn't change anything. However in the meantime the question has been answered to my satisfaction. — kasperd, Nov 27 '17 at 23:23

Peter Cordes · Accepted Answer · 2020-04-06T17:31:57.737

Related: Pascal Cuoq's blog post shows a case where GCC assumes aligned pointers (that two int* don't partially overlap): GCC always assumes aligned pointer accesses. He also links to a 2016 blog post (A bug story: data alignment on x86) that has the exact same bug as this question: auto-vectorization with a misaligned pointer -> segfault.

gcc4.8 makes a loop prologue that tries to reach an alignment boundary, but it assumes that uint16_t *p is 2-byte aligned, i.e. that some number of scalar iterations will make the pointer 16-byte aligned.

I don't think gcc ever intended to support misaligned pointers on x86, it just happened to work for non-atomic types without auto-vectorization. It's definitely undefined behaviour in ISO C to use a pointer to uint16_t with less than alignof(uint16_t)=2 alignment. GCC doesn't warn when it can see you breaking the rule at compile time, and actually happens to make working code (for malloc where it knows the return-value minimum alignment), but that's presumably just an accident of the gcc internals, and shouldn't be taken as an indication of "support".

Try with -O3 -fno-tree-vectorize or -O2. If my explanation is correct, that won't segfault, because it will only use scalar loads (which as you say on x86 don't have any alignment requirements).

gcc knows malloc returns 16-byte aligned memory on this target (x86-64 Linux, where maxalign_t is 16 bytes wide because long double has padding out to 16 bytes in the x86-64 System V ABI). It sees what you're doing and uses movdqu.

But gcc doesn't treat mmap as a builtin, so it doesn't know that it returns page-aligned memory, and applies its usual auto-vectorization strategy which apparently assumes that uint16_t *p is 2-byte aligned, so it can use movdqa after handling misalignment. Your pointer is misaligned and violates this assumption.

(I wonder if newer glibc headers use __attribute__((assume_aligned(4096))) to mark mmap's return value as aligned. That would be a good idea, and would probably have given you about the same code-gen as for malloc. Except it wouldn't work because it would break error-checking for mmap != (void*)-1, as @Alcaro points out with an example on Godbolt: https://gcc.godbolt.org/z/gVrLWT)

on a CPU that is able to access unaligned

SSE2 movdqa segfaults on unaligned, and your elements are themselves misaligned so you have the unusual situation where no array element starts at a 16-byte boundary.

SSE2 is baseline for x86-64, so gcc uses it.

Ubuntu 14.04LTS uses gcc4.8.2 (Off topic: which is old and obsolete, worse code-gen in many cases than gcc5.4 or gcc6.4 especially when auto-vectorizing. It doesn't even recognize -march=haswell.)

14 is the minimum threshold for gcc's heuristics to decide to auto-vectorize your loop in this function, with -O3 and no -march or -mtune options.

I put your code on Godbolt, and this is the relevant part of main:

    call    mmap    #
    lea     rdi, [rax+1]      # p,
    mov     rdx, rax  # buffer,
    mov     rax, rdi  # D.2507, p
    and     eax, 15   # D.2507,
    shr     rax        ##### rax>>=1 discards the low byte, assuming it's zero
    neg     rax       # D.2507
    mov     esi, eax  # prolog_loop_niters.7, D.2507
    and     esi, 7    # prolog_loop_niters.7,
    je      .L2
    # .L2 leads directly to a MOVDQA xmm2, [rdx+1]

It figures out (with this block of code) how many scalar iterations to do before reaching MOVDQA, but none of the code paths lead to a MOVDQU loop. i.e. gcc doesn't have a code path to handle the case where p is odd.

But the code-gen for malloc looks like this:

    call    malloc  #
    movzx   edx, WORD PTR [rax+17]        # D.2497, MEM[(uint16_t *)buffer_5 + 17B]
    movzx   ecx, WORD PTR [rax+27]        # D.2497, MEM[(uint16_t *)buffer_5 + 27B]
    movdqu  xmm2, XMMWORD PTR [rax+1]   # tmp91, MEM[(uint16_t *)buffer_5 + 1B]

Note the use of movdqu. There are some more scalar movzx loads mixed in: 8 of the 14 total iterations are done SIMD, and the remaining 6 with scalar. This is a missed-optimization: it could easily do another 4 with a movq load, especially because that fills an XMM vector after unpacking with zero to get uint32_t elements before adding.

(There are various other missed-optimizations, like maybe using pmaddwd with a multiplier of 1 to add horizontal pairs of words into dword elements.)

Safe code with unaligned pointers:

If you do want to write code which uses unaligned pointers, you can do it correctly in ISO C using memcpy. On targets with efficient unaligned load support (like x86), modern compilers will still just use a simple scalar load into a register, exactly like dereferencing the pointer. But when auto-vectorizing, gcc won't assume that an aligned pointer lines up with element boundaries and will use unaligned loads.

memcpy is how you express an unaligned load / store in ISO C / C++.

#include <string.h>

int sum(int *p) {
    int sum=0;
    for (int i=0 ; i<10001 ; i++) {
        // sum += p[i];
        int tmp;
#ifdef USE_ALIGNED
        tmp = p[i];     // normal dereference
#else
        memcpy(&tmp, &p[i], sizeof(tmp));  // unaligned load
#endif
        sum += tmp;
    }
    return sum;
}

With gcc7.2 -O3 -DUSE_ALIGNED, we get the usual scalar until an alignment boundary, then a vector loop: (Godbolt compiler explorer)

.L4:    # gcc7.2 normal dereference
    add     eax, 1
    paddd   xmm0, XMMWORD PTR [rdx]
    add     rdx, 16
    cmp     ecx, eax
    ja      .L4

But with memcpy, we get auto-vectorization with an unaligned load (with no intro/outro to handle alignement), unlike gcc's normal preference:

.L2:   # gcc7.2 memcpy for an unaligned pointer
    movdqu  xmm2, XMMWORD PTR [rdi]
    add     rdi, 16
    cmp     rax, rdi      # end_pointer != pointer
    paddd   xmm0, xmm2
    jne     .L2           # -mtune=generic still doesn't optimize for macro-fusion of cmp/jcc :(

    # hsum into EAX, then the final odd scalar element:
    add     eax, DWORD PTR [rdi+40000]   # this is how memcpy compiles for normal scalar code, too.

In the OP's case, simply arranging for pointers to be aligned is a better choice. It avoids cache-line splits for scalar code (or for vectorized the way gcc does it). It doesn't cost a lot of extra memory or space, and the data layout in memory isn't fixed.

But sometimes that's not an option. memcpy fairly reliably optimizes away completely with modern gcc / clang when you copy all the bytes of a primitive type. i.e. just a load or store, no function call and no bouncing to an extra memory location. Even at -O0, this simple memcpy inlines with no function call, but of course tmp doesn't optimizes away.

Anyway, check the compiler-generated asm if you're worried that it might not optimize away in a more complicated case, or with different compilers. For example, ICC18 doesn't auto-vectorize the version using memcpy.

uint64_t tmp=0; and then memcpy over the low 3 bytes compiles to an actual copy to memory and reload, so that's not a good way to express zero-extension of odd-sized types, for example.

GNU C `attribute((aligned(1)))` and `may_alias`

Instead of memcpy (which won't inline on some ISAs when GCC doesn't know the pointer is aligned, i.e. exactly this use-case), you can also use a typedef with a GCC attribute to make an under-aligned version of a type.

typedef int __attribute__((aligned(1), may_alias)) unaligned_aliasing_int;

typedef unsigned long __attribute__((may_alias, aligned(1))) unaligned_aliasing_ulong;

related: Why does glibc's strlen need to be so complicated to run quickly? shows how to make a word-at-a-time bithack C strlen safe with this.

Note that it seems ICC doesn't respect __attribute__((may_alias)), but gcc/clang do. I was recently playing around with that trying to write a portable and safe 4-byte SIMD load like _mm_loadu_si32 (which GCC is missing). https://godbolt.org/z/ydMLCK has various combinations of safe everywhere but inefficient code-gen on some compilers, or unsafe on ICC but good everywhere.

aligned(1) may be less bad than memcpy on ISAs like MIPS where unaligned loads can't be done in one instruction.

You use it like any other pointer.

unaligned_aliasing_int *p = something;
int tmp = *p++;
int tmp2 = *p++;

And of course you can index it as normal like p[i].

The flags you suggested for `gcc` does indeed make the segfault go away. Whether the segfault is a bug in `gcc` or in my own code is not an important question to me. Once I understood that unaligned access was likely part of the trigger I updated my code to pad data to a multiple of 4 bytes. The fact that my code often worked with unaligned data made it harder for me to realize that alignment was part of the issue, which is why I wanted to understand why it failed and only sometimes. — kasperd, Nov 27 '17 at 23:17
@kasperd: you only need to pad to a multiple of 2 bytes to make `uint16_t` naturally aligned. But if it's cheap and low-overhead to pad to 4 or 16 bytes, go for it. (If you can pad to 16 bytes, tell the compiler about it with `p = __builtin_assume_aligned(p,16);`). And yes, the unaligned access to `uint16_t` is the only dangerous thing; all the other things are safe. (Even casting a `uint8_t*` to `uint16_t*` and expecting to see the changes when reading through the original `uint8_t*` would be fine, I *think*, because `char*` can alias anything.) — Peter Cordes, Nov 28 '17 at 02:42
The buffer in the actual program also contained some `uint32_t` fields that I would copy to/from a local variable when I needed to access them. I was thinking that with 4 byte alignment I could get rid of a couple of `memcpy` calls from my code. — kasperd, Nov 28 '17 at 10:43
@kasperd: Hmm, well if they're not contiguous with any other 4-byte elements, you probably won't get any aligned vector loads trying to load them. The `memcpy` will probably optimize away to just an unaligned load. But anyway, yes if you have any 4B fields, then 4B alignment sounds like a good idea. That avoids the possibility of page-splits or cache-line splits (except when auto-vectorizing), which would hurt performance. — Peter Cordes, Nov 28 '17 at 10:47
I'm not sure where this idea that gcc ever supported misaligned pointers beyond what the C or C++ standards require came from. As far as I've seen, it always just compiles pointer code down to pretty much the simplest code, just like most other modern compilers. Of course since x86 supports misaligned access all over the place, that often happens to work just fine in many cases, but I don't think that's any evidence gcc "supports" it! What would the non-support look like? Actively checking for misaligned pointers and aborting? Of course no compiler does that (outside of sanitization). — BeeOnRope, Nov 28 '17 at 18:59
In the few cases I know of that might be able to separate support from non-support, _gcc goes right ahead and generates instructions that expect the standard-required alignment_. For example, [it is happy](https://godbolt.org/g/yLtoJR) to use `movaps` to move 16-bit values around when the pointer type indicates 16-byte alignment. It also certainly doesn't do anything to support misaligned `atomic_t` or `std::atomic<>` values: it emits instructions that only have the appropriate atomicity guarantees if the pointer is correctly aligned. Gcc seems to handle alignment much as other compilers do. — BeeOnRope, Nov 28 '17 at 19:02
@BeeOnRope: It doesn't warn or generate faulting code when it knows at compile time that `malloc()+1` is cast to `uint16_t` and dereferenced, but it apparently *knows* that `malloc` return values are 16B-aligned. That's what made me wonder if it was somehow supported. I think you're right, though, that it's just a "happens-to-work" thing when it does work. — Peter Cordes, Nov 28 '17 at 20:03
Yeah you could kind of imagine compilers falling into three camps with respect to any particular type of UB: (1) going out of their way to allow it, (2) generating code with the assumption it doesn't occur but without any effort to detect or prevent it and (3) going out of their way to detect it and do something about it. IMO the vast majority of compilers have traditionally taken the track (2) for most types of UB. That's the easiest, after all, and generates the fastest code. Security issues have made (3) much more popular lately, even at runtime cost (e.g., buffer checks). — BeeOnRope, Nov 28 '17 at 20:57
Now it can be hard to tell the difference between (1) and (2) when the (2) strategy just happens to work for UB code. That's kind of the case for unaligned on x86, especially before the alignment restrictions introduced by SSE. I think if one were using gcc on a must-be aligned platform (including those that have separate unaligned instructions), however, it would have already been obvious that unaligned access wasn't supported. The fuzziness between (1) and (2) ends up being a problem when one day compiler writers decide to take advantage of something in the standard... — BeeOnRope, Nov 28 '17 at 20:58
... (see: strict aliasing and signed overflow) to optimize, or some other change (see SSE movaps) occurs that suddenly makes (1) different from (2) since if enough people were using that "feature" (union type-punning) they can end up having to support it as (1) because it's too late to go back, and sometimes the feature even gets standardized (especially if there is no other viable alternative). At least the compilers are doing a reasonable job of making the compliant way of doing things fast even when it looks slow (e.g., fixed-length misaligned `memcpy` that translates to one instruction). — BeeOnRope, Nov 28 '17 at 20:59
@BeeOnRope: Good points, updated my answer. BTW, union type-punning is guaranteed by ISO C99 (not very explicitly, but most people read it that way). The wording wasn't present in C89, so this might actually be an example of what you're talking about: the standard defining what a lot of code already relied on. (It's also a GNU C++ and C89 extension, of course.) I wonder if there's a way to use `memcpy` either per-element or on blocks here that still auto-vectorizes without any actual copying. (i.e. compiles to `movups`). — Peter Cordes, Nov 28 '17 at 21:16
@PeterCordes - yes, it probably wasn't clear but I was using union type-punning as an example of something that was apparently standardized after perhaps being UB before because a lot of people used it (it's also an example of a C vs C++ divergence since this is not allowed in C++). The C committee didn't do a great job with type punning though because there is still a raging debate about exactly what form punning has to take to be legal (e.g., whether forms where the union is a bit removed from the punning operation are allowed). — BeeOnRope, Nov 28 '17 at 21:21
@BeeOnRope: It's a good thing `memcpy` inlines away to nothing pretty reliably for `uint32_t` / `float` and `double` / `uint64_t`. — Peter Cordes, Nov 28 '17 at 21:24
Interestingly, for the `malloc` version of the OP's code clang compiles the entire function to `xor eax, eax` since it presumably sees the access to uninitalized memory returned by `malloc` and takes a shortcut. @PeterCordes - the [`memcpy` version](https://godbolt.org/g/CpGdWh) seems to generate vectorized code fine without any copying (seems to be the same code as the original direct pointer access). My experience is that both `icc` and `msvc` are crap at handling this type of `memcpy` compared to `gcc` or `clang`. Direct (possibly UB) pointer access is much better for those compilers. — BeeOnRope, Nov 28 '17 at 22:08
"I wonder if newer glibc headers use __attribute__((assume_aligned(4096))) to mark mmap" - no, they don't, and shouldn't. mmap returns MAP_FAILED aka (void*)-1 on failure, which isn't 4096-aligned, so GCC would remove your error checks. https://gcc.godbolt.org/z/gVrLWT — Alcaro, Nov 06 '18 at 02:25
@Alcaro: heh, I didn't think of that problem! Thanks for that and the edit. Updated my answer with your Godbolt link. — Peter Cordes, Nov 06 '18 at 02:34
A conforming implementation for x86 could place declared objects and struct members of 16-bit and larger types at odd addresses, if its generated code would properly handle loads and stores of such types. If I recall, Turbo C for the IBM PC had an option to control precisely that [I know Turbo Pascal did]. On such an implementation, accessing larger objects with unaligned pointers would have fully defined behavior. The Standard's use of the term "Undefined Behavior" includes actions whose behavior would be fully defined on some implementations but not all. — supercat, Jul 11 '20 at 15:38
@supercat: Indeed, an implementation could use `alignof(T) == 1` for all `T` on x86, with some negative performance consequences. But GCC doesn't do that, and optimizes accordingly. — Peter Cordes, Jul 11 '20 at 19:37

Why does unaligned access to mmap'ed memory sometimes segfault on AMD64?

1 Answers1

Safe code with unaligned pointers:

GNU C `attribute((aligned(1)))` and `may_alias`

Linked

Related

Why does unaligned access to mmap'ed memory sometimes segfault on AMD64?

1 Answers1

Safe code with unaligned pointers:

GNU C __attribute__((aligned(1))) and may_alias

Linked

Related

GNU C `attribute((aligned(1)))` and `may_alias`