4

In my program I need to apply __attribute__(( aligned(32))) to an int * or float * I tried like this but I'm not sure it will work.

int  *rarray __attribute__(( aligned(32)));

I saw this but didn't find the answer

ADMS
  • 97
  • 3
  • 16
  • 5
    You can not apply it to the pointer, but to the allocator, lile POSIX_MEMALIGN family http://man7.org/linux/man-pages/man3/posix_memalign.3.html . Although for C++ it's more convenient to drop an aligned allocator to `std::vector` and use `std::vector` as a memory container and extract a pointer from it, using custom allocator like boost::aligned_allocator http://www.boost.org/doc/libs/1_58_0/doc/html/align/tutorial.html#align.tutorial.aligned_allocator – user3528438 Mar 24 '16 at 23:50
  • 1
    Allocator is in a different library `pbmpak.h`. Is there another solution? I don't want to change the `pbmpak.c` I'm coding in `c` – ADMS Mar 24 '16 at 23:52
  • Well then you need to allocate the memory one alignment larger than needed then round the beginning to the nearest aligned location. It's going to be difficult it manage the memory in C because once rounded to the alignment, you can not free it using the aligned pointer, to free it you need the original pointer returned directly from the allocator function. If you are using C++ then you can make a wrapper class to keep an record of the pointer before rounding, and use the constructor to allocate, destructor to deallocate, as kind of RAII resource management. – user3528438 Mar 24 '16 at 23:56
  • If your work involves using a C API function in a C++ software project then I think it's OK to mark this question as both C and C++, but you need to make that clear in the question, because the answer would be quite different if that assumption is not true. – user3528438 Mar 24 '16 at 23:59
  • So is this necessary to be aligned? I'm working on vectorization with `AVX` and I should load this element as aligned to be faster than unaligned. I deleted C++ flag – ADMS Mar 25 '16 at 00:00
  • According to your advise, I think It's better to clone this to an aligned array. Am I right? – ADMS Mar 25 '16 at 00:07
  • No I don't think so. I would try first to find a way to use custom allocated memory with the API you want to use. If that's not possible then I would try manually aligning the memory allocated by the API, which require some code to be written but not a lot. Deep copy, although is indeed a solution, just feels so pussy. – user3528438 Mar 25 '16 at 00:13
  • Thank you, I will change the allocator but I dont know how to do that. `*temp = ( int * ) malloc ( numbytes );` – ADMS Mar 25 '16 at 00:28
  • The pointer does not need to be aligned. The data being pointed to needs to be aligned. Big difference. – Mark Lakata Mar 25 '16 at 00:29
  • You can do it with standard `malloc` as described here, http://stackoverflow.com/questions/227897/how-to-allocate-aligned-memory-only-using-the-standard-library . But I strongly recommend using C11 standard library http://en.cppreference.com/w/c/memory/aligned_alloc if C11 is available. – user3528438 Mar 25 '16 at 00:34
  • There is some problem and I should exactly use `__attribute__(( aligned(32))) ` I tried this : `*temp = ( __attribute__(( aligned(32))) int * )malloc ( numbytes ) ;` Is it OK? – ADMS Mar 25 '16 at 00:46
  • Probably not, and I believe it's highly undefined behavior. – user3528438 Mar 25 '16 at 00:50

1 Answers1

7

So you want to tell the compiler that your pointers are aligned? e.g. that all callers of this function will pass pointers that are guaranteed to be aligned. Either pointers to aligned static or local storage, or pointers they got from C11 aligned_alloc or POSIX posix_memalign. (If those aren't available, _mm_malloc is one option, but free isn't guaranteed to be safe on _mm_malloc results: you need _mm_free). This allows the compiler to auto-vectorize without making a bunch of bloated code to handle unaligned inputs.

When you manually vectorize with intrinsics, you use _mm256_loadu_si256 or _mm256_load_si256 to inform the compiler whether memory is or isn't aligned. Communicating alignment information is the main point of load/store intrinsics, as opposed to simply dereferencing __m256i pointers.


I don't think there's a portable way to inform the compiler that a pointer points to aligned memory. (C11 / C++11 alignas doesn't seem to be able to do that, see below).

With GNU C __attribute__ syntax, it seems to be necessary to use a typedef to get the attribute to apply to the pointed-to type, rather than to the pointer itself. It's definitely easier to type and easier to read if you declare an aligned_int type or something.

// Only helps GCC, not clang or ICC
typedef __attribute__(( aligned(32)))  int aligned_int;
int my_func(const aligned_int *restrict a, const aligned_int *restrict b) {
    int sum = 0;
    for (int i=0 ; i<1024 ; i++) {
        sum += a[i] - b[i];
    }
    return sum;
}

this auto-vectorizes without any bloat for handling unaligned inputs (gcc 5.3 with -O3 on godbolt)

    pxor    xmm0, xmm0
    xor     eax, eax
.L2:
    psubd   xmm0, XMMWORD PTR [rsi+rax]
    paddd   xmm0, XMMWORD PTR [rdi+rax]
    add     rax, 16
    cmp     rax, 4096
    jne     .L2          # end of vector loop

    ...   # horizontal sum with psrldq omitted, see the godbolt link if you're curious
    movd    eax, xmm0
    ret

Without the aligned attribute, you get a big block of scalar intro/outro code, which would be even worse with -march=haswell to make AVX2 code with a wider inner loop.


Clang's normal strategy for unaligned inputs is to use unaligned loads/stores, instead of fully-unrolled intro/outro loops. Without AVX, this means the loads couldn't be folded into memory operands for SSE ALU operations.

The aligned attribute doesn't help clang (tested as recently as clang7.0): it still uses separate movdqu loads. Note that clang's loop is bigger because it defaults to unrolling by 4, whereas gcc doesn't unroll at all without -funroll-loops (which is enabled by -fprofile-use).

But note, this aligned_int typedef only works for GCC itself, not clang or ICC. gcc memory alignment pragma has another example.

__builtin_assume_aligned is noisier syntax, but does work across all compilers that support GNU C extensions.

See How to tell GCC that a pointer argument is always double-word-aligned?


Note that you can't make an array of aligned_int. (see comments for discussion of sizeof(aligned_int), and the fact that it's still 4, not 32). GNU C refuses to treat it as an int-with-padding, so with gcc 5.3:

static aligned_int arr[1024];
// error: alignment of array elements is greater than element size
int tmp = sizeof(arr);

clang-3.8 compiles that, and initializes tmp to 4096. Presumably because it's just totally ignoring the aligned attribute in that context, not doing whatever magic gcc does to have a type that's narrower than its required alignment. (So only every fourth element actually has that alignment.)

The gcc docs claim that using the aligned attribute on a struct does let you make an array, and that this is one of the main use-cases. However, as @user3528438 pointed out in comments, this is not the case: you get the same error as when trying to declare an array of aligned_int. This has been the case since 2005.


To define aligned local or static/global arrays, the aligned attribute should be applied to the entire array, rather than to every element.

In portable C11 and C++11, you can use things like alignas(32) int myarray[1024];. See also Struggling with alignas syntax: it seems to only be useful for aligning things themselves, not declaring that pointers point to aligned memory. std::align is more like ((uintptr_t)ptr) & ~63 or something: forcibly aligning a pointer rather than telling the compiler it was already aligned.

// declaring aligned storage for arrays
#ifndef __cplusplus
#include <stdalign.h>   // for C11: defines alignas() using _Alignas()
#endif                  // C++11 defines alignas without any headers

// works for global/static or local  (aka automatic storage)
alignas(32) int foo[1000];      // portable ISO C++11 and ISO C11 syntax


// __attribute__((aligned(32))) int foo[1000];  // older GNU C
// __declspec something  // older MSVC

See the C11 alignas() documentation on cppreference.

CPP macros can be useful to choose between GNU C __attribute__ syntax and MSVC __declspec syntax for alignment if you want portability on older compilers that don't support C11.

e.g. with this code that declares a local array with more alignment than can be assumed for the stack pointer, the compiler has to make space and then AND the stack pointer to get an aligned pointer:

void foo(int *p);
void bar(void) {
  __attribute__((aligned(32))) int a[1000];
  foo (a);
}

compiles to (clang-3.8 -O3 -std=gnu11 for x86-64)

    push    rbp
    mov     rbp, rsp       # stack frame with base pointer since we're doing unpredictable things to rsp
    and     rsp, -32       # 32B-align the stack
    sub     rsp, 4032      # reserve up to 32B more space than needed
    lea     rdi, [rsp]     # this is weird:  mov rdi,rsp  is a shorter insn to set up foo's arg
    call    foo
    mov     rsp, rbp
    pop     rbp
    ret

gcc (later than 4.8.2) makes significantly larger code doing a bunch of extra work for no reason, the strangest being push QWORD PTR [r10-8] to copy some stack memory to another place on the stack. (check it out on the godbolt link: flip clang to gcc).

Peter Cordes
  • 245,674
  • 35
  • 423
  • 606
  • 1
    I kind of feel you are aligning each `int` of the array rather than aligning the array, like `sizeof(aligned_int)` would be 32? – user3528438 Mar 25 '16 at 00:29
  • @user3528438: hmm, interesting thought. I added a global initialized to `sizeof(aligned_int)` to the updated godbolt link: it's `4`. So pointer arithmetic on an `aligned_int` can easily produce a pointer that's not aligned. I'm not sure how best to explain how/why this works (i.e. at what point the compiler believes that a pointer to `aligned_int` actually has the specified alignment), but it's clearly the most useful way for it to work. – Peter Cordes Mar 25 '16 at 00:41
  • I guess that's something worth asking to the GCC developers or even filing a bug report. – user3528438 Mar 25 '16 at 00:43
  • 1
    @user3528438: Huh? It's not a bug: this is how it's supposed to work. I'm just not sure exactly how the rules-lawyering goes. Ask if you're curious (on the mailing list or in a new SO question), but it's not a valid subject for a bug report. – Peter Cordes Mar 25 '16 at 00:47
  • Thank you, but with `typedef` I should change every thing in my program and it's really hard to do – ADMS Mar 25 '16 at 00:50
  • @Amir: how is it harder than adding `__attribute__((aligned(32)))` in all the same places without `typedef`? I'm sure it's possible, since there's nothing magic about `typedef`. `aligned_int` is compatible with `int`, you only have to use it in places where it matters. – Peter Cordes Mar 25 '16 at 00:53
  • @user3528438: I checked: `static aligned_int arr[1024];` gives an error message: `error: alignment of array elements is greater than element size`. So this is how GNU C resolves the conflict between alignment and element size for non-composite types. [The examples in the gcc docs](https://gcc.gnu.org/onlinedocs/gcc/Common-Type-Attributes.html) are an aligned struct and a more-aligned `int`. They talk more about the struct: an array of that aligned-struct type would have padding to align the struct elements. – Peter Cordes Mar 25 '16 at 00:58
  • @PeterCordes: I have almost 50 program that I used `__attribute__((aligned(32)))` to make arrays aligned but now I should make a dynamic alignment and It's a research so every thing should be the same. I'm sure my professor will tell me if you want to use this you should change previous work like this – ADMS Mar 25 '16 at 01:04
  • 2
    I tried it here https://ideone.com/oxGyiN . So it seems somehow GCC have decided not to fully implement `sizeof()` of aligned types to include extra padding, which basically forbids declaring arrays of such types. This error seems to be added in 2005 https://gcc.gnu.org/ml/gcc-patches/2005-09/msg01853.html . Now, 11 years later, it's hard to argue whether this is a bug or a feature. – user3528438 Mar 25 '16 at 01:05
  • 1
    @user3528438: Thanks for the research. Updated my answer to include your findings. It's weird that the docs still conflict with actual behaviour. – Peter Cordes Mar 25 '16 at 01:12
  • @Amir: Declaring the alignment of (the first element of) an array is different from declaring the alignment of memory pointed to by a pointer. You can leave your arrays alone, and just use `aligned_int` when you need a function to know more about a pointer arg (in cases where it's not inlined into a context where it's known that it's being called on an aligned array). IDK what you mean by "dynamic alignment", but I'm *not* suggesting that you need to replace static arrays with dynamically allocated `aligned_alloc()` arrays. I made a minor edit to make sure this is clear. – Peter Cordes Mar 25 '16 at 01:16
  • I mean in my previous program I used some thing like this `int a[MAX1][MAX2] __attribute__(( aligned(32)))` and now I need to do that same thing with a pointer. All my data should be aligned and you have no idea how my professor is. By the way thank you all I will show him this `aligned_int` – ADMS Mar 25 '16 at 01:23
  • 1
    @Amir: You should keep doing that for any static / global arrays you have. Are you depending on auto-vectorization? If not, you don't need to tell the compiler about function args pointing to aligned memory; just use the appropriate intrinsics to communicate whether an aligned load or store is ok. (Using unaligned loads/stores doesn't matter with AVX as long as the data is in fact aligned at runtime. There's literally zero performance difference between `vmovdqa` and `vmovdqu` when used on aligned data, on Intel and AMD CPUs that support AVX. See Agner Fog's guides, and the x86 tag wiki) – Peter Cordes Mar 25 '16 at 01:53
  • 1
    Note that `-funroll-loops` [does not really unroll the loops like Clang](http://stackoverflow.com/questions/33038542/unroll-loop-and-do-independent-sum-with-vectorization) i.e. it does not break the dependency chain. This is one major weakness of GCC for auto-vectorization. – Z boson Mar 25 '16 at 11:38