Order-preserving memcpy in C++

Question

I'm developing a multicore, multithreaded software library in which I want to offer update-order preserving lock-free shared memory objects that might span multiple cache lines.

Specifically, suppose that I have some vector X of cache-line-sized objects: X[0], … X[K] each occupies exactly one cache line. I write to them in index order: X[0] first, then X[1], etc. If thread 2 reads X[K], will it also see a state for X[0] that is "at least as current" as what it sees for X[K]?

From that same thread, obviously I will see memory semantics that respect the update order. But now if some second thread reads X[K] the question arises: will the corresponding updates to X[0]...X[K-1] be observed?

With locking, we do get this guarantee. But with memcpy used to copy something into the vector, we lose this property: memcpy has a POSIX semantic that doesn't guarantee index-order updates or memory-order updates or any other ordering at all. You just are guaranteed that after memcpy finishes, the entire update has been performed.

My question: is there already an order-preserving memcpy with similar speed but with the desired guarantee? And if not, can such a primitive be implemented without locking?

Assume my target platforms are x86 and ARM.

(Editor's note: originally said Intel, so the OP might not care about AMD.)

Note: There is no guarantee that the *processor's* data cache will be used in memory copy. Many platforms have DMA controllers which can transfer data between memory locations without using the processor. — Thomas Matthews, Aug 27 '18 at 16:24
@ThomasMatthews, completely agree. That is also a consideration. Hoping that someone thought all of this out and came up with a neatly packaged solution years ago... — Ken Birman, Aug 27 '18 at 17:11
Also, be aware that a `memcpy` operation may be interrupted (by various things, including I/O). In that case, you are going to have a reload of the cache. — Thomas Matthews, Aug 27 '18 at 17:18
It's obviously implementable with atomics, assuming at least one lock-free size exists. Note that this isn't a full cache line, it's a "word" of some size. Just write it with `seq_cst` and then see if someone understands if a weaker level is still legal. — o11c, Aug 27 '18 at 17:20
There are no guarantees, except what the C++ standard states. The implementation of `memcpy` is compiler dependent, OS dependent and hardware dependent. For example, the ARM has a specialized instruction that can load up to 16 32-bit registers from memory (not interruptable) and likewise one that writes. However, the compiler may refuse to use the instruction and instead, loop (which is interruptable). Also, depends on how the copying utilizes the processor's register. The brute force is one byte at a time, more optimal is to use a word at a time. — Thomas Matthews, Aug 27 '18 at 17:21
You'll also need to research your platform's cores and how they use data cache. For example, does your platform share data caches between cores? Many platforms share RAM between the cores, which becomes interesting when using the single data bus. — Thomas Matthews, Aug 27 '18 at 17:23
Unless you want to block interrupts, I don't see how you are going to get any guarantees with memory copying (except those stated in the C++ standard). — Thomas Matthews, Aug 27 '18 at 17:27
@o11c, thanks, we are checking to see if this would work and at what performance cost. Thomas makes good points. More thought needed on our part with respect to interrupts — Ken Birman, Aug 27 '18 at 23:23
Keep in mind that `seq_cst` *is* relatively expensive. I don't have all this stuff memorized, but refreshing my memory, it looks like `req` + `acq` can do it cleanly, which is cheap on sane arches like x86 (yes, I just said that) - what arch are you using? Also, keep in mind that you can't have *any* non-atomic accesses - but also, you shouldn't worry about cheap atomics. — o11c, Aug 28 '18 at 03:09
@o11c: yes, the semantics the OP is asking for are *exactly* what release/acquire give you. http://preshing.com/20120913/acquire-and-release-semantics/. x86 does that for free in asm (but only with an atomicity chunk size of 8 bytes at most). You just have to ask the compiler nicely to use ordering. AArch64 only has relaxed or sequential-release, not cheaper plain release. :/ ARM32 only has memory barriers that are significantly stronger than release / acquire. (e.g. even a load-acquire needs a `dmb ish` (full memory barrier). https://godbolt.org/z/r08GzK). — Peter Cordes, Aug 28 '18 at 03:52

Peter Cordes · Answer 1 · 2018-08-28T18:38:05.820

The ordering requirements you describe are exactly what release/acquire semantics provide. (http://preshing.com/20120913/acquire-and-release-semantics/).

The problem is that the unit of atomicity for efficient guaranteed-atomic loads/stores is at most 8 bytes on all x86 and some ARM. Otherwise only 4 bytes on other ARMs. (Why is integer assignment on a naturally aligned variable atomic on x86?). Some Intel CPUs probably in practice have atomic 32 or even 64-byte (AVX512) stores, but neither Intel nor AMD have ever made any guarantees official.

We don't even know if SIMD vector stores have a guaranteed order when they potentially break up a wide aligned store into multiple 8-byte aligned chunks. Or even if those chunks are individually atomic. Per-element atomicity of vector load/store and gather/scatter? There's every reason to believe that they are per-element atomic, even if the documentation doesn't guarantee it.

If having large "objects" is performance critical, you could consider testing vector load/store atomicity on a specific server that you care about, but you're totally on your own as far as guarantees and getting the compiler to use it. (There are intrinsics.) Make sure you test between cores on different sockets, to catch cases like SSE instructions: which CPUs can do atomic 16B memory operations? tearing at 8-byte boundaries because of HyperTransport between sockets on a K10 Opteron. This is probably a really bad idea; you can't guess what if any microarchitectural conditions could make a wide vector store non-atomic in rare cases even when it normally looks like it is atomic.

You can easily have release/acquire ordering for the elements of an array like
alignas(64) atomic<uint64_t> arr[1024];. You just have to ask the compiler nicely:

copy_to_atomic(std::atomic<uint64_t> *__restrict dst_a, 
                      const uint64_t *__restrict src, size_t len) {
    const uint64_t *endsrc = src+len;
    while (src < src+len) {
        dst_a->store( *src, std::memory_order_release );
        dst_a++; src++;
    }
}

On x86-64 it doesn't auto-vectorize or anything, because compilers don't optimize atomics, and because there's no documentation that it's safe to use vectors to store consecutive elements of an array of atomic elements. :( So this basically sucks. See it on the Godbolt compiler explorer

I'd consider rolling your own with volatile __m256i* pointers (aligned load/store), and compiler barriers like atomic_thread_fence(std::memory_order_release) to prevent compile-time reordering. Per-element ordering/atomicity should be ok (but again not guaranteed). And definitely don't count on the whole 32 bytes being atomic, just that higher uint64_t elements are written after lower uint64_t elements (and those stores become visible to other cores in that order).

On ARM32: even an atomic store of a uint64_t is not great. gcc uses a ldrexd / strexd pair (LL/SC), because apparently there is no 8-byte atomic pure store. (I compiled with gcc7.2 -O3 -march=armv7-a. With armv8-a in AArch32 mode, store-pair is atomic. AArch64 also has atomic 8-byte load/store of course.)

You must avoid using a normal C library memcpy implementation. On x86, it can use weakly-ordered stores for large copies, allowing reordering between its own stores (but not with later stores that weren't part of the memcpy, because that could break later release-stores.)

movnt cache-bypassing stores in a vector loop, or rep movsb on a CPU with the ERMSB feature, could both create this effect. Does the Intel Memory Model make SFENCE and LFENCE redundant?.

Or a memcpy implementation could simply choose to do the last (partial) vector first, before entering its main loop.

Concurrent write+read or write+write on non-atomic types in UB in C and C++; that's why memcpy has so much freedom to do whatever it wants, including use weakly-ordered stores as long as it uses sfence if necessary to make sure the memcpy as a whole respects the ordering the compiler expects when it emits code for later mo_release operations.

(i.e. current C++ implementations for x86 do std::atomic with the assumption that there are no weakly-ordered stores for them to worry about. Any code that wants their NT stores to respect the ordering of compiler-generated atomic<T> code must use _mm_sfence(). Or if writing asm by hand, the sfence instruction directly. Or just use xchg if you want to do a sequential-release store and give your asm function the effect of a atomic_thread_fence(mo_seq_cst) as well.)

I'm adding my own up-vote to Peter Cordes's fantastic and detailed reply. It seems to cover absolutely everything, and we are very grateful for the help! For me the topic is kind of closed by this. Thanks! — Ken Birman, Aug 28 '18 at 20:15
@KenBirman: If it covers everything you wanted to know, you should click the checkbox to mark it accepted. Glad I could help. — Peter Cordes, Aug 28 '18 at 21:00
Actual memcpy like current glibc don't do things in a linear order, even appart from the last element. For small sizes (but even in the order of 100s of bytes) glibc does a series of forward copies then a series of backwards copies, that meet in the middle (and usually overlap). So you definitely can't use it for this apart from all the other reasons. — BeeOnRope, Aug 29 '18 at 00:46
This question raises interesting point here regarding _ordering_ versus _atomicity_. In particular, the OP never asks for atomicity: he asks for guarantees that when a subsequent store (like `X[1]`) is observed, locations that were stored earlier (like `X[0]`) will be _at least as recent_. I believe the x86 memory ordering model guarantees this, even with wide SIMD loads and stores. That is, I the ordering guarantees should not be (are not?) restricted to the accesses that are atomic. In particular, this would seem to guarantee that in a "write once" scenario, even wide stores will be ... — BeeOnRope, Aug 29 '18 at 03:02
... seen atomically by subsequent wide loads, if the release-acquire relationship is established (can this also be established by wide stores and loads?). The OP hasn't made clear how he is going to deal with the "at least" part of the "at least as current" requirement, but possibly he can do without atomicity? Doing without it entire seems unlikely, though. — BeeOnRope, Aug 29 '18 at 03:04
@BeeOnRope: What I'm not 100% confident about is that a wide vector store, if split up, will always logically do its lower address chunks before higher address chunks. Nothing gives us any guarantee that even an aligned vector store to `dst_a[0..1]` within a single cache line will store `dst_a[1]` last, although I think in practice we can assume this. (I forgot to put this in the answer.) But yes, if you know *another* rewrite of the array hasn't started, seeing a value in an element implies that all previous elements are "good", with rel/acq. — Peter Cordes, Aug 29 '18 at 03:47
@PeterCordes - right I don't think there any guarantee about the ordering of such "sub stores" within a larger store, but I meant between distinct stores at the assembly level. — BeeOnRope, Aug 29 '18 at 03:57
@BeeOnRope: Oh right, you're talking about the OP's 64-byte objects and having them written in order *within* each object, as well as between objects, with separate atomic release stores. Yes that would satisfy the requirement if that's all the OP needs. I think it's good to phrase the answer this way for other future readers, and kind of ignore that possible interpretation. If that *is* what you need, you can see from this answer how to do it and that it's safe, and if not you will learn that 64-byte atomic stores are unfortunately not a guaranteed thing. — Peter Cordes, Aug 29 '18 at 04:02
@BeeOnRope: I really wonder if anyone is in-practice using 64-byte stores as atomic operations for custom low-latency stuff that only has to work on one machine. (Like algorithmic stock-trading stuff.) I'd definitely try it in those circumstances, because I think I understand enough to know how to test it carefully. You'd have to build some serious test harnesses to convice yourself that your whole algo was really working as part of your real code, though. — Peter Cordes, Aug 29 '18 at 04:04
@PeterCordes if it would be useful for some low latency scenario, I have no doubt that people would use them somewhere. Some things probably even tolerate even the very rare ripping if it were to occur (maybe not trading!). — BeeOnRope, Aug 29 '18 at 04:08
Clicked "accept". The remarks about DMA are dead on: this was really about RDMA in Mellanox, and the emulation of RDMA used by LibFabrics when running on TCP (which, it seems, is at a minimum "hard to use correctly" and indeed, may be buggy!) — Ken Birman, Aug 30 '18 at 00:06

Joe Hershberger · Accepted Answer · 2021-03-02T07:16:28.833

I found the answer by Peter Cordes to this question insightful, detailed, and very helpful. However I didn't see his suggestions put into code, so for posterity and future people needing a quick solution to this issue of requiring ordered writes for DMA or lockless algorithms, I'm including the code I wrote based on that answer. I build it using gcc 4.9 on x64 and armv7-a, though I only ran it and tested it on x64.

#include <atomic>
#include <stdlib.h>
#include <algorithm> // min

extern "C" {

static void * linear_memcpy_portable(void *__restrict dest, const void *__restrict src, size_t n)
{
   // Align dest if not already aligned
   if ((uintptr_t)dest & sizeof(uint64_t)) {
      uint8_t *__restrict dst8 = reinterpret_cast<uint8_t *__restrict>(dest);
      const uint8_t *__restrict src8 = reinterpret_cast<const uint8_t *__restrict>(src);
      const size_t align_n = std::min(n, (uintptr_t)dest & sizeof(uint64_t));
      const uint8_t * const endsrc8 = static_cast<const uint8_t * const>(src) + align_n;
      while (src8 < endsrc8) {
         *dst8 = *src8;
         atomic_thread_fence(std::memory_order_release);
         dst8++; src8++;
      }
      dest = dst8;
      src = src8;
      n = n - align_n;
   }
   typedef uint64_t __attribute__((may_alias,aligned(1))) aliasing_unaligned_uint64_t;
   uint64_t *__restrict dst64 = static_cast<uint64_t *__restrict>(dest);
   const aliasing_unaligned_uint64_t *__restrict src64 = static_cast<const aliasing_unaligned_uint64_t *__restrict>(src);
   const uint64_t * const endsrc64 = src64 + n / sizeof(uint64_t);
   const uint8_t * const endsrc8 = static_cast<const uint8_t * const>(src) + n;
   while (src64 < endsrc64) {
      *dst64 = *src64;
      atomic_thread_fence(std::memory_order_release);
      dst64++; src64++;
   }
   if (reinterpret_cast<const uint8_t * const>(endsrc64) != endsrc8) {
      uint8_t *__restrict dst8 = reinterpret_cast<uint8_t *__restrict>(dst64);
      const uint8_t *__restrict src8 = reinterpret_cast<const uint8_t *__restrict>(src64);
      while (src8 < endsrc8) {
         *dst8 = *src8;
         atomic_thread_fence(std::memory_order_release);
         dst8++; src8++;
      }
   }
   return dest;
}

#if (_M_AMD64 || __x86_64__)
#include <immintrin.h>
static void * linear_memcpy_avx2(void *dest, const void * src, size_t n) __attribute__((target("avx2")));
static void * linear_memcpy_avx2(void *dest, const void * src, size_t n)
{
   __m256i *__restrict dst256 = static_cast<__m256i *__restrict>(dest);
   const __m256i *__restrict src256 = static_cast<const __m256i *__restrict>(src);
   const __m256i * const endsrc256 = src256 + n / sizeof(__m256i);
   const uint8_t * const endsrc8 = static_cast<const uint8_t *>(src) + n;
   while (src256 < endsrc256) {
      _mm256_storeu_si256(dst256, _mm256_loadu_si256(src256));
      atomic_thread_fence(std::memory_order_release);
      dst256++; src256++;
   }
   if (reinterpret_cast<const uint8_t * const>(endsrc256) != endsrc8)
      linear_memcpy_portable(dst256, src256, endsrc8 - reinterpret_cast<const uint8_t * const>(endsrc256));
   return dest;
}

static void * linear_memcpy_sse2(void *dest, const void * src, size_t n) __attribute__((target("sse2")));
static void * linear_memcpy_sse2(void *dest, const void * src, size_t n)
{
   __m128i *__restrict dst128 = static_cast<__m128i *__restrict>(dest);
   const __m128i *__restrict src128 = static_cast<const __m128i *__restrict>(src);
   const __m128i * const endsrc128 = src128 + n / sizeof(__m128i);
   const uint8_t * const endsrc8 = static_cast<const uint8_t *>(src) + n;
   while (src128 < endsrc128) {
      _mm_storeu_si128(dst128, _mm_loadu_si128(src128));
      atomic_thread_fence(std::memory_order_release);
      dst128++; src128++;
   }
   if (reinterpret_cast<const uint8_t * const>(endsrc128) != endsrc8)
      linear_memcpy_portable(dst128, src128, endsrc8 - reinterpret_cast<const uint8_t * const>(endsrc128));
   return dest;
}

static void *(*resolve_linear_memcpy(void))(void *, const void *, size_t)
{
   __builtin_cpu_init();
   // All x64 targets support a minimum of SSE2
   return __builtin_cpu_supports("avx2") ? linear_memcpy_avx2 : linear_memcpy_sse2;
}
#ifdef __AVX2__
// IF AVX2 is specified to the compiler, alias to the avx2 impl so it can be inlined
void * linear_memcpy(void *, const void *, size_t) __attribute__((alias("linear_memcpy_avx2")));
#else
void * linear_memcpy(void *, const void *, size_t) __attribute__((ifunc("resolve_linear_memcpy")));
#endif
#else
void * linear_memcpy(void *, const void *, size_t) __attribute__((alias("linear_memcpy_portable")));
#endif

} // extern "C"

I welcome any feedback on the implementation. :)

Can you please add reference to the answer you are mentioning? — Andrew Naguib, Feb 05 '21 at 17:57
`linear_memcpy_portable` can break when inlining because it violates strict aliasing rules if you use it on memory you access with types other than `char*` or `uint64_t*`. And also possibly violating `alignof(uint64_t)` depending on pointer alignment ([Why does unaligned access to mmap'ed memory sometimes segfault on AMD64?](//stackoverflow.com/q/47510783)). You might want something like GNU C `typedef unsigned long __attribute__((may_alias,aligned(1))) aliasing_unaligned_ulong;` (see also [Why does glibc's strlen need to be so complicated to run quickly?](//stackoverflow.com/a/57671930)) — Peter Cordes, Feb 05 '21 at 20:48
(`__m256i` is already defined as `may_alias` in GNU C; that's why it's safe to use it the way Intel documents; to load from arbitrary C objects that you also access with as other C types. [Is \`reinterpret\_cast\`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?](https://stackoverflow.com/q/52112605)) — Peter Cordes, Feb 05 '21 at 20:49
See also [gcc, strict-aliasing, and horror stories](https://stackoverflow.com/q/2958633) for a Linux kernel bug caused by their old definition of memcpy as copying `long*` chunks, when GCC started doing type-based alias analysis. They "fixed" it by compiling with `-fno-strict-aliasing`, which is popular in general for embedded / kernels that often abuse C. — Peter Cordes, Feb 05 '21 at 20:56
In your AVX2 and SSE versions, just always use `loadu` and `storeu`, not `load`/`store`, inside your loop. When the address happens to be aligned at runtime, `vmovdqu` is exactly as efficient as `vmovdqa` on real CPUs with AVX. (Same for SSE `movups` / SSE2 `movdqu` on Nehalem and later.) You *definitely* do not want the compiler to actually branch inside the copy loop; if you did want to cater to ancient CPUs like Core2Duo and AMD K10, you'd want 2 or 4 versions of the loop. (You might or might not get that from an optimizer). **For modern code, leave misalignment handling to HW.** — Peter Cordes, Feb 05 '21 at 20:58
Hmm, if your runtime resolver stuff doesn't ever let this inline, you might be protected from strict-aliasing UB in practice by the function call boundary. But in the non-x86 fallback case, there's no ifunc indirection so it can inline. (You might want to check `#ifdef __AVX2__` (enabled at compile time, e.g. via `gcc -march=native`) and if so just alias `linear_memcpy` to `linear_memcpy_avx2`, skipping ifunc there, too.) — Peter Cordes, Feb 05 '21 at 21:01
Even on x86 for large copies, but especially if you care about ISAs where unaligned loads / stores take multiple instructions (e.g. older MIPS, older ARM) or are just less efficient (many non-x86), doing an unaligned first chunk and then start with the first aligned chunk is good. Like `(byteptr+8) & -8`. That will partially overlap on unaligned, or not on aligned. An unaligned last chunk (ending at the last byte) works, again for buffers larger than 1 chunk. — Peter Cordes, Feb 05 '21 at 21:10
IDK if that violates your "linear" requirement, but writing the same thing twice should be fine. It won't make later data available too soon. For cacheable write-back memory, the store buffer committing to L1d will absorb this just fine so it performs very well, better than doing more smaller stores on x86 HW with very efficient unaligned load/store. But if you need to avoid it, you might want to use smaller chunks to reach an alignment boundary, if your typical copy sizes are large enough to be worth it on your HW. — Peter Cordes, Feb 05 '21 at 21:14
Thanks for the feedback, @PeterCordes. I updated the code according to your first 4 comments. The ifunc resolver should only function on load and I need the same binary to function on CPUs with and without AVX2. My use-case is to write to a DMA window on a device that functions as a FIFO (ignores lower address bits), so duplicate writes won't work. I considered your suggestion about a pre-alignment loop for the portable implementation, but I'm not sure if that helps unless the src and dst are misaligned by the same amount. Any further suggestion on that? Thanks! — Joe Hershberger, Feb 09 '21 at 03:03
Yeah, I understand how ifunc works, but actually being able to *inline* (where the size may be a compile-time constant) is a significant difference. Future readers might be able to compile with `-mavx2` and not need runtime-dispatch, so it could inline. So it's good to fix that for an SO answer. Maybe you don't want to bother with extra `#ifdef __AVX2__` which won't be true for you, though. — Peter Cordes, Feb 09 '21 at 03:06
Re: alignment: historically the recommendation has been to prefer aligning the destination if you could only pick one (because of possible relative misalignment). That would seem appropriate here, where it seems we care about another thread seeing the stores but aren't apparently worrying about the loads. Aligned stores give less chance for invalidation of a line we've partially written, resulting in needing another RFO (read for ownership) to get ownership of it. — Peter Cordes, Feb 09 '21 at 03:10

Order-preserving memcpy in C++

2 Answers2