What's missing/sub-optimal in this memcpy implementation?

Question

I've become interested in writing a memcpy() as an educational exercise. I won't write a whole treatise of what I did and didn't think about, but here's some guy's implementation:

__forceinline   // Since Size is usually known,
                // most useless code will be optimized out
                // if the function is inlined.

void* myMemcpy(char* Dst, const char* Src, size_t Size)
{
        void* start = Dst;
        for ( ; Size >= sizeof(__m256i); Size -= sizeof(__m256i) )
        {
                __m256i ymm = _mm256_loadu_si256(((const __m256i* &)Src)++);
                _mm256_storeu_si256(((__m256i* &)Dst)++, ymm);
        }

#define CPY_1B *((uint8_t * &)Dst)++ = *((const uint8_t * &)Src)++
#define CPY_2B *((uint16_t* &)Dst)++ = *((const uint16_t* &)Src)++
#define CPY_4B *((uint32_t* &)Dst)++ = *((const uint32_t* &)Src)++
#if defined _M_X64 || defined _M_IA64 || defined __amd64
#define CPY_8B *((uint64_t* &)Dst)++ = *((const uint64_t* &)Src)++
#else
#define CPY_8B _mm_storel_epi64((__m128i *)Dst, _mm_loadu_si128((const __m128i *)Src)), ++(const uint64_t* &)Src, ++(uint64_t* &)Dst
#endif
#define CPY16B _mm_storeu_si128((__m128i *)Dst, _mm_loadu_si128((const __m128i *)Src)), ++(const __m128i* &)Src, ++(__m128i* &)Dst

    switch (Size) {
    case 0x00:                                                      break;
    case 0x01:      CPY_1B;                                         break;
    case 0x02:              CPY_2B;                                 break;
    case 0x03:      CPY_1B; CPY_2B;                                 break;
    case 0x04:                      CPY_4B;                         break;
    case 0x05:      CPY_1B;         CPY_4B;                         break;
    case 0x06:              CPY_2B; CPY_4B;                         break;
    case 0x07:      CPY_1B; CPY_2B; CPY_4B;                         break;
    case 0x08:                              CPY_8B;                 break;
    case 0x09:      CPY_1B;                 CPY_8B;                 break;
    case 0x0A:              CPY_2B;         CPY_8B;                 break;
    case 0x0B:      CPY_1B; CPY_2B;         CPY_8B;                 break;
    case 0x0C:                      CPY_4B; CPY_8B;                 break;
    case 0x0D:      CPY_1B;         CPY_4B; CPY_8B;                 break;
    case 0x0E:              CPY_2B; CPY_4B; CPY_8B;                 break;
    case 0x0F:      CPY_1B; CPY_2B; CPY_4B; CPY_8B;                 break;
    case 0x10:                                      CPY16B;         break;
    case 0x11:      CPY_1B;                         CPY16B;         break;
    case 0x12:              CPY_2B;                 CPY16B;         break;
    case 0x13:      CPY_1B; CPY_2B;                 CPY16B;         break;
    case 0x14:                      CPY_4B;         CPY16B;         break;
    case 0x15:      CPY_1B;         CPY_4B;         CPY16B;         break;
    case 0x16:              CPY_2B; CPY_4B;         CPY16B;         break;
    case 0x17:      CPY_1B; CPY_2B; CPY_4B;         CPY16B;         break;
    case 0x18:                              CPY_8B; CPY16B;         break;
    case 0x19:      CPY_1B;                 CPY_8B; CPY16B;         break;
    case 0x1A:              CPY_2B;         CPY_8B; CPY16B;         break;
    case 0x1B:      CPY_1B; CPY_2B;         CPY_8B; CPY16B;         break;
    case 0x1C:                      CPY_4B; CPY_8B; CPY16B;         break;
    case 0x1D:      CPY_1B;         CPY_4B; CPY_8B; CPY16B;         break;
    case 0x1E:              CPY_2B; CPY_4B; CPY_8B; CPY16B;         break;
    case 0x1F:      CPY_1B; CPY_2B; CPY_4B; CPY_8B; CPY16B;         break;
    }
#undef CPY_1B
#undef CPY_2B
#undef CPY_4B
#undef CPY_8B
#undef CPY16B
        return start;
}

The comment translates as "Size is usually known as the compiler can optimize the code inline out most useless".

I would like to improve, if possible, on this implementation - but maybe there isn't much to improve. I see it uses SSE/AVX for the larger chunks of memory, then instead of a loop over the last < 32 bytes does the equivalent of manual unrolling, with some tweaking. So, here are my questions:

Why unroll the loop for the last several bytes, but not partially unroll the first (and now single) loop?
What about alignment issues? Aren't they important? Should I handle the first several bytes up to some alignment quantum differently, then perform the 256-bit ops on aligned sequences of bytes? And if so, how do I determine the appropriate alignment quantum?
What's the most important missing feature in this implementation (if any)?

Features/Principles mentioned in the answers so far

You should __restrict__ your parameters. (@chux)
The memory bandwidth is a limiting factor; measure your implementation against it.(@Zboson)
For small arrays, you can expect to approach the memory bandwidth; for larger arrays - not as much. (@Zboson)
Multiple threads (may be | are) necessary to saturate the memory bandwidth. (@Zboson)
It is probably wise to optimize differently for large and small copy sizes. (@Zboson)
(Alignment is important? Not explicitly addressed!)
The compiler should be made more explicitly aware of "obvious facts" it can use for optimization (such as the fact that Size < 32 after the first loop). (@chux)
There are arguments for unrolling your SSE/AVX calls (@BenJackson, here), and arguments against doing so (@PaulR)
non-temporal transfers (with which you tell the CPU you don't need it to cache the target location) should be useful for copying larger buffers. (@Zboson)

@MichaelDorgan: it *looks* like Duff's Device but it's not really. — Paul R, Oct 07 '14 at 22:15
You're right - no fall throughs and each switch implemented as a copy. Still, it reminded me of it :) — Michael Dorgan, Oct 07 '14 at 22:15
@MichaelDorgan: I also thought s/he was doing something arcane and magical, but on closer inspection it's pretty straightforward. It looked like a pipe organ arrangement to me... — einpoklum, Oct 07 '14 at 22:19
I really like the expressively arranged `switch` branches. Looks quite nice. 10/10 would commit :) — dom0, Oct 07 '14 at 22:25
"important missing feature in this implementation" is wrong signature. Expected a match to: `void *memcpy(void * restrict s1, const void * restrict s2, size_t n);` — chux - Reinstate Monica, Oct 07 '14 at 22:31
Even with an optimizing compiler may not discern `switch (Size)` with its 32 cases matches `Size` range `0<=Size<32`. Maybe `switch (Size&31)`? Avoid the internally generated `if size > 31`. — chux - Reinstate Monica, Oct 07 '14 at 22:34
@tmyklebu: Not quite a code review, since it's not my code. Will edit to clarify a bit more. — einpoklum, Oct 08 '14 at 06:18
@einpoklum: Makes a bit more sense now that you're asking more concrete questions. It could go either here or on codereview. — tmyklebu, Oct 08 '14 at 13:46
Note that restrict only helps for the parts of your code without intrinsics. Restrict with intrinsics is useless. — Z boson, Oct 08 '14 at 18:00
Optimizing an aligned power-of-two copy loop is “easy” (whereby I mean it takes time to experiment, but no special attention to detail or unusual techniques). Most of the fun of implementing `memcpy` is handling misalignment as efficiently as possible. This implementation is suboptimal in that it will issue cacheline- and page-crossing stores when the buffers are unaligned, and it does lots of excess work for cleanup. — Stephen Canon, Oct 08 '14 at 19:43
@StephenCanon: Not so easy for us mere mortals... not everyone has a gold badge for the C tag and 50k reputation :-( Also, I usually memcpy buffers of more than 1MB, so fiddling with the edges is not that exciting for me (although I know it's of critical importance for other cases). — einpoklum, Oct 08 '14 at 19:47
@einpoklum: I’m being deliberately glib. Even though it’s “easy”, it still takes a fair bit of work, especially if you haven’t done it before. At this point I have shipped at least 7 commercial `memcpy` implementations, so I will admit to having somewhat more experience than most people. =) — Stephen Canon, Oct 08 '14 at 19:50
@einpoklum, I updated my answer based on comments by Stephen Canon and also based on general comments on moving memory by Agner Fog in his optimizing assembly manual. Agner discusses several cases of misaligned memory. I would read the section in his manual. — Z boson, Oct 09 '14 at 12:34
@StephenCanon, it may be easy to implement a power-of-two copy but that does not mean necessarily that the standard library you are using does even that efficiently (well maybe _yours_ does but GCC builtin and EGLIBC still can be improved). — Z boson, Oct 09 '14 at 13:20
@einpoklum, out of curiosity why have you not accepted my answer? What is my answer lacking to your question? I did not fill in every detail (e.g. how to adjust for misalignment) but do you really expect someone to do that for you? — Z boson, Nov 17 '15 at 08:55
@Zboson: Essentially because I was thinking the bottom of the question was summarizing the answers, but I guess you earned your accept :-) — einpoklum, Nov 17 '15 at 15:06
I've fixed AVX-512 move instructions, added more processors that support AVX-512 to my reply. Hope will be useful. — Maxim Masiutin, Jul 01 '17 at 11:13

score 39 · Accepted Answer · edited May 23 '17 at 12:02

I have been studying measuring memory bandwidth for Intel processors with various operations and one of them is memcpy. I have done this on Core2, Ivy Bridge, and Haswell. I did most of my tests using C/C++ with intrinsics (see the code below - but I'm currently rewriting my tests in assembly).

To write your own efficient memcpy function it's important to know what the absolute best bandwidth possible is. This bandwidth is a function of the size of the arrays which will be copied and therefore an efficient memcpy function needs to optimize differently for small and big (and maybe in between). To keep things simple I have optimized for small arrays of 8192 bytes and large arrays of 1 GB.

For small arrays the maximum read and write bandwidth for each core is:

Core2-Ivy Bridge             32 bytes/cycle
Haswell                      64 bytes/cycle

This is the benchmark you should aim for small arrays. For my tests I assume the arrays are aligned to 64-bytes and that the array size is a multiple of 8*sizeof(float)*unroll_factor. Here are my current memcpy results for a size of 8192 bytes (Ubuntu 14.04, GCC 4.9, EGLIBC 2.19):

                             GB/s     efficiency
    Core2 (p9600@2.66 GHz)  
        builtin               35.2    41.3%
        eglibc                39.2    46.0%
        asmlib:               76.0    89.3%
        copy_unroll1:         39.1    46.0%
        copy_unroll8:         73.6    86.5%
    Ivy Bridge (E5-1620@3.6 GHz)                        
        builtin              102.2    88.7%
        eglibc:              107.0    92.9%
        asmlib:              107.6    93.4%
        copy_unroll1:        106.9    92.8%
        copy_unroll8:        111.3    96.6%
    Haswell (i5-4250U@1.3 GHz)
        builtin:              68.4    82.2%     
        eglibc:               39.7    47.7%
        asmlib:               73.2    87.6%
        copy_unroll1:         39.6    47.6%
        copy_unroll8:         81.9    98.4%

The asmlib is Agner Fog's asmlib. The copy_unroll1 and copy_unroll8 functions are defined below.

From this table we can see that the GCC builtin memcpy does not work well on Core2 and that memcpy in EGLIBC does not work well on Core2 or Haswell. I did check out a head version of GLIBC recently and the performance was much better on Haswell. In all cases unrolling gets the best result.

void copy_unroll1(const float *x, float *y, const int n) {
    for(int i=0; i<n/JUMP; i++) {
        VECNF().LOAD(&x[JUMP*(i+0)]).STORE(&y[JUMP*(i+0)]);
    }
}

void copy_unroll8(const float *x, float *y, const int n) {
for(int i=0; i<n/JUMP; i+=8) {
    VECNF().LOAD(&x[JUMP*(i+0)]).STORE(&y[JUMP*(i+0)]);
    VECNF().LOAD(&x[JUMP*(i+1)]).STORE(&y[JUMP*(i+1)]);
    VECNF().LOAD(&x[JUMP*(i+2)]).STORE(&y[JUMP*(i+2)]);
    VECNF().LOAD(&x[JUMP*(i+3)]).STORE(&y[JUMP*(i+3)]);
    VECNF().LOAD(&x[JUMP*(i+4)]).STORE(&y[JUMP*(i+4)]);
    VECNF().LOAD(&x[JUMP*(i+5)]).STORE(&y[JUMP*(i+5)]);
    VECNF().LOAD(&x[JUMP*(i+6)]).STORE(&y[JUMP*(i+6)]);
    VECNF().LOAD(&x[JUMP*(i+7)]).STORE(&y[JUMP*(i+7)]);
}

}

Where VECNF().LOADis _mm_load_ps() for SSE or _mm256_load_ps() for AVX, VECNF().STORE is _mm_store_ps() for SSE or _mm256_store_ps() for AVX, and JUMP is 4 for SSE or 8 for AVX.

For the large size the best result is obtained by using non-temporal store instructions and by using multiple threads. Contrary to what many people may believe a single thread does NOT usually saturate the memory bandwidth.

void copy_stream(const float *x, float *y, const int n) {
    #pragma omp parallel for        
    for(int i=0; i<n/JUMP; i++) {
        VECNF v = VECNF().load_a(&x[JUMP*i]);
        stream(&y[JUMP*i], v);
    }
}

Where stream is _mm_stream_ps() for SSE or _mm256_stream_ps() for AVX

Here are the memcpy results on my E5-1620@3.6 GHz with four threads for 1 GB with a maximum main memory bandwidth of 51.2 GB/s.

                         GB/s     efficiency
    eglibc:              23.6     46%
    asmlib:              36.7     72%
    copy_stream:         36.7     72%

Once again EGLIBC performs poorly. This is because it does not use non-temporal stores.

I modfied the eglibc and asmlib memcpy functions to run in parallel like this

void COPY(const float * __restrict x, float * __restrict y, const int n) {
    #pragma omp parallel
    {
        size_t my_start, my_size;
        int id = omp_get_thread_num();
        int num = omp_get_num_threads();
        my_start = (id*n)/num;
        my_size = ((id+1)*n)/num - my_start;
        memcpy(y+my_start, x+my_start, sizeof(float)*my_size);
    }
}

A general memcpy function needs to account for arrays which are not aligned to 64 bytes (or even to 32 or to 16 bytes) and where the size is not a multiple of 32 bytes or the unroll factor. Additionally, a decision has to be made as to when to use non-temporal stores. The general rule of thumb is to only use non-temporal stores for sizes larger than half the largest cache level (usually L3). But theses are "second order" details which I think should be dealt with after optimizing for ideal cases of large and small. There's not much point in worrying about correcting for misalignment or non-ideal size multiples if the ideal case performs poorly as well.

Update

Based on comments by Stephen Canon I have learned that on Ivy Bridge and Haswell it's more efficient to use rep movsb than movntdqa (a non-temporal store instruction). Intel calls this enhanced rep movsb (ERMSB). This is described in the Intel Optimization manuals in the section 3.7.6 Enhanced REP MOVSB and STOSB operation (ERMSB).

Additionally, in Agner Fog's Optimizing Subroutines in Assembly manual in section 17.9 Moving blocks of data (All processors) he writes:

"There are several ways of moving large blocks of data. The most common methods are:

REP MOVS instruction.
If data are aligned: Read and write in a loop with the largest available register size.
If size is constant: inline move instructions.
If data are misaligned: First move as many bytes as required to make the destination aligned. Then read unaligned and write aligned in a loop with the largest available register size.
If data are misaligned: Read aligned, shift to compensate for misalignment and write aligned.
If the data size is too big for caching, use non-temporal writes to bypass the cache. Shift to compensate for misalignment, if necessary."

A general memcpy should consider each of these points. Additionally, with Ivy Bridge and Haswell it seems that point 1 is better than point 6 for large arrays. Different techniques are necessary for Intel and AMD and for each iteration of technology. I think it's clear that writing your own general efficient memcpyfunction can be quite complicated. But in the special cases I have looked at I have already managed to do better than the GCC builtin memcpy or the one in EGLIBC so the assumption that you can't do better than the standard libraries is incorrect.

A few notes/questions: 1. "sizes larger than half **a cache line** in the largest level", right? 2. Got your point about first- and second-order optimizations, but suppose I choose your unroll8 variant; is alignment important there? I assume your benchmark used aligned buffers. 3. Does the `omp_parallel` help because of the presence of 2 Load/Store units? Will it produce two threads? 4. Isn't using OpenMP here kind of like cheating? — einpoklum, Oct 08 '14 at 14:02
@einpoklum, I mean half the size of the slowest cache. On a system with a 8 MB L3 cache half the size would by 4 MB. I can't say I know this rule of thumb from experience. It's something I read. But there is not question that non-temporal stores make a significant difference when the size is much larger than the slowest cache (e.g. for 1 GB). — Z boson, Oct 08 '14 at 17:41
@einpoklum, for alignment you should try it and see. I only compared the aligned vs. unaligned instructions with aligned memory and I got better results with the aligned instructions. My buffers are aligned to 4096 bytes. Remember that I'm trying to get closest to the theoretical max. Once I achieve this I can optimize for less idea cases but I doubt I'll do this because like you this is only for education purposes. — Z boson, Oct 08 '14 at 17:43
@einpoklum, I set the number of threads to the number of physical cores and then bound the threads. To understand why read the question,answers, and comments at https://stackoverflow.com/questions/25179738/measuring-memory-bandwidth-from-the-dot-product-of-two-arrays. But I don't think it's cheating to use multiple threads. This could really be used to improve the efficiency (speed) of a `memcpy` for large arrays (especially for a NUMA system). However, for small arrays the OpenMP overhead dominates and the result would actually be worse. — Z boson, Oct 08 '14 at 17:47
@einpoklum, see this question/answer for more about memset (same logic for memcpy) on a single socket and multi-socket (NUMA) system https://stackoverflow.com/questions/11576670/in-an-openmp-parallel-code-would-there-be-any-benefit-for-memset-to-be-run-in-p/11579987?noredirect=1#comment39737599_11579987. — Z boson, Oct 08 '14 at 17:50
Note that on Ivybridge and Haswell, with buffers to large to fit in MLC you can beat `movntdqa` using `rep movsb`; `movntdqa` incurs a RFO into LLC, `rep movsb` does not. — Stephen Canon, Oct 08 '14 at 19:04
@StephenCanon, MLC means mid level cache? The L2 Cache? RFO=Read for Owndership? and LLC=last level cache? I guess that's the term I mean when I say have the slowest cache. I mean have the size of the LLC. — Z boson, Oct 08 '14 at 19:18
@StephenCanon, you said buffers too large to fit in the MLC. I infer that that means larger than the LLC as well. So you mean I can do better than `movntdqa` for my 1 GB case? I got some research to do. Thanks! — Z boson, Oct 08 '14 at 19:19
Yes, `rep movsb` is significantly faster than `movntdqa` when streaming to memory on Ivybridge and Haswell (but be aware that pre-Ivybridge it is slow!) — Stephen Canon, Oct 08 '14 at 19:24
@StephenCanon, okay I see a section "17.9 Moving blocks of data (All processors)" in Agner Fog's optimizing assembly manual describing `rep movsb` and many other useful points. I somehow missed this very relevant section. — Z boson, Oct 08 '14 at 19:51
@Zboson: There’s also some discussion in Intel’s optimization manual. — Stephen Canon, Oct 08 '14 at 19:55
@StephenCanon, good point. I found a section "3.7.7 Enhanced REP MOVSB and STOSB operation (ERMSB)" in the Intel Optmization manual and then a section "3.7.7.1 Memcpy Considerations". This is excellent information. — Z boson, Oct 08 '14 at 20:02
@StephenCanon, I finally started to look into `enhanced rep movsb` http://stackoverflow.com/q/43343231/2542702. — Z boson, Apr 11 '17 at 12:07
@StephenCanon - most tests that I've seen shows that `rep movsb` is not faster than a properly written copy with NT stores. On IvB it seems to be at its most competitive (but still generally slower) while on Haswell an d more recent chips it seems to be about 20% slower in general (depending on a lot of factors, including interaction with power-management heuristics). Generally it seems to fall in between the "non-NT solutions" that don't use NT stores at all, and the full-on NT stuff - but I certainly never saw a case where it was "significantly faster". — BeeOnRope, May 08 '17 at 18:53

BeeOnRope · Answer 2 · 2019-01-29T01:39:43.680

The question can't be answered precisely without some additional details such as:

What is the target platform (CPU architecture, most, but memory configuration plays a role too)?
What is the distribution and predictability¹ of the copy lengths (and to a lesser extent, the distribution and predictability of alignments)?
Will the copy size ever be statically known at compile-time?

Still, I can point out a couple things that are likely to be sub-optimal for at least some combination of the above parameters.

32-case Switch Statement

The 32-case switch statement is a cute way of handling the trailing 0 to 31 bytes, and likely benchmarks very well - but may perform badly in the real world due to at least two factors.

Code Size

This switch statement alone takes several hundred bytes of code for the body, in addition to a 32-entry lookup table needed to jump to the correct location for each length. The cost of this isn't going to show up in a focused benchmark of memcpy on a full-sized CPU because everything still fit in the fastest cache level: but in the real world you execute other code too and there is contention for the uop cache and L1 data and instruction caches.

That many instructions may take fully 20% of the effective size of your uop cache³, and uop cache misses (and the corresponding cache-to-legacy encoder transition cycles) could easily wipe the small benefit given by this elaborate switch.

On top of that, the switch requires a 32-entry, 256 byte lookup table for the jump targets⁴. If you ever get a miss to DRAM on that lookup, you are talking a penalty of 150+ cycles: how many non-misses do you need to then to make the switch worth it, given it's probably saving a few or two at the most? Again, that won't show up in a microbenchmark.

For what its worth, this memcpy isn't unusual: that kind of "exhaustive enumeration of cases" is common even in optimized libraries. I can conclude that either their development was driven mostly by microbenchmarks, or that it is still worth it for a large slice of general purpose code, despite the downsides. That said, there are certainly scenarios (instruction and/or data cache pressure) where this is suboptimal.

Branch Prediction

The switch statement relies on a single indirect branch to choose among the alternatives. This going to be efficient to the extent that the branch predictor can predict this indirect branch, which basically means that the sequence of observed lengths needs to be predictable.

Because it is an indirect branch, there are more limits on the predictability of the branch than a conditional branch since there are a limited number of BTB entries. Recent CPUs have made strides here, but it is safe to say that if the series of lengths fed to memcpy don't follow a simple repeating pattern of a short period (as short as 1 or 2 on older CPUs), there will be a branch-mispredict on each call.

This issue is particularly insidious because it is likely to hurt you the most in real-world in exactly the situations where a microbenchmark shows the switch to be the best: short lengths. For very long lengths, the behavior on the trailing 31 bytes isn't very important since it is dominated by the bulk copy. For short lengths, the switch is all-important (indeed, for copies of 31 bytes or less it is all that executes)!

For these short lengths, a predictable series of lengths works very well for the switch since the indirect jump is basically free. In particular, a typical memcpy benchmark "sweeps" over a series of lengths, using the same length repeatedly for each sub-test to report the results for easy graphing of "time vs length" graphs. The switch does great on these tests, often reporting results like 2 or 3 cycles for small lengths of a few bytes.

In the real world, your lengths might be small but unpredicable. In that case, the indirect branch will frequently mispredict⁵, with a penalty of ~20 cycles on modern CPUs. Compared to best case of a couple cycles it is an order of magnitude worse. So the glass jaw here can be very serious (i.e., the behavior of the switch in this typical case can be an order of magnitude worse than the best, whereas at long lengths, you are usually looking at a difference of 50% at most between different strategies).

Solutions

So how can you do better than the above, at least under the conditions where the switch falls apart?

Use Duff's Device

One solution to the code size issue is to combine the switch cases together, duff's device-style.

For example, the assembled code for the length 1, 3 and 7 cases looks like:

Length 1

    movzx   edx, BYTE PTR [rsi]
    mov     BYTE PTR [rcx], dl
    ret

Length 3

    movzx   edx, BYTE PTR [rsi]
    mov     BYTE PTR [rcx], dl
    movzx   edx, WORD PTR [rsi+1]
    mov     WORD PTR [rcx+1], dx

Length 7

    movzx   edx, BYTE PTR [rsi]
    mov     BYTE PTR [rcx], dl
    movzx   edx, WORD PTR [rsi+1]
    mov     WORD PTR [rcx+1], dx
    mov     edx, DWORD PTR [rsi+3]
    mov     DWORD PTR [rcx+3], edx
    ret

This can combined into a single case, with various jump-ins:

    len7:
    mov     edx, DWORD PTR [rsi-6]
    mov     DWORD PTR [rcx-6], edx
    len3:
    movzx   edx, WORD PTR [rsi-2]
    mov     WORD PTR [rcx-2], dx
    len1:
    movzx   edx, BYTE PTR [rsi]
    mov     BYTE PTR [rcx], dl
    ret

The labels don't cost anything, and they combine the cases together and removes two out of 3 ret instructions. Note that the basis for rsi and rcx have changed here: they point to the last byte to copy from/to, rather than the first. That change is free or very cheap depending on the code before the jump.

You can extend that for longer lengths (e.g., you can attach lengths 15 and 31 to the chain above), and use other chains for the missing lengths. The full exercise is left to the reader. You can probably get a 50% size reduction alone from this approach, and much better if you combine it with something else to collapse the sizes from 16 - 31.

This approach only helps with the code size (and possibly the jump table size, if you shrink the size as described in ⁴ and you get under 256 bytes, allowing a byte-sized lookup table. It does nothing for predictability.

Overlapping Stores

One trick that helps for both code size and predictability is to use overlapping stores. That is, memcpy of 8 to 15 bytes can be accomplished in a branch-free way with two 8-byte stores, with the second store partly overlapping the first. For example, to copy 11 bytes, you would do an 8-byte copy at relative position 0 and 11 - 8 == 3. Some of the bytes in the middle would be "copied twice", but in practice this is fine since an 8-byte copy is the same speed as a 1, 2 or 4-byte one.

The C code looks like:

  if (Size >= 8) {
    *((uint64_t*)Dst) = *((const uint64_t*)Src);
    size_t offset = Size & 0x7;
    *(uint64_t *)(Dst + offset) = *(const uint64_t *)(Src + offset);
  }

... and the corresponding assembly is not problematic:

    cmp     rdx, 7
    jbe     .L8
    mov     rcx, QWORD PTR [rsi]
    and     edx, 7
    mov     QWORD PTR [rdi], rcx
    mov     rcx, QWORD PTR [rsi+rdx]
    mov     QWORD PTR [rdi+rdx], rcx

In particular, note that you get exactly two loads, two stores and one and (in addition to the cmp and jmp whose existence depends on how you organize the surrounding code). That's already tied or better than most of the compiler-generated approaches for 8-15 bytes, which might use up to 4 load/store pairs.

Older processors suffered some penalty for such "overlapping stores", but newer architectures (the last decade or so, at least) seem to handle them without penalty⁶. This has two main advantages:

The behavior is branch free for a range of sizes. Effectively, this quantizes the branching so that many values take the same path. All sizes from 8 to 15 (or 8 to 16 if you want) take the same path and suffer no misprediction pressure.
At least 8 or 9 different cases from the switch are subsumed into a single case with a fraction of the total code size.

This approach can be combined with the switch approach, but using only a few cases, or it can be extended to larger sizes with conditional moves that could do, for example, all moves from 8 to 31 bytes without branches.

What works out best again depends on the branch distribution, but overall this "overlapping" technique works very well.

Alignment

The existing code doesn't address alignment.

In fact, it isn't, in general, legal or C or C++, since the char * pointers are simply casted to larger types and dereferenced, which is not legal - although in practice it generates codes that works on today's x86 compilers (but in fact would fail for platform with stricter alignment requirements).

Beyond that, it is often better to handle the alignment specifically. There are three main cases:

The source and destination are already alignment. Even the original algorithm will work fine here.
The source and destination are relatively aligned, but absolutely misaligned. That is, there is a value A that can be added to both the source and destination such that both are aligned.
The source and destination are fully misaligned (i.e., they are not actually aligned and case (2) does not apply).

The existing algorithm will work ok in case (1). It is potentially missing a large optimization the case of (2) since small intro loop could turn an unaligned copy into an aligned one.

It is also likely performing poorly in case (3), since in general in the totally misaligned case you can chose to either align the destination or the source and then proceed "semi-aligned".

The alignment penalties have been getting smaller over time and on the most recent chips are modest for general purpose code but can still be serious for code with many loads and stores. For large copies, it probably doesn't matter too much since you'll end up DRAM bandwidth limited, but for smaller copies misalignment may reduce throughput by 50% or more.

If you use NT stores, alignment can also be important, because many of the NT store instructions perform poorly with misaligned arguments.

No unrolling

The code is not unrolled and compilers unrolled by different amounts by default. Clearly this is suboptimal since among two compilers with different unroll strategies, at most one will be best.

The best approach (at least for known platform targets) is determine which unroll factor is best, and then apply that in the code.

Furthermore, the unrolling can often be combined in a smart way with the "intro" our "outro" code, doing a better job than the compiler could.

Known sizes

The primary reason that it is tough to beat the "builtin" memcpy routine with modern compilers is that compilers don't just call a library memcpy whenever memcpy appears in the source. They know the contract of memcpy and are free to implement it with a single inlined instruction, or even less⁷, in the right scenario.

This is especially obvious with known lengths in memcpy. In this case, if the length is small, compilers will just insert a few instructions to perform the copy efficiently and in-place. This not only avoids the overhead of the function call, but all the checks about size and so on - and also generates at compile time efficient code for the copy, much like the big switch in the implementation above - but without the costs of the switch.

Similarly, the compiler knows a lot of about the alignment of structures in the calling code, and can create code that deals efficiently with alignment.

If you just implement a memcpy2 as a library function, that is tough to replicate. You can get part of the way there my splitting the method into a small and big part: the small part appears in the header file, and does some size checks and potentially just calls the existing memcpy if the size is small or delegates to the library routine if it is large. Through the magic of inlining, you might get to the same place as the builtin memcpy.

Finally, you can also try tricks with __builtin_constant_p or equivalents to handle the small, known case efficiently.

¹ Note that I'm drawing a distinction here between the "distribution" of sizes - e.g., you might say _uniformly distributed between 8 and 24 bytes - and the "predictability" of the actual sequence of sizes (e.g., do the sizes have a predicable pattern)? The question of predictability somewhat subtle because it depends on on the implementation, since as described above certain implementations are inherently more predictable.

² In particular, ~750 bytes of instructions in clang and ~600 bytes in gcc for the body alone, on top of the 256-byte jump lookup table for the switch body which had 180 - 250 instructions (gcc and clang respectively). Godbolt link.

³ Basically 200 fused uops out of an effective uop cache size of 1000 instructions. While recent x86 have had uop cache sizes around ~1500 uops, you can't use it all outside of extremely dedicated padding of your codebase because of the restrictive code-to-cache assignment rules.

⁴ The switch cases have different compiled lengths, so the jump can't be directly calculated. For what it's worth, it could have been done differently: they could have used a 16-bit value in the lookup table at the cost of not using memory-source for the jmp, cutting its size by 75%.

⁵ Unlike conditional branch prediction, which has a typical worst-case prediction rate of ~50% (for totally random branches), a hard-to-predict indirect branch can easily approach 100% since you aren't flipping a coin, you are choosing for an almost infinite set of branch targets. This happens in the real-world: if memcpy is being used to copy small strings with lengths uniformly distributed between 0 and 30, the switch code will mispredict ~97% of the time.

⁶ Of course, there may be penalties for misaligned stores, but these are also generally small and have been getting smaller.

⁷ For example, a memcpy to the stack, followed by some manipulation and a copy somewhere else may be totally eliminated, directly moving the original data to its final location. Even things like malloc followed by memcpy can be totally eliminated.

Overlapping stores is a very nice idea. For example, you need to copy 15 bytes. You just copy 2 blocks of 8 bytes with one byte overlap. mov rax, [rsi+0]; mov rbx, [rsi+7]; mov [rdi+0], rax; mov [rdi+7], rbx Here is how inefficienly is Microsoft's memcpy is copying 15 bytes: mov r8, qword ptr [rdx]; mov ecx, dword ptr 8[rdx]; movzx r9d, word ptr 12[rdx]; movzx r10d, byte ptr 14[rdx]; -- then copy these values back - so Microsoft uses 5 moves on load and 5 moves on store, while we are using just 2 moves on load and 2 moves on store by using overlapping moves. — Maxim Masiutin, May 09 '17 at 03:26
I read at http://www.agner.org/optimize/ that dynamic jumps (by table/index) should be avoided at all costs at modern processors. So the code like lea r9, OFFSET __ImageBase; mov ecx, [(IMAGEREL MoveSmall) + r9 +r8*4]; add rcx, r9; jmp rcx --- becomes very slow on modern processors. Do you have any insight on this? — Maxim Masiutin, May 09 '17 at 03:39
Maybe the following code should be faster at least for the cases of up to 16 bytes: cmp ecx, 0 jz exit mov al, [esi] mov [edi], al cmp ecx, 1 je exit mov al, [esi+1] mov [edi+1], al cmp ecx, 2 je exit mov al, [esi+2] mov [edi+2], al cmp ecx, 3 je exit mov al, [esi+3] mov [edi+3], al cmp ecx, 4 je exit ... and so on — Maxim Masiutin, May 09 '17 at 03:44
@MaximMasiutin - yes, overlapping stores is nice, but it does have some limitations - e.g., you still need to branch for less than 8 bytes if you are doing 8-byte moves. In an application specific case, you might be able to get around this if you allow a few bytes of padding at the end of the region you are copying to, in which case you can copy "don't care" bytes past the end. — BeeOnRope, May 09 '17 at 04:07
@MaximMasiutin - you might want to be specific about the quote, but based on what you said Agner is just saying that indirect jumps are slow. In fact, indirect jumps certainly have the _potential_ to be slow, but they aren't inherently slow if they are well-predicted. If they are well-predicted, they may be fast like other types of jumps. I explained why such jumps might be unpredictable in some detail above. — BeeOnRope, May 09 '17 at 04:09
@MaximMasiutin - your "chain of jumps" it is probably worse than indirect jump approach. Basically you have to look at the _predictability_ of each sequence. In general, your sequence is going to be unpredictable when the sequence is unpredictable, and otherwise OK - just like the indirect jump. A mispredicted branch is approximately just as bad whether it is indirect or not, so you don't usually win prediction-wise by changing it to a series of conditional branches. You lose a bunch: more instructions, copying one byte a time, more branch prediction resources consumed, etc. — BeeOnRope, May 09 '17 at 04:16
Another tip: if we cannot align both the source and the destination, align only the destination and use unaligned loads (vmovdqu) and aligned stores (vmovdqa). Since we have two load units but just one store units, the benefit of aligned store should be higher than of aligned load. ;-) — Maxim Masiutin, May 09 '17 at 14:41
I'm just getting started reading this answer... (1) +1 already for mentioning the code size issue. However - are you sure the compiler won't do something about that? (2) What do you mean by "memory configuration? whether we have matching modules? Or do you mean the exact timing figures? How would that help? As for the architecture - are you asking only because of the availability of AVX, AVX-2, AVX-512 or for other reasons? — einpoklum, May 09 '17 at 16:16
(3) About the branch prediction - actually, whenever you copy something of a fixed length - and short copies are most probably of fixed length - the compiler should (?) just drop the branch altogether when it inlines. For long, unknown-at-compile-time copies - while they can theoretically be of arbitrary length, it's not unreasonable to assume that the common case will be a length divisible by 32, i.e the switch case for 0x0. I know all this is speculative, but it's not farfetched speculation... — einpoklum, May 09 '17 at 16:21
@einpoklum - the compiler doesn't do anything about it (other than compiling it reasonably well, but it's still 32 separate cases) and I cover it in my answer, including a link to the generated assembly on x86 for `gcc` and `clang` (see footnote 2). — BeeOnRope, May 09 '17 at 16:22
@einpoklum - by "memory configuration", I mean a variety of things, but big ones are the memory latency and bandwidth compared to the CPU frequency. For example, in the past, many systems couldn't achieve their full memory bandwidth from a single core, since the maximum transfer size * hardware MLP buffers / latency < the bandwidth. Currently, Intel systems are a mix: some _can_ achieve the maximum BW with a single core (like my 6700HQ): those with relative low memory bandwidth and/or relatively high frequency. Which side they fall on is very important for "NT" or "not NT". — BeeOnRope, May 09 '17 at 16:26
@einpoklum - well the compiler certainly isn't going to inline the whole `memcpy` above, and they don't do "partial inlining" as far as I know (i.e., inlining the initial part of the function and then calling the rest out of line) - which was my point that you may want to split the funciton up to give a chance of inlining. Of course, not-fixed-length copies are extremely common. I don't know which are "more common" but it's safe to say that most copies are very short and many copies are arbitrary length. Pretty much every C++ data structure is doing small variable length copies under the hood. — BeeOnRope, May 09 '17 at 16:29
@BeeOnRope: Re compiler "taking care" of the large amount of code - I saw the link now, yes. Still, my comment about a call with a fixed length stands (for now). Re achieving full bandwidth with one core - I was under the mistaken impression that in all reasonable cases you need more than one core for it; thanks. — einpoklum, May 09 '17 at 16:29
@einpoklum - recent Intel chips can drive about 30 GB/s from one core, and many chips have about that much BW. The bigger parts with quad channel memory you need more than one core for sure. Basically, you can hit your full BW from one core, you definitely want NT stores. If you can't, you may find that normal stores are faster (but only for one core, once you go to more cores, NT will eventually win since it saves bandwidth). — BeeOnRope, May 09 '17 at 16:32
@BeeOnRope: On Intel CPUs with many cores, *per-core* bandwidth to L3 and/or to RAM is actually *lower* than on a quad-core desktop part. The latency over the ring bus is higher, but the number of buffers to track oustanding requests is fixed. So maximum concurrency is fixed, and can't keep the pipe as full on a big Xeon. — Peter Cordes, Jul 03 '17 at 14:47
@PeterCordes - to be clear, you are talking about the maximum bandwidth from a single core on an otherwise idle system, right? I could also interpret _per-core_ to mean the total _per-core_ bandwidth with all cores active at once, which is also much lower on the big chips but only because you essentially have a DRAM bandwidth fixed by the channel/memory speed configuration, divided among all cores on a socket, you get much less per core (but you know that). — BeeOnRope, Jul 03 '17 at 17:26
@BeeOnRope. Yes, on an otherwise-idle system. Good point on the ambiguity. — Peter Cordes, Jul 03 '17 at 17:28
See [this Intel forum thread](https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/480004) for some detailed discussion. — Peter Cordes, Jul 03 '17 at 17:31
Things can be even worse on a multi-socket system, because requests that miss in local L3 have to snoop the other socket's L3 to maintain coherency. Before Haswell, this could really suck if all the cores on the other socket were in C1E state, so it enters "package C1E" state and the uncore clocks way down. See [John McCalpin's message near the end of this thread](https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-architectures/topic/379378). Haswell allowed the uncore clock to stay high even while all cores were asleep. — Peter Cordes, Jul 03 '17 at 17:35
@PeterCordes - right. The latency to DRAM on the server uncore parts is known to often be much worse than the client parts, so for concurrency limited scenarios (like 1 active core), I can imagine it reducing the DRAM bandwidth a lot. The L3 latency though is pretty similar between the parts, I think - the few extra ring stops might add a couple cycles on average, but it's a small impact, I think? So I would blame most of the extra latency in the DRAM case on things further downstream, like the server-part memory controller, not the ring bus. — BeeOnRope, Jul 03 '17 at 17:43
@PeterCordes - good point about the multi-socket snooping. I was thinking in the context of a single socket only. — BeeOnRope, Jul 03 '17 at 17:44
@BeeOnRope: I think I remember measuring lower per-clock L3 bandwidth on HSW and SKL Xeon than on my desktop. (On a Google Cloud VM, but taking the best-case numbers on the assumption that those are the no-contention cases. The SKL VM had low measurement noise since that was before they were general-availability. :) Those were almost certainly on dual-socket hardware, with a massive number of cores per chip. (like 28 for the SKL-X). I should go and check my notes... — Peter Cordes, Jul 03 '17 at 17:46
Here are some [Haswell](http://users.atw.hu/instlatx64/GenuineIntel00306C3_Haswell_NewMemLat.txt) and [Haswell-EP](http://users.atw.hu/instlatx64/GenuineIntel00306F2_HaswellEP2_NewMemLat.txt) numbers, which seems to show a latency of 40ish cycles for the client part and 50ish cycles for the server part (no doubt there may be some TLB miss contribution here too) - but those are in cycles and the server part has a slower clock speed so measured in time the gap would probably be somewhat bigger. — BeeOnRope, Jul 03 '17 at 17:56
... but still that's something like a 5 ns difference max, while memory latencies are often in the 30-40 ns difference range between server and client (e.g., client parts at close to 50ns and server parts at closer to 85ns). — BeeOnRope, Jul 03 '17 at 17:58

Paul R · Answer 3 · 2014-10-08T12:13:59.010

4

Firstly the main loop uses unaligned AVX vector loads/stores to copy 32 bytes at a time, until there are < 32 bytes left to copy:

    for ( ; Size >= sizeof(__m256i); Size -= sizeof(__m256i) )
    {
        __m256i ymm = _mm256_loadu_si256(((const __m256i* &)Src)++);
        _mm256_storeu_si256(((__m256i* &)Dst)++, ymm);
    }

Then the final switch statement handles the residual 0..31 bytes in as efficient manner as possible, using a combination of 8/4/2/1 byte copies as appropriate. Note that this is not an unrolled loop - it's just 32 different optimised code paths which handle the residual bytes using the minimum number of loads and stores.

As for why the main 32 byte AVX loop is not manually unrolled - there are several possible reasons for this:

most compilers will unroll small loops automatically (depending on loop size and optimisation switches)
excessive unrolling can cause small loops to spill out of the LSD cache (typically only 28 decoded µops)
on current Core iX CPUs you can only issue two concurrent loads/stores before you stall [*]
typically even a non-unrolled AVX loop like this can saturate available DRAM bandwidth [*]

[*] note that the last two comments above apply to cases where source and/or destination are not in cache (i.e. writing/reading to/from DRAM), and therefore load/store latency is high.

edited Oct 08 '14 at 12:13

answered Oct 07 '14 at 22:09

Paul R

195,989
32
353
519

There's only one loop because the second one has been unrolled completely. I know what the code does, that's not what I asked. – einpoklum Oct 07 '14 at 22:11
3

The switch statement is **not** an unrolled loop - it's just 32 different code paths depending on how many bytes are left to copy. – Paul R Oct 07 '14 at 22:12
`while ( (size--) > 0) *(Dst++) = *(Src++);` is what it does, isn't it? :-) – einpoklum Oct 07 '14 at 22:17
3

Note the different copy sizes (1, 2, 4, 8 bytes) - this is not a scalar loop that has been unrolled, it's just 31 different small optimised copies to clean up the residual bytes. Call it what you will, but you're missing the point - in the general case the heavy lifting is done by the AVX loop. – Paul R Oct 07 '14 at 22:20
OK - there are several reasons why it might not be a good idea to manually unroll this first loop - I'll edit my answer shortly to expand on these. – Paul R Oct 07 '14 at 22:34
1

The loop is not unrolled because it's not. If it had been unrolled the results would be much different for small array sizes. For Core2-Haswell I get better results unrolling four or eight times with that loop. On Haswell not unrolling gets less than 50% of the peak (I get about 47%). Unrolling eight times on Haswell gets about 98%. – Z boson Oct 08 '14 at 11:54
@Zboson: are you also specifically disabling automatic loop unrolling by the compiler (e.g. using `-O2`, `-Os` or `-fno-unroll-loops`) ? – Paul R Oct 08 '14 at 11:55
@PaulR, no but I was just commenting on that. GCC only unrolls the loop if I specifics `-funroll-loops`. In that case the it unrolls eight times and the results are much better. But I prefer to unroll by hand anyway. – Z boson Oct 08 '14 at 11:57
I don't understand your comment "on current Core iX CPUs you can only issue two concurrent loads/stores before you stall". – Z boson Oct 08 '14 at 11:59
Sure - manual unrolling is often better than automatic unrolling - I just didn't know whether you were comparing like with like. – Paul R Oct 08 '14 at 11:59
@Zboson: well there are typically two load/store units, so if you you issue a load and a store concurrently then any further load/store instructions will stall until one of these is retired. – Paul R Oct 08 '14 at 12:00
Yes, gut on Core2-Ivy Bridge for memcopy it can do one 16 byte read and one 16 byte store per cocky cycle. On Haswell it's one 32 byte read and one 32 byte write per clock cycle. Note that the read,read,write (e.g. in the STREAM triad function) bandwidth is different. – Z boson Oct 08 '14 at 12:04
@Zboson: sure, that's fine when reading/writing L1/L2 cache, but for DRAM access as soon as you get a cache miss it's going to be many clock cycles before the load or store is retired. – Paul R Oct 08 '14 at 12:06
I agree that for DRAM access unrolling is useless. For DRAM access for large sizes instead of unrolling non-temporal stores should be used. – Z boson Oct 08 '14 at 12:09
Speaking of which, do you know why non-temporal stores are not used more? EGLIBC does not used them. Agner Fog's asmlib uses them. I don't understand why they are not used more. I have been meaning to ask a SO question about this. – Z boson Oct 08 '14 at 12:10
OK - that explains the confusion - I'll add a qualifying remark to my answer to clarify that some of the arguments apply to memcpy to/from DRAM. – Paul R Oct 08 '14 at 12:11
I've never had much luck with the non-temporal stores, but I don't tend to write things like memcpy replacements, so it's not something I've looked at a great deal. – Paul R Oct 08 '14 at 12:12
1

Yeah, I tried to make that clear at the start of my answer. A general `memcpy` function has to optimize for small and large differently. – Z boson Oct 08 '14 at 12:12
1

@Zboson: I made a comment on NT stores on your answer, but I’ll expand here: the semantics of x86 NT stores are flawed for use in `memcpy`; they are disastrously slow when they hit L1, and they require a read-for-ownership when they miss L3. Thus, `vmovaps` is much faster for small copies, and `rep movs` is much faster for large copies (on Ivybridge and later). Also, remember that the NT stores require a fence, which isn’t a huge hassle, but it’s one more detail to remember. – Stephen Canon Oct 08 '14 at 19:10
@StephenCanon, okay, I was not aware of `rep movs`. Thank you for the information. I need to learn about `rep movs`. – Z boson Oct 08 '14 at 19:24
@StephenCanon, does this apply to Sandy Bridge as well or just Ivy Bridge and Haswell? – Z boson Oct 08 '14 at 19:25
1

@Zboson: IVB and onward only. It’s one of the primary micro-architectural differences between IVB and SNB. Intel calls the feature “ERMSB” (enhanced rep movsb/stosb) – Stephen Canon Oct 08 '14 at 19:26
@PaulR: What would you reply to the arguments for unrolling made [here](http://stackoverflow.com/a/18320068/1593077)? – einpoklum Oct 08 '14 at 20:07
@einpoklum: I defer to Zboson on this, as he's studied memcpy implementation in far greater detail than I have, but note that my comments were mainly in regard to large copies, where DRAM bandwidth tends to be the limiting factor, whereas I think Zb's focus has been more on smaller copies, where bandwidth is much higher and loop unrolling is more likely to be of benefit. Note also that Zb's baseline is `-fno-unroll-loops`, so he's comparing manual unrolling with no automatic unrolling by the compiler. An interesting discussion all round though. – Paul R Oct 08 '14 at 21:10
@einpoklum: oops - my bad - I missed the fact that "here" was a link to another question - I thought you meant "here" as in the above discussion. I'll get back to you... – Paul R Oct 08 '14 at 21:12
@PaulR, to clarify I don't specify `-fno-unroll-loops`. I just use `-O3`. The way I read your answer it sounds like the compiler will unroll if it thinks it's wise using only e.g `-O3`. However, I have never observed this with intrinsics. So the only way to get unrolling is to explicitly tell the compiler to to it with `-funroll-loops` (or to manually unroll as I have done). – Z boson Oct 14 '14 at 09:44
@Zboson: thanks for the clarification - I think older versions of gcc behaved differently in regard to loop unrolling and `-O3` - it seems to be off by default these days, at least for x86 targets. – Paul R Oct 14 '14 at 10:08
@Zboson and PaulR about : "... well there are typically two load/store units, so if you you issue a load and a store concurrently then any further load/store instructions will stall until one of these is retired" - that's definitely not the way modern (i.e., last 20 years) big OoO cores act. Sure there are only two load ports, but that just limits how many loads can be issued per cycle. The loads themselves go into a load queue and whether they hit in L1, L2, ..., or miss all the way to DRAM, the processor keeps on going and executing instructions that don't depend on the load. – BeeOnRope May 08 '17 at 21:47
In particular, recent CPUs have an out-of-order window (ROB size) of some 200 instructions, so you can do a lot of work even after a miss to DRAM. Most importantly, you can keep issuing more loads, which may also miss (on recent Intel, for example, up to about 10 loads can be "in flight" in this way at once). That why, for example, a pointer chasing load that misses randomly to memory will be nearly an order of magnitude slower than a load that randomly accesses the same locations, but whose addresses are stored in an array: the latter scenario has high MLP the CPU can take advantage of. – BeeOnRope May 08 '17 at 21:49

Maxim Masiutin · Answer 4 · 2017-07-01T11:13:13.030

Taking Benefits of The ERMSB

Please also consider using REP MOVSB for larger blocks.

As you know, since first Pentium CPU produced in 1993, Intel began to make simple commands faster and complex commands (like REP MOVSB) slower. So, REP MOVSB became very slow, and there was no more reason to use it. In 2013, Intel decided to revisit REP MOVSB. If the CPU has CPUID ERMSB (Enhanced REP MOVSB) bit, then REP MOVSB commands are executed differently than on older processors, and are supposed to be fast. On practice, it is only fast for large blocks, 256 bytes and larger, and only when certain conditions are met:

both the source and destination addresses have to be aligned to a 16-Byte boundary;
the source region should not overlap with the destination region;
the length has to be a multiple of 64 to produce higher performance;
the direction has to be forward (CLD).

See the Intel Manual on Optimization, section 3.7.6 Enhanced REP MOVSB and STOSB operation (ERMSB) http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf

Intel recommends using AVX for blocks smaller than 2048 bytes. For the larger blocks, Intel recommends using REP MOVSB. This is because high initial startup costs of REP MOVSB (about 35 cycles).

I have done speed tests, and for the blocks of than 2048 bytes and higher, the performance of REP MOVSB is unbeatable. However, for blocks smaller than 256 bytes, REP MOVSB is very slow, even slower than plain MOV RAX back and forth in a loop.

Please not that ERMSB only affects MOVSB, not MOVSD (MOVSQ), so MOVSB is little bit faster than MOVSD (MOVSQ).

So, you can use AVX for your memcpy() implementation, and if the block is larger than 2048 bytes and all the conditions are met, then call REP MOVSB - so your memcpy() implementation will be unbeatable.

Taking Benefits of The Out-of-Order Execution Engine

You can also read about The Out-of-Order Execution Engine in the "Intel® 64 and IA-32 Architectures Optimization Reference Manual" http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf section the 2.1.2, and take benefits of it.

For example, in Intel SkyLake processor series (launched in 2015), it has:

4 execution units for the Arithmetic logic unit (ALU) (add, and, cmp, or, test, xor, movzx, movsx, mov, (v)movdqu, (v)movdqa, (v)movap*, (v)movup),
3 execution units for Vector ALU ( (v)pand, (v)por, (v)pxor, (v)movq, (v)movq, (v)movap*, (v)movup*, (v)andp*, (v)orp*, (v)paddb/w/d/q, (v)blendv*, (v)blendp*, (v)pblendd)

So we can occupy above units (3+4) in parallel if we use register-only operations. We cannot use 3+4 instructions in parallel for memory copy. We can use simultaneously maximum of up to two 32-bytes instructions to load from memory and one 32-bytes instructions to store from memory, and even if we are working with Level-1 cache.

Please see the Intel manual again to understand on how to do the fastest memcpy implementation: http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf

Section 2.2.2 (The Out-of-Order Engine of the Haswelll microarchitecture): "The Scheduler controls the dispatch of micro-ops onto the dispatch ports. There are eight dispatch ports to support the out-of-order execution core. Four of the eight ports provided execution resources for computational operations. The other 4 ports support memory operations of up to two 256-bit load and one 256-bit store operation in a cycle."

Section 2.2.4 (Cache and Memory Subsystem) has the following note: "First level data cache supports two load micro-ops each cycle; each micro-op can fetch up to 32-bytes of data."

Section 2.2.4.1 (Load and Store Operation Enhancements) has the following information: The L1 data cache can handle two 256-bit (32 bytes) load and one 256-bit (32 bytes) store operations each cycle. The unified L2 can service one cache line (64 bytes) each cycle. Additionally, there are 72 load buffers and 42 store buffers available to support micro-ops execution in-flight.

The other sections (2.3 and so on, dedicated to Sandy Bridge and other microarchitectures) basically reiterate the above information.

The section 2.3.4 (The Execution Core) gives additional details.

The scheduler can dispatch up to six micro-ops every cycle, one on each port. The following table summarizes which operations can be dispatched on which port.

Port 0: ALU, Shift, Mul, STTNI, Int-Div, 128b-Mov, Blend, 256b-Mov
Port 1: ALU, Fast LEA, Slow LEA, MUL, Shuf, Blend, 128bMov, Add, CVT
Port 2 & Port 3: Load_Addr, Store_addr
Port 4: Store_data
Port 5: ALU, Shift, Branch, Fast LEA, Shuf, Blend, 128b-Mov, 256b-Mov

The section 2.3.5.1 (Load and Store Operation Overview) may also be useful to understand on how to make fast memory copy, as well as the section 2.4.4.1 (Loads and Stores).

For the other processor architectures, it is again - two load units and one store unit. Table 2-4 (Cache Parameters of the Skylake Microarchitecture) has the following information:

Peak Bandwidth (bytes/cyc):

First Level Data Cache: 96 bytes (2x32B Load + 1*32B Store)
Second Level Cache: 64 bytes
Third Level Cache: 32 bytes.

I have also done speed tests on my Intel Core i5 6600 CPU (Skylake, 14nm, released in September 2015) with DDR4 memory, and this has confirmed the teory. For example, my test have shown that using generic 64-bit registers for memory copy, even many registers in parallel, degrades performance. Also, using just 2 XMM registers is enough - adding the 3rd doesn't add performance.

If your CPU has AVX CPUID bit, you may take benefits of the large, 256-bit (32 byte) YMM registers to copy memory, to occupy two full load units. The AVX support was first introduced by Intel with the Sandy Bridge processors, shipping in Q1 2011 and later on by AMD with the Bulldozer processor shipping in Q3 2011.

// first cycle  
vmovdqa ymm0, ymmword ptr [rcx+0]      // load 1st 32-byte part using first load unit
vmovdqa ymm1, ymmword ptr [rcx+20h]    // load 2nd 32-byte part using second load unit

// second cycle
vmovdqa ymmword ptr [rdx+0], ymm0      // store 1st 32-byte part using the single store unit

// third cycle
vmovdqa ymmword ptr [rdx+20h], ymm1    ; store 2nd 32-byte part - using the single store unit (this instruction will require a separate cycle since there is only one store unit, and we cannot do two stores in a single cycle)

add ecx, 40h // these instructions will be used by a different unit since they don't invoke load or store, so they won't require a new cycle
add edx, 40h

Also, there is speed benefit if you loop-unroll this code at least 8 times. As I wrote before, adding more registers besides ymm0 and ymm1 doesn't increase performance, because there are just two load units and one store unit. Adding loops like "dec r9 jnz @@again" degrades the performance, but simple "add ecx/edx" does not.

Finally, if your CPU has AVX-512 extension, you can use 512-bit (64-byte) registers to copy memory:

vmovdqu64   zmm0, [rcx+0]           ; load 1st 64-byte part
vmovdqu64   zmm1, [rcx+40h]         ; load 2nd 64-byte part 

vmovdqu64   [rdx+0], zmm0           ; store 1st 64-byte part
vmovdqu64   [rdx+40h], zmm1         ; store 2nd 64-byte part 

add     rcx, 80h
add     rdx, 80h

AVX-512 is supported by the following processors: Xeon Phi x200, released in 2016; Skylake EP/EX Xeon "Purley" (Xeon E5-26xx V5) processors (H2 2017); Cannonlake processors (H2 2017), Skylake-X processors - Core i9-7×××X, i7-7×××X, i5-7×××X - released on June 2017.

Please note that the memory have to be aligned on the size of the registers that you are using. If it is not, please use "unaligned" instructions: vmovdqu and moveups.

Can I make that happen using some kind of C/C++'ish wrappers? Or must I write assembly code? — einpoklum, May 08 '17 at 11:25
Microsoft and Intel compilers have C wrappers, but, in my opinion, assembly code, be it inline, or in a separate .asm file should be preferable. Question is, what is your goal - memcpy() speed, or portability/simplicity. — Maxim Masiutin, May 08 '17 at 11:33
@MaximMasiutin - your attempt to mix SSE and 64-bit `mov` instructions doesn't work because ALUs don't execute loads. There are only two load units on even the most advanced x86 CPUs, so at most two loads can be issued per cycle. Loads of all sizes (8 bits, 16 bits, 32 bits, ..., 256) go to those units, so you usually just want to use the largest loads available for the bulk of a copy. — BeeOnRope, May 08 '17 at 21:52
@BeeOnRope - I have already figured this out. As I've mentioned in my comment: "In practice, when I have done speed tests on my Intel Core i5 6600 CPU (Skylake, 14nm, released in September 2015) with DDR4 memory, using generic 64-bit registers for memory copy degrade performance. Also, using just 2 XMM registers is enough - adding the 3rd doesn't add performance. Probably, the memory bandwidth is limited between the CPU and its cache - I have tested with very small blocks that fit entirely in the L1 cache which is 32KB for data and 32KB for instructions on my CPU." -- so just 2 XMM is enough. — Maxim Masiutin, May 08 '17 at 23:48
Right, but the form of your answer is "in theory, this should work, but in practice it doesn't". The truth, however, is "in theory and practice this doesn't work". Is that not useful information? Also, you conclude that your "mixed GP/SIMD" technique doesn't work due to bandwidth, but that's not really correct: it doesn't work because it based on an incorrect machine model. Sure, if you test on big buffers, you'll end up bandwidth limited so even poor implementations created with mistaken theory can "tie" good ones, but test it on a small buffer and you'll see your theory is wrong. — BeeOnRope, May 08 '17 at 23:54
@BeeOnRope, thank you very much for pointing that out. I have rewritten the relevant section. Thank you again. — Maxim Masiutin, May 09 '17 at 00:25