The question can't be answered precisely without some additional details such as:
- What is the target platform (CPU architecture, most, but memory configuration plays a role too)?
- What is the distribution and predictability1 of the copy lengths (and to a lesser extent, the distribution and predictability of alignments)?
- Will the copy size ever be statically known at compile-time?
Still, I can point out a couple things that are likely to be sub-optimal for at least some combination of the above parameters.
32-case Switch Statement
The 32-case switch statement is a cute way of handling the trailing 0 to 31 bytes, and likely benchmarks very well - but may perform badly in the real world due to at least two factors.
Code Size
This switch statement alone takes several hundred bytes of code for the body, in addition to a 32-entry lookup table needed to jump to the correct location for each length. The cost of this isn't going to show up in a focused benchmark of memcpy
on a full-sized CPU because everything still fit in the fastest cache level: but in the real world you execute other code too and there is contention for the uop cache and L1 data and instruction caches.
That many instructions may take fully 20% of the effective size of your uop cache3, and uop cache misses (and the corresponding cache-to-legacy encoder transition cycles) could easily wipe the small benefit given by this elaborate switch.
On top of that, the switch requires a 32-entry, 256 byte lookup table for the jump targets4. If you ever get a miss to DRAM on that lookup, you are talking a penalty of 150+ cycles: how many non-misses do you need to then to make the switch
worth it, given it's probably saving a few or two at the most? Again, that won't show up in a microbenchmark.
For what its worth, this memcpy
isn't unusual: that kind of "exhaustive enumeration of cases" is common even in optimized libraries. I can conclude that either their development was driven mostly by microbenchmarks, or that it is still worth it for a large slice of general purpose code, despite the downsides. That said, there are certainly scenarios (instruction and/or data cache pressure) where this is suboptimal.
Branch Prediction
The switch statement relies on a single indirect branch to choose among the alternatives. This going to be efficient to the extent that the branch predictor can predict this indirect branch, which basically means that the sequence of observed lengths needs to be predictable.
Because it is an indirect branch, there are more limits on the predictability of the branch than a conditional branch since there are a limited number of BTB entries. Recent CPUs have made strides here, but it is safe to say that if the series of lengths fed to memcpy
don't follow a simple repeating pattern of a short period (as short as 1 or 2 on older CPUs), there will be a branch-mispredict on each call.
This issue is particularly insidious because it is likely to hurt you the most in real-world in exactly the situations where a microbenchmark shows the switch
to be the best: short lengths. For very long lengths, the behavior on the trailing 31 bytes isn't very important since it is dominated by the bulk copy. For short lengths, the switch
is all-important (indeed, for copies of 31 bytes or less it is all that executes)!
For these short lengths, a predictable series of lengths works very well for the switch
since the indirect jump is basically free. In particular, a typical memcpy
benchmark "sweeps" over a series of lengths, using the same length repeatedly for each sub-test to report the results for easy graphing of "time vs length" graphs. The switch
does great on these tests, often reporting results like 2 or 3 cycles for small lengths of a few bytes.
In the real world, your lengths might be small but unpredicable. In that case, the indirect branch will frequently mispredict5, with a penalty of ~20 cycles on modern CPUs. Compared to best case of a couple cycles it is an order of magnitude worse. So the glass jaw here can be very serious (i.e., the behavior of the switch
in this typical case can be an order of magnitude worse than the best, whereas at long lengths, you are usually looking at a difference of 50% at most between different strategies).
Solutions
So how can you do better than the above, at least under the conditions where the switch
falls apart?
Use Duff's Device
One solution to the code size issue is to combine the switch cases together, duff's device-style.
For example, the assembled code for the length 1, 3 and 7 cases looks like:
Length 1
movzx edx, BYTE PTR [rsi]
mov BYTE PTR [rcx], dl
ret
Length 3
movzx edx, BYTE PTR [rsi]
mov BYTE PTR [rcx], dl
movzx edx, WORD PTR [rsi+1]
mov WORD PTR [rcx+1], dx
Length 7
movzx edx, BYTE PTR [rsi]
mov BYTE PTR [rcx], dl
movzx edx, WORD PTR [rsi+1]
mov WORD PTR [rcx+1], dx
mov edx, DWORD PTR [rsi+3]
mov DWORD PTR [rcx+3], edx
ret
This can combined into a single case, with various jump-ins:
len7:
mov edx, DWORD PTR [rsi-6]
mov DWORD PTR [rcx-6], edx
len3:
movzx edx, WORD PTR [rsi-2]
mov WORD PTR [rcx-2], dx
len1:
movzx edx, BYTE PTR [rsi]
mov BYTE PTR [rcx], dl
ret
The labels don't cost anything, and they combine the cases together and removes two out of 3 ret
instructions. Note that the basis for rsi
and rcx
have changed here: they point to the last byte to copy from/to, rather than the first. That change is free or very cheap depending on the code before the jump.
You can extend that for longer lengths (e.g., you can attach lengths 15 and 31 to the chain above), and use other chains for the missing lengths. The full exercise is left to the reader. You can probably get a 50% size reduction alone from this approach, and much better if you combine it with something else to collapse the sizes from 16 - 31.
This approach only helps with the code size (and possibly the jump table size, if you shrink the size as described in 4 and you get under 256 bytes, allowing a byte-sized lookup table. It does nothing for predictability.
Overlapping Stores
One trick that helps for both code size and predictability is to use overlapping stores. That is, memcpy
of 8 to 15 bytes can be accomplished in a branch-free way with two 8-byte stores, with the second store partly overlapping the first. For example, to copy 11 bytes, you would do an 8-byte copy at relative position 0
and 11 - 8 == 3
. Some of the bytes in the middle would be "copied twice", but in practice this is fine since an 8-byte copy is the same speed as a 1, 2 or 4-byte one.
The C code looks like:
if (Size >= 8) {
*((uint64_t*)Dst) = *((const uint64_t*)Src);
size_t offset = Size & 0x7;
*(uint64_t *)(Dst + offset) = *(const uint64_t *)(Src + offset);
}
... and the corresponding assembly is not problematic:
cmp rdx, 7
jbe .L8
mov rcx, QWORD PTR [rsi]
and edx, 7
mov QWORD PTR [rdi], rcx
mov rcx, QWORD PTR [rsi+rdx]
mov QWORD PTR [rdi+rdx], rcx
In particular, note that you get exactly two loads, two stores and one and
(in addition to the cmp
and jmp
whose existence depends on how you organize the surrounding code). That's already tied or better than most of the compiler-generated approaches for 8-15 bytes, which might use up to 4 load/store pairs.
Older processors suffered some penalty for such "overlapping stores", but newer architectures (the last decade or so, at least) seem to handle them without penalty6. This has two main advantages:
The behavior is branch free for a range of sizes. Effectively, this quantizes the branching so that many values take the same path. All sizes from 8 to 15 (or 8 to 16 if you want) take the same path and suffer no misprediction pressure.
At least 8 or 9 different cases from the switch
are subsumed into a single case with a fraction of the total code size.
This approach can be combined with the switch
approach, but using only a few cases, or it can be extended to larger sizes with conditional moves that could do, for example, all moves from 8 to 31 bytes without branches.
What works out best again depends on the branch distribution, but overall this "overlapping" technique works very well.
Alignment
The existing code doesn't address alignment.
In fact, it isn't, in general, legal or C or C++, since the char *
pointers are simply casted to larger types and dereferenced, which is not legal - although in practice it generates codes that works on today's x86 compilers (but in fact would fail for platform with stricter alignment requirements).
Beyond that, it is often better to handle the alignment specifically. There are three main cases:
- The source and destination are already alignment. Even the original algorithm will work fine here.
- The source and destination are relatively aligned, but absolutely misaligned. That is, there is a value
A
that can be added to both the source and destination such that both are aligned.
- The source and destination are fully misaligned (i.e., they are not actually aligned and case (2) does not apply).
The existing algorithm will work ok in case (1). It is potentially missing a large optimization the case of (2) since small intro loop could turn an unaligned copy into an aligned one.
It is also likely performing poorly in case (3), since in general in the totally misaligned case you can chose to either align the destination or the source and then proceed "semi-aligned".
The alignment penalties have been getting smaller over time and on the most recent chips are modest for general purpose code but can still be serious for code with many loads and stores. For large copies, it probably doesn't matter too much since you'll end up DRAM bandwidth limited, but for smaller copies misalignment may reduce throughput by 50% or more.
If you use NT stores, alignment can also be important, because many of the NT store instructions perform poorly with misaligned arguments.
No unrolling
The code is not unrolled and compilers unrolled by different amounts by default. Clearly this is suboptimal since among two compilers with different unroll strategies, at most one will be best.
The best approach (at least for known platform targets) is determine which unroll factor is best, and then apply that in the code.
Furthermore, the unrolling can often be combined in a smart way with the "intro" our "outro" code, doing a better job than the compiler could.
Known sizes
The primary reason that it is tough to beat the "builtin" memcpy
routine with modern compilers is that compilers don't just call a library memcpy
whenever memcpy
appears in the source. They know the contract of memcpy
and are free to implement it with a single inlined instruction, or even less7, in the right scenario.
This is especially obvious with known lengths in memcpy
. In this case, if the length is small, compilers will just insert a few instructions to perform the copy efficiently and in-place. This not only avoids the overhead of the function call, but all the checks about size and so on - and also generates at compile time efficient code for the copy, much like the big switch
in the implementation above - but without the costs of the switch
.
Similarly, the compiler knows a lot of about the alignment of structures in the calling code, and can create code that deals efficiently with alignment.
If you just implement a memcpy2
as a library function, that is tough to replicate. You can get part of the way there my splitting the method into a small and big part: the small part appears in the header file, and does some size checks and potentially just calls the existing memcpy
if the size is small or delegates to the library routine if it is large. Through the magic of inlining, you might get to the same place as the builtin memcpy
.
Finally, you can also try tricks with __builtin_constant_p
or equivalents to handle the small, known case efficiently.
1 Note that I'm drawing a distinction here between the "distribution" of sizes - e.g., you might say _uniformly distributed between 8 and 24 bytes - and the "predictability" of the actual sequence of sizes (e.g., do the sizes have a predicable pattern)? The question of predictability somewhat subtle because it depends on on the implementation, since as described above certain implementations are inherently more predictable.
2 In particular, ~750 bytes of instructions in clang
and ~600 bytes in gcc
for the body alone, on top of the 256-byte jump lookup table for the switch body which had 180 - 250 instructions (gcc
and clang
respectively). Godbolt link.
3 Basically 200 fused uops out of an effective uop cache size of 1000 instructions. While recent x86 have had uop cache sizes around ~1500 uops, you can't use it all outside of extremely dedicated padding of your codebase because of the restrictive code-to-cache assignment rules.
4 The switch cases have different compiled lengths, so the jump can't be directly calculated. For what it's worth, it could have been done differently: they could have used a 16-bit value in the lookup table at the cost of not using memory-source for the jmp
, cutting its size by 75%.
5 Unlike conditional branch prediction, which has a typical worst-case prediction rate of ~50% (for totally random branches), a hard-to-predict indirect branch can easily approach 100% since you aren't flipping a coin, you are choosing for an almost infinite set of branch targets. This happens in the real-world: if memcpy
is being used to copy small strings with lengths uniformly distributed between 0 and 30, the switch
code will mispredict ~97% of the time.
6 Of course, there may be penalties for misaligned stores, but these are also generally small and have been getting smaller.
7 For example, a memcpy
to the stack, followed by some manipulation and a copy somewhere else may be totally eliminated, directly moving the original data to its final location. Even things like malloc
followed by memcpy
can be totally eliminated.