For aligned memcpy and memset REP MOVSD and REP STOSD vs temporal and non-temproray ymm / zmm registers

Question

I have seen a few very informative posts here and here discussing the merits of difference approach for copying / setting memory.

All of the posts go into detail about the pros and cons of REP MOVSD in a variety of settings, though I found where discussing the topic in too generic a setting to actual come to any definitive answers.

So for x86_64 skylake I am curious about the following scenarios:

Assume for all scenarios both src and dst are page aligned

Copying 16384 bytes of data
- Comparing REP MOVSD, ymm blocks, zmm blocks
Copying 2 Gibabytes of data
- Comparing REP MOVSD, ymm blocks w/ VMOVNTDQ, zmm blocks w/ VMOVNTDQ
Setting 16384 bytes of data
- Comparing REP STOSD, ymm blocks, zmm blocks
Setting 2 Gibabytes of data
- Comparing REP STOSD, ymm blocks w/ VMOVNTDQ, zmm blocks w/ VMOVNTDQ

Based the glibc imlementation of memcpy and memset

It seems SIMD register for memcpy and REP STOSD for both of the memset cases.

I am curious if there are any hard recommendations for these scenarios (and to understand better why they are recommendations).

Thank you.

Edit: One of the reasons I am making this post is that the other posts didn't seem to discuss AVX512.

Note that the assumption on page alignment is a very dangerous one as buffers are most often not page aligned and a lot of time is spent getting to an aligned state. Also consider that most buffers copied are actually less than about 100 bytes long. — fuz, Nov 16 '20 at 17:20
For very large transfers it might be better to have a DMA controller perform the copying. — EOF, Nov 16 '20 at 17:24
@fuz yeah, but I think the other psots where pretty clear about those cases. — Noah, Nov 16 '20 at 17:26
@EOF Are you sure? DMA usually involves transfer away from the fast memory controller to some DMA engine on some bus and can be a lot slower than raw memory bandwidth. — fuz, Nov 16 '20 at 18:46
@fuz I expect that depends heavily on the specific architecture. However: 1) while the DMA engine copies memory, the CPU can keep computing, especially if you make an asynchronous memory copy abstraction and 2) even if you're doing a synchronous copy, the DMA engine will be vastly simpler (even for equivalent bandwidth) than a general-purpose CPU, which implies much lower power consumption and consequently heat. If you're blocked on memory bandwidth you can use the time to reduce CPU temperature and open up thermal headroom for boosting the clock as soon as you're compute bound again. — EOF, Nov 16 '20 at 22:24
@EOF on the other hand, this'll certainly mean that all your cache needs to be invalidated, slowing down any transfers back from memory. — fuz, Nov 16 '20 at 23:02
As I said, "For very large transfers". For those your cache will not help *anyway*, and a DMA transfer can avoid polluting your cache (otherwise you would need explicitly non-caching instructions for this) so your working set isn't completely evicted. Wiping out your caches (and TLB!) down to physical memory, beyond LLC with a large `memcpy()` will make your performance suck for quite a while afterwards. — EOF, Nov 16 '20 at 23:12
@fuz Whether you need to explicitly flush caches for DMA to work correctly depends on the architecture again. Some architectures have DMA engines that respect cache coherence, so for those you don't need to do anything special. Others, you only need to clean the cache, rather than flush it. Only on truly programmer-hostile architectures will the caches have to be completely flushed, and even then you probably don't have to flush physically tagged/physically indexed caches unless the architecture is actively malicious. Even then, see my previous comment. — EOF, Nov 16 '20 at 23:23
@EOF that's not what I mean; after DMA is done, the data that was just DMA-copied is no longer in cache and has to be reloaded from memory. That can be expensive. I am of course assuming that cache coherency is somehow dealt with. CPU-based memory copy procedures can avoid this. — fuz, Nov 16 '20 at 23:38
@fuz, check out Intel DSA [https://01.org/blogs/2019/introducing-intel-data-streaming-accelerator https://01.org/blogs/2019/introducing-intel-data-streaming-accelerator] (unfortunately not available yet). It can write to cache or not under software control. — prl, Nov 17 '20 at 00:20

For aligned memcpy and memset REP MOVSD and REP STOSD vs temporal and non-temproray ymm / zmm registers

0 Answers0