When program will benefit from prefetch & non-temporal load/store?

Question

I did a test with this

    for (i32 i = 0; i < 0x800000; ++i)
    {
        // Hopefully this can disable hardware prefetch
        i32 k = (i * 997 & 0x7FFFFF) * 0x40;

        _mm_prefetch(data + ((i + 1) * 997 & 0x7FFFFF) * 0x40, _MM_HINT_NTA);

        for (i32 j = 0; j < 0x40; j += 0x10)
        {
            //__m128 v = _mm_castsi128_ps(_mm_stream_load_si128((__m128i *)(data + k + j)));
            __m128 v = _mm_load_ps((float *)(data + k + j));

            a_single_chain_computation

            //_mm_stream_ps((float *)(data2 + k + j), v);
            _mm_store_ps((float *)(data2 + k + j), v);
        }
    }

Results are weird.

No matter how much time the a_single_chain_computation takes, the load latency is not hidden.
And what's more, the additional total time taken grows as I add more computation. (With a single v = _mm_mul_ps(v, v), prefetching saves about 0.60 - 0.57 = 0.03s. And with 16 v = _mm_mul_ps(v, v), it saves about 1.1 - 0.75 = 0.35s. WHY?)
non-temporal load/stores degrades performance with or without prefetching. (I can understand the load part, but why stores, too?)

Have you tried the normal prefetch? In my experience, I've never had a good use-case for non-temporal loads. But I've found streaming stores to be useful when doing completely random writes in perfectly aligned cacheline-sized blocks. — Mysticial, Jun 26 '13 at 06:26
@Mysticial `_MM_HINT_NTA` is described as `minimizing cache pollution`, guess it's non-temporal. But _MM_HINT_Tx doesn't seem to degrade the performance. Guess it's because that there is no other cache usage. — BlueWanderer, Jun 26 '13 at 06:32
Ah I wouldn't have expected to `_MM_HINT_Tx` to degrade performance. In the case of non-temporary prefetch, it seems self-defeating to prefetch something and not pollute the cache. Since the whole point of prefetching something is to bring it into cache. It's one of the things that I really never understood. :) — Mysticial, Jun 26 '13 at 06:36
@Mysticial just my guess, temporal prefetch will pollution L3 cache. It's not desirable if I don't want to read it again while other thread is using L3 cache heavily. — BlueWanderer, Jun 26 '13 at 06:39
Hmm... It'd be an interesting puzzle to write something to test that. :) — Mysticial, Jun 26 '13 at 06:40
What types do `data` and `data2` point to? I'm guessing you might not be writing all the bytes in the cache lines, possibly making the processor have to merge the write-combining buffer with what's already in memory. Also, you are probably not prefetching far enough ahead for it to be effective. — doug65536, Jun 26 '13 at 06:49
@doug65536 both `char *` aligned to 32 byte boundary. My problem is not that prefetch is not effective, it's TOO illogically effective. — BlueWanderer, Jun 26 '13 at 08:31
@doug65536 Another weird thing is that even with only one `mulps` per iteration prefetching further still degrades the performance... — BlueWanderer, Jun 26 '13 at 08:45

Leeor · Answer 1 · 2013-10-19T12:01:22.637

You need to separate two different things here (which unfortunately have a similar name) :

Non-temporal prefetching - This would prefetch the line, but write it as the least recently used one when it fills the caches, so it would be the first in line for eviction when you next use the same set. That leaves you enough time to actually use it (unless you're very unlucky), but wouldn't waste more than a single way out of that set, since the next prefetch to come along would just replace it. By the way, regarding your comments above - every prefetch would pollute the L3 cache, it's inclusive so you can't get away without it.
Non-temporal (streaming) loads/stores - this also won't pollute the caches, but using a completely different mechanism of making them uncacheable (as well as write combining). This would indeed have a penalty on performance even if you really don't need these lines ever again, since a cacheable write has the luxury of staying buffered in the cache until evicted, so you don't have to write it out right away. With uncacheables you do, and in some scenarios it might interfere with your mem BW. On the other hand you get the benefit of write-combining and weak ordering which may give you some edge is several cases. The bottom line here is that you should use it only when it helps, don't assume it magically improves performance (Nothing does that nowadays..)

Regarding your questions -

your prefetching should work, but it's not early enough to make an impact. try replacing i+1 with a larger number. Actually, maybe even do a sweep, would be interesting to see how many elements in advance you should peek.
i'd guess this is same as 1 - with 16 muls your iteration is long enough for the prefetch to work
As I said - your stores won't have the benefit of buffering in the lower level caches, and would have to get flushed to memory. That's the downside of streaming stores. it's implementation specific of course, so it might improve, but at the moment it's not always effective.

score 1 · Answer 2 · answered Nov 21 '13 at 15:34

If your computation chain is very short and if you're reading memory sequentially then the CPU will prefetch well on its own and actually work faster since its decoder has less work to do.

Streaming loads and stores are good only if you don't plan to access this memory in the near future. They are mainly aimed at uncached write back (WB) memory that's usually found when dealing with graphic surfaces. Explicit prefecthing may work well on one architecture (CPU model) and have a negative effect on other models so use them as a last resort option when optimizing.

When program will benefit from prefetch & non-temporal load/store?

2 Answers2

Linked