0

Let's assume an algorithm is repeatedly processing buffers of data, it may be accessing say 2 to 16 of these buffers, all having the same size. What would you expect to be the optimum size of these buffers, assuming the algorithm can process the full data in smaller blocks.

I expect the potential bottleneck of cache misses if the blocks are too big, but of course the bigger the blocks the better for vectorization.

Let's expect current i7/i9 CPUs (2018)

Any ideas?

Hadi Brais
  • 18,864
  • 3
  • 43
  • 78
  • 1
    What exactly are these buffers being accessed? Is the algorithm reading and writing to these buffers in some pattern? Is it accessing multiple buffers concurrently? Howe much locality is there? How big or small a buffer can be? It could be that the size is less important than being prefetch-friendly. – Hadi Brais Aug 05 '18 at 20:45
  • Thanks. As I explained, it's a few buffers being accessed every time, so for instance you have 8 buffers, each say 1024 floats, and various algorithms are touching them, reading and writing, but in 99% cases going forwards in the buffers, hence somehow processing them from the beginning to the end. – Vojtěch Melda Meluzín Aug 05 '18 at 21:12
  • This still sounds vague to me. So the various algorithms are running concurrently in different threads or sequentially, although the question says there is one algorithm. Does each algorithm require access to all elements of the buffers in mostly sequential order? – Hadi Brais Aug 05 '18 at 21:17
  • Single thread, let's assume each algorithm accesses all of them, for simplicity. I'm generalizing here a lot. – Vojtěch Melda Meluzín Aug 06 '18 at 13:46

1 Answers1

0

Do you have multiple threads? Can you arrange things so the same thread uses the same buffer repeatedly? (i.e. keep buffers associated with threads when possible).

Modern Intel CPUs have 32k L1d, 256k L2 private per-core. (Or Skylake-AVX512 has 1MiB private L2 caches, with less shared L3). (Which cache mapping technique is used in intel core i7 processor?)

Aiming for L2 hits most of the time is good. L2 miss / L3 hit some of the time isn't always terrible, but off-core is significantly slower. Remember that L2 is a unified cache, so it covers code as well, and of course there's stack memory and random other demands for L2. So aiming for a total buffer size of around half L2 size usually gives a good hit-rate for cache-blocking.

Depending on how much bandwidth your algorithm can use, you might even aim for mostly L1d hits, but small buffers can mean more startup / cleanup overhead and spending more time outside of the main loop.


Also remember that with Hyperthreading, each logical core competes for cache on the physical core it's running on. So if two threads end up on the same physical core, but are touching totally different memory, your effective cache sizes are about half.


Probably you should make the buffer size a tunable parameter, and profile with a few different sizes.

Use perf counters to check if you're actually avoiding L1d or L2 misses or not, with different sizes, to help you understand whether your code is sensitive to different amounts of memory latency or not.

Peter Cordes
  • 245,674
  • 35
  • 423
  • 606
  • Thanks for the info! I was more like after some average value to start with, like say 4096B per buffer, but good pointers anyways. – Vojtěch Melda Meluzín Aug 06 '18 at 13:47
  • @VojtěchMeldaMeluzín: If you know how your application uses your buffers, you can calculate a buffer size from the target footprint. There isn't enough info in the question to say much useful. IDK if it's common to access 16 buffers, or if that only happens rarely. – Peter Cordes Aug 06 '18 at 20:58