14

I am working on a project where we have to implement an algorithm that is proven in theory to be cache friendly. In simple terms, if N is the input and B is the number of elements that get transferred between the cache and the RAM every time we have a cache miss, the algorithm will require O(N/B) accesses to the RAM.

I would like to show that this is indeed the behavior in practice. To better understand how one can measure various cache related hardware counters, I decided to use different tools. One is Perf and the other is the PAPI library. Unfortunately, the more I work with these tools, the less I understand what they do exactly.

I am using an Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz with 8 GB of RAM, L1 cache 256 KB, L2 cache 1 MB, L3 cache 6 MB. The cache line size is 64 bytes. I guess that must be the size of the block B.

Let's look at the following example:

#include <iostream>

using namespace std;

struct node{
    int l, r;
};

int main(int argc, char* argv[]){

    int n = 1000000;

    node* A = new node[n];

    int i;
    for(i=0;i<n;i++){
        A[i].l = 1;
        A[i].r = 4;
    }

    return 0;
}

Each node requires 8 bytes, which means that a cache line can fit 8 nodes, so I should be expecting approximately 1000000/8 = 125000 L3 cache misses.

Without optimization (no -O3), this is the output from perf:

 perf stat -B -e cache-references,cache-misses ./cachetests 

 Performance counter stats for './cachetests':

       162,813      cache-references                                            
       142,247      cache-misses              #   87.368 % of all cache refs    

   0.007163021 seconds time elapsed

It is pretty close to what we are expecting. Now suppose that we use the PAPI library.

#include <iostream>
#include <papi.h>

using namespace std;

struct node{
    int l, r;
};

void handle_error(int err){
    std::cerr << "PAPI error: " << err << std::endl;
}

int main(int argc, char* argv[]){

    int numEvents = 2;
    long long values[2];
    int events[2] = {PAPI_L3_TCA,PAPI_L3_TCM};

    if (PAPI_start_counters(events, numEvents) != PAPI_OK)
        handle_error(1);

    int n = 1000000;
    node* A = new node[n];
    int i;
    for(i=0;i<n;i++){
        A[i].l = 1;
        A[i].r = 4;
    }

    if ( PAPI_stop_counters(values, numEvents) != PAPI_OK)
        handle_error(1);

    cout<<"L3 accesses: "<<values[0]<<endl;
    cout<<"L3 misses: "<<values[1]<<endl;
    cout<<"L3 miss/access ratio: "<<(double)values[1]/values[0]<<endl;

    return 0;
}

This is the output that I get:

L3 accesses: 3335
L3 misses: 848
L3 miss/access ratio: 0.254273

Why such a big difference between the two tools?

jsguy
  • 1,789
  • 1
  • 18
  • 34

1 Answers1

16

You can go through the source files of both perf and PAPI to find out to which performance counter they actually map these events, but it turns out they are the same (assuming Intel Core i here): Event 2E with umask 4F for references and 41 for misses. In the the Intel 64 and IA-32 Architectures Developer's Manual these events are described as:

2EH 4FH LONGEST_LAT_CACHE.REFERENCE This event counts requests originating from the core that reference a cache line in the last level cache.

2EH 41H LONGEST_LAT_CACHE.MISS This event counts each cache miss condition for references to the last level cache.

That seems to be ok. So the problem is somewhere else.

Here are my reproduced numbers, only that I increased the array length by a factor of 100. (I noticed large fluctuations in timing results otherwise and with length of 1,000,000 the array almost fits into your L3 cache still). main1 here is your first code example without PAPI and main2 your second one with PAPI.

$ perf stat -e cache-references,cache-misses ./main1 

 Performance counter stats for './main1':

        27.148.932      cache-references                                            
        22.233.713      cache-misses              #   81,895 % of all cache refs 

       0,885166681 seconds time elapsed

$ ./main2 
L3 accesses: 7084911
L3 misses: 2750883
L3 miss/access ratio: 0.388273

These obviously don't match. Let's see where we actually count the LLC references. Here are the first few lines of perf report after perf record -e cache-references ./main1:

  31,22%  main1    [kernel]          [k] 0xffffffff813fdd87                                                                                                                                   ▒
  16,79%  main1    main1             [.] main                                                                                                                                                 ▒
   6,22%  main1    [kernel]          [k] 0xffffffff8182dd24                                                                                                                                   ▒
   5,72%  main1    [kernel]          [k] 0xffffffff811b541d                                                                                                                                   ▒
   3,11%  main1    [kernel]          [k] 0xffffffff811947e9                                                                                                                                   ▒
   1,53%  main1    [kernel]          [k] 0xffffffff811b5454                                                                                                                                   ▒
   1,28%  main1    [kernel]          [k] 0xffffffff811b638a                                              
   1,24%  main1    [kernel]          [k] 0xffffffff811b6381                                                                                                                                   ▒
   1,20%  main1    [kernel]          [k] 0xffffffff811b5417                                                                                                                                   ▒
   1,20%  main1    [kernel]          [k] 0xffffffff811947c9                                                                                                                                   ▒
   1,07%  main1    [kernel]          [k] 0xffffffff811947ab                                                                                                                                   ▒
   0,96%  main1    [kernel]          [k] 0xffffffff81194799                                                                                                                                   ▒
   0,87%  main1    [kernel]          [k] 0xffffffff811947dc   

So what you can see here is that actually only 16.79% of the cache references actually happen in user space, the rest are due to the kernel.

And here lies the problem. Comparing this to the PAPI result is unfair, because PAPI by default only counts user space events. Perf however by default collects user and kernel space events.

For perf we can easily reduce to user space collection only:

$ perf stat -e cache-references:u,cache-misses:u ./main1 

 Performance counter stats for './main1':

         7.170.190      cache-references:u                                          
         2.764.248      cache-misses:u            #   38,552 % of all cache refs    

       0,658690600 seconds time elapsed

These seem to match pretty well.

Edit:

Lets look a bit closer at what the kernel does, this time with debug symbols and cache misses instead of references:

  59,64%  main1    [kernel]       [k] clear_page_c_e
  23,25%  main1    main1          [.] main
   2,71%  main1    [kernel]       [k] compaction_alloc
   2,70%  main1    [kernel]       [k] pageblock_pfn_to_page
   2,38%  main1    [kernel]       [k] get_pfnblock_flags_mask
   1,57%  main1    [kernel]       [k] _raw_spin_lock
   1,23%  main1    [kernel]       [k] clear_huge_page
   1,00%  main1    [kernel]       [k] get_page_from_freelist
   0,89%  main1    [kernel]       [k] free_pages_prepare

As we can see most cache misses actually happen in clear_page_c_e. This is called when a new page is accessed by our program. As explained in the comments new pages are zeroed by the kernel before allowing access, therefore the cache miss already happens here.

This messes with your analysis, because a good part of the cache misses you expect happen in kernel space. However you can not guarantee under which exact circumstances the kernel actually accesses memory, so that might be deviations from the behavior expected by your code.

To avoid this build an additional loop around your array-filling one. Only the first iteration of the inner loop incurs the kernel overhead. As soon as every page in the array was accessed, there should be no contribution left. Here is my result for 100 repetition of the outer loop:

$ perf stat -e cache-references:u,cache-references:k,cache-misses:u,cache-misses:k ./main1

 Performance counter stats for './main1':

     1.327.599.357      cache-references:u                                          
        23.678.135      cache-references:k                                          
     1.242.836.730      cache-misses:u            #   93,615 % of all cache refs    
        22.572.764      cache-misses:k            #   95,332 % of all cache refs    

      38,286354681 seconds time elapsed

The array length was 100,000,000 with 100 iterations and therefore you would have expected 1,250,000,000 cache misses by your analysis. This is pretty close now. The deviation is mostly from the first loop which is loaded to the cache by the kernel during page clearing.

With PAPI a few extra warm-up loops can be inserted before the counter starts, and so the result fits the expectation even better:

$ ./main2 
L3 accesses: 1318699729
L3 misses: 1250684880
L3 miss/access ratio: 0.948423
  • Hmm. I see the difference in numbers too, that's correct, but what in the kernel could cause this many cache misses? The program is all about juggling memory in userspace, on my system it uses the same 55 syscalls for n of 1000000 and n of 100000000, if we're not to count the program loading the only thing it does in the kernel is mapping a region of memory. Page faults maybe? But such a big number just for that? – Roman Khimov Oct 03 '16 at 08:27
  • 2
    @RomanKhimov The kernel symbol making up the largest part of these is `clear_page_c_e`. So I think it is because each page is zeroed by the kernel before passed to userspace. This probably doesn't happen at time of allocation though, but rather at first access. I might have been wrong there. I will update my answer later with some more detailed analysis. –  Oct 03 '16 at 10:03
  • I forgot about zeroing mmaped `MAP_ANONYMOUS` memory, true, and that actually explains everything. It might be interesting to compare numbers with manual `mmap()` using `MAP_UNINITIALIZED`, that should also show the difference between warmed-up-by-zeroing cache and cold uninitialized cache. – Roman Khimov Oct 03 '16 at 10:12
  • Why does the kernel zero out the elements when this would cause unnecessary cache misses? I always thought that the new operation simply reserves a contiguous chunk of memory and does nothing else with it unless you specifically ask for it via constructors for example. Is it possible to enable such a behavior? – jsguy Oct 03 '16 at 12:37
  • @jsguy That is a security measure, because otherwise you might be able to read left-over memory from a previous process. This can be circumvented with MAP_UNINITIALIZED, but that is usually deactivated in the kernel for exactly that security reason. I don't have time right now, but I will edit my post later to show the difference. –  Oct 03 '16 at 17:41
  • @RomanKhimov I tried to test MAP_UNINITIALIZED, but it requires to disable CONFIG_MMU in the kernel. I don't think I have time to experiment with that. Do you know of a simple consistent kernel configuration for x86 allowing MAP_UNINITIALIZED? –  Oct 03 '16 at 18:40
  • @Eichhörnchen: True, it only works for no-mmu, so it's not an option for x86. OK then. – Roman Khimov Oct 03 '16 at 19:01
  • @jsguy: that depends on current program state, in general of course that's some uninitialized chunk, but in this particular case you've just started a program, new calls malloc() and it doesn't have enough memory in its pools, so it in turn gets some memory from the kernel using mmap(), and that is zeroed by the kernel as specified in the man page. – Roman Khimov Oct 03 '16 at 19:06