Why 16 stepping by 4K in main memory causing no L1d cache miss

Question

I'm on an IvyBridge and want to test the L1d cache organization. My understanding is as follows:

On IvyBridge, L1d cache has 32K capacity, 64B cache line, 8 way set associative. Therefore it has 32K/(64*8) = 64 sets, given a main memory addr, the set index can be computed by (addr/64) % 64.

So if I step the main memory by 64*64 (4K), I will always touch the same L1d set. A set only has 8 cache lines, as a consequence if I loop it with 16 steps, I will get almost 100% L1d cache miss.

I write the following program to verify:

section .bss
align   4096
buf:    resb    1<<26

%define gap 64 * 64 ; no L1 cache miss

; %define gap 64 * 64 * 256 ; 41% L1 cache miss

; %define gap 64 * 64 * 512 ; 95% L1 cache miss
; however, total cycle suggests this gap is already at L3 latency level with complete L2 cache miss.

section .text
global _start
_start:
    mov rcx,    10000000
    xor rax,    rax
loop:
    mov rax,    [buf+rax]
    mov rax,    [buf+rax+gap*1]
    mov rax,    [buf+rax+gap*2]
    mov rax,    [buf+rax+gap*3]
    mov rax,    [buf+rax+gap*4]
    mov rax,    [buf+rax+gap*5]
    mov rax,    [buf+rax+gap*6]
    mov rax,    [buf+rax+gap*7]

    mov rax,    [buf+rax+gap*8]
    mov rax,    [buf+rax+gap*9]
    mov rax,    [buf+rax+gap*10]
    mov rax,    [buf+rax+gap*11]
    mov rax,    [buf+rax+gap*12]
    mov rax,    [buf+rax+gap*13]
    mov rax,    [buf+rax+gap*14]
    mov rax,    [buf+rax+gap*15]

    dec rcx,
    jne loop

    xor rdi,    rdi
    mov rax,    60
    syscall

To my surprise, perf shows there is no L1 cache missing at all:

  160,494,057      L1-dcache-loads
        4,290      L1-dcache-load-misses     #    0.00% of all L1-dcache hits

What is wrong in my understanding?

All BSS pages are initially mapped copy-on-write to the same physical zero page. You'll get TLB misses (and maybe soft page faults) but no L1d misses. If you dirtied them first by writing them, you wouldn't see this. Or maybe if you allocated with `mmap(MAP_POPULATE)`, I think. That would pre-fault them at least, avoiding soft page faults but maybe still to the same physical zero page. — Peter Cordes, Jan 07 '19 at 05:22
@PeterCordes But I run this loop in the same memory address, the page fault will only occur in first iteration — user10865622, Jan 07 '19 at 05:25
Oh, you're not striding through that whole giant BSS, you're only accessing out to `gap*15` repeatedly, and RAX stays zero. So you'll only ever have one soft page fault and maybe a few TLB misses. Why are you using RAX at all to create a dependency chain, instead of just RIP-relative addressing of `[buf + gap*n]`? — Peter Cordes, Jan 07 '19 at 05:29
Accessing more memory with a larger gap might lead your kernel to start using a 2M hugepage for your BSS, ironically hurting it by making them no longer alias to the same 4k physical page. L2 misses are expected because it's only 8-way associative as well. — Peter Cordes, Jan 07 '19 at 05:31
@PeterCordes Ah.. So it's an effect of MMU, because the giant BSS is not contiguous in physical memory? — user10865622, Jan 07 '19 at 05:33
Yes, exactly. Copy-on-write and lazy mapping can be gotchas in microbenchmarking. Especially the effects from larger strides caused (I think) by transparent hugepages are totally non-obvious. — Peter Cordes, Jan 07 '19 at 05:35
@PeterCordes Is there any workaround? My purpose is to test L2 latency, so I came up with this benchmark. — user10865622, Jan 07 '19 at 05:38
Yeah, I already suggested two: dirty the memory first (by writing a byte in every page), allocate it dynamically with `mmap(MAP_POPULATE)`, or put it in the `.data` or `.rodata` section where it will actually be mapped with a file backing. (You'll have to make it much smaller, because the zeros will actually be in the executable). — Peter Cordes, Jan 07 '19 at 05:40
BTW, you might want a gap that isn't a power of 2. Just a multiple of the L1 aliasing stride, but *not* of the L2 aliasing stride, so your data can distribute through many sets in L2. — Peter Cordes, Jan 07 '19 at 05:41
I was looking for duplicates. [Is it true, that modern OS may skip copy when realloc is called](https://stackoverflow.com/q/16765389) isn't, but it has some good stuff about virtual memory / COW of zero pages for `mremap`. — Peter Cordes, Jan 07 '19 at 05:43
@PeterCordes Thanks, I will accept it if you post these comments as answer. — user10865622, Jan 07 '19 at 05:43

score 3 · Accepted Answer · edited Jun 20 '20 at 09:12

All BSS pages are initially mapped copy-on-write to the same physical zero page. You'll get TLB misses (and maybe soft page faults) but no L1d misses.

To avoid this and get them mapped to different physical pages:

dirty them first by writing a byte to each page
maybe allocate with mmap(MAP_POPULATE) instead of using the BSS. That would pre-fault them at least, avoiding soft page faults but maybe still to the same physical zero page.
put buf it in the .data or .rodata section, where it will actually be mapped with a file backing. (You'll have to make it much smaller, because the zeros will actually be in the executable).

The more interesting (to me) result is that you do start to get cache misses with a larger stride. You're accessing more total 4k pages then, and this might lead your kernel to start using a 2M hugepage for your BSS, ironically hurting it by making them no longer alias to the same 4k physical page. You could check /proc/PID/smaps to see if there's non-zero AnonHuge for that mapping.

L2 misses are expected because it's only 8-way associative as well, but L3 is more associative and uses a non-simple indexing function that distributes any simple power of 2 stride over multiple sets. (Which cache mapping technique is used in intel core i7 processor?)

BTW, you might want a gap that isn't a power of 2. Just a multiple of the L1 aliasing stride, but not of the L2 aliasing stride, so your data can distribute through many sets in L2.

I was looking for duplicates but didn't find an exact one, although I'm pretty sure I've explained this before somewhere on SO >.<. Probably I'm thinking of How can I obtain consistently high throughput in this loop? where it was exactly this same issue with malloc, not the BSS.

Is it true, that modern OS may skip copy when realloc is called has some good stuff about virtual memory / COW of zero pages for mremap.
Does malloc lazily create the backing pages for an allocation on Linux (and other platforms)? (lazy allocation / not even setting up the page table initially is separate from COW mapping to a zero page. It could always allocate a real page even on a read-only access when a page is first touched.)
How exactly does copy on write work

@user10865622: Oh right, you specifically want contiguous physical memory so you can control aliasing for L2. My "solutions" were just aimed at getting them mapped to separate physical pages, not necessarily contiguous. But yes, `madvise(MADV_HUGEPAGE)` would be a good approach. — Peter Cordes, Jan 08 '19 at 12:46
But note that L1d aliasing should only depend on offset *within* a page. It's a VIPT cache that has its size and associativity specifically chosen so that the index bits come only from the offset-within-4k-page part of the address, so it can index in parallel with the TLB lookup. My answer on [Why is the size of L1 cache smaller than that of the L2 cache in most of the processors?](https://stackoverflow.com/a/38549736) explains some of the reasoning behind how this gives you a VIPT cache that behaves like PIPT except for speed: no homonym or synonym aliasing problems. — Peter Cordes, Jan 08 '19 at 12:50
@user10865622: possibly HW prefetching is able to keep up, when you're bottlenecking your loop on load latency, if the data is hot in L2 and only needs fetching into L1d. You could use perf counters to check L1d replacements, and see if that's why you're getting hits. (`perf stat -e your_usual_counters,l1d.replacement -r3 ./testprog`). This could happen if you have aliasing in L1d but not L2. — Peter Cordes, Jan 08 '19 at 12:52

Why 16 stepping by 4K in main memory causing no L1d cache miss

1 Answers1