I'm on an IvyBridge and want to test the L1d cache organization. My understanding is as follows:
On IvyBridge, L1d cache has 32K capacity, 64B cache line, 8 way set associative. Therefore it has 32K/(64*8) = 64 sets, given a main memory addr
, the set index can be computed by (addr/64) % 64
.
So if I step the main memory by 64*64 (4K), I will always touch the same L1d set. A set only has 8 cache lines, as a consequence if I loop it with 16 steps, I will get almost 100% L1d cache miss.
I write the following program to verify:
section .bss
align 4096
buf: resb 1<<26
%define gap 64 * 64 ; no L1 cache miss
; %define gap 64 * 64 * 256 ; 41% L1 cache miss
; %define gap 64 * 64 * 512 ; 95% L1 cache miss
; however, total cycle suggests this gap is already at L3 latency level with complete L2 cache miss.
section .text
global _start
_start:
mov rcx, 10000000
xor rax, rax
loop:
mov rax, [buf+rax]
mov rax, [buf+rax+gap*1]
mov rax, [buf+rax+gap*2]
mov rax, [buf+rax+gap*3]
mov rax, [buf+rax+gap*4]
mov rax, [buf+rax+gap*5]
mov rax, [buf+rax+gap*6]
mov rax, [buf+rax+gap*7]
mov rax, [buf+rax+gap*8]
mov rax, [buf+rax+gap*9]
mov rax, [buf+rax+gap*10]
mov rax, [buf+rax+gap*11]
mov rax, [buf+rax+gap*12]
mov rax, [buf+rax+gap*13]
mov rax, [buf+rax+gap*14]
mov rax, [buf+rax+gap*15]
dec rcx,
jne loop
xor rdi, rdi
mov rax, 60
syscall
To my surprise, perf
shows there is no L1 cache missing at all:
160,494,057 L1-dcache-loads
4,290 L1-dcache-load-misses # 0.00% of all L1-dcache hits
What is wrong in my understanding?