Why do L1 and L2 Cache waste space saving the same data?

Question

I don't know why L1 Cache and L2 Cache save the same data.

For example, let's say we want to access Memory[x] for the first time. Memory[x] is mapped to the L2 Cache first, then the same data piece is mapped to L1 Cache where CPU register can retrieve data from.

But we have duplicated data stored on both L1 and L2 cache, isn't it a problem or at least a waste of storage space?

It can be a problem when L2 is shared between multiple cores since you may have multiple different copies of the same cache line. [A cache coherence protocol](https://en.wikipedia.org/wiki/Cache_coherence) would be required to maintain coherence. Otherwise, if there is only one core, then no problem. — Hadi Brais, Apr 11 '18 at 23:39
@Hadi Brais, if there is only one core, there is still duplicated data exist in L1 and L2 cache, which is not good, isn't it? — amjad, Apr 11 '18 at 23:57
How? That core can only change the data in L1. Then when it gets evicted from L1, the changes are propagated L2. The core cannot directly access L2, it has to go through L1. So the copy in L2 may get only *temporarily* incoherent, and that is never observed by the core. — Hadi Brais, Apr 12 '18 at 00:04
@HadiBrais: I think the OP is wondering about the *performance* downside (cache capacity) of wasting space storing the same data twice with a Not-Inclusive / Not-Exclusive https://en.wikipedia.org/wiki/Cache_inclusion_policy. Some CPUs do in fact use an L2 that's exclusive of L1d (e.g. [AMD K10 / Barcelona](https://www.realworldtech.com/barcelona/7/)), so an L2 hit can just exchange lines between L1d and L2 if L1d needs to evict something from that set. https://www.realworldtech.com/bulldozer/3/ points out that Bulldozer's shared L3 is a victim cache, and thus *mostly* exclusive of L2. — Peter Cordes, Apr 12 '18 at 03:14
@PeterCordes Yeah maybe. I thought the OP is confused about how the core interacts with two caches L1 and L2. I didn't mention the inclusive/exclusive/not-Inclusive terms to keep it as simple as possible. — Hadi Brais, Apr 12 '18 at 03:18

score 4 · Answer 1 · answered Apr 12 '18 at 09:50

I edited your question to ask about why CPUs waste cache space storing the same data in multiple levels of cache, because I think that's what you're asking.

Not all caches are like that. The Cache Inclusion Policy for an outer cache can be Inclusive, Exclusive, or Not-Inclusive / Not-Exclusive.

NINE is the "normal" case, not maintaining either special property, but L2 does tend to have copies of most lines in L1 for the reason you describe in the question. If L2 is less associative than L1 (like in Skylake-client) and the access pattern creates a lot of conflict misses in L2 (unlikely), you could get a decent amount of data that's only in L1. And maybe in other ways, e.g. via hardware prefetch, or from L2 evictions of data due to code-fetch, because real CPUs use split L1i / L1d caches.

For the outer caches to be useful, you need some way for data to enter them so you can get an L2 hit sometime after the line was evicted from the smaller L1. Having inner caches like L1d fetch through outer caches gives you that for free, and has some advantages. You can put hardware prefetch logic in an outer or middle level of cache, which doesn't have to be as high-performance as L1. (e.g. Intel CPUs have most of their prefetch logic in the private per-core L2, but also some prefetch logic in L1d).

The other main option is for the outer cache to be a victim cache, i.e. lines enter it only when they're evicted from L1. So you can loop over an array of L1 + L2 size and probably still get L2 hits. The extra logic to implement this is useful if you want a relatively large L1 compared to L2, so the total size is more than a little larger than L2 alone.

With an exclusive L2, an L1 miss / L2 hit can just exchange lines between L1d and L2 if L1d needs to evict something from that set.

Some CPUs do in fact use an L2 that's exclusive of L1d (e.g. AMD K10 / Barcelona). Both of those caches are private per-core caches, not shared, so it's like the simple L1 / L2 situation for a single core CPU you're talking about.

Things get more complicated with multi-core CPUs and shared caches!

Barcelona's shared L3 cache is also mostly exclusive of the inner caches, but not strictly. David Kanter explains:

First, it is mostly exclusive, but not entirely so. When a line is sent from the L3 cache to an L1D cache, if the cache line is shared, or is likely to be shared, then it will remain in the L3 – leading to duplication which would never happen in a totally exclusive hierarchy. A fetched cache line is likely to be shared if it contains code, or if the data has been previously shared (sharing history is tracked). Second, the eviction policy for the L3 has been changed. In the K8, when a cache line is brought in from memory, a pseudo-least recently used algorithm would evict the oldest line in the cache. However, in Barcelona’s L3, the replacement algorithm has been changed to also take into account sharing, and it prefers evicting unshared lines.

AMD's successor to K10/Barcelona is Bulldozer. https://www.realworldtech.com/bulldozer/3/ points out that Bulldozer's shared L3 is also victim cache, and thus mostly exclusive of L2. It's probably like Barcelona's L3.

But Bulldozer's L1d is a small write-through cache with an even smaller (4k) write-combining buffer, so it's mostly inclusive of L2. Bulldozer's write-through L1d is generally considered a mistake in the CPU design world, and Ryzen went back to a normal 32kiB write-back L1d like Intel has been using all along (with great results). A pair of weak integer cores form a "cluster" that shares an FPU/SIMD unit, and shares a big L2 that's "mostly inclusive". (i.e. probably a standard NINE). This cluster thing is Bulldozer's alternative to SMT / Hyperthreading, which AMD also ditched for Ryzen in favour of normal SMT with a massively wide out-of-order core.

Ryzen also has some exclusivity between core clusters (CCX), apparently, but I haven't looked into the details.

I've been talking about AMD first because they have used exclusive caches in recent designs, and seem to have a preference for victim caches. Intel hasn't tried as many different things, because they hit on a good design with Nehalem and stuck with it until Skylake-AVX512.

Intel Nehalem and later use a large shared tag-inclusive L3 cache. For lines that are modified / exclusive (MESI) in a private per-core L1d or L2 (NINE) cache, the L3 tags still indicate which cores (might) have a copy of a line, so requests from one core for exclusive access to a line don't have to be broadcast to all cores, only to cores that might still have it cached. (i.e. it's a snoop filter for coherency traffic, which lets CPUs scale up to dozens of cores per chip without flooding each other with requests when they're not even sharing memory.)

i.e. L3 tags hold info about where a line is (or might be) cached in an L2 or L1 somewhere, so it knows where to send invalidation messages instead of broadcasting messages from every core to all other cores.

With Skylake-X (Skylake-server / SKX / SKL-SP), Intel dropped that and made L3 NINE and only a bit bigger than the total per-core L2 size. But there's still a snoop filter, it just doesn't have data. I don't know what Intel's planning to do for future (dual?)/quad/hex-core laptop / desktop chips (e.g. Cannonlake / Icelake). That's small enough that their classic ring bus would still be great, so they could keep doing that in mobile/desktop parts and only use a mesh in high-end / server parts, like they are in Skylake.

Realworldtech forum discussions of inclusive vs. exclusive vs. non-inclusive:

CPU architecture experts spend time discussing what makes for a good design on that forum. While searching for stuff about exclusive caches, I found this thread, where some disadvantages of strictly inclusive last-level caches are presented. e.g. they force private per-core L2 caches to be small (otherwise you waste too much space with duplication between L3 and L2).

Also, L2 caches filter requests to L3, so when its LRU algorithm needs to drop a line, the one it's seen least-recently can easily be one that stays permanently hot in L2 / L1 of a core. But when an inclusive L3 decides to drop a line, it has to evict it from all inner caches that have it, too!

David Kanter replied with an interesting list of advantages for inclusive outer caches. I think he's comparing to exclusive caches, rather than to NINE. e.g. his point about data sharing being easier only applies vs. exclusive caches, where I think he's suggesting that a strictly exclusive cache hierarchy might cause evictions when multiple cores want the same line even in a shared/read-only manner.

Why do L1 and L2 Cache waste space saving the same data?

1 Answers1

Realworldtech forum discussions of inclusive vs. exclusive vs. non-inclusive:

Linked