CPU Cache disadvantages of using linked lists in C

Question

I was wondering what were the advantages and disadvantages of linked-list compared to contiguous arrays in C. Therefore I read a wikipedia article about linked-lists. https://en.wikipedia.org/wiki/Linked_list#Disadvantages

According to this article, the disadvantages are the following:

They use more memory than arrays because of the storage used by their pointers.

Nodes in a linked list must be read in order from the beginning as linked lists are inherently sequential access.

Difficulties arise in linked lists when it comes to reverse traversing. For instance, singly linked lists are cumbersome to navigate backwards and while doubly linked lists are somewhat easier to read, memory is wasted in allocating.

Nodes are stored incontiguously, greatly increasing the time required to access individual elements within the list, especially with a CPU cache.

I understand the first 3 points but I am having a hard time with the last one:

Nodes are stored incontiguously, greatly increasing the time required to access individual elements within the list, especially with a CPU cache.

The article about CPU Cache does not mention anything about non contiguous memory arrays. As far as I know CPU Caches just caches frequently used adresses for a total 10^-6 cache miss.

Therefore, I do not understand why the CPU cache should be less efficient when it comes to non contiguous memory arrays.

Possible duplicate of [Linked lists, arrays, and hardware memory caches](http://stackoverflow.com/questions/36371706/linked-lists-arrays-and-hardware-memory-caches) — Bo Persson, Oct 16 '16 at 14:58

Zbynek Vyskovsky - kvr000 · Accepted Answer · 2016-11-04T14:35:36.850

16

CPU caches actually do two things.

The one you mentioned is caching recently used memory.

The other however is predicting which memory is going to be used in near future. The algorithm is usually quite simple - it assumes that the program processes big array of data and whenever it accesses some memory it will prefetch few more bytes behind.

This doesn't work for linked list as the nodes are randomly placed in memory.

Additionally, the CPU loads bigger blocks of memory (64, 128 bytes). Again, for the int64 array with single read it has data for processing 8 or 16 elements. For linked list it reads one block and the rest may be wasted as the next node can be in completely different chunk of memory.

And last but not least, related to previous section - linked list takes more memory for its management, the most simple version will take at least additional sizeof(pointer) bytes for the pointer to the next node. But it's not so much about CPU cache anymore.

edited Nov 04 '16 at 14:35

answered Oct 16 '16 at 15:01

Zbynek Vyskovsky - kvr000

16,547
2
30
40

1

The big difference isn't so much predication as granularity. That is the cache consists of individual cache lines, each storing a small contiguous block of memory. Thus for unpredictable or cold accesses if each element is, say, 16-byte and a cache line 128 bytes then for a linear array scan only every eight access is a miss, whereas a linked list might end up wasting 7/8ths of the memory bandwidth. – doynax Oct 16 '16 at 15:23
@doynax, it's quite a lot of both, these days. But you're right that the answer could be better discussing granularity (spatial locality) and prediction as separate concepts. – sh1 Nov 04 '16 at 09:45

score 10 · Answer 2 · answered Oct 17 '16 at 19:29

The article is only scratching the surface, and gets some things wrong (or at least questionable), but the overall outcome is usually about the same: linked lists are much slower.

One thing to note is that "nodes are stored incontiguously [sic]" is an overly strong claim. It is true that in general nodes returned by, for example, malloc may be spread around in memory, especially if nodes are allocated at different times or from different threads. However, in practice, many nodes are often allocated on the same thread, at the same time, and these will often end up quite contiguous in memory, because good malloc implementations are, well, good! Furthermore, when performance is a concern, you may often use special allocators on a per-object basis, which allocated the fixed-sized notes from one or more contiguous chunks of memory, which will provide great spatial locality.

So you can assume that in at least some scenarios, linked lists will give you reasonable to good spatial locality. It largely depends on if you are adding most of all of your list elements at once (linked lists do OK), or are constantly adding elements over a longer period of time (linked lists will have poor spatial locality).

Now, on the side of lists being slow, one of the main issues glossed over with linked lists is the large constant factors associated with some operations relative to the array variant. Everyone knows that accessing an element given its index is O(n) in a linked list and O(1) in an array, so you don't use the linked list if you are going to do a lot of accesses by index. Similarly, everyone knows that adding an element to the middle of a list takes O(1) time in a linked list, and O(n) time in an array, so the former wins in that scenario.

What they don't address is that even operations that have the same algorithmic complexity can be much slower in practice in one implementation...

Let's take iterating over all the elements in a list (looking for a particular value, perhaps). That's an O(n) operation regardless if you use a linked or array representation. So it's a tie, right?

Not so fast! The actual performance can vary a lot! Here is what typical find() implementations would look like when compiled at -O2 optimization level in x86 gcc, thanks to godbolt which makes this easy.

Array

C Code

int find_array(int val, int *array, unsigned int size) {
    for (unsigned int i=0; i < size; i++) {
      if (array[i] == val)
        return i;
    }

    return -1;
}

Assembly (loop only)¹

.L6:
        add     rsi, 4
        cmp     DWORD PTR [rsi-4], edi
        je      .done
        add     eax, 1
        cmp     edx, eax
        jne     .notfound

Linked List

C Code

struct Node {
  struct Node *next;
  int item;
};

Node * find_list(int val, Node *listptr) {
    while (listptr) {
      if (listptr->item == val)
        return listptr;
      listptr = listptr->next;
    }
    return 0;
}

Assembly (loop only)

.L20:
        cmp     DWORD PTR [rax+8], edi
        je      .done
        mov     rax, QWORD PTR [rax]
        test    rax, rax
        jne     .notfound

Just eyeballing the C code, both methods look competitive. The array method is going to have an increment of i, a couple of comparisons, and one memory access to read the value from the array. The linked list version if going to have a couple of (adjacent) memory accesses to read the Node.val and Node.next members, and a couple of comparisons.

The assembly seems to bear that out: the linked list version has 5 instructions and the array version² has 6. All of the instructions are simple ones that have a throughput of 1 per cycle or more on modern hardware.

If you test it though - with both lists fully resident in L1, you'll find that the array version executes at about 1.5 cyles per iteration, while the linked list version takes about 4! That's because the linked list version is limited by it's loop-carried dependency on listptr. The one line listptr = listptr->next boils down to on instruction, but that one instruction will never execute more than once every 4 cycles, because each execution depends on the completion of the prior one (you need to finish reading listptr->next before you can calculate listptr->next->next). Even though modern CPUs can execute something like 2 loads cycles every cycle, these loads take ~4 cycles to complete, so you get a serial bottleneck here.

The array version also has loads, but the address doesn't depend on the prior load:

add     rsi, 4
cmp     DWORD PTR [rsi-4], edi

It depends only on rsi, which is simply calculated by adding 4 each iteration. An add has a latency of one cycle on modern hardware, so this doesn't create a bottleneck (unless you get below 1 cycle/iteration). So the array loop is able to use the full power of the CPU, executing many instructions in parallel. The linked list version is not.

This isn't unique to "find" - any operation linked that needs to iterate over many elements will have this pointer chasing behavior, which is inherently slow on modern hardware.

¹I omitted the epilogue and prologue for each assembly function because it really isn't doing anything interesting. Both versions had no epilogue at all really, and the proloque was very similar for both, peeling off the first iteration and jumping into the middle of the loop. The full code is available for inspection in any case.

²It's worth noting that gcc didn't really do as well as it could have here, since it maintains both rsi as the pointer into the array, and eax as the index i. This means two separate cmp instructions, and two increments. Better would have been to maintain only the pointer rsi in the loop, and to compare against (array + 4*size) as the "not found" condition. That would eliminate one increment. Additionally, you could eliminate one cmp by having rsi run from -4*size up to zero, and indexing into array using [rdi + rsi] where rdi is array + 4*size. Shows that even today optimizing compilers aren't getting everything right!

The implicit assumption is that linked lists are used predominantly where you have real need of O(1) insertion, and therefore will almost always become mangled and non-contiguous over time. Otherwise you would have chosen some partially linked solution (as set of linked vectors perhaps). Also - you missed the benefits of cache prefetching over arrays. — Leeor, Oct 18 '16 at 05:28
That's your assumption or the OPs assumption, or everyone's? Of course, for a single specific use of a list data structure, with a well known and predictable access pattern, you might have the luxury of choosing exactly the right list type. In the real world you are often comparing arrays and linked lists and other types for a variety of scenarios (eg choosing the default list type for a language, or application, etc) which may not have an obvious choice. — BeeOnRope, Oct 18 '16 at 20:17
Also, if you are doing a ton of inserts, the O(1) behavior of a linked data structure is going to crush the array based O(n) behaviors, unless the list is short. So in that case you don't need all the subtle arguments about coefficients - you just use the structure that doesn't have terrible behavior for your common operation. I didn't cover prefetching or any of the other reasons in my answer since they are covered well by the existing answers. I wanted to add something new :) — BeeOnRope, Oct 18 '16 at 20:20
@BeeOnRope if I implement a linked list as an array will it lead to better utilization of cache lines especially if the number of inserts is a multiple of 64 ? — gansub, Oct 13 '18 at 10:16
It depends what you mean by "implement the linked list as an array". Basically anything you do so that the nodes are stored close together, and tightly packed is going to help. The number of inserts being a multiple of 64 is not important. @gansub — BeeOnRope, Oct 13 '18 at 15:19

score 3 · Answer 3 · answered Oct 16 '16 at 15:27

CPU cache usually takes in a page of a certain size for example (the common one) 4096 bytes or 4kB and accesses information needed from there. To fetch a page there is a considerate amount of time consumed let's say 1000 cycles. If say we have an array of 4096 bytes which is contiguous we will fetch a 4096 bytes page from cache memory and probably most of the data will be there. If not maybe we need to fetch another page to get the rest of the data.

Example: We have 2 pages from 0-8191 and the array is in between 2048 and 6244 then we will fetch page#1 from 0-4095 to get the desired elements and then page#2 from 4096-8191 to get all array elements we want. This results in fetching 2 pages from memory to our cache to get our data.

What happens in a list though? In a list the data are non-contiguous which means that the elements are not in contiguous places in memory so they are probably scattered through various pages. This means that a CPU has to fetch a lot of pages from memory to the cache to get the desired data.

Example: Node#1 mem_address = 1000, Node#2 mem_address = 5000, Node#3 mem_address = 18000. If the CPU is able to see in 4k pages sizes then it has to fetch 3 different pages from memory to find the data it wants.

Also, the memory uses prefetch techniques to fetch pages of memory before they are needed so if the linked list is small let's say A -> B -> C, then the first cycle will be slow because the prefetcher can't predict the next block to fetch. But, on the next cycle we say that the prefetcher is warmed up and it can start predicting the path of the linked list and fetch the correct blocks on time.

Summarizing arrays are easily predictable by the hardware and are in one place so they are easy to fetch, while linked lists are unpredictable and are scattered throughout memory, which makes the life of the predictor and CPU harder.

score 1 · Answer 4 · answered Oct 21 '18 at 23:18

BeeOnRope's answer is good and highlights the cycle count overheads of traversing a linked list vs iterating through an array, but as he explicitly says that's assuming "both lists fully resident in L1". However, it's far more likely that an array will fit better in L1 than a linked list, and the moment you start thrashing your cache the performance difference becomes huge. RAM can be more than 100x slower than L1, with L2 and L3 (if your CPU has any) being between 3x to 14x slower.

On a 64 bit architecture, each pointer takes 8 bytes, and a doubly linked list needs two of them or 16 bytes of overhead. If you only want a single 4 byte uint32 per entry, that means you need 5x as much storage for the dlist as you need for an array. Arrays guarantee locality, and although malloc can do OK at locality if you allocate stuff together in the right order, you often can't. Lets approximate poor locality by saying it takes 2x the space, so a dlist uses 10x as much "locality space" as an array. That's enough to push you from fitting in L1 to overflowing into L3, or even worse from L2 into RAM.

CPU Cache disadvantages of using linked lists in C

4 Answers4

Array

C Code

Assembly (loop only)1

Linked List

C Code

Assembly (loop only)

Assembly (loop only)¹