The article is only scratching the surface, and gets some things wrong (or at least questionable), but the overall outcome is usually about the same: linked lists are much slower.
One thing to note is that "nodes are stored incontiguously [sic]" is an overly strong claim. It is true that in general nodes returned by, for example, malloc
may be spread around in memory, especially if nodes are allocated at different times or from different threads. However, in practice, many nodes are often allocated on the same thread, at the same time, and these will often end up quite contiguous in memory, because good malloc
implementations are, well, good! Furthermore, when performance is a concern, you may often use special allocators on a per-object basis, which allocated the fixed-sized notes from one or more contiguous chunks of memory, which will provide great spatial locality.
So you can assume that in at least some scenarios, linked lists will give you reasonable to good spatial locality. It largely depends on if you are adding most of all of your list elements at once (linked lists do OK), or are constantly adding elements over a longer period of time (linked lists will have poor spatial locality).
Now, on the side of lists being slow, one of the main issues glossed over with linked lists is the large constant factors associated with some operations relative to the array variant. Everyone knows that accessing an element given its index is O(n)
in a linked list and O(1)
in an array, so you don't use the linked list if you are going to do a lot of accesses by index. Similarly, everyone knows that adding an element to the middle of a list takes O(1)
time in a linked list, and O(n)
time in an array, so the former wins in that scenario.
What they don't address is that even operations that have the same algorithmic complexity can be much slower in practice in one implementation...
Let's take iterating over all the elements in a list (looking for a particular value, perhaps). That's an O(n)
operation regardless if you use a linked or array representation. So it's a tie, right?
Not so fast! The actual performance can vary a lot! Here is what typical find()
implementations would look like when compiled at -O2
optimization level in x86 gcc, thanks to godbolt which makes this easy.
Array
C Code
int find_array(int val, int *array, unsigned int size) {
for (unsigned int i=0; i < size; i++) {
if (array[i] == val)
return i;
}
return -1;
}
Assembly (loop only)1
.L6:
add rsi, 4
cmp DWORD PTR [rsi-4], edi
je .done
add eax, 1
cmp edx, eax
jne .notfound
Linked List
C Code
struct Node {
struct Node *next;
int item;
};
Node * find_list(int val, Node *listptr) {
while (listptr) {
if (listptr->item == val)
return listptr;
listptr = listptr->next;
}
return 0;
}
Assembly (loop only)
.L20:
cmp DWORD PTR [rax+8], edi
je .done
mov rax, QWORD PTR [rax]
test rax, rax
jne .notfound
Just eyeballing the C code, both methods look competitive. The array method is going to have an increment of i
, a couple of comparisons, and one memory access to read the value from the array. The linked list version if going to have a couple of (adjacent) memory accesses to read the Node.val
and Node.next
members, and a couple of comparisons.
The assembly seems to bear that out: the linked list version has 5 instructions and the array version2 has 6. All of the instructions are simple ones that have a throughput of 1 per cycle or more on modern hardware.
If you test it though - with both lists fully resident in L1, you'll find that the array version executes at about 1.5 cyles per iteration, while the linked list version takes about 4! That's because the linked list version is limited by it's loop-carried dependency on listptr
. The one line listptr = listptr->next
boils down to on instruction, but that one instruction will never execute more than once every 4 cycles, because each execution depends on the completion of the prior one (you need to finish reading listptr->next
before you can calculate listptr->next->next
). Even though modern CPUs can execute something like 2 loads cycles every cycle, these loads take ~4 cycles to complete, so you get a serial bottleneck here.
The array version also has loads, but the address doesn't depend on the prior load:
add rsi, 4
cmp DWORD PTR [rsi-4], edi
It depends only on rsi
, which is simply calculated by adding 4 each iteration. An add
has a latency of one cycle on modern hardware, so this doesn't create a bottleneck (unless you get below 1 cycle/iteration). So the array loop is able to use the full power of the CPU, executing many instructions in parallel. The linked list version is not.
This isn't unique to "find" - any operation linked that needs to iterate over many elements will have this pointer chasing behavior, which is inherently slow on modern hardware.
1I omitted the epilogue and prologue for each assembly function because it really isn't doing anything interesting. Both versions had no epilogue at all really, and the proloque was very similar for both, peeling off the first iteration and jumping into the middle of the loop. The full code is available for inspection in any case.
2It's worth noting that gcc didn't really do as well as it could have here, since it maintains both rsi
as the pointer into the array, and eax
as the index i
. This means two separate cmp
instructions, and two increments. Better would have been to maintain only the pointer rsi
in the loop, and to compare against (array + 4*size)
as the "not found" condition. That would eliminate one increment. Additionally, you could eliminate one cmp
by having rsi
run from -4*size
up to zero, and indexing into array using [rdi + rsi]
where rdi is array + 4*size
. Shows that even today optimizing compilers aren't getting everything right!