Answer
The number of iterations in MSD radix depends on the input size, whereas the number of iterations in LSD radix sort depends on the key length. This often leads to MSD radix sort requiring significantly fewer iterations than LSD radix sort and is therefore faster.
Memory allocations are not an issue, as MSD radix sort can easily be implemented in-place.
Rationale
I've made an implementation for both LSD and MSD radix sort, so I could see what properties they have that makes MSD radix sort faster than LSD radix sort.
I've compared their speeds with std::sort on an array of 100.000.000 random positive 63-bit integers (I used std::sort's result I also used for verifying the sorted arrays) and got the following results:
- Pure LSD sort : 10.5s
- std::sort : 9.5s
- Pure MSD sort : 9.3s
- MSD sort + insertion_sort : 7.6s
So, it is slightly faster than std::sort, and if the leaves are sorted with insertion_sort, it is quite a bit faster.
Why might MSD radix sort be faster than LSD radix sort?
- There is the cache-locality, though I doubt whether this is really important, as LSD radix sort also scans through the array, instead of performing random access.
- MSD radix sort can be implemented such that its space complexity is O(d k), and thus only depends on the radix d and the length of the items k. This can be allocated on the stack, which is almost free. Hence it is basically an in-place sorting algorithm.
- The bottom layers can be pruned. I.e. when a bucket contains only 1 element, it is already sorted and thus does not need to recurse on that bucket. Hence, MSD radix sort only needs to perform approximately log(n)/log(d) iterations. While LSD radix sort always must perform k iterations.
I believe that this last point is the reason why MSD radix sort is often faster than LSD radixsort. If the input data is uniformly random distributed, then the expected running time is O(n log(n)/log(d)), whereas LSD radix sort's running time is O(n k). And usually n is a lot smaller than k^d. Only if n = o(k^d), LSD radix sort would be faster. However, in that case counting sort (radix sort with k=1) can be used as well.
The implementations
inline void insertion_sort(int64_t * array, int n) {
for (int i=1; i<n; i++) {
int64_t val = array[i];
int j = i;
while (j>0 && array[j-1] > val) {
array[j] = array[j-1];
j--;
}
array[j] = val;
}
}
void msd_sort(int64_t * array, int n, int64_t bit=60) {
const int64_t mask = INT64_C(7);
// Count bucket sizes
int count[9]={};
for (int i=0; i<n; i++) {
count[((array[i]>>bit) & mask)+1]++;
}
// Create holes.
int loc[8];
int64_t unsorted[8];
int live = 0;
for (int i=0; i<8; i++) {
loc[i] = count[i];
count[i+1]+=count[i];
unsorted[live] = array[loc[i]];
if (loc[i] < count[i+1]) {
live++;
}
}
live--;
// Perform sort
for (int i=0; i<n; i++) {
int64_t val = unsorted[live];
int64_t d = (val>>bit) & mask;
array[loc[d]] = val;
loc[d]++;
unsorted[live] = array[loc[d]];
if (loc[d] == count[d+1]) {
live--;
}
}
if (bit>0) {
for (int i=0; i<8; i++) {
n = count[i+1] - count[i];
if (n > 20) { // If replaced by n > 1, insertion_sort is not needed.
msd_sort(array + count[i], n, bit-3);
} else {
insertion_sort(array + count[i], n);
}
}
}
}
void lsd_sort(int64_t * array, int n) {
const int64_t mask = INT64_C(7);
std::vector<int64_t> buffer(n);
for (int64_t bit=0; bit<63; bit+=3) {
// Copy and count
int count[9]={};
for (int i=0; i<n; i++) {
buffer[i] = array[i];
count[((array[i]>>bit) & mask) + 1]++;
}
// Init writer positions
for (int i=0; i<8; i++) {
count[i+1]+=count[i];
}
// Perform sort
for (int i=0; i<n; i++) {
int64_t val = buffer[i];
int64_t d = (val>>bit) & mask;
array[count[d]] = val;
count[d]++;
}
}
}