3

If I just need to sort strings composed by ASCII characters, wondering what are the differences between using most significant v.s. least significant radix sorting? I think they should have the same results, but confused by the following statement from below link, and if anyone could help to clarify, it will be great.

https://en.wikipedia.org/wiki/Radix_sort

A most significant digit (MSD) radix sort can be used to sort keys in lexicographic order. Unlike a least significant digit (LSD) radix sort, a most significant digit radix sort does not necessarily preserve the original order of duplicate keys.

thanks in advance, Lin

Lin Ma
  • 8,271
  • 25
  • 84
  • 152

2 Answers2

6

A LSD radix sort can logically concatenate the sorted bins after each pass (consider them to be a single bin if using a counting / radix sort). A MSD radix sort has to recursively sort each bin independently after each pass. If sorting by bytes, that 256 bins after first pass, 65536 bins after second pass, 16777216 (16 million) bins after third pass, ... .

This is why the old card sorters sort data LSD first. Link to video of one of these in action. The cards are fed in and drop into the chutes face down. In the video, the card sorter drops the cards into bins "0" to "9", then the operator takes the cards from the 0 bin, then takes the cards from the 1 bin and places them on top (behind) the 0 bin cards, then the 2 bin cards go behind the deck, and so on, "concatenating" the cards from the bins. For large decks of cards, above the card sorter would be set of shelves above each bin to place the cards when the decks were too large to hold by hand.

http://www.youtube.com/watch?v=jJH2alRcx4M

Example C++ LSD radix sort for 32 bit unsigned integers, where each "digit" is a byte. Most of the code generates a matrix of counts which are converted into indices that mark the boundaries between variable size bins. The actual radix sort is in the last nested loop.

//  a is input array, b is working array
uint32_t * RadixSort(uint32_t * a, uint32_t *b, size_t count)
{
size_t mIndex[4][256] = {0};            // count / index matrix
size_t i,j,m,n;
uint32_t u;
    for(i = 0; i < count; i++){         // generate histograms
        u = a[i];
        for(j = 0; j < 4; j++){
            mIndex[j][(size_t)(u & 0xff)]++;
            u >>= 8;
        }       
    }
    for(j = 0; j < 4; j++){             // convert to indices
        m = 0;
        for(i = 0; i < 256; i++){
            n = mIndex[j][i];
            mIndex[j][i] = m;
            m += n;
        }       
    }
    for(j = 0; j < 4; j++){             // radix sort
        for(i = 0; i < count; i++){     //  sort by current lsb
            u = a[i];
            m = (size_t)(u>>(j<<3))&0xff;
            b[mIndex[j][m]++] = u;
        }
        std::swap(a, b);                //  swap ptrs
    }
    return(a);
}
rcgldr
  • 23,179
  • 3
  • 24
  • 50
  • Thanks rcgldr, nice reply and vote up. :) Still confused about your comments, "A MSD radix sort has to recursively sort each bin independently after each pass", and why MSD cannot logically concatenate the sorted bins, as LSD did? An example is appreciated. :) – Lin Ma Feb 22 '16 at 04:42
  • 1
    Hi rcgldr, did more study and I think I have got what do you mean "LSD radix sort can logically concatenate the sorted bins after each pass", but MSD cannot. For example, if we do radix sort by byte, suppose a few elements their MS-Byte is 0xFE, their 2nd-MS-Byte might be smaller than some elements whose MS-Byte is smaller than 0xFE, and in this case if we purely order by 2nd-MS-Byte irrespective of the order or MS-Byte, the order will be wrong -- so it is why we need 256 * 256 bins for 2nd-MS-Byte? Is that correct understanding? Thanks. – Lin Ma Feb 22 '16 at 05:27
  • 1
    @LinMa - your description seems correct, if doing MSD by byte, you start with 256 bins, then each of those bins will need 256 bins for the 2nd pass, and so on. LSD doesn't have this issue, and it's also stable (the order of equal records is preserved). – rcgldr Feb 22 '16 at 09:08
  • 1
    I updated my answer to include a link to a video showing an old card sorter, a hardware version of radix sort. – rcgldr Feb 22 '16 at 09:15
  • Thanks rcgldr for the reply, and love the youtube video you shared. Are you the guy showing the demo in the video? :) – Lin Ma Feb 23 '16 at 04:07
  • BTW, rcgldr, if you have good radix standalone implementation in Java or Python to recommend me to learn, it will be great. – Lin Ma Feb 23 '16 at 04:08
  • 2
    @LinMa - I have an example in C for unsigned integers. A slight modification is needed for signed integers (effectively toggle the sign bit for ones or twos complement numbers). Java doesn't have a native unsigned integer, so I'd have to create an example for signed integers. If interested I could update my answer with a C byte oriented radix sort for an array of unsigned integers. You can probably find examples for radix sort of strings in Java if you got lucky with a web search. – rcgldr Feb 23 '16 at 06:29
  • Thanks rcgldr, I love C and I think true programmer should only program for C. Thanks for sharing in advance. :) – Lin Ma Feb 23 '16 at 06:50
  • Thanks rcgldr for sharing the C version code, will study today and mark your reply as answer, I also take time today to rewrite a simple version of code for Java version, if we could discuss there for the new topic, it will be great, http://stackoverflow.com/questions/35591399/improvement-on-radix-sort-for-advice – Lin Ma Feb 24 '16 at 02:25
  • Hi rcgldr, spend time today to learn your code. One quick question, in the github version of code, there is a control variable needSort (https://github.com/zeyuanxy/hacker-rank/blob/master/algorithms/strings/string-similarity/Solution.Java#L58), wondering there is something we can do to improve algorithm efficiency? I see in your code, there is no such control. – Lin Ma Feb 25 '16 at 00:46
  • 1
    @LinMa - I didn't study the code that well, but my impression is needSort is used to skip empty buckets. I'd have to study that code more to understand why it does that. In my example code, a series of one or more empty buckets just result in the same index value being used for the matrix elements corresponding to those empty buckets and the next non-empty bucket during the conversion from counts to indices. After each radix pass, all of the buckets are effectively concatenated, without an actual concatenation step, and there are no gaps. – rcgldr Feb 25 '16 at 01:17
  • Hi rcgldr, I did more debugging today as well, and share my thoughts. I think needSort could be used to skip sorting for longer suffix string, if for shorter suffix string, they already ordered. For example, suppose we are ordering string with length 4, they are "ac.." and "bc..", since the heading two character already sorted, there is no need to sort by four characters since order are the same when comparing heading two characters. This is pretty smart and if you could confirm as well, I will be more confident for my findings. Looking forward to your reply. :) – Lin Ma Feb 25 '16 at 02:06
  • In my previous reply, for string "ac..", for sign '.' I mean any arbitrary characters. – Lin Ma Feb 25 '16 at 02:07
  • 1
    @LinMa - It appears that the code you're working on is a string similarity algorithm (I'm not familiar with these type of algorithms). If so, then the goal is to calculate some type of "distance" (numerical difference) between two strings, not to sort an array of strings, so the radix sort functions used for string similarity are different from a radix sort used to sort an array of strings. – rcgldr Feb 25 '16 at 04:03
  • agree with you. But for radixSort1_v2/radixSort0_v2, and radixSort1/radixSort0, I think they are more general radix sort (I post all code for the completion purpose). Your advice is highly appreciated whether my new code of radixSort1_v2/radixSort0_v2 are correct (especially removing the dependencies of globalPtr and pointTo). – Lin Ma Feb 25 '16 at 20:37
  • Thanks for all the help, rcgldr. Mark your reply as answered. :) – Lin Ma Feb 27 '16 at 23:24
  • Hello, I am studying radix sort but I have some doubts, Can you please help me understanding why do you LSD radix sort using each "digit" is a byte (8bits). Would be worse to make each "digit" a short (4bits) or 16 bits, or even a single bit ? It seems to me that the performance wuold be the same... – fredcrs Jul 06 '16 at 11:25
  • 1
    @fredcrs - As in my example assume that an array of 32 bit unsigned integers is being sorted. If a digit is a single bit, it takes 32 passes to do the sort. If a digit is 4 bits, it takes 8 passes, if a digit is 8 bits, it takes 4 passes, if a digit is 16 bits, it takes 2 passes. However in the case of 16 bit digits, the size of each row in the matrix is 65536*4 = 262144 bytes, exceeding the size of L1 cache on many processors. The end result is the total time for 8 bit versus 16 bit digits is about the same, depending on the number of elements (the size) of the array. – rcgldr Jul 06 '16 at 17:04
  • 1
    @fredcrs - The size of the array has some effect on relative performance. This is because there's a fixed overhead for converting counts into indices. For 8 bit digits, it's 4 x 256 = 1024 loops, for 16 bit digits, it's 2 x 65536 = 131072 loops. If the size of the array is small, the fixed overhead becomes a greater part of the overall time to sort. For a large array, the fixed overhead is only a small part of the overall time. Another option would be to use a digit size of 10, 11, 11 bits, 3 radix passes and one copy pass, but it's not much different than 8 bit or 16 bit digits. – rcgldr Jul 06 '16 at 17:19
  • Got it. Thank you very much – fredcrs Jul 06 '16 at 19:34
2

The part that's confusing you is that pretty much ALL LSD radix sorts preserve the order of duplicate keys. That's because they rely on this property to work at all. For example, if you have 2 iterations like this, sorting by first the ones place and then the tens place:

22        21        11
21   ->   11   ->   21
11        22        22

When we sort by tens we need to preserve the tie-breaking order we got when we sorted by ones, so that 21 and 22 come out in the proper order even though they have the same digits in the 10s place. If you implement the first sort (by ones) the same way you have to do all the other ones (and why wouldn't you?), then the sort is stable.

An MSD radix sort can be written using the same kinds of sorting steps as an LSD radix sort, in which case it will be stable, too. But there are other, often more efficient ways to implement an MSD radix sort that don't have this property.

MSD-first radix sorts that don't preserve the order or duplicates are usually in-place, i.e., they work without allocating a separate array to hold the sorted elements.

NOTE that none of this makes any difference if you're just sorting a list of strings by comparing their ASCII code points. "preserving the order of duplicate keys" only matters when they have extra information attached to them. For example if the keys have associated values, or if you are sorting in a case-independent manner and you want "Abe" and "abE" out in the same order they came in.

Matt Timmermans
  • 36,921
  • 2
  • 27
  • 59
  • Thanks Matt, vote up. :) In your comments, "often more efficient ways to implement an MSD radix sort that don't have this property", in what kinds of implementation MSD does not preserve the order? The only way I can think is like how LSD did. Your advice is appreciated. :) – Lin Ma Feb 22 '16 at 04:40
  • BTW, Matt, did further analysis, in your example of 22, 21, 11, when using MSD first sorting, we sort tens first, and then when sorting ones, we could also reserve the "tie-breaking order" we got when we sorted by tens, what is the issue? Thanks. – Lin Ma Feb 22 '16 at 04:59
  • 1
    Yes, as I said, you an use the same kinds of sorting steps with MSD-first, and it will preserve the order. I'll add a comment about the ways that don't – Matt Timmermans Feb 22 '16 at 12:40
  • Thanks Matt, looking forward to your update. :) Have you read reply from rcgldr? It seems using MSD has additional cost of using more bins -- For example, if we do radix sort by byte, suppose a few elements their MS-Byte is 0xFE, their 2nd-MS-Byte might be smaller than some elements whose MS-Byte is smaller than 0xFE, and in this case if we purely order by 2nd-MS-Byte irrespective of the order or MS-Byte, the order will be wrong -- so it is why we need 256 * 256 bins for 2nd-MS-Byte? Is that correct understanding? Thanks. – Lin Ma Feb 22 '16 at 20:27
  • BTW, Matt, if you have good radix standalone implementation in Java or Python to recommend me to learn, it will be great. – Lin Ma Feb 23 '16 at 04:30
  • Hi Matt, one quick question, in the github version of code, there is a control variable needSort (https://github.com/zeyuanxy/hacker-rank/blob/master/algorithms/strings/string-similarity/Solution.Java#L58), wondering there is something we can do to improve algorithm efficiency of radix sort? Thanks. – Lin Ma Feb 25 '16 at 00:47
  • 1
    You don't need more bins to do an MSB-first sort. After sorting by char0, you recurse to sort each bin separately. All the strings in each 2nd-level sort have the same first char, so you still only need 256 bins. For improving the speed of the sort, there are lots of little things that make a difference. – Matt Timmermans Feb 25 '16 at 13:45
  • Thanks Matt, smart idea and vote up! I recently re-write radix sort part of code to make it simple to remove dependencies of globalPtr and pointTo, I post my code here, and your advice is highly appreciated (http://codereview.stackexchange.com/questions/121026/radix-sorting-in-java). – Lin Ma Feb 25 '16 at 20:36