Choosing a good sorting algorithm

Question

The Java application spends most of its time sorting some keys and removing duplicates.

So choosing an adapted sorting algorithm is mandatory.

Keys are integers (around 256bits but not necessarily) and the array sizes are between 1000 and 100000 keys.

The input arrays are made of consecutive key groups. These groups are already sorted and small (around 10 keys).

An array example (3 groups, 32bits keys):

After sorting and removing duplicates :

Any tough ? Any idea ? Any link ?

Thanks

PS : after looking at sorting algorithms including a lot of variations of merge sort, radix sort, qui... I continue digging around hash maps.

PPS : finally I forked Java legacy merge sort, added filtering and the concept of sorted groups. It provides a great speedup.

Please share some thoughts that you have on this. Did you try anything? — Sergey Kalinichenko, Sep 08 '13 at 16:09
We don't know what you don't know. The question seems straight forward to me. What do you find tricky? — Peter Lawrey, Sep 08 '13 at 16:11
Sorting 100,000 integers should be pretty fast. But what is a "256 bit" integer? Are these Big Integers? — user949300, Sep 08 '13 at 16:27
I tested quick sort, tim sort (provided by Java libraries) and radix sort. Tim sort provided some interesting results but I wonder if i can do better. — , Sep 08 '13 at 18:25
I also tested merge sort provided by Java libraries. This is slower than tim sort too. — , Sep 08 '13 at 18:47
As per Peter's comment below, are you sure that it is the sorting, and _not_ the _reading / parsing_ of the values, that is taking the time? In my experience, sorting 100,000 floats takes well under a second. What is your issue? Could you post the actual times? — user949300, Sep 08 '13 at 20:48
Apparently, this operation happens frequently in you app. Are the lists completely different each time? If so, I think you are stuck with one of our answers. Look into some multi-threading so you can be doing this in parallel over the many times the sorting is needed. If not, i.e. the lists are nearly the same each time, then you might be able to do something clever. Like just merge in the changed keys. Here you are starting to go beyond the limits of my computer science. — user949300, Sep 08 '13 at 21:45
Keys are similar from an array to another but tracking updates require to compare entirely the last and the next array. — , Sep 09 '13 at 07:15

score 5 · Accepted Answer · edited May 23 '17 at 12:05

5

Merge Sort (http://en.wikipedia.org/wiki/Merge_sort)

Since your input data is presorted you have a head start. You can enter the 1st value from each list into a PriorityQueue, take out the least, and add the next value from that list into the queue. Repeat. With some checks for getting to the end. :-)

I'm sure there are SO answers with more complete details.

some more links:

http://www.cs.washington.edu/education/courses/cse373/06sp/handouts/lecture08.pdf

Algorithm for N-way merge

and, my own answer with pretty complete Java code:

Merging multiple sorted csv files with complex comparison

edited May 23 '17 at 12:05

Community

1
1

answered Sep 08 '13 at 16:11

user949300

14,622
6
30
61

Can Merge sort remove duplicates efficiently ? – Sep 08 '13 at 18:33
Good point. Normally it will include duplicates. But if you added some simple logic to check that the value you are about to add isn't the one you just did add, it should be o.k. – user949300 Sep 08 '13 at 20:45
If keys are sorted in-place, then keys will be moved a lot. Not sure this is effective. Do you have any links for optimized implementation ? – Sep 08 '13 at 21:06
Check out @aioobe answer and pseudocode in the 2nd link I provided, or my answer with complete Java code in the 3rd link. You are only "sorting" the top N keys from the N groups (in your example, 3), using a PriorityQueue or TreeMap. Very efficient. You just need to add a check to not output a key if it equals the last one you just outputted. – user949300 Sep 08 '13 at 21:40

score 1 · Answer 2 · answered Sep 08 '13 at 16:13

1

The simplest solution without any more details is

You should be able to read all the lines into a TreeSet and print them out at the end.

BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
TreeSet<String> sortedSet = new TreeSet<String>();
for(String line; (line = br.readLine()) != null;)
    sortedSet.add(line);
for (String s : sortedSet) 
    System.out.println(s);

answered Sep 08 '13 at 16:13

Peter Lawrey

498,481
72
700
1,075

Since the input data is already largely sorted, this will make an ugly tree set. I suspect a more "normal" algorithm of put them all into a Set, then make a List, then Collections.sort() might be faster. I'm not sure why OP's sorting is slow, 100,000 integers isn't much, so I definitely like your very quick and simple approach. – user949300 Sep 08 '13 at 16:25
1

@user949300 TreeSet should be a balanced tree. A merge sort is possibly more efficient, but much more complicated. I suspect the time spent is in parsing and comparing the keys, not the sort itself. – Peter Lawrey Sep 08 '13 at 16:28
1

You are probably right about where the time gets spent. I did the merge sort when it was processing tens of millions of complex Strings from dozens of files. – user949300 Sep 08 '13 at 16:32
I will check TreeSet performances. – Sep 08 '13 at 18:32
No TreeSet is the slowest amongst tested solutions. – Sep 08 '13 at 18:46
@Peter Lawrey, there is no parsing. Keys are computed. – Sep 08 '13 at 21:09

score 0 · Answer 3 · answered Sep 08 '13 at 16:11

0

I would suggest you to use Collections.sort here, as that would take care of duplicates (if you create a SET for the numbers) as, and the sorting time complexity is O(nlogn) which is as good as it gets.

If you have only a specific set of numbers, then you might want to take a look at Radix sort.

answered Sep 08 '13 at 16:11

Neeraj

7,443
8
36
67

1

Collections.sort() doesn't remove duplicates. – Peter Lawrey Sep 08 '13 at 16:12
Sorry about the ambiguity. When I mentioned collections, I meant that the under-lying datastructure would be a set. – Neeraj Sep 08 '13 at 16:14
You can only Collections.sort() a List. You could use a TreeSet but then you wouldn't need to sort(). – Peter Lawrey Sep 08 '13 at 16:15

score 0 · Answer 4 · answered Sep 08 '13 at 16:16

0

If you sort totally new array each time, you may benefit from Quick sort or maybe Bucket sort

If your array is update Fibonacci heap(most effective, though complex), Binomial heap, or simple Binary heap.

answered Sep 08 '13 at 16:16

sshilovsky

763
1
7
19

score 0 · Answer 5 · answered Sep 08 '13 at 16:16

0

Since your sort keys are integers in a limited range, you can use radix sort. The radix sort has linear time complexity, while more generic sort algorithms based on comparisons have minimum O(n log n) run time for sorting n items, making the radix sort and similar sorting algorithms superior for large data sets.

answered Sep 08 '13 at 16:16

Joni

101,441
12
123
178

I picked some representative arrays and tested Radix sort. Tim sort is faster than radix sort. – Sep 08 '13 at 18:30
As a sort based on comparing elements with O(n log n) time complexity, for large datasets Tim sort is guaranteed to be slower than radix sort. For small datasets the execution time is determined by implementation details; for example how effectively the CPU cache is used in your particular implementation of the algorithm. – Joni Sep 08 '13 at 18:51
The application typically runs on a relatively new workstation. Do you know any optimized radix sort implementation ? Maybe mine isn't fine. – Sep 08 '13 at 19:14

David says Reinstate Monica · Answer 6 · 2013-09-08T21:35:54.457

0

You could just iterate through all the elements and put them all in a Set. Specifically, put all the elements in a TreeSet to give you proper ordering. This will also automatically remove duplicates. Your code would actually be very simple -

Set<int> sortedUniqueKeys = new TreeSet<int>(keys);

Where keys is the unsorted array of duplicate integer keys. All the sorting/duplicate removal is done in the constructor and is (presumably) fast.

edited Sep 08 '13 at 21:35

answered Sep 08 '13 at 16:17

David says Reinstate Monica

16,634
19
69
108

I will check TreeSet performances. – Sep 08 '13 at 18:32
No TreeSet is the slowest amongst tested solutions. – Sep 08 '13 at 19:04
@sylvain Yeah, I forgot to mention that. This is exactly the trade off you would expect - super simple code, but since its so simple it loses a bunch of optimization. – David says Reinstate Monica Sep 08 '13 at 19:22
But you made a point. I should dig around set creation. – Sep 08 '13 at 21:07
@sylvain You should look into manually filling the set. This level of granularity should give you a good optimization boost. Also look into using a `hashset` instead of a `treeset` I believe they have faster insetions, but I forget. They do have a different ordering though, so play around with it before you decided to stick with it. – David says Reinstate Monica Sep 08 '13 at 21:35

Choosing a good sorting algorithm

6 Answers6