11

Sorting an array of integers with numpy's quicksort has become the bottleneck of my algorithm. Unfortunately, numpy does not have radix sort yet. Although counting sort would be a one-liner in numpy:

np.repeat(np.arange(1+x.max()), np.bincount(x))

see the accepted answer to the How can I vectorize this python count sort so it is absolutely as fast as it can be? question, the integers in my application can run from 0 to 2**32.

Am I stuck with quicksort?


This post was primarily motivated by the [Numpy grouping using itertools.groupby performance](https://stackoverflow.com/q/4651683/341970) question. Also note that [it is not merely OK to ask and answer your own question, it is explicitly encouraged.](https://blog.stackoverflow.com/2011/07/its-ok-to-ask-and-answer-your-own-questions/)
Community
  • 1
  • 1
Ali
  • 51,545
  • 25
  • 157
  • 246
  • Very similar question: [vectorized radix sort with numpy - can it beat np.sort?](http://stackoverflow.com/q/34023841/) –  Feb 15 '16 at 10:27
  • @morningsun I know, [I left a comment at that question on Feb 10](http://stackoverflow.com/questions/34023841/vectorized-radix-sort-with-numpy-can-it-beat-np-sort#comment58344130_34023841), that is, 5 days ago. That question is different though; that question wants a pure NumPy solution. – Ali Feb 15 '16 at 12:05

3 Answers3

14

No, you are not stuck with quicksort. You could use, for example, integer_sort from Boost.Sort or u4_sort from usort. When sorting this array:

array(randint(0, high=1<<32, size=10**8), uint32)

I get the following results:

NumPy quicksort:         8.636 s  1.0  (baseline)
Boost.Sort integer_sort: 4.327 s  2.0x speedup
usort u4_sort:           2.065 s  4.2x speedup

I would not jump to conclusions based on this single experiment and use usort blindly. I would test with my actual data and measure what happens. Your mileage will vary depending on your data and on your machine. The integer_sort in Boost.Sort has a rich set of options for tuning, see the documentation.

Below I describe two ways to call a native C or C++ function from Python. Despite the long description, it's fairly easy to do it.


Boost.Sort

Put these lines into the spreadsort.cpp file:

#include <cinttypes>
#include "boost/sort/spreadsort/spreadsort.hpp"
using namespace boost::sort::spreadsort;

extern "C" {
    void spreadsort(std::uint32_t* begin,  std::size_t len) {
        integer_sort(begin, begin + len);
    }
}

It basically instantiates the templated integer_sort for 32 bit unsigned integers; the extern "C" part ensures C linkage by disabling name mangling. Assuming you are using gcc and that the necessary include files of boost are under the /tmp/boost_1_60_0 directory, you can compile it:

g++ -O3 -std=c++11 -march=native -DNDEBUG -shared -fPIC -I/tmp/boost_1_60_0 spreadsort.cpp -o spreadsort.so  

The key flags are -fPIC to generate position-independet code and -shared to generate a shared object .so file. (Read the docs of gcc for further details.)

Then, you wrap the spreadsort() C++ function in Python using ctypes:

from ctypes import cdll, c_size_t, c_uint32
from numpy import uint32
from numpy.ctypeslib import ndpointer

__all__ = ['integer_sort']

# In spreadsort.cpp: void spreadsort(std::uint32_t* begin,  std::size_t len)
lib = cdll.LoadLibrary('./spreadsort.so')
sort = lib.spreadsort
sort.restype = None
sort.argtypes = [ndpointer(c_uint32, flags='C_CONTIGUOUS'), c_size_t]

def integer_sort(arr):
    assert arr.dtype == uint32, 'Expected uint32, got {}'.format(arr.dtype)
    sort(arr, arr.size)

Alternatively, you can use cffi:

from cffi import FFI
from numpy import uint32

__all__ = ['integer_sort']

ffi = FFI()
ffi.cdef('void spreadsort(uint32_t* begin,  size_t len);')
C = ffi.dlopen('./spreadsort.so')

def integer_sort(arr):
    assert arr.dtype == uint32, 'Expected uint32, got {}'.format(arr.dtype)
    begin = ffi.cast('uint32_t*', arr.ctypes.data)
    C.spreadsort(begin, arr.size)

At the cdll.LoadLibrary() and ffi.dlopen() calls I assumed that the path to the spreadsort.so file is ./spreadsort.so. Alternatively, you can write

lib = cdll.LoadLibrary('spreadsort.so')

or

C = ffi.dlopen('spreadsort.so')

if you append the path to spreadsort.so to the LD_LIBRARY_PATH environment variable. See also Shared Libraries.

Usage. In both cases you simply call the above Python wrapper function integer_sort() with your numpy array of 32 bit unsigned integers.


usort

As for u4_sort, you can compile it as follows:

cc -DBUILDING_u4_sort -I/usr/include -I./ -I../ -I../../ -I../../../ -I../../../../ -std=c99 -fgnu89-inline -O3 -g -fPIC -shared -march=native u4_sort.c -o u4_sort.so

Issue this command in the directory where the u4_sort.c file is located. (Probably there is a less hackish way but I failed to figure that out. I just looked into the deps.mk file in the usort directory to find out the necessary compiler flags and include paths.)

Then, you can wrap the C function as follows:

from cffi import FFI
from numpy import uint32

__all__ = ['integer_sort']

ffi = FFI()
ffi.cdef('void u4_sort(unsigned* a, const long sz);')
C = ffi.dlopen('u4_sort.so')

def integer_sort(arr):
    assert arr.dtype == uint32, 'Expected uint32, got {}'.format(arr.dtype)
    begin = ffi.cast('unsigned*', arr.ctypes.data)
    C.u4_sort(begin, arr.size)

In the above code, I assumed that the path to u4_sort.so has been appended to the LD_LIBRARY_PATH environment variable.

Usage. As before with Boost.Sort, you simply call the above Python wrapper function integer_sort() with your numpy array of 32 bit unsigned integers.

Ali
  • 51,545
  • 25
  • 157
  • 246
  • Note the comment for u4_sort : implements in place u4 radix sort , means that the result ends up in the original array, not that it doesn't allocate a second working buffer : | * buf = (unsigned * ) malloc(sz * sizeof(unsigned)); | . It makes one read pass to create the count matrix {with rows b0, b1, b2}, three read/write sort passes to sort the data, one copy pass to put the data back into the original array. From MSF to LSF, the field sizes are 10 bits, 11 bits, 11 bits. My example's better results are probably due to system and/or compiler. – rcgldr Feb 10 '16 at 20:43
4

A radix sort base 256 (1 byte) can generate a matrix of counts used to determine bucket size in 1 pass of the data, then takes 4 passes to do the sort. On my system, Intel 2600K 3.4ghz, using Visual Studio release build for a C++ program, it takes about 1.15 seconds to sort 10^8 psuedo random unsigned 32 bit integers.

Looking at the u4_sort code mentioned in Ali's answer, the programs are similar, but u4_sort uses field sizes of {10,11,11}, taking 3 passes to sort data and 1 pass to copy back, while this example uses field sizes of {8,8,8,8}, taking 4 passes to sort data. The process is probably memory bandwidth limited due to the random access writes, so the optimizations in u4_sort, like macros for shift, one loop with fixed shift per field, aren't helping much. My results are better probably due to system and/or compiler differences. (Note u8_sort is for 64 bit integers).

Example code:

//  a is input array, b is working array
void RadixSort(uint32_t * a, uint32_t *b, size_t count)
{
size_t mIndex[4][256] = {0};            // count / index matrix
size_t i,j,m,n;
uint32_t u;
    for(i = 0; i < count; i++){         // generate histograms
        u = a[i];
        for(j = 0; j < 4; j++){
            mIndex[j][(size_t)(u & 0xff)]++;
            u >>= 8;
        }       
    }
    for(j = 0; j < 4; j++){             // convert to indices
        m = 0;
        for(i = 0; i < 256; i++){
            n = mIndex[j][i];
            mIndex[j][i] = m;
            m += n;
        }       
    }
    for(j = 0; j < 4; j++){             // radix sort
        for(i = 0; i < count; i++){     //  sort by current lsb
            u = a[i];
            m = (size_t)(u>>(j<<3))&0xff;
            b[mIndex[j][m]++] = u;
        }
        std::swap(a, b);                //  swap ptrs
    }
}
rcgldr
  • 23,179
  • 3
  • 24
  • 50
  • +1 Thanks for the feedback! One would have to measure each algorithm with different kinds of inputs (random, mostly sorted, sorted), with different ranges for the values (0-7, 0-255, 1024-16383, etc), with varying array length sizes, on different architectures with different cache sizes, etc. That would tell us more. We only see one point of a complex picture at the moment, the result of a single experiment. What's still puzzling is that the Boost.Sort algorithm is not doing so well; I would have expected more from it. – Ali Feb 10 '16 at 22:15
  • Interesting use of prefix-sums to put the radix buckets head-to-tail. `mIndex` uses 8k of L1 cache, which is probably just about right. Hopefully most of it it stays hot in L1 (32k) most of the time, instead of being evicted by pressure from accessing the main array. I think you could cache-block the final loop, though. Start with `j=3`, so the scratch array has all the data grouped into 256 buckets. Go all the way to a final sort within each of those 256 buckets. That way, data movement from the 3rd and 4th radixes will happen on data that's prob. still at least in L3, maybe even L2. – Peter Cordes Feb 10 '16 at 23:41
  • mIndex uses 8K in 64 bit mode, 4K in 32 bit mode, although only 1/4 (2K) of it is being used per inner loop. The reads are sequential, but the writes are random if the data is random, and if the dataset >> L3 cache size, then the writes are the primary performance issue. For 64 bit integers in 64 bit mode, mIndex would be 16K ([8][256]), but only 1/8 (2K) of it is used per inner loop. – rcgldr Feb 11 '16 at 01:19
  • For signed integers, the most significant fields in mIndex are biased, and the data field sign extended, or no bias is used and the sign bit of the most significant data field is reversed during the final radix sort pass. Again since the random writes are usually the limiting factor, I don't think it matters which approach is used for signed integers. – rcgldr Feb 11 '16 at 01:20
  • @rcgldr: My point was that using the high byte for the first pass of random writes will let you do the next passes in much smaller groups, which *do* fit in L3 or even L2 cache. Scattered writes into hot cache is much cheaper than scattered writes into cold memory. You take advantage of this by doing the MSB first, so the later random write passes don't scatter anywhere near as widely. You could even use a different sorting algo once you distribute into multiple smaller buckets. – Peter Cordes Feb 14 '16 at 18:26
  • @PeterCordes - MSB first requires a recursive like approach where each bucket from one pass is sorted as an individual unit on the next pass. Using four 8 bit fields with 32 bit integers, after the first pass, 256 buckets have to separately sorted, then 65536 buckets separately sorted, then 16777216 buckets separately sorted. Each pass on a bucket would require a separate pass to generate a column of mIndex. With LSB first, the buckets can be considered concatenated after each pass. This is why sorting using the old card sorters was done LSB first. – rcgldr Feb 14 '16 at 20:17
  • Ok, I think I see why MSB-first doesn't work as simply with concatenated buckets. There's a tradeoff between simpler logic and worse memory access pattern, though. I'd guess you'd get the best results from MSB-first recursing for one or two leading bytes, then using an O(nlogn) sort on each bucket. Or if you are going to do two passes of radix, maybe doing 2nd-MSB then MSB so you can build both mIndex blocks in one pass. (I'm not sure the extra pass over the data to allow concatenated buckets is even worth it, vs. many separately-allocated buckets, unless virtual address space is limited.) – Peter Cordes Feb 14 '16 at 20:57
  • @PeterCordes - without the separate pass for concatenated buckets, then either dynamically expanding containers like vectors have to be used, or each bucket has to be the same size as the array, and concatenation if doing LSB first involves a lot of moves. – rcgldr Feb 14 '16 at 21:00
  • @rcgldr: yeah, exactly. Dynamically expanding buffers like vectors are cheap. `realloc(3)` doesn't usually need to copy. Doubling the size each time a realloc is needed gives you amortized constant time insertion, even if every realloc does trigger a copy. Memory pages you never touch at the end of a dynamic array hardly cost anything: just page table entries that are never accessed. The occasional copy from dynamic arrays is easily worth the huge improvement in cache-locality from re-processing smaller and smaller chunks while they're still hot in cache. – Peter Cordes Feb 14 '16 at 21:05
  • I modify my answer for good improvement. my old timeit protocol was biaised. – B. M. Feb 23 '16 at 15:15
  • @rcgldr, The Numba code is parallelized, could this be as well? I think it is better to replace the `std::swap()` with a Macro to make the code `C` compatible. – Royi Jul 11 '20 at 14:00
  • @Royi - The array could be split up into k parts, the k parts sorted via radix sort in parallel (multi-threaded), then merged. The bottle neck in radix sort is random access writes. I also tried doing most significant byte first on a large array, to create 256 sub-arrays, each of which would fit in L3 cache, but it was only a 5 % gain. I don't recall comparing times using multi-threading with radix sort. I did compare timesfor a [multi-threaded merge sort](https://codereview.stackexchange.com/questions/148025). – rcgldr Jul 11 '20 at 18:29
  • @rcgldr, Do you recommend this for small to medium size (5-500 elements) arrays of Unsigned Integers? – Royi Jul 11 '20 at 18:34
  • @Royi - most libraries will switch to using insertion sort for less than 32 to 64 elements. For 500 elements, it probably doesn't matter, since it will take just a fraction of a second. – rcgldr Jul 11 '20 at 22:17
  • I thought using Sorting Network for up to ~10 elements. Above that use the lowest overhead between quick sort, merge sort and *something else sort* :-). Do you suggest using Insert Sort up to ~64 and from there either quick or merge? – Royi Jul 12 '20 at 04:19
  • @Royi - I'm not aware of a library that uses a sorting network. Generally sorting networks are used in hardware. If sorting numbers, radix sort is fastest, unless the number size is so large that merge sort ends up with fewer passes. There is also the issue of the random address writes of a radix sort, versus sequential read and write from a merge sort. I don't use quick sort much. A 4 way merge sort, without heap, is about the same speed as quick sort's best case, and avoid quick sorts worst case time complexity of O(n^2), or using median of medians which makes quick sort slower. – rcgldr Jul 12 '20 at 08:14
  • Could you link to an implementation of "A 4 way merge sort, without heap"? I will merge them with something like https://github.com/eggpi/sorting-networks-test. What do you think? – Royi Jul 12 '20 at 17:24
  • @Royi - an example of 4 way merge sort without heap in C++ is shown in the third part of [this answer](https://stackoverflow.com/questions/34844613/34845789#34845789) . – rcgldr Jul 12 '20 at 19:02
2

A radix-sort with python/numba(0.23), according to @rcgldr C program, with multithread on a 2 cores processor.

first the radix sort on numba, with two global arrays for efficiency.

from threading import Thread
from pylab import *
from numba import jit
n=uint32(10**8)
m=n//2
if 'x1'  not in locals() : x1=array(randint(0,1<<16,2*n),uint16); #to avoid regeneration
x2=x1.copy()
x=frombuffer(x2,uint32) # randint doesn't work with 32 bits here :(
y=empty_like(x) 
nbits=8
buffsize=1<<nbits
mask=buffsize-1

@jit(nopython=True,nogil=True)
def radix(x,y):
    xs=x.size
    dec=0
    while dec < 32 :
        u=np.zeros(buffsize,uint32)
        k=0
        while k<xs:
            u[(x[k]>>dec)& mask]+=1
            k+=1
        j=t=0
        for j in range(buffsize):
            b=u[j]
            u[j]=t
            t+=b
            v=u.copy()
        k=0
        while k<xs:
            j=(x[k]>>dec)&mask
            y[u[j]]=x[k]
            u[j]+=1
            k+=1
        x,y=y,x
        dec+=nbits

then the parallélisation, possible with nogil option.

def para(nthreads=2):
        threads=[Thread(target=radix,
            args=(x[i*n//nthreads(i+1)*n//nthreads],
            y[i*n//nthreads:(i+1)*n//nthreads])) 
            for i in range(nthreads)]
        for t in  threads: t.start()
        for t in  threads: t.join()

@jit
def fuse(x,y):
    kl=0
    kr=n//2
    k=0
    while k<n:
        if y[kl]<x[kr] :
            x[k]=y[kl]
            kl+=1
            if kl==m : break
        else :
            x[k]=x[kr]
            kr+=1
        k+=1

def sort():
    para(2)
    y[:m]=x[:m]
    fuse(x,y)

benchmarks :

In [24]: %timeit x2=x1.copy();x=frombuffer(x2,uint32) # time offset
1 loops, best of 3: 431 ms per loop

In [25]: %timeit x2=x1.copy();x=frombuffer(x2,uint32);x.sort()
1 loops, best of 3: 37.8 s per loop

In [26]: %timeit x2=x1.copy();x=frombuffer(x2,uint32);para(1)
1 loops, best of 3: 5.7 s per loop

In [27]: %timeit x2=x1.copy();x=frombuffer(x2,uint32);sort()
1 loops, best of 3: 4.02 s per loop      

So a pure python solution with a 10x (37s->3.5s) gain on my poor 1GHz machine. Can be enhanced with more cores and multifusion.

B. M.
  • 16,489
  • 2
  • 30
  • 47
  • +1 Thanks! I am wondering why numba failed to optimize this better. I don't think the next version of numba will solve it: I suspect that LLVM (the actual optimizer behind numba) failed. Well, maybe I will look into this during the week-end. – Ali Feb 12 '16 at 01:23
  • I've seen numba underperform before. – dstromberg Feb 13 '16 at 03:28
  • Be fair with numba and keep in mind that both sort in place so you basically comparing the worst case: Array is already sorted (or am I wrong?). Also depending on your setup you include the "compile" cost in the numba version. On my computer with 10**7 elements I get ``numba: 1.18 s per loop``, ``numpy quicksort: 1.84 s per loop`` (but both of these include the "creation of the random arrays") – MSeifert Feb 13 '16 at 15:25