2

I have successfully used Cython for the first time to significantly speed up packing nibbles from one list of integers (bytes) into another (see Faster bit-level data packing), e.g. packing the two sequential bytes 0x0A and 0x0B into 0xAB.

def pack(it):
    """Cythonize python nibble packing loop, typed"""
    cdef unsigned int n = len(it)//2
    cdef unsigned int i
    return [ (it[i*2]//16)<<4 | it[i*2+1]//16 for i in range(n) ]

While the resulting speed is satisfactory, I am curious whether this can be taken further by making better use of the input and output lists.

cython3 -a pack.cyx generates a very "cythonic" HTML report that I unfortunately am not experienced enough to draw any useful conclusions from.

From a C point of view the loop should "simply" access two unsigned int arrays. Possibly, using a wider data type (16/32 bit) could further speed this up proportionally.

The question is: (how) can Python [binary/immutable] sequence types be typed as unsigned int array for Cython?


Using array as suggested in How to convert python array to cython array? does not seem to make it faster (and the array needs to be created from bytes object beforehand), nor does typing the parameter as list instead of object (same as no type) or using for loop instead of list comprehension:

def packx(list it):
    """Cythonize python nibble packing loop, typed"""
    cdef unsigned int n = len(it)//2
    cdef unsigned int i
    cdef list r = [0]*n
    for i in range(n):
        r[i] = (it[i*2]//16)<<4 | it[i*2+1]//16
    return r

I think my earlier test just specified an array.array as input, but following the comments I've now just tried

from cpython cimport array
import array

def packa(array.array a):
    """Cythonize python nibble packing loop, typed"""
    cdef unsigned int n = len(a)//2
    cdef unsigned int i
    cdef unsigned int b[256*64/2]
    for i in range(n):
        b[i] = (a[i*2]//16)<<4 | a[i*2+1]//16;

    cdef array.array c = array.array("B", b)
    return c

which compiles but

ima = array.array("B", imd) # unsigned char (1 Byte)
pa = packa(ima)
packed = pa.tolist()

segfaults. I find the documentation a bit sparse, so any hints on what the problem is here and how to allocate the array for output data are appreciated.


Taking @ead's first approach, plus combining division and shifting (seems to save a microsecond:
#cython: boundscheck=False, wraparound=False

def packa(char[::1] a):
    """Cythonize python nibble packing loop, typed with array"""

    cdef unsigned int n = len(a)//2
    cdef unsigned int i

    # cdef unsigned int b[256*64/2]
    cdef array.array res = array.array('B', [])
    array.resize(res, n)

    for i in range(n):
        res.data.as_chars[i] = ( a[i*2] & 0xF0 ) | (a[i*2+1] >> 4);

    return res

compiles much longer, but runs much faster:

python3 -m timeit -s 'from pack import packa; import array; data = array.array("B", bytes([0]*256*64))' 'packa(data)'
1000 loops, best of 3: 236 usec per loop

Amazing! But, with the additional bytes-to-array and array-to-list conversion

ima = array.array("B", imd) # unsigned char (1 Byte)
pa = packa(ima)
packed = pa.tolist() # bytes would probably also do

it now only takes about 1.7 ms - very cool!


Down to 150 us timed or approx. 0.4 ms actual:

from cython cimport boundscheck, wraparound
from cpython cimport array
import array

@boundscheck(False)
@wraparound(False)
def pack(const unsigned char[::1] di):
    cdef:
        unsigned int i, n = len(di)
        unsigned char h, l, r
        array.array do = array.array('B')

    array.resize(do, n>>1)

    for i in range(0, n, 2):
        h = di[i] & 0xF0
        l = di[i+1] >> 4
        r = h | l
        do.data.as_uchars[i>>1] = r

    return do

I'm not converting the result array to a list anymore, this is done automatically by py-spidev when writing, and the total time is about the same: 10 ms (@ 10 MHz).

handle
  • 5,063
  • 2
  • 40
  • 69
  • Have you tried the plain numpy approach: `np.array(it,dtype=np.uint8).view(dtype=np.uint16)`? – DavidW Nov 19 '17 at 13:46
  • @DavidW Would this not just result in `0x0A0B`? Then it's not what I am looking for. – handle Nov 19 '17 at 14:04
  • @DavidW It results in decimal 2826 (`0x0B0A`, depending on endianess). But thanks for the input! – handle Nov 19 '17 at 14:14
  • Do you really know, what is slowing your down? Measure it first, and you will be surprised! You approach is `python-ints->c-ints->calc new c-ints->create-python ints` I didn't measure, but my guess would be that the last step `c-ints->reate-python-ints` is the bottle neck (but as I didn't measure it, I maybe mistaken). There is no gain in speeding up the calcultion with c-ints. The only way I see is to use `array.array` (or numpy array) instead of list in your python-program, so the last step isn't needed. – ead Nov 19 '17 at 14:16
  • You're right - this can't be done by reinterpretting the memory I think – DavidW Nov 19 '17 at 14:18
  • @ead No, I don't know whether there is a bottleneck left that could be worked around. I don't know how to partially measure either. I've added a small C equivalent to my answer at https://stackoverflow.com/a/47375829/1619432 that I think is about a 100 times faster. Using array actually was slower (also, it requires conversion from bytes). – handle Nov 19 '17 at 14:25
  • Then please post your `array.array`-version, i.e. a function which takes `array.array('c',...)-object` or `char[::1]` for that matter and returns `array.array('c', ...)`-object. I would really like to see, why it is not faster than your current version. – ead Nov 19 '17 at 14:33
  • @ead I tried and failed ('c' does not seem to be valid). Please have a look at my updated question. I'll try to post an [mcve] in a couple of hours. – handle Nov 19 '17 at 15:33
  • For those following these comments: it's pretty much worked out - see the updated question and ead's answer. – handle Nov 22 '17 at 07:06

1 Answers1

2

If you wanna to be as fast as C you should not use list with python-integers inside but an array.array. It is possible to get a speed-up of around 140 for your python+list code by using cython+array.array.

Here are some ideas how to make your code faster with cython. As benchmark I choose a list with 1000 elements (big enough and cache-misses have no effects yet):

import random
l=[random.randint(0,15) for _ in range(1000)]

As baseline, your python-implementation with list:

def packx(it):
    n = len(it)//2
    r = [0]*n
    for i in range(n):
        r[i] = (it[i*2]%16)<<4 | it[i*2+1]%16
    return r

%timeit packx(l)
143 µs ± 1.95 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

By the way, I use % instead of //, which is probably what you want, otherwise you would get only 0s as result (only lower bits have data in your description).

After cythonizing the same function (with %%cython-magic) we get a speed-up of around 2:

%timeit packx(l)
77.6 µs ± 1.28 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Let's look at the html produced by option -a, we see the following for the line corresponding to the for-loop:

..... 
__pyx_t_2 = PyNumber_Multiply(__pyx_v_i, __pyx_int_2); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 6, __pyx_L1_error)
 __Pyx_GOTREF(__pyx_t_2);
 __pyx_t_5 = PyObject_GetItem(__pyx_v_it, __pyx_t_2); if (unlikely(!__pyx_t_5)) __PYX_ERR(0, 6, __pyx_L1_error)
 __Pyx_GOTREF(__pyx_t_5);
 __Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0;
 __pyx_t_2 = __Pyx_PyInt_RemainderObjC(__pyx_t_5, __pyx_int_16, 16, 0); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 6, __pyx_L1_error)
...

Py_NumberMultiply means that we use slow python-multiplication, Pyx_DECREF- all temporaries are slow python-objects. We need to change that!

Let's pass not a list but an array.array of bytes to our function and return an array.array of bytes back. Lists have full fledged python objects inside, array.arraythe lowly raw c-data which is faster:

%%cython
from cpython cimport array
def cy_apackx(char[::1] it):
    cdef unsigned int n = len(it)//2
    cdef unsigned int i
    cdef array.array res = array.array('b', [])
    array.resize(res, n)
    for i in range(n):
        res.data.as_chars[i] = (it[i*2]%16)<<4 | it[i*2+1]%16
    return res

import array
a=array.array('B', l)
%timeit cy_apackx(a)
19.2 µs ± 316 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Better, but let take a look at the generated html, there is still some slow python-code:

 __pyx_t_2 = __Pyx_PyInt_From_long(((__Pyx_mod_long((*((char *) ( /* dim=0 */ ((char *) (((char *) __pyx_v_it.data) + __pyx_t_7)) ))), 16) << 4) | __Pyx_mod_long((*((char *) ( /* dim=0 */ ((char *) (((char *) __pyx_v_it.data) + __pyx_t_8)) ))), 16))); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 9, __pyx_L1_error)
__Pyx_GOTREF(__pyx_t_2);
 if (unlikely(__Pyx_SetItemInt(((PyObject *)__pyx_v_res), __pyx_v_i, __pyx_t_2, unsigned int, 0, __Pyx_PyInt_From_unsigned_int, 0, 0, 1) < 0)) __PYX_ERR(0, 9, __pyx_L1_error)
 __Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0;

We still use a python-setter for array (__Pax_SetItemInt) and for this a python objecct __pyx_t_2 is needed, to avoid this we use array.data.as_chars:

%%cython
from cpython cimport array
def cy_apackx(char[::1] it):
    cdef unsigned int n = len(it)//2
    cdef unsigned int i
    cdef array.array res = array.array('B', [])
    array.resize(res, n)
    for i in range(n):
        res.data.as_chars[i] = (it[i*2]%16)<<4 | it[i*2+1]%16  ##HERE!
return res

%timeit cy_apackx(a)
1.86 µs ± 30.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Much better, but let's take a look at html again, and we see some calls to __Pyx_RaiseBufferIndexError - this safety costs some time, so let's switch it off:

%%cython
from cpython cimport array    
import cython
@cython.boundscheck(False) # switch of safety-checks
@cython.wraparound(False) # switch of safety-checks
def cy_apackx(char[::1] it):
    cdef unsigned int n = len(it)//2
    cdef unsigned int i
    cdef array.array res = array.array('B', [])
    array.resize(res, n)
    for i in range(n):
        res.data.as_chars[i] = (it[i*2]%16)<<4 | it[i*2+1]%16  ##HERE!
    return res

%timeit cy_apackx(a)
1.53 µs ± 11.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

When we look at the generated html, we see:

__pyx_t_7 = (__pyx_v_i * 2);
__pyx_t_8 = ((__pyx_v_i * 2) + 1);
(__pyx_v_res->data.as_chars[__pyx_v_i]) = ((__Pyx_mod_long((*((char *) ( /* dim=0 */ ((char *) (((char *) __pyx_v_it.data) + __pyx_t_7)) ))), 16) << 4) | __Pyx_mod_long((*((char *) ( /* dim=0 */ ((char *) (((char *) __pyx_v_it.data) + __pyx_t_8)) ))), 16));

No python-stuff! Good so far. However, I'm not sure about __Pyx_mod_long, its definition is:

static CYTHON_INLINE long __Pyx_mod_long(long a, long b) {
   long r = a % b;
   r += ((r != 0) & ((r ^ b) < 0)) * b;
   return r;
}

So C and Python have differences for mod of negative numbers and it must be taken into account. This function-definition, albeit inlined, will prevent the C-compiler from optimizing a%16 as a&15. We have only positive numbers, so no need to care about them, thus we need to do the a&15-trick by ourselves:

%%cython
from cpython cimport array
import cython
@cython.boundscheck(False)
@cython.wraparound(False)
def cy_apackx(char[::1] it):
    cdef unsigned int n = len(it)//2
    cdef unsigned int i
    cdef array.array res = array.array('B', [])
    array.resize(res, n)
    for i in range(n):
        res.data.as_chars[i] = (it[i*2]&15)<<4 | (it[i*2+1]&15)
    return res

%timeit cy_apackx(a)
1.02 µs ± 8.63 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

I'm also satified with the resulting C-code/html (only one line):

(__pyx_v_res->data.as_chars[__pyx_v_i]) = ((((*((char *) ( /* dim=0 */ ((char *) (((char *) __pyx_v_it.data) + __pyx_t_7)) ))) & 15) << 4) | ((*((char *) ( /* dim=0 */ ((char *) (((char *) __pyx_v_it.data) + __pyx_t_8)) ))) & 15));

Conclusion: In the sum that means speed up of 140 (140 µs vs 1.02 µs)- not bad! Another interesting point: the calculation itself takes about 2 µs (and that comprises less than optimal bound checking and division) - 138 µs are for creating, registering and deleting temporary python objects.


If you need the upper bits and can assume that lower bits are without dirt (otherwise &250 can help), you can use:

from cpython cimport array
import cython
@cython.boundscheck(False)
@cython.wraparound(False)
def cy_apackx(char[::1] it):
    cdef unsigned int n = len(it)//2
    cdef unsigned int i
    cdef array.array res = array.array('B', [])
    array.resize(res, n)
    for i in range(n):
        res.data.as_chars[i] = it[i*2] | (it[i*2+1]>>4)
    return res

%timeit cy_apackx(a)
819 ns ± 8.24 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Another interesting question is which costs have the operations if list is used. If we start with the "improved" version:

%%cython
def cy_packx(it):
    cdef unsigned int n = len(it)//2
    cdef unsigned int i
    res=[0]*n
    for i in range(n):
        res[i] = it[i*2] | (it[i*2+1]>>4))
    return res

%timeit cy_packx(l)
20.7 µs ± 450 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

we see, that reducing the number of integer operation leads to a big speed-up. That is due to the fact, that python-integers are immutable and every operation creates a new temporary object, which is costly. Eliminating operations means also eliminating costly temporaries.

However, it[i*2] | (it[i*2+1]>>4) is done with python-integer, as next step we make it cdef-operations:

%%cython   
def cy_packx(it):
    cdef unsigned int n = len(it)//2
    cdef unsigned int i
    cdef unsigned char a,b
    res=[0]*n
    for i in range(n):
        a=it[i*2]
        b=it[i*2+1]  # ensures next operations are fast
        res[i]= a | (b>>4)
    return res   

    %timeit cy_packx(l)
    7.3 µs ± 880 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

I don't know how it can be improved further, thus we have 7.3 µs for list vs. 1 µs for array.array.


Last question, what is the costs break down of the list version? I order to avoid being optimized away by the C-compiler, we use a slightly different baseline function:

%%cython
def cy_packx(it):
        cdef unsigned int n = len(it)//2
        cdef unsigned int i
        cdef unsigned char a,b
        cdef unsigned char s = 0
        res=[0]*n
        for i in range(n):
            a=it[i*2]
            b=it[i*2+1]  # ensures next operations are fast
            s+=a | (b>>4)
            res[i]= s
        return res
%timeit cy_packx(l)

In [79]: %timeit cy_packx(l)
7.67 µs ± 106 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

The usage of the s variable means, it does not get optimized away in the second version:

%%cython   
def cy_packx(it):
        cdef unsigned int n = len(it)//2
        cdef unsigned int i
        cdef unsigned char a,b
        cdef unsigned char s = 0
        res=[0]*n
        for i in range(n):
            a=it[i*2]
            b=it[i*2+1]  # ensures next operations are fast
            s+=a | (b>>4)
        res[0]=s
        return res 

In [81]: %timeit cy_packx(l)
5.46 µs ± 72.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

About 2 µs or about 30% are the costs for creating new integer objects. What are the costs of the memory allocation?

%%cython   
def cy_packx(it):
        cdef unsigned int n = len(it)//2
        cdef unsigned int i
        cdef unsigned char a,b
        cdef unsigned char s = 0
        for i in range(n):
            a=it[i*2]
            b=it[i*2+1]  # ensures next operations are fast
            s+=a | (b>>4)
        return s 

In [84]: %timeit cy_packx(l)
3.84 µs ± 43.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

That leads to the following performance break down of the list-version:

                    Time(in µs)      Percentage(in %)
     all                7.7                 100
     calculation          1                  12
     alloc memory       1.6                  21
     create ints        2.2                  29
     access data/cast   2.6                  38

I must confess, I expected create ints to play a bigger role and didn't thing accessing the data in the list and casting it to chars will cost that much.

ead
  • 27,136
  • 4
  • 67
  • 108
  • Thank you! I'll only be able to test in about 20 hrs. Source data will be 8 bits but needs to be reduced to 4 bits, thus the division by 16 (equivalent to >> 4, so further optimization possible). `0x0A` was meant to show the nibbles clearly. I will update accordingly. In the meantime, could you point me to documentation on the `char[::1] it` slicing syntax? – handle Nov 19 '17 at 20:46
  • @handle you can take a look at typed memory views: http://cython.readthedocs.io/en/latest/src/userguide/memoryviews.html – ead Nov 19 '17 at 22:15
  • @handle I also improved the list-version somewhat, it might still be an option for you. – ead Nov 19 '17 at 22:16
  • Thanks again, down to 1.7 ms for the whole conversion via array (see updated question). Will look into your further improvements in about 8 hrs. – handle Nov 20 '17 at 08:02
  • The array (conversion) should not be necessary as Cython is supposed to support `bytes` as either char* or memoryview, unfortunately I also get this error: https://github.com/cython/cython/issues/1682. It works with `bytearray`though I yet have to figure out how to return the computed data. – handle Nov 20 '17 at 21:26
  • @handle Yeah, I'm sometimes overusing `array.array`, however the difference is not that big compared to `bytearray`. I don't know how you use the data in your program. But usually one tries to use `bytearray`or `array.array` + cython for calculations and avoids `list`s if possible in the whole program. – ead Nov 21 '17 at 06:38
  • They seem to have a [fix for this almost ready](https://github.com/cython/cython/pull/1869), in the meantime I'll see if I can get the same perfomance with `bytearray` as with `array.array`. I get `bytes`from Pillow (PIL) and write them to a SPI with [py-spidev](https://github.com/doceme/py-spidev/blob/master/spidev_module.c#L121) which [optionally converts the sequence data to a list](https://docs.python.org/3/c-api/sequence.html#c.PySequence_Fast), copies the data to a C-type buffer which then is written to the SPI device (so there's room for improvement as well). – handle Nov 21 '17 at 09:28
  • Further speedup by eliminating multiplication with a step-2 loop and rshift instead of division to get target index, now at 150 us timed and 420 us actual. bytearray or array.array do indeed not make much of a difference, if anything, array is slightly faster. Thanks again, I might revisit this once Cython is fixed. – handle Nov 21 '17 at 17:08
  • Cython seems to have been fixed ([#1682](https://github.com/cython/cython/issues/1682)) but I'll wait for a release to play with this again... – handle Feb 16 '18 at 13:29