Fastest Python log-sum-exp in a 'reduceat'

Question

As part of a statistical programming package, I need to add log-transformed values together with the LogSumExp Function. This is significantly less efficient than adding unlogged values together.

Furthermore, I need to add values together using the numpy.ufunc.reduecat functionality.

There are various options I've considered, with code below:

(for comparison in non-log-space) use numpy.add.reduceat
Numpy's ufunc for adding logged values together: np.logaddexp.reduceat
Handwritten reduceat function with the following logsumexp functions:
- scipy's implemention of logsumexp
- logsumexp function in Python (with numba)
- Streaming logsumexp function in Python (with numba)

def logsumexp_reduceat(arr, indices, logsum_exp_func):
    res = list()
    i_start = indices[0]
    for cur_index, i in enumerate(indices[1:]):
        res.append(logsum_exp_func(arr[i_start:i]))
        i_start = i

    res.append(logsum_exp_func(arr[i:]))
    return res

@numba.jit(nopython=True)
def logsumexp(X):
    r = 0.0
    for x in X:
        r += np.exp(x)  
    return np.log(r)

@numba.jit(nopython=True)
def logsumexp_stream(X):
    alpha = -np.Inf
    r = 0.0
    for x in X:
        if x != -np.Inf:
            if x <= alpha:
                r += np.exp(x - alpha)
            else:
                r *= np.exp(alpha - x)
                r += 1.0
                alpha = x
    return np.log(r) + alpha

arr = np.random.uniform(0,0.1, 10000)
log_arr = np.log(arr)
indices = sorted(np.random.randint(0, 10000, 100))

# approach 1
%timeit np.add.reduceat(arr, indices)
12.7 µs ± 503 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

# approach 2
%timeit np.logaddexp.reduceat(log_arr, indices)
462 µs ± 17.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# approach 3, scipy function
%timeit logsum_exp_reduceat(arr, indices, scipy.special.logsumexp)
3.69 ms ± 273 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# approach 3 handwritten logsumexp
%timeit logsumexp_reduceat(log_arr, indices, logsumexp)
139 µs ± 7.1 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

# approach 3 streaming logsumexp
%timeit logsumexp_reduceat(log_arr, indices, logsumexp_stream)
164 µs ± 10.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

The timeit results show that handwritten logsumexp functions with numba are the fastest options, but are still 10x slower than numpy.add.reduceat.

A few questions:

Are there any other approaches (or tweaks to the options I've presented) which are faster? For instance, is there a way to use a lookup table to compute the logsumexp function?
Why is Sebastian Nowozin's "streaming logsumexp" function not faster than the naive approach?

For the Numba solution use Intel SVML and make sure that you don't run into the SVML Bug. https://stackoverflow.com/a/56939240/4045774 and use the fastmath flag to make SIMD vectorization possible. — max9111, Jan 22 '20 at 08:50
Thanks max911, I add the following ```from llvmlite import binding binding.set_option('SVML', '-vector-library=SVML') ``` and use ```fastmath=True```, but I see minimal improvement in the benchmarks, perhaps I'm missing something? — Wilder Wohns, Jan 22 '20 at 10:24
I find it very odd that the built-in `np.logaddexp.reduceat` function is slower than the hand-crafted numba version. — user2667066, Jan 22 '20 at 10:44
I think the streaming logsumexp solution can give a speedup when the arrays are too large too fit in memory, but otherwise may be slower, as it involves slightly more computational cost — user2667066, Jan 22 '20 at 12:44

score 1 · Accepted Answer · answered Jan 22 '20 at 18:42

There is some room for improvement

But never expect logsumexp to be as fast as a standard summation, because exp is quite a expensive operation.

Example

import numpy as np

#from version 0.43 until 0.47 this has to be set before importing numba
#Bug: https://github.com/numba/numba/issues/4689
from llvmlite import binding
binding.set_option('SVML', '-vector-library=SVML')
import numba as nb

@nb.njit(fastmath=True,parallel=False)
def logsum_exp_reduceat(arr, indices):
    res = np.empty(indices.shape[0],dtype=arr.dtype)

    for i in nb.prange(indices.shape[0]-1):
        r = 0.
        for j in range(indices[i],indices[i+1]):
            r += np.exp(arr[j])  
        res[i]=np.log(r)

    r = 0.
    for j in range(indices[-1],arr.shape[0]):
        r += np.exp(arr[j])  
    res[-1]=np.log(r)
    return res

Timings

#small example where parallelization doesn't make sense
arr = np.random.uniform(0,0.1, 10_000)
log_arr = np.log(arr)
#use arrays if possible
indices = np.sort(np.random.randint(0, 10_000, 100))

%timeit logsum_exp_reduceat(arr, indices)
#without parallelzation 22 µs ± 173 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
#with parallelization   84.7 µs ± 32.2 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit np.add.reduceat(arr, indices)
#4.46 µs ± 61.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

#large example where parallelization makes sense
arr = np.random.uniform(0,0.1, 1000_000)
log_arr = np.log(arr)
indices = np.sort(np.random.randint(0, 1000_000, 100))

%timeit logsum_exp_reduceat(arr, indices)
#without parallelzation 1.57 ms ± 14.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#with parallelization   409 µs ± 14.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit np.add.reduceat(arr, indices)
#340 µs ± 11.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Fastest Python log-sum-exp in a 'reduceat'

1 Answers1