3

As part of a statistical programming package, I need to add log-transformed values together with the LogSumExp Function. This is significantly less efficient than adding unlogged values together.

Furthermore, I need to add values together using the numpy.ufunc.reduecat functionality.

There are various options I've considered, with code below:

  1. (for comparison in non-log-space) use numpy.add.reduceat
  2. Numpy's ufunc for adding logged values together: np.logaddexp.reduceat
  3. Handwritten reduceat function with the following logsumexp functions:
def logsumexp_reduceat(arr, indices, logsum_exp_func):
    res = list()
    i_start = indices[0]
    for cur_index, i in enumerate(indices[1:]):
        res.append(logsum_exp_func(arr[i_start:i]))
        i_start = i

    res.append(logsum_exp_func(arr[i:]))
    return res

@numba.jit(nopython=True)
def logsumexp(X):
    r = 0.0
    for x in X:
        r += np.exp(x)  
    return np.log(r)

@numba.jit(nopython=True)
def logsumexp_stream(X):
    alpha = -np.Inf
    r = 0.0
    for x in X:
        if x != -np.Inf:
            if x <= alpha:
                r += np.exp(x - alpha)
            else:
                r *= np.exp(alpha - x)
                r += 1.0
                alpha = x
    return np.log(r) + alpha

arr = np.random.uniform(0,0.1, 10000)
log_arr = np.log(arr)
indices = sorted(np.random.randint(0, 10000, 100))

# approach 1
%timeit np.add.reduceat(arr, indices)
12.7 µs ± 503 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

# approach 2
%timeit np.logaddexp.reduceat(log_arr, indices)
462 µs ± 17.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# approach 3, scipy function
%timeit logsum_exp_reduceat(arr, indices, scipy.special.logsumexp)
3.69 ms ± 273 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# approach 3 handwritten logsumexp
%timeit logsumexp_reduceat(log_arr, indices, logsumexp)
139 µs ± 7.1 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

# approach 3 streaming logsumexp
%timeit logsumexp_reduceat(log_arr, indices, logsumexp_stream)
164 µs ± 10.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

The timeit results show that handwritten logsumexp functions with numba are the fastest options, but are still 10x slower than numpy.add.reduceat.

A few questions:

  1. Are there any other approaches (or tweaks to the options I've presented) which are faster? For instance, is there a way to use a lookup table to compute the logsumexp function?
  2. Why is Sebastian Nowozin's "streaming logsumexp" function not faster than the naive approach?
  • For the Numba solution use Intel SVML and make sure that you don't run into the SVML Bug. https://stackoverflow.com/a/56939240/4045774 and use the fastmath flag to make SIMD vectorization possible. – max9111 Jan 22 '20 at 08:50
  • Thanks max911, I add the following ```from llvmlite import binding binding.set_option('SVML', '-vector-library=SVML') ``` and use ```fastmath=True```, but I see minimal improvement in the benchmarks, perhaps I'm missing something? – Wilder Wohns Jan 22 '20 at 10:24
  • I find it very odd that the built-in `np.logaddexp.reduceat` function is slower than the hand-crafted numba version. – user2667066 Jan 22 '20 at 10:44
  • I think the streaming logsumexp solution can give a speedup when the arrays are too large too fit in memory, but otherwise may be slower, as it involves slightly more computational cost – user2667066 Jan 22 '20 at 12:44

1 Answers1

1

There is some room for improvement

But never expect logsumexp to be as fast as a standard summation, because exp is quite a expensive operation.

Example

import numpy as np

#from version 0.43 until 0.47 this has to be set before importing numba
#Bug: https://github.com/numba/numba/issues/4689
from llvmlite import binding
binding.set_option('SVML', '-vector-library=SVML')
import numba as nb

@nb.njit(fastmath=True,parallel=False)
def logsum_exp_reduceat(arr, indices):
    res = np.empty(indices.shape[0],dtype=arr.dtype)

    for i in nb.prange(indices.shape[0]-1):
        r = 0.
        for j in range(indices[i],indices[i+1]):
            r += np.exp(arr[j])  
        res[i]=np.log(r)

    r = 0.
    for j in range(indices[-1],arr.shape[0]):
        r += np.exp(arr[j])  
    res[-1]=np.log(r)
    return res

Timings

#small example where parallelization doesn't make sense
arr = np.random.uniform(0,0.1, 10_000)
log_arr = np.log(arr)
#use arrays if possible
indices = np.sort(np.random.randint(0, 10_000, 100))

%timeit logsum_exp_reduceat(arr, indices)
#without parallelzation 22 µs ± 173 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
#with parallelization   84.7 µs ± 32.2 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit np.add.reduceat(arr, indices)
#4.46 µs ± 61.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

#large example where parallelization makes sense
arr = np.random.uniform(0,0.1, 1000_000)
log_arr = np.log(arr)
indices = np.sort(np.random.randint(0, 1000_000, 100))

%timeit logsum_exp_reduceat(arr, indices)
#without parallelzation 1.57 ms ± 14.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#with parallelization   409 µs ± 14.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit np.add.reduceat(arr, indices)
#340 µs ± 11.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
max9111
  • 5,288
  • 1
  • 11
  • 22