improving code efficiency: standard deviation on sliding windows

Question

I am trying to improve function which calculate for each pixel of an image the standard deviation of the pixels located in the neighborhood of the pixel. My function uses two embedded loops to run accross the matrix, and it is the bottleneck of my programme. I guess there is likely a way to improve it by getting rid of the loops thanks to numpy, but I don't know how to proceed. Any advice are welcome!

regards

def sliding_std_dev(image_original,radius=5) :
    height, width = image_original.shape
    result = np.zeros_like(image_original) # initialize the output matrix
    hgt = range(radius,height-radius)
    wdt = range(radius,width-radius)
    for i in hgt:
        for j in wdt:
            result[i,j] = np.std(image_original[i-radius:i+radius,j-radius:j+radius])
    return result

Can you give approximate sizes for the image and radius? – tom10 Aug 24 '13 at 16:01 — tom10, Aug 24 '13 at 16:01

nneonneo · Accepted Answer · 2013-08-25T15:44:39.163

30

Cool trick: you can compute the standard deviation given just the sum of squared values and the sum of values in the window.

Therefore, you can compute the standard deviation very fast using a uniform filter on the data:

from scipy.ndimage.filters import uniform_filter

def window_stdev(arr, radius):
    c1 = uniform_filter(arr, radius*2, mode='constant', origin=-radius)
    c2 = uniform_filter(arr*arr, radius*2, mode='constant', origin=-radius)
    return ((c2 - c1*c1)**.5)[:-radius*2+1,:-radius*2+1]

This is ridiculously faster than the original function. For a 1024x1024 array and a radius of 20, the old function takes 34.11 seconds, and the new function takes 0.11 seconds, a speed-up of 300-fold.

How does this work mathematically? It computes the quantity sqrt(mean(x^2) - mean(x)^2) for each window. We can derive this quantity from the standard deviation sqrt(mean((x - mean(x))^2)) as follows:

Let E be the expectation operator (basically mean()), and X be the random variable of data. Then:

E[(X - E[X])^2]
= E[X^2 - 2X*E[X] + E[X]^2]
= E[X^2] - E[2X*E[X]] + E[E[X]^2] (by the linearity of the expectation operator)
= E[X^2] - 2E[X]*E[X] + E[X]^2 (again by linearity, and the fact that E[X] is a constant)
= E[X^2] - E[X]^2

which proves that the quantity computed using this technique is mathematically equivalent to the standard deviation.

edited Aug 25 '13 at 15:44

answered Aug 24 '13 at 19:50

nneonneo

154,210
32
267
343

This solution seems very smart but I don't feel comfortable with it: it seems it calculates on each neighborood the square root of the difference between the mean of the square and the square of the mean. To me is different from the square root of the mean of the square of difference between each value and the mean value. In other word, mean(x^2)-mean(x)^2 is not equal to mean(x^2-mean(x)^2). What do you think? – baptiste pagneux Aug 25 '13 at 14:30
3

@baptistepagneux: You can prove it's true. Quick proof: let `x` be the mean, `X` be the random variable, and `E` the expectation operator (same as "mean()"). Then, `E[(X-x)^2] = E[X^2 - 2Xx + x^2] = E[X^2] - 2E[X]x + x^2` (last one by linearity of `E` and the fact that `x` is a constant) `= E[X^2] - x^2` (because `E[X] = x` by definition) = `E[X^2] - E[X]^2`. QED. – nneonneo Aug 25 '13 at 15:32
Your proof is very convincing, thank you. Your algo is really the best for me. Thanks – baptiste pagneux Aug 26 '13 at 13:34
I would not use this answer - it produces incorrect results. Please see my answer below for an accurate solution. – Max Jaderberg Feb 25 '14 at 23:18
1

@MaxJaderberg: Mind explaining why or how it produces incorrect results? I included a proof to show that the math is correct. – nneonneo Feb 26 '14 at 01:26
@nneonneo the math is definitely correct! I'm not super familiar with uniform_filter, but I'm pretty sure you need a normalising constant in front of it i.e. 1/((2*radius)**2) – Max Jaderberg Feb 26 '14 at 10:56
1

@MaxJaderberg: No, you do not; check for yourself. `uniform_filter` already performs normalization. On a randomly-generated image (`np.random.random((100,100))`), my solution produces a result that has an error of at most 1.4e-15 across all pixels, compared to the result given by the OP's code. – nneonneo Feb 26 '14 at 22:41
I found an OpenCV `boxFilter`-based solution for the almost identical "variance over window" problem **even faster** than a `uniform_filter`-based solution; for my test example, the speedup was a factor of 4.1, see [this answer](http://stackoverflow.com/a/36266187/1628638). – Ulrich Stern Mar 31 '16 at 18:17

score 12 · Answer 2 · answered Aug 24 '13 at 22:36

12

The most often used method to do this kind of things in image processing is using summed area tables, an idea introduced in this paper in 1984. The idea is that, when you compute a quantity by adding over a window, and move the window e.g. one pixel to the right, you don't need to add all the items in the new window, you only need to subtract the leftmost column from the total, and add the new rightmost column. So if you create an accumulated sum array over both dimensions from your array, you can get the sum over a window with a couple of sums and a subtraction. If you keep summed area tables for your array and its square, it's very easy to get the variance from those two. Here's an implementation:

def windowed_sum(a, win):
    table = np.cumsum(np.cumsum(a, axis=0), axis=1)
    win_sum = np.empty(tuple(np.subtract(a.shape, win-1)))
    win_sum[0,0] = table[win-1, win-1]
    win_sum[0, 1:] = table[win-1, win:] - table[win-1, :-win]
    win_sum[1:, 0] = table[win:, win-1] - table[:-win, win-1]
    win_sum[1:, 1:] = (table[win:, win:] + table[:-win, :-win] -
                       table[win:, :-win] - table[:-win, win:])
    return win_sum

def windowed_var(a, win):
    win_a = windowed_sum(a, win)
    win_a2 = windowed_sum(a*a, win)
    return (win_a2 - win_a * win_a / win/ win) / win / win

To see that this works:

>>> a = np.arange(25).reshape(5,5)
>>> windowed_var(a, 3)
array([[ 17.33333333,  17.33333333,  17.33333333],
       [ 17.33333333,  17.33333333,  17.33333333],
       [ 17.33333333,  17.33333333,  17.33333333]])
>>> np.var(a[:3, :3])
17.333333333333332
>>> np.var(a[-3:, -3:])
17.333333333333332

This should run a couple of notches faster than convolution based methods.

answered Aug 24 '13 at 22:36

Jaime

59,107
15
108
149

This seems like it'd be pretty fast. You should time it. – nneonneo Aug 25 '13 at 15:46
@nneonneo I tried to time it, and your solution with `uniform_filter` was beating my code above consistently, and performing in constant time no matter what the window size... The reason is that `uniform_filter` is already using the add-next-item-subtract-first trick, here's the [line](https://github.com/scipy/scipy/blob/785491ce81da53fcbf42661811758906f8ff2563/scipy/ndimage/src/ni_filters.c#L323) where it all happens. – Jaime Aug 26 '13 at 04:38
I know that `uniform_filter` does that trick, but I am curious how the more explicit version here does (also, this one should be independent of window size, no?) – nneonneo Aug 26 '13 at 04:40
2

@nneonneo If you strip all unnecessary intermediate arrays from the code above, using `np.add` and `np.subtract` with `out` keywords, it runs about 20-30% slower than `uniform_filter`, with similar performance for things like independence of window size. So not that bad of an option if you want to skip the `scipy` dependency. – Jaime Aug 26 '13 at 04:58
I suggest you to add this `a = np.int32(a)` as the first line in `windowed_var(a, win)`, to avoid problems. – phyrox Jan 22 '14 at 10:58
@phyrox How would that help in any way? – Jaime Jan 22 '14 at 14:27
@Jaime I were having trouble with an array that used a different type of data (int32 I think), and during the las division It returned extrange numbers. I solved it using the line I wrote. – phyrox Jan 22 '14 at 16:15

score 3 · Answer 3 · edited May 23 '17 at 12:25

First off, there's more than one way to do this.

It's not the most efficient speed-wise, but using scipy.ndimage.generic_filter will allow you to easily apply an arbitrary python function over a moving window.

As a quick example:

result = scipy.ndimage.generic_filter(data, np.std, size=2*radius)

Note that the boundary conditions can be controlled by mode kwarg.

Another way to do this is to use some various striding tricks to make a view of the array that's effectively a moving window, and then apply np.std along the last axis. (Note: this is taken from one of my previous answers here: https://stackoverflow.com/a/4947453/325565)

def strided_sliding_std_dev(data, radius=5):
    windowed = rolling_window(data, (2*radius, 2*radius))
    shape = windowed.shape
    windowed = windowed.reshape(shape[0], shape[1], -1)
    return windowed.std(axis=-1)

def rolling_window(a, window):
    """Takes a numpy array *a* and a sequence of (or single) *window* lengths
    and returns a view of *a* that represents a moving window."""
    if not hasattr(window, '__iter__'):
        return rolling_window_lastaxis(a, window)
    for i, win in enumerate(window):
        if win > 1:
            a = a.swapaxes(i, -1)
            a = rolling_window_lastaxis(a, win)
            a = a.swapaxes(-2, i)
    return a

def rolling_window_lastaxis(a, window):
    """Directly taken from Erik Rigtorp's post to numpy-discussion.
    <http://www.mail-archive.com/numpy-discussion@scipy.org/msg29450.html>"""
    if window < 1:
       raise ValueError, "`window` must be at least 1."
    if window > a.shape[-1]:
       raise ValueError, "`window` is too long."
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)

It's a bit hard to understand what's going on here at first glance. Not to plug one of my own answers, but I don't want to re-type the explanation, so have a look here: https://stackoverflow.com/a/4924433/325565 if you haven't see these sorts of "striding" tricks before.

If we compare timings with a 100x100 array of random floats with a radius of 5, it's ~10x faster than the original or the generic_filter version. However, you have no flexibility in the boundary conditions with this version. (It's identical to what you're currently doing, while the generic_filter version gives you lots of flexibility at the expense of speed.)

# Your original function with nested loops
In [21]: %timeit sliding_std_dev(data)
1 loops, best of 3: 237 ms per loop

# Using scipy.ndimage.generic_filter
In [22]: %timeit ndimage_std_dev(data)
1 loops, best of 3: 244 ms per loop

# The "stride-tricks" version above
In [23]: %timeit strided_sliding_std_dev(data)
100 loops, best of 3: 15.4 ms per loop

# Ophion's version that uses `np.take`
In [24]: %timeit new_std_dev(data)
100 loops, best of 3: 19.3 ms per loop

The downside to the "stride-tricks" version is that unlike "normal" strided rolling window tricks, this version does make a copy, and it's much larger than the original array. You will run into memory problems if you use this on a large array! (On a side note, it's basically equivalent to @Ophion's answer in terms of memory use and speed. It's just a different approach to doing the same thing.)

These methods are not really equivalent, the `radius` variable is doing two separate things. — Daniel, Aug 24 '13 at 17:29
@Ophion - Woops! Good catch. My `radius` is actually a diameter. — Joe Kington, Aug 24 '13 at 17:29
Also, obviously the boundary conditions are handled differently, but they're equivalent away from the boundaries. I mainly posted this just to show different techniques, not to give a 100% match to the way the OP handled the boundary conditions. — Joe Kington, Aug 24 '13 at 17:38
It is a great use of strided arrays, but I was mainly pointing it out because this is only ~5% faster when using the proper radius *and* removing the `hstack` and `vstack` parts. — Daniel, Aug 24 '13 at 17:43
It's not properly documented, but in numpy 1.7 you can give `np.std` a tuple of axes instead of a single axis. So you can get a 4D view of the array, with e.g. shape `(rows-win+1, cols-win+1, win, win)`, then call on that view `.std(axis=(-1, -2))`, and get the windowed standard deviation without making a copy. That would make it equivalent to the uniform filter that @nneonneo proposed. — Jaime, Aug 25 '13 at 04:45

Daniel · Answer 4 · 2013-08-24T19:38:04.220

You can first obtain the indices and then use np.take to form the new array:

def new_std_dev(image_original,radius=5):
    cols,rows=image_original.shape

    #First obtain the indices for the top left position
    diameter=np.arange(radius*2)
    x,y=np.meshgrid(diameter,diameter)
    index=np.ravel_multi_index((y,x),(cols,rows)).ravel()

    #Cast this in two dimesions and take the stdev
    index=index+np.arange(rows-radius*2)[:,None]+np.arange(cols-radius*2)[:,None,None]*(rows)
    data=np.std(np.take(image_original,index),-1)

    #Add the zeros back to the output array
    top=np.zeros((radius,rows-radius*2))
    sides=np.zeros((cols,radius))

    data=np.vstack((top,data,top))
    data=np.hstack((sides,data,sides))
    return data

First generate some random data and check timings:

a=np.random.rand(50,20)

print np.allclose(new_std_dev(a),sliding_std_dev(a))
True

%timeit sliding_std_dev(a)
100 loops, best of 3: 18 ms per loop

%timeit new_std_dev(a)
1000 loops, best of 3: 472 us per loop

For larger arrays its always faster as long as you have enough memory:

a=np.random.rand(200,200)

print np.allclose(new_std_dev(a),sliding_std_dev(a))
True

%timeit sliding_std_dev(a)
1 loops, best of 3: 1.58 s per loop

%timeit new_std_dev(a)
10 loops, best of 3: 52.3 ms per loop

The original function is faster for very small arrays, it looks like the break even point is when hgt*wdt >50. Something to note your function is taking square frames and placing the std dev in the bottom right index, not sampling around the index. Is this intentional?

Thank you @Ophion for your solution and for pinpointing my mistake in my original function. I meant in fact : result[i,j] = np.std(image_original[i-radius:i+radius+1,j-radius:j+radius+1]) to have the window centered around the pixel. To get the same result with your algo, I modified diameter=np.arange(radius*2+1), it seems it gives the same result as my modified original function. It took me a while for understanding your solution, but it is really brilliant and very efficient. Thank you again — baptiste pagneux, Aug 25 '13 at 12:20
I will go for @nneonneo solution. Your solution is fast but I get memory error when running it on large image (which is strange to me, but I don't know what is going on under the hood). Thank you again anyway — baptiste pagneux, Aug 26 '13 at 13:39
@baptistepagneux When using `np.take` it will create a the three dimensional array that looks like this `(k,N,N)` where `k` is the `kth` window and the `NxN` is the windowed component. As you can see this duplicates a lot of indices drastically increasing your memory footprint likely `arr_size*N*N`. @nneonneo solution is an optimized routine that does not duplicate the windows and so only requires an array of size `constant*arr_size`. `constant` being fairly small looks like 2-3 from a quick glance at the code. Overall a much better way of doing things. — Daniel, Aug 26 '13 at 13:49

improving code efficiency: standard deviation on sliding windows

4 Answers4

Linked

Related