Python iterate through array while finding the mean of the top k elements

Question

Suppose I have a Python array a=[3, 5, 2, 7, 5, 3, 6, 8, 4]. My goal is to iterate through this array 3 elements at a time returning the mean of the top 2 of the three elements.

Using the above array, during my iteration step, the first three elements are [3, 5, 2] and the mean of the top 2 elements is 4. The next three elements are [5, 2, 7] and the mean of the top 2 elements is 6. The next three elements are [2, 7, 5] and the mean of the top 2 elements is again 6. ...

Hence, the result for the above array would be [4, 6, 6, 6, 5.5, 7, 7].

What is the nicest way to write such a function?

The original question I had in mind was that for an input array of length m, we iterate through n elements at a time while finding the mean of the top k elements such that m >= n >= k. The question was phrased in the above manner for simplicity. I was hoping to generalize a good solution to the general case. — Student, Feb 27 '18 at 18:41
I'm voting to close this question as off-topic because it's just asking for code. Probably a homework question. — Izkata, Feb 27 '18 at 19:09
My solution seems to be about 4-5 times faster than (foslock's) accepted answer :) https://repl.it/repls/AliveTechnoGraph — גלעד ברקן, Feb 28 '18 at 16:47

foslock · Accepted Answer · 2018-03-07T18:36:33.170

14

Solution

You can use some fancy slicing of your list to manipulate subsets of elements. Simply grab each three element sublist, sort to find the top two elements, and then find the simple average (aka. mean) and add it to a result list.

Code

def get_means(input_list):
    means = []
    for i in xrange(len(input_list)-2):
        three_elements = input_list[i:i+3]
        sum_top_two = sum(three_elements) - min(three_elements)
        means.append(sum_top_two/2.0)
    return means

Example

You can see your example input (and desired result) like so:

print(get_means([3, 5, 2, 7, 5, 3, 6, 8, 4]))
# [4.0, 6.0, 6.0, 6.0, 5.5, 7.0, 7.0]

And more...

There are some other great answers that get into more performance directed answers, including one using a generator to avoid large in memory lists: https://stackoverflow.com/a/49001728/416500

edited Mar 07 '18 at 18:36

answered Feb 27 '18 at 04:00

foslock

3,029
2
18
22

12

You could avoid sorting using `sum_top_two = sum(three_elements) - min(three_elements)` – gyre Feb 27 '18 at 13:51
3

An improvement might be to turn the function into a generator, by replacing the `means.append` line with `yield`. This would be more memory efficient in case `input_list` is very large – thegreatemu Feb 27 '18 at 17:33
1

@gyre Good suggestion! Personally I like to mirror how I operate on the list myself (get the top two by sorting the three elements in my head) – foslock Feb 27 '18 at 23:56
@thegreatemu Also a good suggestion, and the written function is very easily converted to return generator. I return a list here since that's what the question asked. – foslock Feb 27 '18 at 23:56
1

My solution seems to be about 4-5 times faster :) https://repl.it/repls/AliveTechnoGraph – גלעד ברקן Feb 28 '18 at 16:47

score 12 · Answer 2 · answered Feb 27 '18 at 10:04

I believe in splitting the code in 2 parts. Here that would be getting the sliding window, getting the top 2 elements, and calculating the mean. cleanest way to do this is using generators

Sliding window

Slight variation on evamicur's answer using tee, islice and zip to create the window:

def windowed_iterator(iterable, n=2):
    iterators = itertools.tee(iterable, n)
    iterators = (itertools.islice(it, i, None) for i, it in enumerate(iterators))
    yield from zip(*iterators)

windows = windowed_iterator(iterable=a, n=3)

[(3, 5, 2), (5, 2, 7), (2, 7, 5), (7, 5, 3), (5, 3, 6), (3, 6, 8), (6, 8, 4)]

top 2 elements

to calculate the mean of the 2 highest you can use any of the methods used in the other answers, I think the heapq on is the clearest

from heapq import nlargest
top_n = map(lambda x: nlargest(2, x), windows)

or equivalently

top_n = (nlargest(2, i) for i in windows)

[[5, 3], [7, 5], [7, 5], [7, 5], [6, 5], [8, 6], [8, 6]]

mean

from statistics import mean
means = map(mean, top_n)

[4, 6, 6, 6, 5.5, 7, 7]

I like your windowed iterator function, I just timed them both on my machine and got equivalent performance for range(10**5), window size of 3 and top 2. When I make the window size and # largest bigger the deque seems to scale much better — evamicur, Feb 27 '18 at 18:46
oops I mistimed, the deque is very slightly faster for large windows, it seems the nlargest dominates in this case anyway — evamicur, Feb 27 '18 at 18:56
My solution seems to be about 60-80 times faster :) https://repl.it/repls/TragicNextCleantech — גלעד ברקן, Feb 28 '18 at 16:45

score 8 · Answer 3 · answered Feb 27 '18 at 04:02

8

The following code does what you need:

[sum(sorted(a[i:i + 3])[-2:]) / 2 for i in range(len(a) - 2)]

Given your a=[3, 5, 2, 7, 5, 3, 6, 8, 4], returns:

[4.0, 6.0, 6.0, 6.0, 5.5, 7.0, 7.0]

answered Feb 27 '18 at 04:02

damores

1,985
2
14
25

301_Moved_Permanently · Answer 4 · 2018-02-27T10:29:05.557

itertools has a neat recipe to extract pairs of items from any iterable, not only indexable. You can adapt it slightly to extract triplets instead:

def tripletwise(iterable):
    a, b, c = itertools.tee(iterable, 3)
    next(b, None)
    next(itertools.islice(c, 2, 2), None)
    return zip(a, b, c)

Using that, you can simplify iterating over all triplets:

def windowed_means(iterable):
    return [
        (sum(window) - min(window)) / 2.0
        for window in tripletwise(iterable)
    ]

evamicur · Answer 5 · 2018-02-27T19:11:28.760

Iterator-only solution

foslok's solution is definitely fine, but I wanted to play around and make a version of this with generators. It only stores a deque of length(window_size) as it iterates through the original list, then finds the n_largest values and calculates the mean thereof.

import itertools as it
from collections import deque
from heapq import nlargest
from statistics import mean

def windowed(iterable, n):
    _iter = iter(iterable)
    d = deque((it.islice(_iter, n)), maxlen=n)
    yield tuple(d)
    for i in _iter:
        d.append(i)
        yield tuple(d)

a = [3, 5, 2, 7, 5, 3, 6, 8, 4]
means = [mean(nlargest(2, w)) for w in windowed(a, 3)]
print(means)

result:

[4, 6, 6, 6, 5.5, 7, 7]

Thus to change both the number of elements (window size) or the n largest elements just change the arguments to the respective functions. This approach also avoids the use of slicing so it can be more easily applied to iterables that you can't or don't want to slice.

Timings

def deque_version(iterable, n, k):
    means = (mean(nlargest(n, w)) for w in windowed(iterable, k))
    for m in means:
        pass

def tee_version(iterable, n, k):
    means = (mean(nlargest(n, w)) for w in windowed_iterator(iterable, k))
    for m in means:
        pass

a = list(range(10**5))


n = 3 
k = 2
print("n={} k={}".format(n, k))
print("Deque")
%timeit deque_version(a, n, k)
print("Tee")
%timeit tee_version(a, n, k)

n = 1000 
k = 2
print("n={} k={}".format(n, k))
print("Deque")
%timeit deque_version(a, n, k)
print("Tee")
%timeit tee_version(a, n, k)

n = 50
k = 25
print("n={} k={}".format(n, k))
print("Deque")
%timeit deque_version(a, n, k)
print("Tee")
%timeit tee_version(a, n, k)


result:

n=3 k=2
Deque
1.28 s ± 3.07 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Tee
1.28 s ± 16.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
n=1000 k=2
Deque
1.28 s ± 8.72 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Tee
1.27 s ± 2.92 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
n=50 k=25
Deque
2.46 s ± 10.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Tee
2.47 s ± 2.45 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So apparently the itertools tee vs deque doens't matter much.

My solution seems to be about 50 times faster :) https://repl.it/repls/TragicNextCleantech — גלעד ברקן, Feb 28 '18 at 16:46
I didn't bother timing it but yours very well may be faster, it's less general though. — evamicur, Feb 28 '18 at 22:10
I'm pretty sure tee is often implemented using deques so the negligible performance difference makes sense — gyre, Mar 03 '18 at 05:21

score 3 · Answer 6 · answered Feb 27 '18 at 09:15

As a vectorized approach using Numpy you can do the following:

np.sort(np.column_stack((a[:-2], a[1:-1], a[2:])))[:,-2:].mean(axis=1)

Demo:

In [13]: a=np.array([3, 5, 2, 7, 5, 3, 6, 8, 4])

In [14]: np.sort(np.column_stack((a[:-2], a[1:-1], a[2:])))[:,-2:].mean(axis=1)
Out[14]: array([4. , 6. , 6. , 6. , 5.5, 7. , 7. ])

John R · Answer 7 · 2018-04-04T19:27:52.777

1

Use list comprehension

from statistics import mean

yourList=[3, 5, 2, 7, 5, 3, 6, 8, 4]

k = 3

listYouWant = [mean(x) for x in [y[1:k] for y in [sorted(yourList[z:z+k]) for z in xrange(len(yourList)) if z < len(yourList) -(k-1)]]]

yields [4.0, 6.0, 6.0, 6.0, 5.5, 7.0, 7.0]

edited Apr 04 '18 at 19:27

answered Feb 27 '18 at 03:53

John R

1,341
7
17

score 1 · Answer 8 · answered Feb 27 '18 at 04:01

You can try this!

>>> a
[3, 5, 2, 7, 5, 3, 6, 8, 4]
>>> n
3
>>> m
2
>>> [sum(sorted(a[i*n:i*n+n])[1:])/m for i in range(len(a)/n)]
[4, 6, 7]

That is,

>>> a
[3, 5, 2, 7, 5, 3, 6, 8, 4]
>>> n
3
>>> [i for i in range(len(a)/n)]
[0, 1, 2]
>>> m=2
>>> [a[i*n:i*n+n] for i in range(len(a)/n)]
[[3, 5, 2], [7, 5, 3], [6, 8, 4]]
>>> [sorted(a[i*n:i*n+n]) for i in range(len(a)/n)]
[[2, 3, 5], [3, 5, 7], [4, 6, 8]]
>>> [sorted(a[i*n:i*n+n])[1:] for i in range(len(a)/n)]
[[3, 5], [5, 7], [6, 8]]
>>> [sum(sorted(a[i*n:i*n+n])[1:]) for i in range(len(a)/n)]
[8, 12, 14]
>>> [sum(sorted(a[i*n:i*n+n])[1:])/m for i in range(len(a)/n)]
[4, 6, 7]

score 1 · Answer 9 · answered Feb 27 '18 at 04:03

a=[3, 5, 2, 7, 5, 3, 6, 8, 4]
mean_list = [
    mean(x)
        for x in [
            y[1:3]
                for y in [
                    sorted(a[z:z+3])
                        for z in range(len(a))
                            if z < len(a) -2
                ]
        ]
]

score 1 · Answer 10 · answered Feb 27 '18 at 04:29

You can look at it from generators perspective too:

a=[3, 5, 2, 7, 5, 3, 6, 8, 4]

def gen_list():
    for i in range(0, len(a) - 3):
        yield sorted(a[i:i + 3], reverse=True)

apply_division = map(lambda x: sum(x[:2]) / len(x[:2]), gen_list())


if __name__=="__main__":
    result = list(apply_division)
    print(result)
[4.0, 6.0, 6.0, 6.0, 5.5, 7.0]

score 1 · Answer 11 · answered Feb 27 '18 at 05:09

You need a sliding window iterator along with the mean of max two elements. I will try to produce a generic solution which can be used with sliding window of size n where n is any positive real number.

from itertools import islice

def calculate_means(items, window_length=3):
     stop_seq = window_length - 1
     sliding_window = [sorted(islice(items[x:],window_length),reverse=True) for x in range(len(items)-stop_seq)]
     return [sum(a[:stop_seq])/stop_seq for a in sliding_window]

>>> calculate_means([3, 5, 2, 7, 5, 3, 6, 8, 4])
>>> [4.0, 6.0, 6.0, 6.0, 5.5, 7.0, 7.0]

score 1 · Answer 12 · answered Feb 27 '18 at 17:13

For the record, here is a functional version:

>>> f=lambda values:[] if len(values)<=2 else [(sum(values[:3])-min(values[:3]))/2]+f(values[1:])
>>> f([3, 5, 2, 7, 5, 3, 6, 8, 4])
[4.0, 6.0, 6.0, 6.0, 5.5, 7.0, 7.0]
>>> f([3, 5, 2])
[4.0]
>>> f([3, 5])
[]

score 1 · Answer 13 · answered Feb 27 '18 at 17:24

Using sliding window algorithm and the third-party more_itertools.windowed tool:

import statistics as stats

import more_itertools as mit


lst = [3, 5, 2, 7, 5, 3, 6, 8, 4]

[stats.mean(sorted(w)[1:]) for w in mit.windowed(lst, 3)]
# [4, 6, 6, 6, 5.5, 7, 7]

See also @Maarten Fabré's related post.

FatihAkici · Answer 14 · 2018-02-27T16:49:47.543

Don't sort your sub-lists, that operation is nlog(n)! Instead, find the largest two numbers with an O(n) algorithm. This will increase the efficiency of your solution. The efficiency gain will be more visible if you work on a larger problem of "find sum of top m out of a moving window of k items" for large m and k.

def largestTwoMeans(myList):
    means = []
    for i in xrange(len(myList)-2):
        subList = myList[i:i+3]
        first, second = -float("inf"), -float("inf")
        for f in subList:       
            if f >= first:
                first, second = f, first
            elif first > f > second:
                second = f
        means.append((first+second)/2.0)
    return means

print largestTwoMeans(myList)
Out[222]: [4.0, 6.0, 6.0, 6.0, 5.5, 7.0, 7.0]

Here is the generator version:

def largestTwoMeans(myList):
    for i in xrange(len(myList)-2):
        subList = myList[i:i+3]
        first, second = -float("inf"), -float("inf")
        for f in subList:       
            if f >= first:
                first, second = f, first
            elif first > f > second:
                second = f
        yield (first+second)/2.0

print list(largestTwoMeans(myList))
Out[223]: [4.0, 6.0, 6.0, 6.0, 5.5, 7.0, 7.0]

Well `n` is 3 in this case so sorting makes sense as it will be faster than manual iteration. Besides you’re comparing numbers to `None` which is only valid in Python 2. — 301_Moved_Permanently, Feb 27 '18 at 09:24
That's right but the problem mentions `k`, for which my answer aims to generalize. The inefficiency of sort will be emphasized as `k` grows. And yes I coded in Python 2, but I'll address that in my edit. Thank you. — FatihAkici, Feb 27 '18 at 16:04
@FatihAkici Your "Don't sort your sub-lists, that operation is nlog(n)" comment is misleading. Sorting the list of constant size 3 (as the question defines without question) is constant time, and only in the severely generalized case does the time complexity increase. You are not wrong, but be careful of https://en.wikipedia.org/wiki/Program_optimization — foslock, Feb 28 '18 at 00:01

גלעד ברקן · Answer 15 · 2018-02-28T14:22:34.487

To sort three numbers, we need a maximum of three comparisons. To find the lowest of three numbers we only need two by quickselect. We also don't need to make any sublist copies:

a,b,c

a < b
? (a < c ? a : c)
: (b < c ? b : c)

def f(A):
  means = [None] * (len(A) - 2)

  for i in xrange(len(A) - 2):
    if A[i] < A[i+1]:
      means[i] = (A[i+1] + A[i+2]) / 2.0 if A[i] < A[i+2] else (A[i] + A[i+1]) / 2.0
    else:
      means[i] = (A[i] + A[i+2]) / 2.0 if A[i+1] < A[i+2] else (A[i] + A[i+1]) / 2.0

  return means

print f([3, 5, 2, 7, 5, 3, 6, 8, 4])