121

I am currently working on an algorithm to implement a rolling median filter (analogous to a rolling mean filter) in C. From my search of the literature, there appear to be two reasonably efficient ways to do it. The first is to sort the initial window of values, then perform a binary search to insert the new value and remove the existing one at each iteration.

The second (from Hardle and Steiger, 1995, JRSS-C, Algorithm 296) builds a double-ended heap structure, with a maxheap on one end, a minheap on the other, and the median in the middle. This yields a linear-time algorithm instead of one that is O(n log n).

Here is my problem: implementing the former is doable, but I need to run this on millions of time series, so efficiency matters a lot. The latter is proving very difficult to implement. I found code in the Trunmed.c file of the code for the stats package of R, but it is rather indecipherable.

Does anyone know of a well-written C implementation for the linear time rolling median algorithm?

Edit: Link to Trunmed.c code http://google.com/codesearch/p?hl=en&sa=N&cd=1&ct=rc#mYw3h_Lb_e0/R-2.2.0/src/library/stats/src/Trunmed.c

pixelistik
  • 6,797
  • 2
  • 28
  • 39
AWB
  • 1,383
  • 2
  • 9
  • 8
  • Just implemented a moving mean... moving median is somewhat more tricky. Try googling moving median. – hookenz Aug 20 '09 at 23:11
  • Tried google and google code search. It turned up the Trunmed.c code and an implementation in another language for a SGI port of the Trunmed code (from what I could tell). Also, the JRSS algorithm I cited is apparently the only one in the journal's series for which the original code was not archived. – AWB Aug 20 '09 at 23:23
  • How many numbers do you have in each time series? Even with a million of them, if you only have a few thousand numbers, it might not take longer than a minute or two to run (if your code is written efficiently). – Dana the Sane Aug 21 '09 at 00:37
  • That code reference is ancient! R 2.2.0 is over three years old, we are currently at R 2.9.1 with 2.9.2 slated for September 24 and R 2.10.0 in October. – Dirk Eddelbuettel Aug 21 '09 at 01:03
  • 16
    how is the two heaps solution linear? it's O(n log k) where k is the window size because the heap's delete is O(log k). – yairchu Aug 30 '09 at 09:51
  • 4
    Some implementations and comparisons: https://github.com/suomela/median-filter – Jukka Suomela Apr 21 '14 at 11:24
  • Related: [Find running median from a stream of integers](https://stackoverflow.com/questions/10657503/find-running-median-from-a-stream-of-integers) – Bernhard Barker Jun 21 '19 at 14:00

13 Answers13

30

I have looked at R's src/library/stats/src/Trunmed.c a few times as I wanted something similar too in a standalone C++ class / C subroutine. Note that this are actually two implementations in one, see src/library/stats/man/runmed.Rd (the source of the help file) which says

\details{
  Apart from the end values, the result \code{y = runmed(x, k)} simply has
  \code{y[j] = median(x[(j-k2):(j+k2)])} (k = 2*k2+1), computed very
  efficiently.

  The two algorithms are internally entirely different:
  \describe{
    \item{"Turlach"}{is the Härdle-Steiger
      algorithm (see Ref.) as implemented by Berwin Turlach.
      A tree algorithm is used, ensuring performance \eqn{O(n \log
        k)}{O(n * log(k))} where \code{n <- length(x)} which is
      asymptotically optimal.}
    \item{"Stuetzle"}{is the (older) Stuetzle-Friedman implementation
      which makes use of median \emph{updating} when one observation
      enters and one leaves the smoothing window.  While this performs as
      \eqn{O(n \times k)}{O(n * k)} which is slower asymptotically, it is
      considerably faster for small \eqn{k} or \eqn{n}.}
  }
}

It would be nice to see this re-used in a more standalone fashion. Are you volunteering? I can help with some of the R bits.

Edit 1: Besides the link to the older version of Trunmed.c above, here are current SVN copies of

Edit 2: Ryan Tibshirani has some C and Fortran code on fast median binning which may be a suitable starting point for a windowed approach.

Josh O'Brien
  • 148,908
  • 25
  • 332
  • 435
Dirk Eddelbuettel
  • 331,520
  • 51
  • 596
  • 675
  • Thanks Dirk. Once I get a clean solution, I am planning on releasing it under GPL. I would be interested in setting up a R and Python interfaces as well. – AWB Aug 21 '09 at 15:45
  • 9
    @AWB What ended up happening with this idea? Did you incorporate your solution into a package? – Xu Wang Oct 31 '11 at 02:50
25

I couldn't find a modern implementation of a c++ data structure with order-statistic so ended up implementing both ideas in top coders link suggested by MAK ( Match Editorial: scroll down to FloatingMedian).

Two multisets

The first idea partitions the data into two data structures (heaps, multisets etc) with O(ln N) per insert/delete does not allow the quantile to be changed dynamically without a large cost. I.e. we can have a rolling median, or a rolling 75% but not both at the same time.

Segment tree

The second idea uses a segment tree which is O(ln N) for insert/deletions/queries but is more flexible. Best of all the "N" is the size of your data range. So if your rolling median has a window of a million items, but your data varies from 1..65536, then only 16 operations are required per movement of the rolling window of 1 million!!

The c++ code is similar to what Denis posted above ("Here's a simple algorithm for quantized data")

GNU Order Statistic Trees

Just before giving up, I found that stdlibc++ contains order statistic trees!!!

These have two critical operations:

iter = tree.find_by_order(value)
order = tree.order_of_key(value)

See libstdc++ manual policy_based_data_structures_test (search for "split and join").

I have wrapped the tree for use in a convenience header for compilers supporting c++0x/c++11 style partial typedefs:

#if !defined(GNU_ORDER_STATISTIC_SET_H)
#define GNU_ORDER_STATISTIC_SET_H
#include <ext/pb_ds/assoc_container.hpp>
#include <ext/pb_ds/tree_policy.hpp>

// A red-black tree table storing ints and their order
// statistics. Note that since the tree uses
// tree_order_statistics_node_update as its update policy, then it
// includes its methods by_order and order_of_key.
template <typename T>
using t_order_statistic_set = __gnu_pbds::tree<
                                  T,
                                  __gnu_pbds::null_type,
                                  std::less<T>,
                                  __gnu_pbds::rb_tree_tag,
                                  // This policy updates nodes'  metadata for order statistics.
                                  __gnu_pbds::tree_order_statistics_node_update>;

#endif //GNU_ORDER_STATISTIC_SET_H
Leo Goodstadt
  • 2,001
  • 1
  • 19
  • 22
  • Actually, the libstdc++ extension containers do *not* allow for multiple values !by design! As suggested by my name above (t_order_statistic_set), multiple values are merged. So, they need a bit more work for our purposes :-( – Leo Goodstadt Jun 27 '12 at 18:26
  • We need to 1) make a map of values to count (instead of sets) 2) the branch sizes should reflect the count of the keys (libstdc++-v3/include/ext/pb_ds/detail/tree_policy/order_statistics_imp.hpp) inherit from the tree, and 3) overload insert() to increase the count / call update_to_top() if the value is already present 4) overload erase() to decrease the count / call update_to_top() if the value is not unique (See libstdc++-v3/include/ext/pb_ds/detail/rb_tree_map_/rb_tree_.hpp) Any volunteers?? – Leo Goodstadt Jun 27 '12 at 18:37
16

I've done a C implementation here. A few more details are in this question: Rolling median in C - Turlach implementation.

Sample usage:

int main(int argc, char* argv[])
{
   int i, v;
   Mediator* m = MediatorNew(15);
 
   for (i=0; i<30; i++) {
      v = rand() & 127;
      printf("Inserting %3d \n", v);
      MediatorInsert(m, v);
      v = MediatorMedian(m);
      printf("Median = %3d.\n\n", v);
      ShowTree(m);
   }
}
pevik
  • 3,746
  • 3
  • 27
  • 37
AShelly
  • 32,752
  • 12
  • 84
  • 142
12

I use this incremental median estimator:

median += eta * sgn(sample - median)

which has the same form as the more common mean estimator:

mean += eta * (sample - mean)

Here eta is a small learning rate parameter (e.g. 0.001), and sgn() is the signum function which returns one of {-1, 0, 1}. (Use a constant eta like this if the data is non-stationary and you want to track changes over time; otherwise, for stationary sources use something like eta = 1 / n to converge, where n is the number of samples seen so far.)

Also, I modified the median estimator to make it work for arbitrary quantiles. In general, a quantile function tells you the value that divides the data into two fractions: p and 1 - p. The following estimates this value incrementally:

quantile += eta * (sgn(sample - quantile) + 2.0 * p - 1.0)

The value p should be within [0, 1]. This essentially shifts the sgn() function's symmetrical output {-1, 0, 1} to lean toward one side, partitioning the data samples into two unequally-sized bins (fractions p and 1 - p of the data are less than/greater than the quantile estimate, respectively). Note that for p = 0.5, this reduces to the median estimator.

Regexident
  • 29,108
  • 10
  • 91
  • 98
Tyler Streeter
  • 1,154
  • 10
  • 13
  • 2
    Cool, Here's a modification that adjusts 'eta' based on the running mean...(mean is used as a rough estimate of the median so it converges on big values at the same rate it converges on tiny values). i.e. eta is tuned automatically. http://stackoverflow.com/questions/11482529/median-filter-super-efficient-implementation/15150968#15150968 – Jeff McClintock Mar 01 '13 at 05:25
  • 3
    For a similar technique, see this paper on frugal streaming: http://arxiv.org/pdf/1407.1121v1.pdf It can estimate any quartile and adapts to changes in the mean. It requires that you only store two values: last estimate and direction of last adjustment (+1 or -1). The algorithm is simple to implement. I find that the error is within 5% about 97% of the time. – Paul Chernoch Aug 24 '15 at 16:31
9

Here's a simple algorithm for quantized data (months later):

""" median1.py: moving median 1d for quantized, e.g. 8-bit data

Method: cache the median, so that wider windows are faster.
    The code is simple -- no heaps, no trees.

Keywords: median filter, moving median, running median, numpy, scipy

See Perreault + Hebert, Median Filtering in Constant Time, 2007,
    http://nomis80.org/ctmf.html: nice 6-page paper and C code,
    mainly for 2d images

Example:
    y = medians( x, window=window, nlevel=nlevel )
    uses:
    med = Median1( nlevel, window, counts=np.bincount( x[0:window] ))
    med.addsub( +, - )  -- see the picture in Perreault
    m = med.median()  -- using cached m, summ

How it works:
    picture nlevel=8, window=3 -- 3 1s in an array of 8 counters:
        counts: . 1 . . 1 . 1 .
        sums:   0 1 1 1 2 2 3 3
                        ^ sums[3] < 2 <= sums[4] <=> median 4
        addsub( 0, 1 )  m, summ stay the same
        addsub( 5, 1 )  slide right
        addsub( 5, 6 )  slide left

Updating `counts` in an `addsub` is trivial, updating `sums` is not.
But we can cache the previous median `m` and the sum to m `summ`.
The less often the median changes, the faster;
so fewer levels or *wider* windows are faster.
(Like any cache, run time varies a lot, depending on the input.)

See also:
    scipy.signal.medfilt -- runtime roughly ~ window size
    http://stackoverflow.com/questions/1309263/rolling-median-algorithm-in-c

"""

from __future__ import division
import numpy as np  # bincount, pad0

__date__ = "2009-10-27 oct"
__author_email__ = "denis-bz-py at t-online dot de"


#...............................................................................
class Median1:
    """ moving median 1d for quantized, e.g. 8-bit data """

    def __init__( s, nlevel, window, counts ):
        s.nlevel = nlevel  # >= len(counts)
        s.window = window  # == sum(counts)
        s.half = (window // 2) + 1  # odd or even
        s.setcounts( counts )

    def median( s ):
        """ step up or down until sum cnt to m-1 < half <= sum to m """
        if s.summ - s.cnt[s.m] < s.half <= s.summ:
            return s.m
        j, sumj = s.m, s.summ
        if sumj <= s.half:
            while j < s.nlevel - 1:
                j += 1
                sumj += s.cnt[j]
                # print "j sumj:", j, sumj
                if sumj - s.cnt[j] < s.half <= sumj:  break
        else:
            while j > 0:
                sumj -= s.cnt[j]
                j -= 1
                # print "j sumj:", j, sumj
                if sumj - s.cnt[j] < s.half <= sumj:  break
        s.m, s.summ = j, sumj
        return s.m

    def addsub( s, add, sub ):
        s.cnt[add] += 1
        s.cnt[sub] -= 1
        assert s.cnt[sub] >= 0, (add, sub)
        if add <= s.m:
            s.summ += 1
        if sub <= s.m:
            s.summ -= 1

    def setcounts( s, counts ):
        assert len(counts) <= s.nlevel, (len(counts), s.nlevel)
        if len(counts) < s.nlevel:
            counts = pad0__( counts, s.nlevel )  # numpy array / list
        sumcounts = sum(counts)
        assert sumcounts == s.window, (sumcounts, s.window)
        s.cnt = counts
        s.slowmedian()

    def slowmedian( s ):
        j, sumj = -1, 0
        while sumj < s.half:
            j += 1
            sumj += s.cnt[j]
        s.m, s.summ = j, sumj

    def __str__( s ):
        return ("median %d: " % s.m) + \
            "".join([ (" ." if c == 0 else "%2d" % c) for c in s.cnt ])

#...............................................................................
def medianfilter( x, window, nlevel=256 ):
    """ moving medians, y[j] = median( x[j:j+window] )
        -> a shorter list, len(y) = len(x) - window + 1
    """
    assert len(x) >= window, (len(x), window)
    # np.clip( x, 0, nlevel-1, out=x )
        # cf http://scipy.org/Cookbook/Rebinning
    cnt = np.bincount( x[0:window] )
    med = Median1( nlevel=nlevel, window=window, counts=cnt )
    y = (len(x) - window + 1) * [0]
    y[0] = med.median()
    for j in xrange( len(x) - window ):
        med.addsub( x[j+window], x[j] )
        y[j+1] = med.median()
    return y  # list
    # return np.array( y )

def pad0__( x, tolen ):
    """ pad x with 0 s, numpy array or list """
    n = tolen - len(x)
    if n > 0:
        try:
            x = np.r_[ x, np.zeros( n, dtype=x[0].dtype )]
        except NameError:
            x += n * [0]
    return x

#...............................................................................
if __name__ == "__main__":
    Len = 10000
    window = 3
    nlevel = 256
    period = 100

    np.set_printoptions( 2, threshold=100, edgeitems=10 )
    # print medians( np.arange(3), 3 )

    sinwave = (np.sin( 2 * np.pi * np.arange(Len) / period )
        + 1) * (nlevel-1) / 2
    x = np.asarray( sinwave, int )
    print "x:", x
    for window in ( 3, 31, 63, 127, 255 ):
        if window > Len:  continue
        print "medianfilter: Len=%d window=%d nlevel=%d:" % (Len, window, nlevel)
            y = medianfilter( x, window=window, nlevel=nlevel )
        print np.array( y )

# end median1.py
denis
  • 18,805
  • 6
  • 56
  • 79
4

Rolling median can be found by maintaining two partitions of numbers.

For maintaining partitions use Min Heap and Max Heap.

Max Heap will contain numbers smaller than equal to median.

Min Heap will contain numbers greater than equal to median.

Balancing Constraint: if total number of elements are even then both heap should have equal elements.

if total number of elements are odd then Max Heap will have one more element than Min Heap.

Median Element: If Both partitions has equal number of elements then median will be half of sum of max element from first partition and min element from second partition.

Otherwise median will be max element from first partition.

Algorithm-
1- Take two Heap(1 Min Heap and 1 Max Heap)
   Max Heap will contain first half number of elements
   Min Heap will contain second half number of elements

2- Compare new number from stream with top of Max Heap, 
   if it is smaller or equal add that number in max heap. 
   Otherwise add number in Min Heap.

3- if min Heap has more elements than Max Heap 
   then remove top element of Min Heap and add in Max Heap.
   if max Heap has more than one element than in Min Heap 
   then remove top element of Max Heap and add in Min Heap.

4- If Both heaps has equal number of elements then
   median will be half of sum of max element from Max Heap and min element from Min Heap.
   Otherwise median will be max element from the first partition.
public class Solution {

    public static void main(String[] args) {
        Scanner in = new Scanner(System.in);
        RunningMedianHeaps s = new RunningMedianHeaps();
        int n = in.nextInt();
        for(int a_i=0; a_i < n; a_i++){
            printMedian(s,in.nextInt());
        }
        in.close();       
    }

    public static void printMedian(RunningMedianHeaps s, int nextNum){
            s.addNumberInHeap(nextNum);
            System.out.printf("%.1f\n",s.getMedian());
    }
}

class RunningMedianHeaps{
    PriorityQueue<Integer> minHeap = new PriorityQueue<Integer>();
    PriorityQueue<Integer> maxHeap = new PriorityQueue<Integer>(Comparator.reverseOrder());

    public double getMedian() {

        int size = minHeap.size() + maxHeap.size();     
        if(size % 2 == 0)
            return (maxHeap.peek()+minHeap.peek())/2.0;
        return maxHeap.peek()*1.0;
    }

    private void balanceHeaps() {
        if(maxHeap.size() < minHeap.size())
        {
            maxHeap.add(minHeap.poll());
        }   
        else if(maxHeap.size() > 1+minHeap.size())
        {
            minHeap.add(maxHeap.poll());
        }
    }

    public void addNumberInHeap(int num) {
        if(maxHeap.size()==0 || num <= maxHeap.peek())
        {
            maxHeap.add(num);
        }
        else
        {
            minHeap.add(num);
        }
        balanceHeaps();
    }
}
Harshit
  • 207
  • 1
  • 5
  • 1
    Its not clear to me how much benefit a third Java answer provides for a C question. You should ask a new question, and then supply your Java answer in that question. – jww Feb 19 '17 at 05:06
  • logic died after reading this 'then remove top element of Min Heap and add in Min Heap.' .At least have the courtesy to read the algo before posting – Cyclotron3x3 May 21 '17 at 07:17
  • 5
    This algorithm is not for a rolling median but for the median of a growing number of elements. For the rolling median, one must also remove an element from the heaps, which needs to be found first. – Walter Mar 24 '18 at 17:55
2

It is maybe worth pointing out that there is a special case which has a simple exact solution: when all the values in the stream are integers within a (relatively) small defined range. For instance, assume they must all lie between 0 and 1023. In this case just define an array of 1024 elements and a count, and clear all of these values. For each value in the stream increment the corresponding bin and the count. After the stream ends find the bin which contains the count/2 highest value - easily accomplished by adding successive bins starting from 0. Using the same method the value of an arbitrary rank order may be found. (There is a minor complication if detecting bin saturation and "upgrading" the size of the storage bins to a larger type during a run will be needed.)

This special case may seem artificial, but in practice it is very common. It can also be applied as an approximation for real numbers if they lie within a range and a "good enough" level of precision is known. This would hold for pretty much any set of measurements on a group of "real world" objects. For instance, the heights or weights of a group of people. Not a big enough set? It would work just as well for the lengths or weights of all the (individual) bacteria on the planet - assuming somebody could supply the data!

It looks like I misread the original - which seems like it wants a sliding window median instead of the just the median of a very long stream. This approach still works for that. Load the first N stream values for the initial window, then for the N+1th stream value increment the corresponding bin while decrementing the bin corresponding to the 0th stream value. It is necessary in this case to retain the last N values to allow the decrement, which can be done efficiently by cyclically addressing an array of size N. Since the position of the median can only change by -2,-1,0,1,2 on each step of the sliding window, it isn't necessary to sum all the bins up to the median on each step, just adjust the "median pointer" depending upon which side(s) bins were modified. For instance, if both the new value and the one being removed fall below the current median then it doesn't change (offset = 0). The method breaks down when N becomes too large to hold conveniently in memory.

mathog
  • 305
  • 3
  • 10
1

If you have the ability to reference values as a function of points in time, you could sample values with replacement, applying bootstrapping to generate a bootstrapped median value within confidence intervals. This may let you calculate an approximated median with greater efficiency than constantly sorting incoming values into a data structure.

Alex Reynolds
  • 91,635
  • 50
  • 223
  • 320
0

Here is one that can be used when exact output is not important (for display purposes etc.) You need totalcount and lastmedian, plus the newvalue.

{
totalcount++;
newmedian=lastmedian+(newvalue>lastmedian?1:-1)*(lastmedian==0?newvalue: lastmedian/totalcount*2);
}

Produces quite exact results for things like page_display_time.

Rules: the input stream needs to be smooth on the order of page display time, big in count (>30 etc), and have a non zero median.

Example: page load time, 800 items, 10ms...3000ms, average 90ms, real median:11ms

After 30 inputs, medians error is generally <=20% (9ms..12ms), and gets less and less. After 800 inputs, the error is +-2%.

Another thinker with a similar solution is here: Median Filter Super efficient implementation

Community
  • 1
  • 1
Johan
  • 607
  • 7
  • 11
0

For those who need a running median in Java...PriorityQueue is your friend. O(log N) insert, O(1) current median, and O(N) remove. If you know the distribution of your data you can do a lot better than this.

public class RunningMedian {
  // Two priority queues, one of reversed order.
  PriorityQueue<Integer> lower = new PriorityQueue<Integer>(10,
          new Comparator<Integer>() {
              public int compare(Integer arg0, Integer arg1) {
                  return (arg0 < arg1) ? 1 : arg0 == arg1 ? 0 : -1;
              }
          }), higher = new PriorityQueue<Integer>();

  public void insert(Integer n) {
      if (lower.isEmpty() && higher.isEmpty())
          lower.add(n);
      else {
          if (n <= lower.peek())
              lower.add(n);
          else
              higher.add(n);
          rebalance();
      }
  }

  void rebalance() {
      if (lower.size() < higher.size() - 1)
          lower.add(higher.remove());
      else if (higher.size() < lower.size() - 1)
          higher.add(lower.remove());
  }

  public Integer getMedian() {
      if (lower.isEmpty() && higher.isEmpty())
          return null;
      else if (lower.size() == higher.size())
          return (lower.peek() + higher.peek()) / 2;
      else
          return (lower.size() < higher.size()) ? higher.peek() : lower
                  .peek();
  }

  public void remove(Integer n) {
      if (lower.remove(n) || higher.remove(n))
          rebalance();
  }
}
Ross Judson
  • 1,132
  • 5
  • 11
  • c++ has order statistic trees from gnu in an extension to the standard library. See my post below. – Leo Goodstadt Jun 27 '12 at 14:33
  • I think your code is not put here correctly. There are some incomplete parts there like: `}), higher = new PriorityQueue();` or `new PriorityQueue(10,`. I couldn't run the code. – Hengameh Jul 14 '15 at 10:03
  • @Hengameh Java ends statements with semicolons -- line breaks don't matter at all. You must have copied it incorrectly. – Matthew Read Feb 21 '16 at 06:20
  • 1
    You should ask a new question, and then supply your Java answer in that question. – jww Feb 19 '17 at 05:05
-1

Here is the java implementation

package MedianOfIntegerStream;

import java.util.Comparator;
import java.util.HashSet;
import java.util.Iterator;
import java.util.Set;
import java.util.TreeSet;


public class MedianOfIntegerStream {

    public Set<Integer> rightMinSet;
    public Set<Integer> leftMaxSet;
    public int numOfElements;

    public MedianOfIntegerStream() {
        rightMinSet = new TreeSet<Integer>();
        leftMaxSet = new TreeSet<Integer>(new DescendingComparator());
        numOfElements = 0;
    }

    public void addNumberToStream(Integer num) {
        leftMaxSet.add(num);

        Iterator<Integer> iterMax = leftMaxSet.iterator();
        Iterator<Integer> iterMin = rightMinSet.iterator();
        int maxEl = iterMax.next();
        int minEl = 0;
        if (iterMin.hasNext()) {
            minEl = iterMin.next();
        }

        if (numOfElements % 2 == 0) {
            if (numOfElements == 0) {
                numOfElements++;
                return;
            } else if (maxEl > minEl) {
                iterMax.remove();

                if (minEl != 0) {
                    iterMin.remove();
                }
                leftMaxSet.add(minEl);
                rightMinSet.add(maxEl);
            }
        } else {

            if (maxEl != 0) {
                iterMax.remove();
            }

            rightMinSet.add(maxEl);
        }
        numOfElements++;
    }

    public Double getMedian() {
        if (numOfElements % 2 != 0)
            return new Double(leftMaxSet.iterator().next());
        else
            return (leftMaxSet.iterator().next() + rightMinSet.iterator().next()) / 2.0;
    }

    private class DescendingComparator implements Comparator<Integer> {
        @Override
        public int compare(Integer o1, Integer o2) {
            return o2 - o1;
        }
    }

    public static void main(String[] args) {
        MedianOfIntegerStream streamMedian = new MedianOfIntegerStream();

        streamMedian.addNumberToStream(1);
        System.out.println(streamMedian.getMedian()); // should be 1

        streamMedian.addNumberToStream(5);
        streamMedian.addNumberToStream(10);
        streamMedian.addNumberToStream(12);
        streamMedian.addNumberToStream(2);
        System.out.println(streamMedian.getMedian()); // should be 5

        streamMedian.addNumberToStream(3);
        streamMedian.addNumberToStream(8);
        streamMedian.addNumberToStream(9);
        System.out.println(streamMedian.getMedian()); // should be 6.5
    }
}
M Sach
  • 30,322
  • 72
  • 198
  • 300
  • 1
    You should ask a new question, and then supply your Java answer in that question. – jww Feb 19 '17 at 05:05
-1

Based on @mathog thoughts, this is a C# implementation for a running median over an array of bytes with known range of values. Can be extended to other integer types.

  /// <summary>
  /// Median estimation by histogram, avoids multiple sorting operations for a running median
  /// </summary>
  public class MedianEstimator
  {
    private readonly int m_size2;
    private readonly byte[] m_counts;

    /// <summary>
    /// Estimated median, available right after calling <see cref="Init"/> or <see cref="Update"/>.
    /// </summary>
    public byte Median { get; private set; }

    /// <summary>
    /// Ctor
    /// </summary>
    /// <param name="size">Median size in samples</param>
    /// <param name="maxValue">Maximum expected value in input data</param>
    public MedianEstimator(
      int size,
      byte maxValue)
    {
      m_size2 = size / 2;
      m_counts = new byte[maxValue + 1];
    }

    /// <summary>
    /// Initializes the internal histogram with the passed sample values
    /// </summary>
    /// <param name="values">Array of values, usually the start of the array for a running median</param>
    public void Init(byte[] values)
    {
      for (var i = 0; i < values.Length; i++)
        m_counts[values[i]]++;

      UpdateMedian();
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private void UpdateMedian()
    {
      // The median is the first value up to which counts add to size / 2
      var sum = 0;
      Median = 0;
      for (var i = 0; i < m_counts.Length; i++)
      {
        sum += m_counts[i];
        Median = (byte) i;
        if (sum > m_size2) break;
      }
    }

    /// <summary>
    /// Updates the median estimation by removing <paramref name="last"/> and adding <paramref name="next"/>. These
    /// values must be updated as the running median is applied. If the median length is <i>N</i>, at the sample
    /// <i>i</i>, <paramref name="last"/> is sample at index <i>i</i>-<i>N</i>/2 and <paramref name="next"/> is sample
    /// at index <i>i</i>+<i>N</i>/2+1.
    /// </summary>
    /// <param name="last">Sample at the start of the moving window that is to be removed</param>
    /// <param name="next">Sample at the end of the moving window + 1 that is to be added</param>
    public void Update(byte last, byte next)
    {
      m_counts[last]--;
      m_counts[next]++;

      // The conditions below do not change median value so there is no need to update it
      if (last == next ||
          last < Median && next < Median || // both below median
          last > Median && next > Median) // both above median
        return;

      UpdateMedian();
    }

Testing against a running median, with timing:

private void TestMedianEstimator()
{
  var r = new Random();
  const int SIZE = 15;
  const byte MAX_VAL = 80;
  var values = new byte[100000];
  for (var i = 0; i < values.Length; i++)
    values[i] = (byte) (MAX_VAL * r.NextDouble());

  var timer = Stopwatch.StartNew();
  // Running median
  var window = new byte[2 * SIZE + 1];
  var medians = new byte[values.Length];
  for (var i = SIZE; i < values.Length - SIZE - 1; i++)
  {
    for (int j = i - SIZE, k = 0; j <= i + SIZE; j++, k++)
      window[k] = values[j];
    Array.Sort(window);
    medians[i] = window[SIZE];
  }
  timer.Stop();
  var elapsed1 = timer.Elapsed;

  timer.Restart();
  var me = new MedianEstimator(2 * SIZE + 1, MAX_VAL);
  me.Init(values.Slice(0, 2 * SIZE + 1));
  var meMedians = new byte[values.Length];
  for (var i = SIZE; i < values.Length - SIZE - 1; i++)
  {
    meMedians[i] = me.Median;
    me.Update(values[i - SIZE], values[i + SIZE + 1]);
  }
  timer.Stop();
  var elapsed2 = timer.Elapsed;

  WriteLineToLog($"{elapsed1.TotalMilliseconds / elapsed2.TotalMilliseconds:0.00}");

  var diff = 0;
  for (var i = 0; i < meMedians.Length; i++)
    diff += Math.Abs(meMedians[i] - medians[i]);
  WriteLineToLog($"Diff: {diff}");
}
qnaninf
  • 29
  • 3
-5

If you just require a smoothed average a quick/easy way is to multiply the latest value by x and the average value by (1-x) then add them. This then becomes the new average.

edit: Not what the user asked for and not as statistically valid but good enough for a lot of uses.
I'll leave it in here (in spite of the downvotes) for search!

Martin Beckett
  • 90,457
  • 25
  • 178
  • 252
  • 2
    This calculates the mean. He wants the median. Also, he is calculating the median of a sliding window of values, not of the entire set. – A. Levy Aug 21 '09 at 12:51
  • 1
    This calculates a running average of a window of values with a decay constant depending on X - it's very useful where performance matters and you can't be bothered doing a kalman filter. I put it in so search can find it. – Martin Beckett Aug 21 '09 at 14:48
  • This is what I also immediately thought of, having implemented such a filter as a very basic and cheap lowpass-filter for an audio app. – James Morris Oct 26 '09 at 15:30