0

Problem: Given that integers are read from a data stream. Find median of elements read so far in efficient way.

I found a solution here

My questions is why do we need to use heaps instead of just simply adding numbers into a vector?

For example, assuming we are using a vector to store the incoming data, then we call the method to calculate the median as follows:

if vector size is even
   return (element at size/2 + element at size/2-1);
else
   return (element at size/2);

Would the above solution work?

Community
  • 1
  • 1
firefly
  • 181
  • 1
  • 2
  • 10

3 Answers3

2

Your solution cannot work if the elements are not in order in your vector. And if you add elements at the end of the vector, they will not be in order.

On the other hand, the elements are in order in a heap.

Also, there is a missing division by two in the first return statement.

ChronoTrigger
  • 7,849
  • 1
  • 34
  • 50
  • Thanks for clarification! Not sure if it's right, but from my understanding, assuming you have n integers coming from the data stream; since it takes o(lg(n)) to insert an element into a heap, so total time complexity would be o(nlg(n)). On the other hand, we can first insert data into a vector which takes linear time, then call sorting algorithm which is also o(nlg(n)). Therefore, I don't really see the advantage over using complex data structure for this problem. – firefly Oct 20 '15 at 00:30
  • The difference is that if you sort the vector every time you want to compute the median, you are doing extra work because you are not taking advantage of the already sorted elements. Consider this case: get n items from stream, compute the median, get one item more, compute the median again. With the heap you have O(nlogn) + O(logn) + O(logn) + O(logn). With the vector you have O(n) + O(nlogn) + O(1) + O(nlogn). So, memory issues apart, it depends on how often you want to compute the median. – ChronoTrigger Oct 20 '15 at 00:54
1

There's at least two reasons the solution you propose isn't generally used:

  1. Generally, it is assumed that if you're processing a stream of data, that stream is huge or even infinite so storing all the values isn't practical.
  2. As @ChronoTrigger says, you'd have to sort your vector to use it. The problem generally assumes you want to be able to ask for the median over and over as new data stream in. In order to do that with your solution you'd have to sort your vector over and over which would be slow.

Overall, maintaining an accurate median over a streaming data set is hard to do efficiently. There's a number of algorithms that can do this, but they all make trade-offs like lower accuracy for lower memory usage, etc.

Oliver Dain
  • 8,273
  • 3
  • 27
  • 43
  • Thanks Oliver! I see your point of constantly sorting the vector, however, for the heap approach, don't we still need to store the whole data stream? – firefly Oct 20 '15 at 00:34
  • Yes, for the heap approach you'd still need to store the whole stream. Note that the first response on the SO post you linked talks about the memory issues with that approach. – Oliver Dain Oct 20 '15 at 00:48
0

Vector would only work when you add the new element in its proper position (according to the sorting order).

For example: stream: 8 3 4 1 10 12

Median at every step if you just keep adding the element at the end of vector:

step 1: vector: 8 median: 8
step 2: vector: 8, 3 median: (8+3)/2
step 3: vector: 8, 3, 4 median: 3 (when actually it should be 4)

Hope you get the idea

pgiitu
  • 1,593
  • 1
  • 13
  • 23