Maintain the median of the stream of numbers

There is a stream of numbers, that has unknown size. We should process the stream in such a way, that at every moment we should be able to return the median of currently processed numbers.

For understanding what median is, please refer to the following wiki article. The simplest solution would be to add incoming numbers to a data structure, like c++ vector, and sort it every time, we are asked for the median. It wouldn’t be that bad, as with quicksort we could get O(N log N) complexity, where N is the size of currently read data.

However there is a better solution, that provides the median instantly, but requires some extra step while processing incoming numbers. Let’s imagine that we maintain the split into two parts in such a way that the difference between sizes of splits is not greater than 2. The left part contains smaller numbers, and the right larger. In the other words, every element from the left part is smaller than any element from the right part. How would we calculate the median now? If the left and right parts have equal size, we can just take the maximum element from the left MAXL part and minimum from the right MINR. Then the median = (MAXL + MINR)/2. If the array size differs by one, we take MINL if the left part is bigger, or MINR otherwise.

But then the important question appears: how to maintain the same size of both parts?. It’s quite simple, if the left part is bigger than the right one, by more than one element, we take the MAXL and move it to the right part. It becomes the new MINR. For the right part we proceed in the same way. I mean we move MINR to the left, that becomes the new MAXL. But how do we reculcate the new MINL, when we moved it to the right?

Here is where the data structures comes in. Let’s use a max heap to keep the left part and min heap to keep the right part. Then we get exactly what was described in the previous paragraph, without to much implementation and all the requirements are met. An example c++ code could look like:

priority_queue<int, vector<int>, less<int>> left; // left heap
priority_queue<int, vector<int>, greater<int>> right; // right heap

while(true){
 int number;
 cin >> number; // process the stream

 // push to the proper heap
 if(right.empty() || number >= right.top())
   right.push(number);
  else left.push(number);

 // remain the size difference
 if((int)left.size() - (int)right.size() > 1)
  right.push(left.top()); left.pop();
 else if((int)right.size() - (int)left.size() > 1){
  left.push(right.top()); right.pop();
 
 //  get the median
 int median = 0;
 if(left.size() == right.size())
  median = (left.top() + right.top())/2;
 else if (left.size() > right.size())
  median = left.top();
 else median = right.top();

 // do with the median... something
 {...}
}
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s