45

Two sorted arrays of length n are given and the question is to find, in O(n) time, the median of their sum array, which contains all the possible pairwise sums between every element of array A and every element of array B.

For instance: Let A[2,4,6] and B[1,3,5] be the two given arrays. The sum array is [2+1,2+3,2+5,4+1,4+3,4+5,6+1,6+3,6+5]. Find the median of this array in O(n).

Solving the question in O(n^2) is pretty straight-forward but is there any O(n) solution to this problem?

Note: This is an interview question asked to one of my friends and the interviewer was quite sure that it can be solved in O(n) time.

Aditya
  • 451
  • 5
  • 8
  • Are arrays sorted in the beginning? – Mikhail Jun 26 '13 at 09:54
  • 1
    Well, it can be done in `O(n + n log n)` –  Jun 26 '13 at 09:55
  • 2
    Do you know if the median of the sum is the sum of the medians ? – GameAlchemist Jun 26 '13 at 09:55
  • There is a description on wikipedia http://en.wikipedia.org/wiki/Median – raam86 Jun 26 '13 at 09:57
  • possible duplicate of [Finding the median of an unsorted array](http://stackoverflow.com/questions/10662013/finding-the-median-of-an-unsorted-array) – Jon Jun 26 '13 at 09:58
  • 5
    Hey, OP states the sum of arrays more like Cartesian product, the result array contains `N*N` elements. Be aware. – Mikhail Jun 26 '13 at 10:02
  • 1
    @Mikhail The arrays are sorted to begin with. Missed this info in the question. Thanks – Aditya Jun 26 '13 at 11:17
  • @Jon There are O(n*n) elements in the Cartesian product sum thingy, in which case generating it and just finding the median of it will not be O(n). – Bernhard Barker Jun 26 '13 at 11:40
  • 18
    Ugh. It's definitely possible (Mirzaian–Arjomandi 1985), but expecting the O(n) algorithm in an interview is lunacy. – David Eisenstat Jun 26 '13 at 13:01
  • We can generate the sum array in O(n) and use quickselect to get the median of an unsorted array in O(n). Figured with help of google – dchhetri Jun 26 '13 at 22:10
  • 2
    @user814628 that's O(n^2) not O(n) – aaronman Jun 26 '13 at 22:13
  • @aaronman can you explain which part? – dchhetri Jun 26 '13 at 22:14
  • 1
    @user814628 both because the size of the new list is O(n^2) – aaronman Jun 26 '13 at 22:15
  • 10
    Here is a link to Mirzaian–Arjomandi 1985, as mentioned by David: http://www.cse.yorku.ca/~andy/pubs/X+Y.pdf – simonzack Jun 26 '13 at 22:34
  • @David Eisenstat: expecting the correct algorithm is lunacy.. but maybe they weren't looking for that. You would be shocked how many candidates can come up with wrong answers and be convinced that they work without testing (and proving) them. – Karoly Horvath Jun 26 '13 at 22:40
  • @KarolyHorvath then what are you supposed to do in an interview like that – aaronman Jun 26 '13 at 23:40
  • @H2CO3 `O(n + nlogn)` is the same as `O(nlogn)` – Khanh Nguyen Jun 27 '13 at 00:10
  • Starting with your sorted arrays you could compose the final array (at least the first n/2 + 1 elements) in such a way that it will be already sorted. – Eugen Constantin Dinca Jun 27 '13 at 00:36
  • Knowing that one corner contains the minimum value, and the opposite corner contains the maximum value, my intuition says to start with pointers in the two remaining corners -- from those positions you can either get a lower or higher value by moving horizontally or vertically as required. Walk these two pointers together until they meet? Unfortunately I haven't the time to confirm that this is on the right track. I guess I'll just read others answers, now. Handwavey "proof" centres around knowledge of state of quadrants and extra knowledge from other pointer blah blah blah... – sh1 Jun 27 '13 at 01:21
  • If I was out of work with even a nickle in the bank that is not a company I would want to work for. They should ask a question based on an actual current problem. – paparazzo Jun 27 '13 at 01:30
  • Agreed -- it's a dumb question to ask in a job interview. Asking what thought process you'd use to attack it might be reasonable, but that requires a pretty insightful interviewer. – Hot Licks Jun 27 '13 at 01:54
  • @KarolyHorvath I agree, but it's cruel and unusual to toss out a running time without a disclaimer that it was publishable stuff thirty years ago. That goes double if, as could be the case, it's slower for non-galactic n than a simple O(n log n)-time algorithm. – David Eisenstat Jun 27 '13 at 02:11
  • @DavidEisenstat: Never mind that asking questions this hard is generally bad practice in interviews. "Are you familiar with this Stupid Theory Trick?" isn't something any programmer will ever have to care about. Unless he has a hard problem and needs to feel confident that there's no point looking in the literature. – tmyklebu Jun 27 '13 at 02:39
  • Funny part is interviewer was quite adamant on finding an O(n) solution and continued on this question for more than 45 minutes. – Aditya Jun 27 '13 at 08:46
  • While I don't necessarily credit any interviewer with insight matching the toughness of the question they ask, it's worth noting that a question where the candidate stands to answer successfully puts an upper limit on the discussion that can be had in the process. A string of easy questions to be knocked down one at a time really _would_ be a test of experience with clever tricks. And an actual coding test would mean admitting that programming work tends to be a lot more mundane than anybody wants to admit. – sh1 Jun 27 '13 at 15:10

4 Answers4

14

The correct O(n) solution is quite complicated, and takes a significant amount of text, code and skill to explain and prove. More precisely, it takes 3 pages to do so convincingly, as can be seen in details here http://www.cse.yorku.ca/~andy/pubs/X+Y.pdf (found by simonzack in the comments).

It is basically a clever divide-and-conquer algorithm that, among other things, takes advantage of the fact that in a sorted n-by-n matrix, one can find in O(n) the amount of elements that are smaller/greater than a given number k. It recursively breaks down the matrix into smaller submatrixes (by taking only the odd rows and columns, resulting in a submatrix that has n/2 colums and n/2 rows) which combined with the step above, results in a complexity of O(n) + O(n/2) + O(n/4)... = O(2*n) = O(n). It is crazy!

I can't explain it better than the paper, which is why I'll explain a simpler, O(n logn) solution instead :).


O(n * logn) solution:

It's an interview! You can't get that O(n) solution in time. So hey, why not provide a solution that, although not optimal, shows you can do better than the other obvious O(n²) candidates?

I'll make use of the O(n) algorithm mentioned above, to find the amount of numbers that are smaller/greater than a given number k in a sorted n-by-n matrix. Keep in mind that we don't need an actual matrix! The Cartesian sum of two arrays of size n, as described by the OP, results in a sorted n-by-n matrix, which we can simulate by considering the elements of the array as follows:

a[3] = {1, 5, 9};
b[3] = {4, 6, 8};
//a + b:
{1+4, 1+6, 1+8,
 5+4, 5+6, 5+8,
 9+4, 9+6, 9+8}

Thus each row contains non-decreasing numbers, and so does each column. Now, pretend you're given a number k. We want to find in O(n) how many of the numbers in this matrix are smaller than k, and how many are greater. Clearly, if both values are less than (n²+1)/2, that means k is our median!

The algorithm is pretty simple:

int smaller_than_k(int k){
    int x = 0, j = n-1;
    for(int i = 0; i < n; ++i){
        while(j >= 0 && k <= a[i]+b[j]){
            --j;
        }
        x += j+1;
    }
    return x;
}

This basically counts how many elements fit the condition at each row. Since the rows and columns are already sorted as seen above, this will provide the correct result. And as both i and j iterate at most n times each, the algorithm is O(n) [Note that j does not get reset within the for loop]. The greater_than_k algorithm is similar.

Now, how do we choose k? That is the logn part. Binary Search! As has been mentioned in other answers/comments, the median must be a value contained within this array:

candidates[n] = {a[0]+b[n-1], a[1]+b[n-2],... a[n-1]+b[0]};.

Simply sort this array [also O(n*logn)], and run the binary search on it. Since the array is now in non-decreasing order, it is straight-forward to notice that the amount of numbers smaller than each candidate[i] is also a non-decreasing value (monotonic function), which makes it suitable for the binary search. The largest number k = candidate[i] whose result smaller_than_k(k) returns smaller than (n²+1)/2 is the answer, and is obtained in log(n) iterations:

int b_search(){
    int lo = 0, hi = n, mid, n2 = (n²+1)/2;
    while(hi-lo > 1){
        mid = (hi+lo)/2;
        if(smaller_than_k(candidate[mid]) < n2)
            lo = mid;
        else
            hi = mid;
    }
    return candidate[lo]; // the median
}
i Code 4 Food
  • 2,124
  • 1
  • 13
  • 21
  • 1
    "And as both i and j iterate at most n times each, the algorithm is O(n)" => Shouldn't it be O(n^2)? – Khanh Nguyen Jun 27 '13 at 02:18
  • @KhanhNguyen `j` does not depend on `i`. It starts at `n-1` and gets subtracted at most `n` times in total (it does not get reset to `n-1`). So there are at most `2*n` iterations combined. – i Code 4 Food Jun 27 '13 at 02:21
  • 1
    But there's another problem: if I am right, after getting the candidates sorted, you run `smaller_than_k(k)` on *each* candidate, until you find the one. Wouldn't that make it `O(n^2)` in the worst case? – Khanh Nguyen Jun 27 '13 at 02:53
  • @KhanhNguyen I do not run on *each* candidate, that is exactly where Binary Search kicks in. After sorting the array, I need only run `smaller_than_k()` on `log(n)` elements of it to obtain the one which represents the median, as described in the last step. – i Code 4 Food Jun 27 '13 at 03:02
  • 1
    Could you explain in detail why the answer is amongst `candidates`? Other answers gives only an idea, but I cannot come out with a thorough proof. – Mikhail Jun 27 '13 at 04:52
  • 2
    The median doesn't necessarily lie on the diagonal of the matrix (the given `candidates` matrix), as @Mikhail wonders. Consider `[1,2,3,4]` and `[10,20,30,40]`. `candidates` is `[14,23,32,41]` but the median is the average of 24 and 31. – xan Jul 02 '13 at 20:10
  • can you elaborate on candidate ? I don't think it is correct . – Aseem Goyal Feb 23 '14 at 07:32
  • if the number of elements in the array is even, then of course the median is never in any of the sums - but that can be done with an extra check before returning `candidate[lo]`, as basically that means both `[lo]` and `[lo+1]` will fail the greater/smaller checks respectively, and in such case, return `(candidate[lo]+candidate[lo+1])/2`. You don't even need to do the checks as you know this beforehand when `n` is even. – i Code 4 Food Feb 25 '14 at 07:41
1

Let's say the arrays are A = {A[1] ... A[n]}, and B = {B[1] ... B[n]}, and the pairwise sum array is C = {A[i] + B[j], where 1 <= i <= n, 1 <= j <= n} which has n^2 elements and we need to find its median.

Median of C must be an element of the array D = {A[1] + B[n], A[2] + B[n - 1], ... A[n] + B[1]}: if you fix A[i], and consider all the sums A[i] + B[j], you would see that the only A[i] + B[j = n + 1 - i] (which is one of D) could be the median. That is, it may not be the median, but if it is not, then all other A[i] + B[j] are also not median.

This can be proved by considering all B[j] and count the number of values that are lower and number of values that are greater than A[i] + B[j] (we can do this quite accurately because the two arrays are sorted -- the calculation is a bit messy thought). You'd see that for A[i] + B[n + 1 - j] these two counts are most "balanced".

The problem then reduces to finding median of D, which has only n elements. An algorithm such as Hoare's will work.

UPDATE: this answer is wrong. The real conclusion here is that the median is one of D's element, but then D's median is the not the same as C's median.

Khanh Nguyen
  • 10,636
  • 9
  • 47
  • 62
  • this is what aaronman said, isn't it? i thought there was a counter-example? – andrew cooke Jun 27 '13 at 00:58
  • 3
    if you can't read deleted posts, consider [0 1 1 1 2] and [0 0 0 1 2]. if i've understood you correctly, your "diagonal" is [2 2 1 1 2] and the median of that is 2. but the correct result is 1. – andrew cooke Jun 27 '13 at 01:00
  • Somone found the solution in the paper, but it would be nice if it could be delivered in code in c++ or java, or at least explained in less mathematical terms than in the paper – aaronman Jun 27 '13 at 01:01
  • @andrewcooke there is the two answers not deleted right now are oversimplifying the problem, the solution is known and it is a pretty complex algo if you read the paper – aaronman Jun 27 '13 at 01:22
  • @aaronman when i skimmed the paper it seemed much more general than the question here. i haven't worked out whether the median is a simpler special case and i didn't see it mentioned in the paper (but i really did only only skim it). what i am posting here helps other people understand that they are wrong. a counter example is much more convincing than "it's hard and you're dumb". – andrew cooke Jun 27 '13 at 01:27
  • @andrewcooke The paper is a solution for finding the nth largest element out of an ordered matrix or one that would be constructed from summing the cross product of two vectors X+Y. Median is just the nth largest element where n = size/2. So, the paper is a solution for this problem. – Patashu Jun 27 '13 at 01:30
  • @Patashu i am not saying the paper doesn't solve the problem. when i said "i haven't worked out whether the median is a simpler special case" what i meant is, perhaps median is a special case that has a simpler solution than the general approach given in the paper. i don't know, and i didn't give it much thought. but i am not quetsioning that the paper has a solution. – andrew cooke Jun 27 '13 at 01:37
  • 1
    @aaronman You (or I) *don't* have to delete your answer when it is wrong. There is no rules SO saying that you can't post wrong answer, as long as you invest enought time and effort into it. Just downvote it, leave a note for later viewers. All we are trying to do is contribute a good answer. My answer was wrong, but it is an idea. By leaving it here, future viewers won't make the same mistake (and hopefully derive an answer by improving it). And, if you haven't deleted your post, I wouldn't have wasted my time on trying the same idea! – Khanh Nguyen Jun 27 '13 at 01:37
  • @andrewcooke a counter example is good, but it's even better if you can point out which step is wrong. No offense though, good job with the counter example :) – Khanh Nguyen Jun 27 '13 at 01:37
  • @KhanhNguyen - after some karma level you can see deleted posts. it's easy to forget that not everyone can (sorry). the wrong bit is in the counting, i would guess. the bit you didn't explain... – andrew cooke Jun 27 '13 at 01:38
  • I don't think I asked you to delete it at any point, but I deleted mine because I felt that there was no reason to keep it and it was on the wrong track – aaronman Jun 27 '13 at 01:40
  • @andrewcooke Maybe, but I saw a(nother) hole. I added an update, actually I should have seen it from begining, my answer never gave a fractional answer (ie average of two values), which must happen when both arrays have odd number of elements. – Khanh Nguyen Jun 27 '13 at 01:52
  • 1
    If you know the answer is wrong, you should probably delete it. – David Heffernan Jun 27 '13 at 06:06
  • @DavidHeffernan the only reason he isn't is because it has 2 up-votes and only 2 down-votes(net gain) – aaronman Jun 28 '13 at 00:51
0

Doesn't this work?:

You can compute the rank of a number in linear time as long as A and B are sorted. The technique you use for computing the rank can also be used to find all things in A+B that are between some lower bound and some upper bound in time linear the size of the output plus |A|+|B|.

Randomly sample n things from A+B. Take the median, say foo. Compute the rank of foo. With constant probability, foo's rank is within n of the median's rank. Keep doing this (an expected constant number of times) until you have lower and upper bounds on the median that are within 2n of each other. (This whole process takes expected linear time, but it's obviously slow.)

All you have to do now is enumerate everything between the bounds and do a linear-time selection on a linear-sized list.

(Unrelatedly, I wouldn't excuse the interviewer for asking such an obviously crappy interview question. Stuff like this in no way indicates your ability to code.)

EDIT: You can compute the rank of a number x by doing something like this:

Set i = j = 0.
While j < |B| and A[i] + B[j] <= x, j++.
While i < |A| {
  While A[i] + B[j] > x and j >= 0, j--.
  If j < 0, break.
  rank += j+1.
  i++.
}

FURTHER EDIT: Actually, the above trick only narrows down the candidate space to about n log(n) members of A+B. Then you have a general selection problem within a universe of size n log(n); you can do basically the same trick one more time and find a range of size proportional to sqrt(n) log(n) where you do selection.

Here's why: If you sample k things from an n-set and take the median, then the sample median's order is between the (1/2 - sqrt(log(n) / k))th and the (1/2 + sqrt(log(n) / k))th elements with at least constant probability. When n = |A+B|, we'll want to take k = sqrt(n) and we get a range of about sqrt(n log n) elements --- that's about |A| log |A|. But then you do it again and you get a range on the order of sqrt(n) polylog(n).

tmyklebu
  • 13,171
  • 3
  • 25
  • 51
  • So the rank takes higher than linear (nested for loop) solution is not linear – aaronman Jun 27 '13 at 01:33
  • Anything that says "randomly" usually has worst case complexity infinity. – aschepler Jun 27 '13 at 01:43
  • No, the rank computation is obviously linear. And this is called a "Las Vegas" algorithm; it always returns the correct answer and its expected runtime is nice. – tmyklebu Jun 27 '13 at 02:26
  • `All you have to do now is enumerate everything between the bounds and do a linear-time selection on a linear-sized list.` How exactly do you plan on computing this list? Keep in mind the numbers do not need to be small, your list of 2n numbers could have a lower bound of 10^7 and higher bound of 10^9 and you need to figure out what are those 2n numbers in it. Other than that, your solution is kind of similar to mine, except I use a binary search instead of a random algorithm. – i Code 4 Food Jun 27 '13 at 04:15
  • @Arthur: You compute that list just like you compute the ranks. Find lower and upper bounds on `j` for each `i` so that everything within the range lies between the bounds. Then you can enumerate those few elements of `A+B` that matter. Random sampling tricks like this are usually the key to defeating binary search. (As a bonus, it often runs faster in practice. I wasn't convinced of its practical use either until I saw someone actually use a trick like this.) – tmyklebu Jun 27 '13 at 05:11
0

You should use a selection algorithm to find the median of an unsorted list in O(n). Look at this: http://en.wikipedia.org/wiki/Selection_algorithm#Linear_general_selection_algorithm_-_Median_of_Medians_algorithm

Mattia Larentis
  • 112
  • 1
  • 6