4

The background

According to Wikipedia and other sources I've found, building a binary heap of n elements by starting with an empty binary heap and inserting the n elements into it is O(n log n), since binary heap insertion is O(log n) and you're doing it n times. Let's call this the insertion algorithm.

It also presents an alternate approach in which you sink/trickle down/percolate down/cascade down/heapify down/bubble down the first/top half of the elements, starting with the middle element and ending with the first element, and that this is O(n), a much better complexity. The proof of this complexity rests on the insight that the sink complexity for each element depends on its height in the binary heap: if it's near the bottom, it will be small, maybe zero; if it's near the top, it can be large, maybe log n. The point is that the complexity isn't log n for every element sunk in this process, so the overall complexity is much less than O(n log n), and is in fact O(n). Let's call this the sink algorithm.

The question

Why isn't the complexity for the insertion algorithm the same as that of the sink algorithm, for the same reasons?

Consider the actual work done for the first few elements in the insertion algorithm. The cost of the first insertion isn't log n, it's zero, because the binary heap is empty! The cost of the second insertion is at worst one swap, and the cost of the fourth is at worst two swaps, and so on. The actual complexity of inserting an element depends on the current depth of the binary heap, so the complexity for most insertions is less than O(log n). The insertion cost doesn't even technically reach O(log n) until after all n elements have been inserted [it's O(log (n - 1)) for the last element]!

These savings sound just like the savings gotten by the sink algorithm, so why aren't they counted the same for both algorithms?

templatetypedef
  • 328,018
  • 92
  • 813
  • 992
indil
  • 203
  • 4
  • 6
  • 1
    Your logic sounds quite reasonable to me. – Jerry Coffin Jan 20 '13 at 05:46
  • 4
    @indil, since the big-O notation is used for _worst-case_ asymptotic time complexity, you have to consider the most expensive insertion scenario, which would be building a max-heap from a list of n elements given in ascending order. There will be O(n/2) of these added in the "leaf" positions, and each of these will be "bubbled-up" the entire height of the heap, i.e. O(log n). So in the worst case, this gives O(n*log(n)). – 808sound Jan 20 '13 at 06:05
  • 1
    It is as @808sound wrote, the Wikipedia page implicitly assumes worst-case examination. – Michael Foukarakis Jan 20 '13 at 06:37
  • I don't follow the Wikipedia article's math for the complexity. I come up with Sum[d=0 to floor(log2 n)] of 2^d * (floor(log2 n) - d) for the number of swaps (worst case) for a perfect binary tree of height/depth d (zero-based), and I don't see how to reduce that to O(n). But I'm not the greatest at this kind of math. If I could understand that, I'd understand how to get O(n) on paper. However, I also wrote code for the above summation and observed the O(n) complexity myself in numbers, so there it is. – indil Feb 03 '13 at 23:52
  • Thanks to all for their comments and answers! – indil Feb 03 '13 at 23:57
  • Similar: [build-heap-complexity](http://stackoverflow.com/questions/9755721/build-heap-complexity) – nawfal Jun 05 '14 at 15:04

4 Answers4

5

Actually, when n=2^x - 1 (the lowest level is full), n/2 elements may require log(n) swaps in the insertion algorithm (to become leaf nodes). So you'll need (n/2)(log(n)) swaps for the leaves only, which already makes it O(nlogn).

In the other algorithm, only one element needs log(n) swaps, 2 need log(n)-1 swaps, 4 need log(n)-2 swaps, etc. Wikipedia shows a proof that it results in a series convergent to a constant in place of a logarithm.

Maciej Stachowski
  • 1,463
  • 9
  • 18
1

While it's true that log(n-1) is less than log(n), it's not smaller by enough to make a difference.

Mathematically: The worst-case cost of inserting the i'th element is ceil(log i). Therefore the worst-case cost of inserting elements 1 through n is sum(i = 1..n, ceil(log i)) > sum(i = 1..n, log i) = log 1 + log 1 + ... + log n = log(1 × 2 × ... × n) = log n! = O(n log n).

Raymond Chen
  • 42,606
  • 11
  • 86
  • 125
1

The intuition is that the sink algorithm moves only a few things (those in the small layers at the top of the heap/tree) distance log(n), while the insertion algorithm moves many things (those in the big layers at the bottom of the heap) distance log(n).

The intuition for why the sink algorithm can get away with this that the insertion algorithm is also meeting an additional (nice) requirement: if we stop the insertion at any point, the partially formed heap has to be (and is) a valid heap. For the sink algorithm, all we get is a weird malformed bottom portion of a heap. Sort of like a pine tree with the top cut off.

Also, summations and blah blah. It's best to think asymptotically about what happens when inserting, say, the last half of the elements of an arbitrarily large set of size n.

Andrew W.
  • 206
  • 1
  • 1
0

Ran into the same problem yesterday. I tried coming up with some form of proof to satisfy myself. Does this make any sense?

If you start inserting from the bottom, The leaves will have constant time insertion- just copying it into the array.

The worst case running time for a level above the leaves is:

k*(n/2h)*h

where h is the height (leaves being 0, top being log(n) ) k is a constant( just for good measure ). So (n/2h) is the number of nodes per level and h is the MAXIMUM number of 'sinking' operations per insert

There are log(n) levels, Hence, The total running time will be

Sum for h from 1 to log(n): n* k* (h/2h)

Which is k*n * SUM h=[1,log(n)]: (h/2h)

The sum is a simple Arithmetico-Geometric Progression which comes out to 2. So you get a running time of k*n*2, Which is O(n)

The running time per level isn't strictly what i said it was but it is strictly less than that.Any pitfalls?

2bigpigs
  • 407
  • 3
  • 12