Start by implementing a simple binary tree (that is, without balancing), along with the corresponding program to count words in a file. Get that working, so you'll have something to test with. Don't worry about balancing just yet; that's the really hard part.
Here's an insertion function (untested) for a simple binary tree:
/*
* Insert a new key into a binary tree, or find an existing one.
*
* Return the new or existing node whose key is equal to the one requested.
* Update *node with the new (sub)tree root.
*/
Node *insert(Node **node, String key)
{
if (*node == NULL) {
*node = new Node(key);
return *node;
}
if (key < (*node)->key)
return insert(&(*node)->left, key);
if (key > (*node)->key)
return insert(&(*node)->right, key);
return *node;
}
Once you have a simple binary tree working and tested, re-implement the insertion function to perform balancing, which is the heart of the AVL algorithm.
Start by knowing the invariants of an AVL tree:
- The balance factor of any node (the difference between the left child's height and the right child's height) is either -1, 0, or +1.
- In-order traversal produces a dictionary order.
I recommend referring to the AVL tree insertion diagram on Wikipedia. It illustrates the four rotations you will need to implement and where they are needed.
A rotation is necessary when the balance factor of a node goes out of range—in other words, when the difference in height between the left subtree and right subtree is greater than 1.
How do you determine the balance factor of a node? Well, any of the following will work:
- Add a
height
member to the Node structure, and determine the balance factor of any given node by subtracting the right child's height from the left child's height.
- Add a
balance
member to the Node structure. This might be a little harder to figure out, but it yields simpler code (I think).
- Compute tree heights and balances by traversal. This is inefficient, so much so that it defeats the purpose of AVL. However, it's easier to understand and less bug-prone.
I recommend starting with the 3rd approach, so you can test your balancing code sooner.
To clarify what is meant by "height" and "balance factor", here are functions to compute them:
int height(Node *node)
{
if (node == NULL)
return -1;
return std::max(height(node->left), height(node->right)) + 1;
}
int balanceFactor(Node *node)
{
assert(node != NULL);
return height(node->right) - height(node->left);
}
Figuring out how to update heights or balance factors incrementally is going to involve paper, pencil, simple algebra, and common sense.
I hope this helps!