I'm taking an online machine learning class and in lecture 9 which covers gradient descent, I can't quite follow how he derives the direction vector of the descent (around the 57:15 mark). He's explaining how for gradient descent, we basically move from a point on the surface $\mathbf w(0)$ to a new position $\mathbf w(0) + \eta \hat v$, that is, we move in the direction of the unit vector $\hat v$ in $\eta$ increments. For this purpose, he has defined $\eta$ to be a fixed, constant increment. So what we're left with trying to figure out is how to determine in which direction to move.

This is the relevant part of the derivation that I'm stuck on:

$$ \begin{align} \Delta E_{in} &= E_{in}(\mathbf w(0) + \eta \hat v) - E_{in}(\mathbf w(0)) \\ &= \eta \nabla E_{in}(\mathbf w(0))^{\mathrm T} \hat v + O(\eta^2) \end{align} $$

A transcript of what he says to derive the second line is:

Now if I can write this down using the taylor series expansion with one term, ok, this is $E_{in}$ of the original point plus a move, minus the original point. That would also be the derivative times the difference, right?

So the derivative times the difference here would be the gradient transpose times the vector times $\eta$ and I just took $\eta$ outside ok, so this would be the move according to the first-order approximation of the surface. If the surface was linear, this would be exact, but the surface is not linear, and therefore I have other terms which are of the order $\eta^2$ and up (that's what the $O(\eta^2)$ represents), and the assumption for gradient descent is that I'm going to ignore those terms as if they didn't exist.

I didn't quite understand what gradients were, I thought it was just the same as the derivative, but after looking it up and watching some videos I think I understand its relevance to this application. If my rudimentary, non-formal understanding of gradients is correct, then it makes sense that $\hat v$ would be chosen to be the negation of the vector of the gradient at $\mathbf w(0)$, normalized. The intuition being (?) that we want to go in the opposite direction of the steepest ascent direction (gradient), since we want to descend to the minimum.

If that intuition is correct, then I think I understand the rest of the derivation at the end of this post. What confuses me though is how he derived the second line (without the $O(\eta^2)$ component, which I simply ignored like he says to do). I think it's just some sort of simple property/re-arrangement of the first line, but I can't quite figure it out. On the page for the directional derivative (and I'm not even sure if the directional derivative applies here; I'm not too familiar with it) I found that:

$$ \lim_{h \to 0} \frac {f(\mathrm x + h \mathrm v) - f(\mathrm x)} h = \nabla f(\mathrm x) \cdot \mathrm v $$

This of course looks awfully similar to the first line, so maybe he used this so that we could have the gradient in the equation instead? I have no idea.

I also looked up the taylor series to refresh my understanding, since he said something about the "taylor series expansion with one term" and "this is the first-order approximation of the surface", and the closest thing I could find was:

$$ f(x) = f(a) + f'(a)(x - a) $$

But I'm not quite sure how that fits into this.

I would appreciate any possible explanation or guidance. I don't have an understanding of higher level math (multivariate calculus etc.), but I think I do have a good grasp on basic calculus and linear algebra.

The full derivation for $\hat v$ is:

$$ \begin{align} \Delta E_{in} &= E_{in}(\mathbf w(0) + \eta \hat v) - E_{in}(\mathbf w(0)) \\ &= \eta \nabla E_{in}(\mathbf w(0))^{\mathrm T} \hat v + O(\eta^2) \\ &\geq -\eta \lVert \nabla E_{in}(\mathbf w(0)) \rVert \\ \hat v &= - \frac {\nabla E_{in}(\mathbf w(0))} {\lVert \nabla E_{in}(\mathbf w(0)) \rVert} \end{align} $$