16

In his free online book, "Neural Networks and Deep Learning", Michael Nielsen proposes to prove the next result:

If $C$ is a cost function which depends on $v_{1}, v_{2}, ..., v_{n}$, he states that we make a move in the $\Delta v$ direction to decrease $C$ as much as possible, and that's equivalent to minimizing $\Delta C \approx \nabla C \cdot \Delta v$. So if $\lvert\lvert\Delta v\rvert\rvert = \epsilon$ for a small $\epsilon$, it can be proved that the choice of $\Delta v$ that minimizes $\Delta C \approx \nabla C \cdot \Delta v$ is $\Delta v = -\eta \nabla C$ where $\eta = \epsilon / \lvert\lvert \nabla C \rvert\rvert$. And he suggests using the Cauchy-Schwarz inequality to prove this.

Ok, so what I've done is to minimize with respect to $\Delta v$ an equivalent function $0 = min_{\Delta v} \lvert\lvert \nabla C \Delta v \rvert\rvert^{2} \leq min_{\Delta v}\lvert\lvert \nabla C \rvert\rvert^{2}\lvert\lvert \Delta v\rvert\rvert^{2}$ (using C-S inequality). I would say this is the correct path to prove the result but I'm stuck and can't arrive to the same result.

Thanks.

Chill2Macht
  • 19,382
  • 10
  • 39
  • 120
David
  • 776
  • 5
  • 12
  • 2
    The RHS of the C_S is $\|del\; C\|^2\epsilon^2. $ We have equality in C-S iff $\delta v$ and $del\; C$ are linearly dependent, that is, $\delta v= \eta \;del\; C$ for some $ \eta.$ We choose $\eta<0$ so that $\delta C<0.$ The absolute value of $\eta$ must be $\|\delta v\|/\|\;del\; C\|$. – DanielWainfleet Mar 09 '16 at 11:42
  • 2
    .....(continued).... in order that $\|\delta v\|=|\eta|\cdot \|del\;C\|$. – DanielWainfleet Mar 09 '16 at 11:50
  • Thank you, I think this answer the question. I didn't know the fact that linearly dependent vectors hold the equality in Cauchy Schwartz inequality. That was my problem. Thanks again. – David Mar 09 '16 at 14:29
  • 1
    You're welcome. When $x=(x_1,...,x_n)$ and $y=(y_1,...,y_n)$ then $\|x\|^2\cdot \|y\|^2-(x\cdot y)^2=\sum_{1\leq i – DanielWainfleet Mar 09 '16 at 14:59

2 Answers2

12

Goal: We want to minimize ΔC ≈ ∇C ⋅ Δv by finding some value for Δv that does the trick.

Given: ||Δv|| = ϵ for some small fixed ϵ ∈ ℝ > 0 (this is our fixed “step size” by which we’ll move down the error surface of C).
How should we move v (what should Δv be?) to decrease C as much as possible?

Claim: The optimal value is Δv = -η∇C where η = ϵ / ||∇C|| , or, Δv = -ϵ∇C / ||∇C||

Proof:
1) What is the minimum of ∇C ⋅ Δv? By Cauchy-Schwarz inequality we know that:
|∇C ⋅ Δv| ≤ ||∇C|| * ||Δv||
∴ min(∇C ⋅ Δv) = - ||∇C|| * ||Δv|| (recall: if |x| ≤ 2, x ∈ [-2, 2])

2) By substitution, we want some value for Δv such that:
∇C ⋅ Δv = -||∇C|| * ||Δv||
= ∇C ⋅ Δv = -||∇C||ϵ

3) Consider the following:
∇C ⋅ ∇C = ||∇C||2 (because ||∇C|| = sqrt(∇C ⋅ ∇C))
∴ (∇C ⋅ ∇C) / ||∇C|| = ||∇C||

4) Now multiply both sides by -ϵ: (-ϵ∇C ⋅ ∇C) / ||∇C|| = -ϵ||∇C||
Notice that the right hand side of this equality is the same as in (2).

5) Rewrite the left hand side of (4) to separate one of the ∇C’s. The other term will be our Δv such that ∇C ⋅ Δv = -||∇C||ϵ.

((-ϵ∇C) / ||∇C||) ⋅ ∇C = -ϵ||∇C||

∴ because the Cauchy-Schwarz inequality told us what the minimum of ∇C ⋅ Δv could be (by considering |∇C ⋅ Δv|), we know Δv must be (-ϵ∇C) / ||∇C|| ∎

phoenixdown
  • 270
  • 3
  • 8
2

suppose by absurd that

∇C ⋅ Δv < ∇C ⋅ ( -ϵ∇C / ||∇C|| )

note that ride side is <= 0 and so left side too

changing sign now we get

-∇C ⋅ Δv > ∇C ⋅ ( ϵ∇C / ||∇C|| )

note that now both right and left side are >= 0

applying modulo we get

| ∇C ⋅ Δv | > ∇C ⋅ ( ϵ∇C / ||∇C|| )

given that ||Δv|| = ϵ (by hipotesys) and ||a|| = a⋅a/||a|| we get

| ∇C ⋅ Δv | > || ∇C || * || Δv ||

this is absurd because by Cauchy–Schwarz inequality | ∇C ⋅ Δv | <= || ∇C || * || Δv ||

frhack
  • 184
  • 7