SciPy optimisation: Newton-CG vs BFGS vs L-BFGS

Question

I am doing an optimisation problem using Scipy, where I am taking a flat network of vertices and bonds of size NNxNN, connecting two sides of it (i.e., making it periodic), and minimising an energy function, so that it curls up to form a cylinder. (See the links below.)

Since I have the function energy(xyz-position) and it's gradient, I decided to use the three methods recommended in the Scipy manual -- Newton-CG, BFGS, L-BFGS-B -- and compare how they performed.

I call the optimisation function as follows, and I merely replace 'Newton-CG' with 'BFGS' and 'L-BFGS-B' according to case:

from scipy.optimize import minimize
res = minimize(energy, xyzInit, method='Newton-CG', jac = energy_der,  options={'disp': True})

I found the following general behaviour (I am giving the output data for the case of NN=9, corresponding to a 3*9^2=243-dimensional parameter space) -

BFGS systematically failed to find the correct minimum (for low NN), and failed to converge at all for large NN. See https://plot.ly/~apal90/162/ for end result.

 NN=9
 Method: BFGS
 Warning: Desired error not necessarily achieved due to precision loss.
 Current function value: 204.465912
 Iterations: 1239
 Function evaluations: 1520
 Gradient evaluations: 1508
 Time taken for minimisation: 340.728140116

Newton-CG found the correct minimum for small NN (<=8), but starting from NN=9, returned an incorrect minimum (viz., a cylinder squashed at one end), and for higher values stopped even converging. Note: This behaviour was for some reason aggravated for odd NN's. See https://plot.ly/~apal90/164/
```
 NN=9
 Method: Newton-CG
 Optimization terminated successfully.
 Current function value: 7.954412
 Iterations: 49
 Function evaluations: 58
 Gradient evaluations: 1654
 Hessian evaluations: 0
 Time taken for minimisation: 294.203114033
```
L-BFGS-B found the correct minimum, and that too blazingly fast, for all NN's that I tested (up to NN=14). See https://plot.ly/~apal90/160/
```
 NN=9
 Method: L-BFGS-B
 Time taken for minimisation: 36.3749790192
```

Question: Why is L-BFGS-B superior in this case to the other two methods? In particular, why is it so much superior to BFGS, when both are supposed to be quasi-Newton methods that work (to my understanding), in exactly the same manner.

My thoughts on the situation: All three methods do quadratic approximations at every point x. For this, it needs a gradient and a Hessian. If the Hessian is not given, it must be calculated by the algorithm. In our case, where only the gradient is explicitly given, this is calculated at every step numerically by the algorithm. More specifically, what we require is the inverse of the Hessian, and this is a very expensive step, especially in higher dimensions. Now, Newton-CG calculates this inverse Hessian explicitly, hence it's longer time requirements. The quasi-Newton methods like BFGS and L-BFGS calculate an approximation to the Hessian (i.e., the curvature) based on the gradient, which is cheaper on time, and which is also supposedly a better estimate of the curvature about a point. Thus, for quadratic functions, Newton-CG converges faster, whereas for non-quadratic functions, the quasi-Newton functions converge better. L-BFGS is a lower memory version of BFGS that stores far less memory at every step than the full NxN matrix, hence it is faster than BFGS.

This explanation shows a divergence between Newton-CG and the quasi-Newton methods. What it does not explain is the inability of the algorithms to find the true minimum, and especially the disparity between BFGS and L-BFGS, which are both supposed to function in the same manner.

My general hunch on the long convergence times is that the system is non-quadratic (i.e. flat) about the minimum, and thus the algorithm oscillates about with converging. Beyond that, if BFGS and L-BFGS truly work in the same manner, I believe there must be some difference between the convergence tolerance levels of the Scipy algorithms. Otherwise, BFGS and L-BFGS don't really work in the same manner, and the latter probably calculates the Hessian far more accurately.

References --

http://www.scipy-lectures.org/advanced/mathematical_optimization/#newton-and-quasi-newton-methods

https://en.wikipedia.org/wiki/Newton%27s_method_in_optimization

https://en.wikipedia.org/wiki/Quasi-Newton_method

https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.minimize-bfgs.html#optimize-minimize-bfgs

https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.minimize-lbfgsb.html#optimize-minimize-lbfgsb

Can you show how you invoked the algorithms? From an interface point of view the main difference between L-BFGS-B to the other two algorithms is that it supports bound constraints. Did you specify any? These may have had a positive effect on convergence. — kazemakase, Feb 24 '17 at 08:28
@kazemakase No, I supplied no constraints at all to L-BFGS-B. I have updated the post to show my code line. It's the exact same for all the algorithms. — ap21, Feb 24 '17 at 16:50
Cross-posted: http://stackoverflow.com/q/42424444/781723, http://cs.stackexchange.com/q/70780/755. Please [do not post the same question on multiple sites](http://meta.stackexchange.com/q/64068). Each community should have an honest shot at answering without anybody's time being wasted. — D.W., Feb 24 '17 at 19:33
Is `energy()` small enough / self-contained enough to post, say on gist.github ? (Scalable test cases are useful for testing other optimizers too.) — denis, Aug 17 '18 at 09:20
@denis No, unfortunately, it's not self-contained enough. Too many complicated input variables involved. — ap21, Aug 28 '18 at 20:50
+1, and I get similar speed improvements. But I'll also point out that I generally find that when BFGS fails with complaints about precision, it's either because the `energy` function is actually poorly behaved numerically or because the Jacobian function isn't actually correct. It's good to check the latter with `scipy.optimize.check_grad` or `scipy.optimize.approx_fprime`. In particular, I do convergence plots by varying the values of `epsilon` to those functions. — Mike, Nov 23 '18 at 03:04
@Mike, how exactly does changing the value of `epsilon` help with the final optimisation? It won't give you a better Jacobian function. — ap21, Nov 25 '18 at 13:48
@ap21 I mentioned `epsilon` as one of the arguments to `check_grad` or `approx_fprime`; it's not an argument to `minimize` itself. What I'm suggesting is that you need to check that your Jacobian function is actually correct and numerically well behaved for the range of values that `minimize` will be testing, and you do that test with those first two functions. If you have the wrong Jacobian or if it's not well behaved numerically, feeding it in to the `minimize` function can result in this "precision loss" warning. — Mike, Nov 25 '18 at 14:52

score 6 · Accepted Answer · answered Apr 01 '19 at 18:17

Your question is missing two important information: The energy function and the initial guess. The energy function can be convex/non-convex, smooth/piecewise-smooth/discontinuous. For this reason, it makes it hard to fully answer your question in your context. However, I can explain some key differences between BFGS and L-BFGS-B.

Both methods are iterative methods for solving nonlinear optimization problems. They both approximate the Newton method by using an approximation of the Hessian of the function at every iteration. The key difference with the Newton method is that instead of computing the full Hessian at a specific point, they accumulate the gradients at previous points and use the BFGS formula to put them together as an approximation of the Hessian. Newton and BFGS methods are not guaranteed to converge unless the function has a quadratic Taylor expansion near an optimum.

The original BFGS method accumulates all gradients since the given initial guess. There is two problems with this method. First, the memory can increase indefinitely. Second, for nonlinear problems, the Hessian at the initial guess is often not representative of the Hessian at the solution. The approximated Hessian will thus be biased until enough gradients are accumulated close to the solution. This can slow down convergence, but should, in my experience, still converge with a good line search algorithm for energy functions that have a single local minimum.

L-BFGS is the same as BFGS but with a limited-memory, which means that after some time, old gradients are discarded to leave more space for freshly computed gradients. This solves the problem of the memory, and it avoids the bias of the initial gradient. However, depending on the number of gradients kept in memory, the Hessian might never be precisely estimated, and can be another source of bias. This can also slow down convergence, but again, it should still converge with a good line search algorithm for energy functions that have a single local minimum.

L-BFGS-B is the same as L-BFGS but with bound constraints on the input variables. L-BFGS-B will stop optimizing variables that are on the boundary of the domain. Since you did not specify any constraints, this aspect of the algorithm does not apply to your problem.

My hypothesis is that you are trying to solve a smooth but non-convex problem using an initial guess that is far from the solution, and that you end up in a local minimum. Since you mentioned that you start from a flat configuration, I would not be surprised that you start in a singularity that leads to a degenerate Hessian, which can cause troubles for the rest of the optimization. The only difference between BFGS and L-BFGS in your case is that every iteration will compute a gradient that is slightly different, and that the L-BFGS method will end up following a path that leads to the global minimum.

Thanks for the long answer. At the end, are you suggesting that L-BFGS performs better than BFGS due to sheer luck? Or because it forgets the initial flat state faster? Also, do you have a robust algorithm for finding the global minimum? — ap21, Apr 03 '19 at 14:53
It is hard to confirm without the energy function itself, but I think this is sheer luck. SciPy has a global optimization method called "Basin Shopping" that performs successive local optimization using random starting points (no guarantees to converge). I never used it personally, so I cannot confirm its efficiency. However, global optimization tends to be very slow. I suggest to find a simple algorithm that gives you a good approximation of the solution, and use local minimization methods to refine it. Your solution seems to be really structured, so I think it should not be too hard to do. — Gilles-Philippe Paillé, Apr 04 '19 at 15:29
Got it. But what exactly do you mean by "using local minimization methods to refine it"? — ap21, Apr 04 '19 at 17:17
I mean that you use the method that you are already using (L-BFGS for example), but instead of starting from the flat configuration, you start from the result of the approximated solution. For example, you can wrap you vertices like a cylinder in a first pass, and feed these vertex positions to L-BFGS. — Gilles-Philippe Paillé, Apr 04 '19 at 18:44
In my test I meet the opposite case where only newton-cg converge but lbfgs or saga and etc not converge to the global optimum, just search for why is it so and see your discussions, so confusion remained. — Frank, Nov 28 '20 at 04:08
@Frank This is possible. As discussed in the beginning of my answer, these are optimization methods for finding *local* minima on smooth functions. If your function doesn't meet these criteria, then there is no guarantee that either will converge. You might find that newton-cg have a better chance to converge for your specific problem, but it might be just luck. In my experience, finding a good initial guess is the best way to increase the chance of convergence when using these methods on functions that have multiple local minima. — Gilles-Philippe Paillé, Nov 28 '20 at 05:20

SciPy optimisation: Newton-CG vs BFGS vs L-BFGS

1 Answers1

Linked