Why Direct Linear Transformation (DLT) cannot give the optimal camera extrinsics?

Question

I'm reading the source code of function solvePnP() in OpenCV, when the flags param uses default value SOLVEPNP_ITERATIVE, it's calling cvFindExtrinsicCameraParams2, in which it FIRST uses the DLT algorithm (if we have a non-planar set of 3D points) to initialize the 6DOF camera pose, and SECOND uses CvLevMarq solver to minimize the reprojection error.

My question is: the DLT fomulates the problem as a linear least square problem and solves it with SVD decomposition, it seems to be an optimal solution, why do we still use Lev-Marq iterative method afterwards?

Or, what's the issue/limitation of the DLT algorithm to be inferior? Why is the closed-form solution result in a LOCAL minimum to the cost function?

I think it is common to add an extra step of some sort of non-linear refinement to the extrinsics estimation, which is done iteratively. See here: http://www.epixea.com/research/multi-view-coding-thesisse9.html — Dan, May 07 '17 at 05:25
@Dan thx for your link, I know it's common, just as what `CvLevMarq` solver did in opencv. I mean, DLT seems to be a kind of **closed-form minimization** of the cost function, why it's still inferior (local minimum)? — zhangxaochen, May 07 '17 at 07:17

score 18 · Answer 1 · answered May 13 '17 at 14:11

When you want to find the solution to a problem, the first step is to express this problem in mathematical terms, and you can then use existing mathematical tools to find a solution to your equations. However, interesting problems can usually be expressed in many different mathematical ways, each of which may lead to a slightly different solution. It then takes work to analyze the different methods to understand which one provides the most stable/accurate/efficient/etc solution.

In the case of the PnP problem, we want to find the camera pose given associations between 3D points and their projections image plane.

A first way to express this problem mathematically is to cast it as a linear least squares problem. This approach is known as the DLT approach, and it is interesting because linear least-squares have a closed-form solution which can be found robustly using the Singular Value Decomposition. However, this approach assumes that the camera pose P has 12 degrees of freedom when really it has only 6 (3 for the 3D rotation plus 3 for the 3D translation). To obtain a 6DOF camera pose from the result of this approach an approximation is needed (which is not covered by the linear cost function of the DLT), leading to an inaccurate solution.

A second way to express the PnP problem mathematically is to use the geometric error as a cost function, and to find the camera pose that minimizes the geometric error. Since the geometric error is non-linear, this approach estimates the solution using iterative solvers, such as the Levenberg Marquardt algorithm. Such algorithms can take into account the 6 degrees of freedom of the camera pose, leading to accurate solutions. However, since they are iterative approaches, they need to be provided with an initial estimate of the solution, which in practice is often obtained using the DLT approach.

Now to answer the title of your question: sure, the DLT algorithm gives the optimal camera extrinsics, but it is optimal only in the sense of the linear cost function solved by the DLT algorithm. Over the years, scientists have found more complex cost functions leading to more accurate solutions, but also more difficult to solve.

Thx first~ you mean DLT is inferior because it's getting a 12DOF matrix not exactly a rigid transformation matrix? But as I know, the operation `svd(H)=UWV^T` and then `result=VU^T` indeed gives a exact rotation matrix (orthogonal matrix with det=+1) with 3DOF, not 9DOF, isn't it? — zhangxaochen, May 16 '17 at 07:41
Yes, there are ways to extract a 6DOF camera pose from a 12DOF one (e.g. using SVD, like you said). However this transformation is not covered by the linear least-square problem of the DLT algorithm, hence you end-up with a 6DOF camera pose that is not anymore optimal in the sense of your linear least-squares cost function. To embed the 6DOF pose estimation in a least-square problem, you need a non-linear function, and the best way we know to minimize a non-linear function is via iterative solvers (e.g. the Levenberg-Marquardt algorithm). — BConic, May 16 '17 at 18:47

Why Direct Linear Transformation (DLT) cannot give the optimal camera extrinsics?

1 Answers1