I am reading Gaussian Distribution from a machine learning book. It states that -

We shall determine values for the unknown parameters $\mu$ and $\sigma^2$ in the Gaussian by maximizing the likelihood function. In practice, it is more convenient to maximize the log of the likelihood function. Because the logarithm is monotonically increasing function of its argument, maximization of the log of a function is equivalent to maximization of the function itself. Taking the log not only simplifies the subsequent mathematical analysis, but it also helps numerically because the product of a large number of small probabilities can easily underflow the numerical precision of the computer, and this is resolved by computing instead the sum of the log probabilities.

can anyone give me some intuition behind it with some example? Where the log likelihood is more convenient over likelihood. Please give me a practical example.

Thanks in advance!

  • 1,142
  • 1
  • 11
  • 17
Kaidul Islam
  • 693
  • 1
  • 6
  • 6
  • 1
    Related question: https://stats.stackexchange.com/questions/174481/why-to-optimize-max-log-probability-instead-of-probability – Mohit Pandey Dec 05 '18 at 21:34

3 Answers3

  1. It is extremely useful for example when you want to calculate the joint likelihood for a set of independent and identically distributed points. Assuming that you have your points: $$X=\{x_1,x_2,\ldots,x_N\} $$ The total likelihood is the product of the likelihood for each point, i.e.: $$p(X\mid\Theta)=\prod_{i=1}^Np(x_i\mid\Theta) $$ where $\Theta$ are the model parameters: vector of means $\mu$ and covariance matrix $\Sigma$. If you use the log-likelihood you will end up with sum instead of product: $$\ln p(X\mid\Theta)=\sum_{i=1}^N\ln p(x_i\mid\Theta) $$
  2. Also in the case of Gaussian, it allows you to avoid computation of the exponential:

    $$p(x\mid\Theta) = \dfrac{1}{(\sqrt{2\pi})^d\sqrt{\det\Sigma}}e^{-\frac{1}{2}(x-\mu)^T \Sigma^{-1}(x-\mu)}$$ Which becomes:

    $$\ln p(x\mid\Theta) = -\frac{d}{2}\ln(2\pi)-\frac{1}{2}\ln(\det \Sigma)-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)$$

  3. Like you mentioned $\ln x$ is a monotonically increasing function, thus log-likelihoods have the same relations of order as the likelihoods:

    $$p(x\mid\Theta_1)>p(x\mid\Theta_2) \Leftrightarrow \ln p(x\mid\Theta_1)>\ln p(x\mid\Theta_2)$$

  4. From a standpoint of computational complexity, you can imagine that first of all summing is less expensive than multiplication (although nowadays these are almost equal). But what is even more important, likelihoods would become very small and you will run out of your floating point precision very quickly, yielding an underflow. That's why it is way more convenient to use the logarithm of the likelihood. Simply try to calculate the likelihood by hand, using pocket calculator - almost impossible.

    Additionally in the classification framework you can simplify calculations even further. The relations of order will remain valid if you drop the division by $2$ and the $d\ln(2\pi)$ term. You can do that because these are class independent. Also, as one might notice if variance of both classes is the same ($\Sigma_1=\Sigma_2 $), then you can also remove the $\ln(\det \Sigma) $ term.

Michael Hardy
  • 1
  • 30
  • 276
  • 565
  • 1,142
  • 1
  • 11
  • 17
  • 2
    In your second reasoning, you avoid the computation of the exponential but now you have to compute the $ln$ instead. Just wondering, is $ln$ always computationally less expensive than the exponential? – jlcv Jun 08 '16 at 08:09
  • From a machine learning perspective one of your objectives is to calculate stable gradients. In my opinion the answer above is not well structured. Point 4 describes the problem, the reason why you want to look for a smarter solution than just calculating the likelihood. Point 3 describes why you are even allowed to use the natural logarithm as a function because it keeps intact one important property of the original function. Point 1 and 2 describe the mathematics of how point 4 is achieved. – Joop Jan 31 '20 at 13:13
  • 2
    @JustinLiang Point 2 is about computational stability. The exponential can cause overflow, whereas taking the log makes this way less likely. – Joop Jan 31 '20 at 13:19

First of all as stated, the log is monotonically increasing so maximizing likelihood is equivalent to maximizing log likelihood. Furthermore, one can make use of $\ln(ab) = \ln(a) + \ln(b)$. Many equations simplify significantly because one gets sums where one had products before and now one can maximize simply by taking derivatives and setting equal to $0$.

  • 1,401
  • 8
  • 11
  • Look at the example on this page: http://www.unc.edu/courses/2007spring/enst/562/001/docs/lectures/lecture15.htm#loglikelihood . Maximizing the product would be a horrible task, maximizing the sum however is quite doable. – hickslebummbumm Aug 10 '14 at 11:21
  • Could u pls provide some alternative correct link. This link is not working any more. – Abhinav Ravi Apr 15 '19 at 06:07

Maybe a practical example for computer scientists: As @jojek mentioned

$$p(X|\Theta) = \prod_{i=1}^N p(x_i|\Theta)$$


$$\ln p(X| \Theta) = \sum_{i=1}^N \ln p(x_i| \Theta)$$

So if you execute a little python script you can directly see it:

>>> import numpy as np

# Create some small probability values.
>>> r = np.random.random_sample((100,)) * 0.000001

# Take the log of each (I ignore the logarithms basis, because it does not matter here).
>>> l = [np.log(x) for x in r]
# Compute product of probability values.
>>> np.product(r)

# Compute sum of log of probability values.
>>> np.sum(l)

Here you can see that it is more practical to work with logarithms. Especially in machine learning where you compute with many very small numbers (e.g. check the weights of a neural network) this can lead to much better results.

  • 131
  • 1