Likelihood and method of Maximum Likelihood

This illustration shows a sample of n independent observations, and two continuous distributions f₁(x) and f₂(x), with f₂(x) being just f₁(x) translated by a certain amount.

Of these two distributions, which one is the most likely to have generated the sample ? Clearly, the answer is f₁(x), and we would like to formalize this intuition.

Although this is not strictly impossible, we don't believe that f₂(x) generated the sample because all the observations are in regions where the values of f₂(x) are small : the probability for an observation to appear in such a region is small, and it is even more unlikely that all the observations in the sample would appear in low density regions.

On the other hand, the values taken by f₁(x) are substantial for all the observations, which are then where one would expect them to be, would the sample be actually generated by f₁(x).

Definition of the likelihood

Of the many ways to quantify this intuitive judgement, one turns out to be remarkably effective. For any probability distribution f(x), just multiply the values of f(x) for each of the observations of the sample, denote the result L, and call it the likelihood of the distribution f(x) for this particular sample :

Clearly, the likelihood can have a large value only if all the observations are in regions where f(x) is not very small.

This definition has the additional advantage that L receives a natural interpretation. The sample {x_i} may be regarded as a single observation generated by the n-variate probability distribution

because of the independence of the individual observations. So the likelihood of the distribution is just the value of the n-variate probability density f(x₁, x₂, ..., x_n) for the set of observations in the sample considered as a unique n-variate observation.

Likelihood and estimation, Maximum Likelihood estimators

These considerations make us believe that "likelihood" might be a helpful concept for identifying the distribution that generated a given sample.

First note, though, that as such, this approach is moot if we don't a priori restrict our search : the probability distribution leading to the largest possible value of the likelihood is obtained by assigning the probability 1/n to each of the points where there is an observation, and assigning the value 0 to f(x) for any other point of the x axis. This result is both trivial and useless.

But consider the example given in the above illustration : f₁(x) and f₂(x) are assumed to belong to a family of distributions, all identical in shape and differing only by their position along the x axis (location family). It now makes sense to ask for which position of the generic distribution f(x) is the likelihood largest. If we denote θ the parameter adjusting the horizontal position of the distribution, one may consider the value of θ conducive to the largest likelihood as being probably fairly close to the true (and unknown) value θ₀ of the parameter of the distribution that actually generated the sample.

It then appears that the concept of likelihood may lead to a method of parameter estimation. The method consists in retaining as an estimate of θ₀ the value of θ conducive to the largest possible value of the sample likelihood. This method is thus called Maximum Likelihood estimation, which is, in fact, the most powerful and widely used method of parameter estimation these days.

An estimator θ^* obtained by maximizing the likelihood of a probability distribution defined up to the value of a parameter θ is called a Maximum Likelihood estimator and is usually denoted "MLE".

When we need to emphasize the fact that the likelihood depends on both the sample x = {x_i} and the parameter θ, we'll denote it L(x, θ).

We used a continuous probability distribution for illustrating the concept of Maximum Likelihood, but the principle is valid for any distribution, either discrete or continuous.

Log-Likelihood

The likelihood is defined as a product, and maximizing a product is usually more difficult than maximizing a sum. But if a function L(θ) is changed into a new function L^'(θ) by a monotonously increasing transformation, then L(θ) and L^'(θ) will clearly reach their maximum values for the same value of θ.

In particular, if the monotonous transformation is logarithmic, maximization of a product is then turned into an easier maximization of a sum.

The logarithm of the likelihood is called the log-likelihood, and will be denoted log-L. So, by definition :

and the likelihood and log-likelihood reach their extrema for the same values of θ.

Maximizing the likelihood

The principle of Maximum Likelihood estimation is straightforward enough, but its practice is fraught with difficulties, as is the case of any optimization problem.

Identifying extrema

From your highschool days, you may remember that the extrema of a differentiable function L(θ) verify the condition

As we just mentioned, this condition is also valid for the log-likelihood log-L instead of the likelihood L.

So the most natural approach to maximizing a differentiable likelihood is to first solve this equation. Yet, this alone is far from solving the problem for a number of reasons.

Tractability

Even though most classical likelihoods are differentiable (with the important exception of the uniform distribution), there is no reason why the solutions of this equation should have simple analytical forms. As a matter of fact, more often than not, they don't, and it may then be necessary to resort to computer numerical techniques to identify the extrema of the likelihood function (as is typically the case, for example, with Logistic Regression).

Identifying maxima

The above equation identifies extrema of L(θ), but says nothing about which of these extrema are maxima (that we are interested in) and which are minima (that we are not interested in). To make things worse, some inflexion points may also satisfy the equation. So, after the solutions of the equation have been identified, one must go through these solutions to retain only those corresponding to maxima.

* There is no reason why the likelihood would have a single maximum. So once the maxima have been found, only the largest among them is retained, provided it is within the allowed range of the parameter θ. It can be shown, though, that under certain regularity conditions, the probability for the likelihood function to have a unique maximum tends to 1 as the sample size grows without limit.

identifies extrema only within the interior of the range of θ. It is therefore ineffective in identifying extrema :

- Or that are "at infinity" (lower image of the above illustration). The likelihood then has no maximum.

Numerical errors

When the maximum of the likelihood is identified by computer numerical techniques, the issue of the validity of the solution thus found is crucial. The value resulting from intense numerical computation may be extremely sensitive to round-off errors, thus conducing to estimated values of the parameter θ that may be substantially different from the value that would be obtained in the absence of computation errors. This is particularly true when the true value of the parameter is in a region where the likelihood varies very little with θ, or when the maximum is at infinity.

Numerical instabilities

In the same line of reasoning, it may happen that the value of the likelihood is extremely sensitive to small changes in the values of the observations. Because real-world observations are always somewhat uncertain, it is a good idea to check the changes in the estimated value of the parameter when the values of the observations are slightly modified. If these small modifications lead to large variations in the estimated value of the parameter, the original value of the estimated parameter should be regarded with some suspicion.

Multivariate likelihood

Maximization of the likelihood may also be used for estimating several parameters simultaneously. This will be the case :

1) When two (or more) parameters of a univariate distribution are estimated simultaneously (for example, simultaneous estimation of the mean and of the variance of a normal distribution, see animation below).

2) When a (vector) parameter of a multivariate distribution is estimated. For example :

* Estimating the mean of a p-variate distribution is equivalent to the simultaneous estimation of p univariate parameters (the coordinates of the distribution mean).

* Estimating a covariance matrix involves the simultaneous estimation of its n(n + 1)/2 coefficients (owing to the symmetry of the matrix).

First order conditions

As in the univariate case (and with the same restrictions concerning the range), local extrema can be identified by setting to 0 all the partial derivatives of the likelihood function with respect to the components of the parameter. For example, if the vector parameter has two components θ₁ and θ₂, the extrema of the likelihood must verify :

Second order conditions

The second order condition permitting to certify that a point of a twice continuously differentiable function L verifying the above first odrder conditions is indeed a maximum are somewhat more complicated than in the univariate case. It is in fact a set of two conditions :

1) At least one of the second partial derivative of L with respect to the components of the parameter must be negative (not just non positive) :

2) The determinant of the matrix of the second order partial derivatives of L must be positive (not just non negative) :

This last condition is in practice fairly annoying as it usually leads to cumbersome calculations even in simple cases.

Animation

This animation illustrates the idea of maximizing the likelihood of a normal distribution when its two parameters (mean and variance) have to be estimated simultaneously.

The "Book of Animations" on your computer

Maximum likelihood estimation and Least Squares estimation

Because of the important properties of MLEs (see below), Maximum Likelihood estimation is the premier choice for estimating the values of the parameters of a model ("fitting" the model to the data).

Yet, the most popular modeling technique, namely Linear Regression (Simple or Multiple), does not use Maximum Likelihood estimation, but rather Least Squares estimation instead. Why is it so ?

In fact, it can be shown that under the standard assumptions of linear regression (uncorrelated normal errors with identical variances), Least Squares estimation and Maximum Likelihood estimation always lead to identical results. Least Squares estimation is then prefered, mostly because of its useful geometric interpretation in terms of orthogonal projection.

This equivalence is certainly not true anymore for techniques for which the standard assumptions of Linear Regression are meaningless, as is the case for Logistic Regression or classification with Neural Networks. Then, Maximum Likelihood estimation is just about the only operational technique for estimating the parameters of the model.

Likelihood and tests

As the likelihood measures the quality of the fit between a distribution and a sample, it should be expected to play an important role in tests bearing on the choice between candidate distributions as the distribution that generated the sample.

The simplest example of using the likelihood in tests is to be found with the Neyman-Pearson theorem, which states that the Best Critical Region for a test that has to decide between two canditate distributions is entirely determined by considerations about the likelihoods of these two distributions for the sample at hand.

Properties of Maximum Likelihood estimators

So far, we only convinced ourselves that maximizing the likelihood of a sample seems to be a reasonable way of estimating the value of the parameter of a distribution (or of a model), but we also anticipated some technical difficulties in doing so. So why insist on Maximum Likelihood estimation ?

Invariance property of MLEs

Suppose we identified θ^*, the Maximum Likelihood estimator of the parameter θ. Suppose also that what we are really interested in is not θ, but rather a function of θ, say τ(θ). How can we find an estimator of τ(θ)? For example, is the MLE of a variance σ² of any help in identifying an estimator of σ?

It is. We'll show that for any function τ(.), if θ^* is the Maximum Likelihood estimator of θ, then τ(θ^*) is the Maximum Likelihood estimator of τ(θ).

Fixed size samples

Asymptotic properties of MLEs

The strongest justification for Maximum Likelihood estimation may be found in the asymptotic (that is, for large samples) properties of MLEs.

The least that can be expected from a statistic as a candidate estimator is to be consistent. We'll show that, under certain regularity conditions, a MLE is indeed consistent : for larger and larger samples, its variance tends to 0 and its expectation tends to the true value θ₀ of the parameter.

As the sample size grows without limit, we'll show that the distribution of a MLE converges to a normal distribution. Even for moderately large samples, the distribution of MLE is approximately normal.

Last but certainly not least, we'll show that a MLE is asymptotically efficient. What this means is that as the sample size grows without limit, the ratio of the variance of a MLE to the Cramér-Rao lower bound tends to 1. As a MLE is asymptotically unbiased, it is then also asymptotically efficient.

Remember, though, that the asymptotic properties of an estimator, good as they may be, say nothing about the properties of this estimator for small samples, and there is no reason to believe that MLEs are particularly good estimators for small samples. In particular :

* Consistency implies asymptotic unbiasedness, but MLEs have no reason for being unbiased estimators, and more often than not, they are biased.

* Asymptotic efficiency implies the smallest possible variance for very large samples, but says nothing about the variance of a MLE for moderate size samples.

Caveat

Maximum Likelihood estimation is attractive because it is conceptually simple and receives an intuitive interpretation. Yet, a mathematically rigorous approach of the properties of MLEs is difficult, and invariably involves regularity conditions on the likelihood function that are both difficult to establish, difficult to interpret and difficult to check in real life applications.

These regularity conditions cannot be casually ignored, and the already long life of Maximum Likelihood estimation is illustrated by a number of lethally pathological behaviors of MLEs, even for the most basic properties (e.g. consistency). So MLEs should certainly not be considered as a magic solution to be selected without regard for other types of estimators.

In this Tutorial, we show that a Maximum Likelihood estimator (MLE) is consistent.

More precisely, if we denote θ₀ the (unknown) value of the estimated parameter, we'll show that no matter how small the positive number δ, the probability for the likelihood function to have a maximum in the interval ]θ₀ - δ, θ₀ + δ[ tends to 1 as the sample size grows without limit.

This result will be reached by the following line of reasoning : the likelihood function certainly has a maximum in this interval if its derivative is positive for

θ₀ - δ and negative for θ₀ + δ (assuming that this derivative is continuous). We'll show that it is indeed the case with a probability larger than 1 - ε, however small ε, when the sample size grows without limit.

The demonstration will call on some results established when studying the Cramér-Rao lower bound.

In this Tutorial, we show that a Maximum Likelihood Estimator is asymptotically normally distributed : as the sample size grows without limit, the distribution of this MLE (times n^1/2 ) gets closer and closer to a normal distribution whose variance we'll calculate (the mean is of course the true value of the parameter as we just showed that a MLE is consistent).

The proof will use the same Taylor expansion of the score that was developed in the previous Tutorial. From this expansion, we'll derive an expression for the MLE whose limit distribution we'll calculate by resorting successively to several versions of Slutsky's theorem, and of course, to the Central Limit Theorem whose role in the demonstration should certainly be anticipated.

* And we'll show that the ratio of its variance to the Cramér-Rao lower bound tends to 1.

This last result will be a direct consequence of the demonstration of the asymptotic normality.

In this Tutorial, we prove the so-called "invariance property" of Maximum Likelihood estimators. This property states that if θ^* is the Maximum Likelihood estimator of the parameter θ, then, for any function τ(.), the MLE of τ(θ) is τ(θ^*).

The case where the function τ(.) is one-to-one is pretty straightforward. The demonstration when τ(.) is not one-to-one is a bit more intricate.

It will appear that this result is in fact not really statistical in nature, but is rather a general statement about the maximization of a function.

Estimation
Logistic Regression
Neyman-Pearson lemma
Cramér-Rao lower bound

Download this Glossary