Interactive animation |
This illustration shows a sample of n independent observations, and two continuous distributions f1(x) and f2(x), with f2(x) being just f1(x) translated by a certain amount.
Of these two distributions, which one is the most likely to have generated the sample ? Clearly, the answer is f1(x), and we would like to formalize this intuition.
Although this is not strictly impossible, we don't believe that f2(x) generated the sample because all the observations are in regions where the values of f2(x) are small : the probability for an observation to appear in such a region is small, and it is even more unlikely that all the observations in the sample would appear in low density regions.
On the other hand, the values taken by f1(x) are substantial for all the observations, which are then where one would expect them to be, would the sample be actually generated by f1(x).
Of the many ways to quantify this intuitive judgement, one turns out to be remarkably effective. For any probability distribution f(x), just multiply the values of f(x) for each of the observations of the sample, denote the result L, and call it the likelihood of the distribution f(x) for this particular sample :
Likelihood = L =: Πi f(xi) i = 1, 2, ..., n |
Clearly, the likelihood can have a large value only if all the observations are in regions where f(x) is not very small.
This definition has the additional advantage that L receives a natural interpretation. The sample {xi} may be regarded as a single observation generated by the n-variate probability distribution
f(x1, x2, ..., xn) = Πi f(xi)
because of the independence of the individual observations. So the likelihood of the distribution is just the value of the n-variate probability density f(x1, x2, ..., xn ) for the set of observations in the sample considered as a unique n-variate observation.
These considerations make us believe that "likelihood" might be a helpful concept for identifying the distribution that generated a given sample.
First note, though, that as such, this approach is moot if we don't a priori restrict our search : the probability distribution leading to the largest possible value of the likelihood is obtained by assigning the probability 1/n to each of the points where there is an observation, and assigning the value 0 to f(x) for any other point of the x axis. This result is both trivial and useless.
But consider the example given in the above illustration : f1(x) and f2(x) are assumed to belong to a family of distributions, all identical in shape and differing only by their position along the x axis (location family). It now makes sense to ask for which position of the generic distribution f(x) is the likelihood largest. If we denote θ the parameter adjusting the horizontal position of the distribution, one may consider the value of θ conducive to the largest likelihood as being probably fairly close to the true (and unknown) value θ0 of the parameter of the distribution that actually generated the sample.
It then appears that the concept of likelihood may lead to a method of parameter estimation. The method consists in retaining as an estimate of θ0 the value of θ conducive to the largest possible value of the sample likelihood. This method is thus called Maximum Likelihood estimation, which is, in fact, the most powerful and widely used method of parameter estimation these days.
An estimator θ* obtained by maximizing the likelihood of a probability distribution defined up to the value of a parameter θ is called a Maximum Likelihood estimator and is usually denoted "MLE".
When we need to emphasize the fact that the likelihood depends on both the sample x = {xi} and the parameter θ, we'll denote it L(x, θ).
-----
We used a continuous probability distribution for illustrating the concept of Maximum Likelihood, but the principle is valid for any distribution, either discrete or continuous.
The likelihood is defined as a product, and maximizing a product is usually more difficult than maximizing a sum. But if a function L(θ) is changed into a new function L'(θ) by a monotonously increasing transformation, then L(θ) and L'(θ) will clearly reach their maximum values for the same value of θ.
In particular, if the monotonous transformation is logarithmic, maximization of a product is then turned into an easier maximization of a sum.
The logarithm of the likelihood is called the log-likelihood, and will be denoted log-L. So, by definition :
Log-likelihood = log-L = : Σi log(f(xi)) i = 1, 2, ..., n
and the likelihood and log-likelihood reach their extrema for the same values of θ.
The principle of Maximum Likelihood estimation is straightforward enough, but its practice is fraught with difficulties, as is the case of any optimization problem.
From your highschool days, you may remember that the extrema of a differentiable function L(θ) verify the condition
As we just mentioned, this condition is also valid for the
log-likelihood log-L instead of the likelihood L.
So the most natural approach to maximizing a differentiable likelihood is to first solve this equation. Yet, this alone is far from solving the problem for a number of reasons.
Even though most classical likelihoods are differentiable (with the important exception of the uniform distribution), there is no reason why the solutions of this equation should have simple analytical forms. As a matter of fact, more often than not, they don't, and it may then be necessary to resort to computer numerical techniques to identify the extrema of the likelihood function (as is typically the case, for example, with Logistic Regression).
The above equation identifies extrema of L(θ), but says nothing about which of these extrema are maxima (that we are interested in) and which are minima (that we are not interested in). To make things worse, some inflexion points may also satisfy the equation. So, after the solutions of the equation have been identified, one must go through these solutions to retain only those corresponding to maxima.
Computer optimization techniques can be adjusted so as to identify only maxima.
Recall that a genuine maximum also verifies :
a condition that has to be checked for every solution of the first equation.
* There is no reason why the likelihood would have a single maximum. So once the maxima have been found, only the largest among them is retained, provided it is within the allowed range of the parameter θ. It can be shown, though, that under certain regularity conditions, the probability for the likelihood function to have a unique maximum tends to 1 as the sample size grows without limit.
* The equation
identifies extrema only within the interior of the range of θ. It is therefore ineffective in identifying extrema :
- That are on the boundary of the range of θ when this range is limited.
- Or that are "at infinity" (lower image of the above illustration). The likelihood then has no maximum.
When the maximum of the likelihood is identified by computer numerical techniques, the issue of the validity of the solution thus found is crucial. The value resulting from intense numerical computation may be extremely sensitive to round-off errors, thus conducing to estimated values of the parameter θ that may be substantially different from the value that would be obtained in the absence of computation errors. This is particularly true when the true value of the parameter is in a region where the likelihood varies very little with θ, or when the maximum is at infinity.
In the same line of reasoning, it may happen that the value of the likelihood is extremely sensitive to small changes in the values of the observations. Because real-world observations are always somewhat uncertain, it is a good idea to check the changes in the estimated value of the parameter when the values of the observations are slightly modified. If these small modifications lead to large variations in the estimated value of the parameter, the original value of the estimated parameter should be regarded with some suspicion.
Maximization of the likelihood may also be used for estimating several parameters simultaneously. This will be the case :
1) When two (or more) parameters of a univariate distribution are estimated simultaneously (for example, simultaneous estimation of the mean and of the variance of a normal distribution, see animation below).
2) When a (vector) parameter of a multivariate distribution is estimated. For example :
* Estimating the mean of a p-variate distribution is equivalent to the simultaneous estimation of p univariate parameters (the coordinates of the distribution mean).
* Estimating a covariance matrix involves the simultaneous estimation of its n(n + 1)/2 coefficients (owing to the symmetry of the matrix).
-----
The situation is now a bit more complex than in the univariate case.
As in the univariate case (and with the same restrictions concerning the range), local extrema can be identified by setting to 0 all the partial derivatives of the likelihood function with respect to the components of the parameter. For example, if the vector parameter has two components θ1 and θ2, the extrema of the likelihood must verify :
and
The second order condition permitting to certify that a point of a twice continuously differentiable function L verifying the above first odrder conditions is indeed a maximum are somewhat more complicated than in the univariate case. It is in fact a set of two conditions :
1) At least one of the second partial derivative of L with respect to the components of the parameter must be negative (not just non positive) :
for at least one i
2) The determinant of the matrix of the second order partial derivatives of L must be positive (not just non negative) :
This last condition is in practice fairly annoying as it usually leads to cumbersome calculations even in simple cases.
This animation illustrates the idea of maximizing the likelihood of a normal distribution when its two parameters (mean and variance) have to be estimated simultaneously.
The distribution likelihood is the product of the heights of all the green connections from the sample points to the gaussian curve. The posted value is the ratio of the current likelihood to the largest possible likelihood.
To fit the candidate normal distribution to the sample : * Translate it by translating the top of the curve with your mouse, * Change its width (standard deviation) by translating either side of the curve with your mouse.
Fine-tune the position and width of the curve by clicking and keeping your mouse button down : * Above the top of the curve to make it taller (and therefore narrower), * In the area below the curve to make it shorter (and therefore wider), * On either side of the curve to translate it.
|
Because of the important properties of MLEs (see below), Maximum Likelihood estimation is the premier choice for estimating the values of the parameters of a model ("fitting" the model to the data).
Yet, the most popular modeling technique, namely Linear Regression (Simple or Multiple), does not use Maximum Likelihood estimation, but rather Least Squares estimation instead. Why is it so ?
In fact, it can be shown that under the standard assumptions of linear regression (uncorrelated normal errors with identical variances), Least Squares estimation and Maximum Likelihood estimation always lead to identical results. Least Squares estimation is then prefered, mostly because of its useful geometric interpretation in terms of orthogonal projection.
This equivalence is certainly not true anymore for techniques for which the standard assumptions of Linear Regression are meaningless, as is the case for Logistic Regression or classification with Neural Networks. Then, Maximum Likelihood estimation is just about the only operational technique for estimating the parameters of the model.
As the likelihood measures the quality of the fit between a distribution and a sample, it should be expected to play an important role in tests bearing on the choice between candidate distributions as the distribution that generated the sample.
The simplest example of using the likelihood in tests is to be found with the Neyman-Pearson theorem, which states that the Best Critical Region for a test that has to decide between two canditate distributions is entirely determined by considerations about the likelihoods of these two distributions for the sample at hand.
So far, we only convinced ourselves that maximizing the likelihood of a sample seems to be a reasonable way of estimating the value of the parameter of a distribution (or of a model), but we also anticipated some technical difficulties in doing so. So why insist on Maximum Likelihood estimation ?
It turns out that MLEs have very interesting properties, that we now enunciate.
Suppose we identified θ*, the Maximum Likelihood estimator of the parameter θ. Suppose also that what we are really interested in is not θ, but rather a function of θ, say τ(θ). How can we find an estimator of τ(θ)? For example, is the MLE of a variance σ² of any help in identifying an estimator of σ?
It is. We'll show that for any function τ(.), if θ* is the Maximum Likelihood estimator of θ, then τ(θ*) is the Maximum Likelihood estimator of τ(θ).
For a given samples size, it can be shown that :
The strongest justification for Maximum Likelihood estimation may be found in the asymptotic (that is, for large samples) properties of MLEs.
The least that can be expected from a statistic as a candidate estimator is to be consistent. We'll show that, under certain regularity conditions, a MLE is indeed consistent : for larger and larger samples, its variance tends to 0 and its expectation tends to the true value θ0 of the parameter.
As the sample size grows without limit, we'll show that the distribution of a MLE converges to a normal distribution. Even for moderately large samples, the distribution of MLE is approximately normal.
3) Asymptotic efficiency
Last but certainly not least, we'll show that a MLE is asymptotically efficient. What this means is that as the sample size grows without limit, the ratio of the variance of a MLE to the Cramér-Rao lower bound tends to 1. As a MLE is asymptotically unbiased, it is then also asymptotically efficient.
-----
Remember, though, that the asymptotic properties of an estimator, good as they may be, say nothing about the properties of this estimator for small samples, and there is no reason to believe that MLEs are particularly good estimators for small samples. In particular :
* Consistency implies asymptotic unbiasedness, but MLEs have no reason for being unbiased estimators, and more often than not, they are biased.
* Asymptotic efficiency implies the smallest possible variance for very large samples, but says nothing about the variance of a MLE for moderate size samples.
Maximum Likelihood estimation is attractive because it is conceptually simple and receives an intuitive interpretation. Yet, a mathematically rigorous approach of the properties of MLEs is difficult, and invariably involves regularity conditions on the likelihood function that are both difficult to establish, difficult to interpret and difficult to check in real life applications.
These regularity conditions cannot be casually ignored, and the already long life of Maximum Likelihood estimation is illustrated by a number of lethally pathological behaviors of MLEs, even for the most basic properties (e.g. consistency). So MLEs should certainly not be considered as a magic solution to be selected without regard for other types of estimators.
_______________________________________________________________________
Tutorial 1 |
In this Tutorial, we show that a Maximum Likelihood estimator (MLE) is consistent.
More precisely, if we denote θ0 the (unknown) value of the estimated parameter, we'll show that no matter how small the positive number δ, the probability for the likelihood function to have a maximum in the interval ]θ0 - δ, θ0 + δ[ tends to 1 as the sample size grows without limit.
This result will be reached by the following line of reasoning : the likelihood function certainly has a maximum in this interval if its derivative is positive for
θ0 - δ and negative for θ0 + δ (assuming that this derivative is continuous). We'll show that it is indeed the case with a probability larger than 1 - ε, however small ε, when the sample size grows without limit.
The demonstration will call on some results established when studying the Cramér-Rao lower bound.
A MAXIMUM LIKELIHOOD ESTIMATOR IS CONSISTENT
The Taylor expansion of the score Limits of the coefficients of the Taylor expansion The Weak Law of Large Numbers Limit of the zeroth order term Limit of the first order term Limit of the remainder Deterministic solution Probabilistic solution Conclusion |
||
TUTORIAL |
__________________________________________________
Tutorial 2 |
In this Tutorial, we show that a Maximum Likelihood Estimator is asymptotically normally distributed : as the sample size grows without limit, the distribution of this MLE (times n1/2 ) gets closer and closer to a normal distribution whose variance we'll calculate (the mean is of course the true value of the parameter as we just showed that a MLE is consistent).
The proof will use the same Taylor expansion of the score that was developed in the previous Tutorial. From this expansion, we'll derive an expression for the MLE whose limit distribution we'll calculate by resorting successively to several versions of Slutsky's theorem, and of course, to the Central Limit Theorem whose role in the demonstration should certainly be anticipated.
-----
We finally adress the issue of the efficiency of a MLE.
* As a MLE is consistent, it is asymptotically unbiased,
* And we'll show that the ratio of its variance to the Cramér-Rao lower bound tends to 1.
A MLE is therefore asymptotically efficient.
This last result will be a direct consequence of the demonstration of the asymptotic normality.
A MLE IS ASYMPTOTICALLY NORMAL AND EFFICIENT
Outline of the proof Limit distribution of the denominator First term Second term Slutsky and limit distribution of the denominator Limit distribution of the numerator Expectation Variance Central Limit Theorem Slutsky and asymptotic normality A MLE is asymptotically efficient |
||
TUTORIAL |
______________________________________________
Tutorial 3 |
In this Tutorial, we prove the so-called "invariance property" of Maximum Likelihood estimators. This property states that if θ* is the Maximum Likelihood estimator of the parameter θ, then, for any function τ(.), the MLE of τ(θ) is τ(θ*).
The case where the function τ(.) is one-to-one is pretty straightforward. The demonstration when τ(.) is not one-to-one is a bit more intricate.
It will appear that this result is in fact not really statistical in nature, but is rather a general statement about the maximization of a function.
INVARIANCE PROPERTIES OF MLEs
Function is one-to-one Function is not one-to-one Induced likelihood Maximum value of the induced likelihood Maximizing the induced likelihood General mathematical result |
||
TUTORIAL |
______________________________________________________
Related readings :