17 Estimation Methods
17.1 Objectives
Apply the method of moments to estimate parameters or sets of parameters from given data.
Derive the likelihood function for a given random sample from a distribution.
Derive a maximum likelihood estimate of a parameter or set of parameters.
Calculate and interpret the bias of an estimator by analyzing its expected value relative to the true parameter.
17.2 Transitioning from probability models to statistical models
We started this book with descriptive models of data and then moved onto probability models. In these probability models, we have been characterizing experiments and random processes using both theory and simulation. These models are using a model about a random event to make decisions about data. These models are about the population and are used to make decisions about samples and data. For example, suppose we flip a fair coin 10 times, and record the number of heads. The population is the collection of all possible outcomes of this experiment. In this case, the population is infinite, as we could run this experiment repeatedly without limit. If we assume, model, the number of heads as a binomial distribution, we know the exact distribution of the outcomes. For example, we know that exactly 24.61% of the time, we will obtain 5 heads out of 10 flips of a fair coin. We can also use the model to characterize the variance, that is when it does not equal 5 and how much different from 5 it will be. However, these probability models are highly dependent on the assumptions and the values of the parameters.
From this point on in the book, we will focus on statistical models. Statistical models describe one or more variables and their relationships. We use these models to make decisions about the population, to predict future outcomes, or both. Often we don’t know the true underlying process; all we have is a sample of observations and perhaps some context. Using inferential statistics, we can draw conclusions about the underlying process. For example, suppose we are given a coin and we don’t know whether it is fair. So, we flip it a number of times to obtain a sample of outcomes. We can use that sample to decide whether the coin could be fair.
In some sense, we’ve already explored some of these concepts. In our simulation examples, we have drawn observations from a population of interest and used those observations to estimate characteristics of another population or segment of the experiment. For example, we explored random variable
Statistical models and probability models are not separate. In statistical models we find relationships, the explained portion of variation, and use probability models for the remaining random variation. In Figure 17.1, we demonstrate this relationship between the two types of models. In the first part of our studies, we will use univariate data in statistical models to estimate the parameters of a probability model. From there we will develop more sophisticated models to include multivariate models.
17.3 Estimation
Recall that in probability models, we have complete information about the population and we use that to describe the expected behavior of samples from that population. In statistics we are given a sample from a population about which we know little or nothing.
In this chapter, we will discuss estimation. Given a sample, we would like to estimate population parameters. There are several ways to do that. We will discuss two methods: method of moments and maximum likelihood.
17.4 Method of Moments
Recall earlier we discussed moments. We can refer to
Suppose
The
The value
We can use the sample moments to estimate the population moments since the population moments are usually functions of a distribution’s parameters,
This is all technical, so let’s look at an example.
Example:
Supposeis an i.i.d., independent and identically distributed, sample from a uniform distribution , and we don’t know . That is, our data consists of positive random numbers but we don’t know the upper bound. Find the method of moments estimator for , the upper bound.
We know that if
Our best guess for the first population moment (
Note that we could have used the second moments about the mean as well. This is less intuitive but still applicable. In this case we know that if
To decide which is better we need a criteria of comparison. This is beyond the scope of this book, but some common criteria are unbiased and minimum variance.
The method of moments can be used to estimate more than one parameter as well. We simply would have to incorporate higher order moments.
Example:
Suppose we take an i.i.d. sample from the normal distribution with parametersand . Find method of moments estimates of and .
First, we remember that we know two population moments for the normal distribution:
Setting these equal to the sample moments yields:
Again, we notice that the estimate for
Exercise:
You shot 25 free throws and make 21. Assuming a binomial model fits. Find an estimate of the probability of making a free throw.
There are two ways to approach this problem depending on how we define the random variable. In the first case we will use a binomial random variable,
A second approach is to let
17.5 Maximum likelihood
Recall that using method of moments involves finding values of the parameters that cause the population moments to be equal to the sample moments. Solving for the parameters yields method of moments estimates.
Next we will discuss one more estimation method, maximum likelihood estimation. In this method, we are finding values of parameters that would make the observed data most “likely”. In order to do this, we first need to introduce the likelihood function.
17.5.1 Likelihood Function
Suppose
The likelihood function is denoted as
The likelihood function is really the pmf/pdf except instead of the variables being random and the parameter(s) fixed, the values of the variable are known and the parameter(s) are unknown. A note on notation, we are using the semicolon in the pdf and likelihood function to denote what is known or given. In the pmf/pdf the parameters are known and thus follow the semicolon. The opposite is the case in the likelihood function.
Let’s do an example to help understand these ideas.
Example:
Suppose we are presented with a coin and are unsure of its fairness. We toss the coin 50 times and obtain 18 heads and 32 tails. Letbe the probability that a coin flip results in heads, we could use but we are getting you used to the two different common ways to represent a binomial parameter. What is the likelihood function of ?
This is a binomial process, but each individual coin flip can be thought of as a Bernoulli experiment. That is,
Notice this makes sense
and
Generalizing for any sample size
For our example
which makes sense because we had 18 successes, heads, and 32 failures, tails. The likelihood function is a function of the unknown parameter
17.5.2 Maximum Likelihood Estimation
Once we have a likelihood function
Most of the time (but not always), this will involve simple optimization through calculus (i.e., take the derivative with respect to the parameter, set to 0 and solve for the parameter). When maximizing the likelihood function through calculus, it is often easier to maximize the log of the likelihood function, denoted as log
because now we can take the derivative of a sum instead of a product, thus making it much easier.
Example:
Continuing our example. Find the maximum likelihood estimator for.
Recall that our likelihood function is
Figure 17.2 is a plot of the likelihood function as a function of the unknown parameter
By visual inspection, the value of
To maximize by mathematical methods, we need to take the derivative of the likelihood function with respect to
We can find the derivative of the likelihood function by applying the product rule:
We could simplify this, set to 0, and solve for
Now, taking the derivative does not require the product rule:
Setting equal to 0 yields:
Solving for
Note that technically, we should confirm that the function is concave down at our critical value, ensuring that
This value is negative for all relevant values of
In the case of our example (18 heads out of 50 trials),
This seems to make sense. Our best guess for the probability of heads is the number of observed heads divided by our number of trials. That was a great deal of algebra and calculus for what appears to be an obvious answer. However, in more difficult problems, it is not as obvious what to use for a MLE.
17.5.3 Numerical Methods - OPTIONAL
When obtaining MLEs, there are times when analytical methods (calculus) are not feasible or not possible. In the Pruim book (Pruim 2011), there is a good example regarding data from Old Faithful at Yellowstone National Park. We leave the example to the interested learner.