26  Analysis of Variance

26.1 Objectives

  1. Conduct and interpret a hypothesis test for equality of two or more means using both permutation and the \(F\) distribution. Evaluate the assumptions of ANOVA.

26.2 Introduction

In the last chapter, we learned about the chi-squared distribution, and used both mathematically derived tests (Pearson’s chi-squared test) and randomization tests to determine whether two categorical variables are independent. We also examined settings with one categorical and one numerical variable, testing for equality of means and equality of variances in two samples.

In this chapter, we will learn how to compare more than two means simultaneously.

26.3 Comparing more than two means

In contrast to last chapter, we now want to compare means across more than two groups. Again, we have two variables, where one is continuous (numerical) and the other is categorical. We might initially think to do pairwise comparisons, such as two sample t-tests, as a solution. Suppose we have three groups for which we want to compare means. We might be tempted to compare the first mean with the second, then the first mean with the third, and then finally compare the second and third means, for a total of three comparisons. However, this strategy can be treacherous. If we have many groups and do many comparisons, the Type 1 error is inflated and it is likely that we will eventually find a difference just by chance, even if there is no difference in the populations.

In this chapter, we will learn a new method called analysis of variance (ANOVA) and a new test statistic called the \(F\) statistic. ANOVA uses a single hypothesis test to determine whether the means across many groups are equal. The hypotheses are:

\(H_0\): The mean outcome is the same across all groups. In statistical notation, \(\mu_1 = \mu_2 = \cdots = \mu_k\) where \(\mu_i\) represents the mean of the outcome for observations in category \(i\).
\(H_A\): At least one mean is different.

Generally we must check three conditions on the data before performing ANOVA with the \(F\) distribution:

  1. the observations are independent within and across groups,
  2. the data within each group are nearly normal, and
  3. the variability across the groups is about equal.

When these three conditions are met, we may perform an ANOVA, using the \(F\) distribution, to determine whether the data provide strong evidence against the null hypothesis that all the group means, \(\mu_i\), are equal.

26.3.1 MLB batting performance

Let’s revisit the MLB batting performance example. We would like to discern whether there are real differences between the batting performance of baseball players according to their position. We will now consider all four positions from the dataset: outfielder (OF), infielder (IF), designated hitter (DH), and catcher (C). The data is available in the mlb_obp.csv file. As a reminder, batting performance is measured by on-base percentage.

Read the data into R.

mlb_obp <- read_csv("data/mlb_obp.csv")

Let’s review our data:

inspect(mlb_obp)

categorical variables:  
      name     class levels   n missing
1 position character      4 327       0
                                   distribution
1 IF (47.1%), OF (36.7%), C (11.9%) ...        

quantitative variables:  
  name   class   min    Q1 median     Q3   max     mean         sd   n missing
1  obp numeric 0.174 0.309  0.331 0.3545 0.437 0.332159 0.03570249 327       0

Next, change the variable position to a factor to give us greater control.

mlb_obp <- mlb_obp %>%
  mutate(position = as.factor(position))

Let’s look at summary statistics of the on-base percentage by position, this time considering all four positions.

favstats(obp ~ position, data = mlb_obp)
  position   min      Q1 median      Q3   max      mean         sd   n missing
1        C 0.219 0.30000 0.3180 0.35700 0.405 0.3226154 0.04513175  39       0
2       DH 0.287 0.31625 0.3525 0.36950 0.412 0.3477857 0.03603669  14       0
3       IF 0.174 0.30800 0.3270 0.35275 0.437 0.3315260 0.03709504 154       0
4       OF 0.265 0.31475 0.3345 0.35300 0.411 0.3342500 0.02944394 120       0

The means for each group are pretty similar to each other.

Exercise: The null hypothesis under consideration is the following: \(\mu_{OF} = \mu_{IF} = \mu_{DH} = \mu_{C}\). Write the null and corresponding alternative hypotheses in plain language.1

Exercise:
Construct side-by-side boxplots.

Figure 26.1 shows the side-by-side boxplots for all four positions.

Figure 26.1: Boxplots of on-base percentage by position played.

The largest difference between the sample means appears to be between the designated hitter and the catcher positions. Consider again the original hypotheses:

\(H_0\): \(\mu_{OF} = \mu_{IF} = \mu_{DH} = \mu_{C}\)
\(H_A\): The average on-base percentage (\(\mu_i\)) varies across some (or all) groups.

Thought question: Why might it be inappropriate to run the test by simply estimating whether the difference of \(\mu_{DH}\) and \(\mu_{C}\) is statistically significant at an \(\alpha = 0.05\) significance level?

The primary issue here is that we are inspecting the data before picking the groups that will be compared. It is inappropriate to examine all data by eye (informal testing) and only afterwards decide which parts to formally test. This is called data snooping or data fishing. Naturally, we would pick the groups with the largest differences for the formal test, leading to an inflation in the Type 1 error rate. To understand this better, let’s consider a slightly different problem.

Suppose we are to measure the aptitude for students in 20 classes of a large elementary school at the beginning of the year. In this school, all students are randomly assigned to classrooms, so any differences we observe between the classes at the start of the year are completely due to chance. However, with so many groups, we will probably observe a few groups that look rather different from each other. If we select only the classes that look really different, we will probably make the wrong conclusion that the assignment wasn’t random. While we might only formally test differences for a few pairs of classes, we informally evaluated the other classes by eye before choosing the most extreme cases for a comparison.

In the next section, we will learn how to use the \(F\) statistic and ANOVA to test whether observed differences in means could have happened just by chance, even if there was no true difference in the respective population means.

26.3.2 Analysis of variance (ANOVA) and the F test

The method of analysis of variance (ANOVA) in this context focuses on answering one question: is the variability in the sample means so large that it seems unlikely to be from chance alone? This question is different from earlier testing procedures because we will simultaneously consider many groups, and evaluate whether their sample means differ more than we would expect from natural variation. We call this variability the mean square between groups (\(MSG\)), and it has an associated degrees of freedom, \(df_{G} = k - 1\), when there are \(k\) groups. The \(MSG\) can be thought of as a scaled variance formula for means. If the null hypothesis is true, any variation in the sample means is due to chance and shouldn’t be too large. We typically use software to find the \(MSG\); however, the mathematical derivation follows. Let \(\bar{x}_i\) represent the mean outcome for observations in group \(i\), and let \(\bar{x}\) represent the mean outcome across all groups. Then, the mean square between groups is computed as

\[ MSG = \frac{1}{df_{G}}SSG = \frac{1}{k - 1}\sum_{i = 1}^{k} n_{i}\left(\bar{x}_{i} - \bar{x}\right)^2, \]

where \(SSG\) is called the sum of squares between groups and \(n_{i}\) is the sample size of group \(i\).

The mean square between the groups is, on its own, quite useless in a hypothesis test. We need a benchmark value for how much variability is expected among the sample means, if the null hypothesis is true. To this end, we compute a pooled variance estimate, often abbreviated as the mean squared error (\(MSE\)), which has an associated degrees of freedom of \(df_E = n - k\). It is helpful to think of the \(MSE\) as a measure of the variability within the groups. To find the \(MSE\), the sum of squares total (\(SST\))} is computed as

\[SST = \sum_{i = 1}^{n} \left(x_{i} - \bar{x}\right)^2,\]

where the sum is over all observations in the data set. Then we compute the sum of squared errors (\(SSE\)) in one of two equivalent ways:

\[ SSE = SST - SSG = (n_1 - 1)s_1^2 + (n_2 - 1)s_2^2 + \cdots + (n_k - 1)s_k^2, \]

where \(s_i^2\) is the sample variance (square of the standard deviation) of the observations in group \(i\). Then the \(MSE\) is the standardized form of \(SSE\):

\[MSE = \frac{1}{df_{E}}SSE\]

When the null hypothesis is true, any differences among the sample means are only due to chance, and the \(MSG\) and \(MSE\) should be about equal. For ANOVA, we examine the ratio of \(MSG\) and \(MSE\) in the \(F\) test statistic:

\[F = \frac{MSG}{MSE}\]

The \(MSG\) represents a measure of the between-group variability, and \(MSE\) measures the variability within each of the groups. Using a randomization test, we could also look at the difference in the mean squared errors as a test statistic instead of the ratio.

We can use the \(F\) statistic to evaluate the hypotheses in what is called an F test. A \(p\)-value can be computed from the \(F\) statistic using an \(F\) distribution, which has two associated parameters: \(df_{1}\) and \(df_{2}\). For the \(F\) statistic in ANOVA, \(df_{1} = df_{G}\) and \(df_{2}= df_{E}\). The \(F\) statistic is really a ratio of chi-squared random variables.

The \(F\) distribution, shown in Figure 26.2, takes on positive values and is right skewed, like the chi-squared distribution. The larger the observed variability in the sample means (\(MSG\)) relative to the within-group observations (\(MSE\)), the larger \(F\) will be and the stronger the evidence against the null hypothesis. Because larger values of \(F\) represent stronger evidence against the null hypothesis, we use the upper tail of the distribution to compute a \(p\)-value.

Figure 26.2: The F distribution.

The \(F\) statistic and the \(F\) test
Analysis of variance (ANOVA) is used to test whether the mean outcome differs across two or more groups. ANOVA uses a test statistic \(F\), which represents a standardized ratio of variability in the sample means (across the groups), relative to the variability within the groups. If \(H_0\) is true and the model assumptions are satisfied, the statistic \(F\) follows an \(F\) distribution with parameters \(df_{1} = k - 1\) and \(df_{2} = n - k\). The upper tail of the \(F\) distribution is used to represent the \(p\)-value.

26.3.2.1 ANOVA

We will use R to perform the calculations for the ANOVA. But let’s check our assumptions first.

There are three conditions we must check before conducting an ANOVA: 1) all observations must be independent, 2) the data in each group must be nearly normal, and 3) the variance within each group must be approximately equal.

Independence
All observations must be independent. More specifically, observations must be independent within and across groups. If the data are a simple random sample from less than 10% of the population, this assumption is reasonable. For processes and experiments, we must carefully consider whether the data may be independent (e.g., no paired data). In our MLB data, the data were not sampled; they consist of all players from the 2010 season with at least 200 at bats. However, there are not obvious reasons why independence would not hold for most or all observations, given our intended population is all MLB seasons (or something similar). This requires a bit of hand waving, but remember that independence is often difficult to assess.

Approximately normal
As with one- and two-sample testing for means, the normality assumption is especially important when the sample size is quite small. When we have larger data sets (and larger groups) and there are no extreme outliers, the normality condition is not usually a concern. The normal probability plots (quantile-quantile plots) for each group of the MLB data are shown below; there is some deviation from normality for infielders, but this isn’t a substantial concern since there are over 150 observations in that group and the outliers are not extreme. Sometimes in ANOVA, there are so many groups or so few observations per group that checking normality for each group isn’t reasonable. One solution is to combine the groups into one set of data. First, calculate the residuals of the baseball data, which are calculated by taking the observed values and subtracting the corresponding group means. For example, an outfielder with OBP of 0.435 would have a residual of \(0.435 - \bar{x}_{OF} = 0.082\). Then, to check the normality condition, create a normal probability plot using all the residuals simultaneously.

Figure 26.3 is the quantile-quantile plot to assess the normality assumption.

Figure 26.3: Quantile-quantile plot for two-sample test of means.

Constant variance
The final assumption is that the variance within the groups is about equal from one group to the next. This assumption is important to check, especially when the samples sizes vary widely across the groups. The constant variance assumption can be checked by examining a side-by-side box plot of the outcomes across the groups, which we did previously in Figure @ref(fig:box231-fig). In this case, the variability is similar across the four groups but not identical. We also see in the output of favstats that the standard deviation varies a bit from one group to the next. Whether these differences are from natural variation is unclear, so we should report the uncertainty of meeting this assumption when the final results are reported. The permutation test does not have this assumption and can be used as a check on the results from the ANOVA.

In summary, independence is always important to an ANOVA analysis, but is often difficult to assess. The normality condition is very important when the sample sizes for each group are relatively small. The constant variance condition is especially important when the sample sizes differ between groups.

Let’s write the hypotheses for the MLB example again:

\(H_0\): The average on-base percentage is equal across the four positions.
\(H_A\): The average on-base percentage varies across some (or all) groups.

The test statistic is the ratio of the between-groups variance (mean square between groups, \(MSG\)) and the pooled within-group variance (mean squared error, \(MSE\)). We perform ANOVA in R using the aov() function, and use the summary() function to extract the most important information from the output.

summary(aov(obp ~ position, data = mlb_obp))
             Df Sum Sq  Mean Sq F value Pr(>F)
position      3 0.0076 0.002519   1.994  0.115
Residuals   323 0.4080 0.001263               

This table contains all the information we need. It has the degrees of freedom, sums of squares, mean squared errors, \(F\) test statistic, and \(p\)-value. The test statistic \(\left(\frac{MSG}{MSE}\right)\) is 1.994, \(\frac{0.002519}{0.001263} = 1.994\). The \(p\)-value is larger than 0.05, indicating the evidence is not strong enough to reject the null hypothesis at a significance level of 0.05. That is, the data do not provide strong evidence that the average on-base percentage varies by player’s primary field position.

The calculation of the \(p\)-value can also be done by finding the probability associated with the upper tail of the \(F\) distribution:

pf(1.994, 3, 323, lower.tail = FALSE)
[1] 0.1147443

Figure 26.4 is a plot of the \(F\) distribution with the observed \(F\) test statistic shown as a red line.

Figure 26.4: The F distribution.

26.3.2.2 Permutation test

We can repeat the same analysis using a permutation test. We will first run it using a ratio of variances (mean squares) and then, for interest, as a difference in variances.

We need a way to extract the mean squares from the output. The broom package contains a function called tidy() that cleans up output from functions and makes them into data frames.

library(broom)
aov(obp ~ position, data = mlb_obp) %>%
  tidy()
# A tibble: 2 × 6
  term         df   sumsq  meansq statistic p.value
  <chr>     <dbl>   <dbl>   <dbl>     <dbl>   <dbl>
1 position      3 0.00756 0.00252      1.99   0.115
2 Residuals   323 0.408   0.00126     NA     NA    

Let’s summarize the values in the meansq column and develop our test statistic, the ratio of mean squares. We could just pull the statistic (using the pull() function) but we want to be able to generate a different test statistic, the difference of mean squares, as well.

aov(obp ~ position, data = mlb_obp) %>%
  tidy() %>%
  summarize(stat = meansq[1] / meansq[2]) 
# A tibble: 1 × 1
   stat
  <dbl>
1  1.99

Now we are ready. First, get our observed test statistic using pull().

obs <- aov(obp ~ position, data = mlb_obp) %>%
  tidy() %>%
  summarize(stat = meansq[1] / meansq[2]) %>%
  pull()
obs
[1] 1.994349

Let’s put our test statistic into a function that includes shuffling the position variable.

f_stat <- function(x){
  aov(obp ~ shuffle(position), data = x) %>%
  tidy() %>%
  summarize(stat = meansq[1] / meansq[2]) %>%
  pull()
}

Now, we run our function.

set.seed(5321)
f_stat(mlb_obp)
[1] 0.1185079

Next, we run the randomization test using the do() function. There is an easier way to do all of this work with the purrr package but we will continue with the work we have started.

set.seed(5321)
results <- do(1000)*(f_stat(mlb_obp))

The above code is slow in executing because the tidyverse functions inside our f_stat() function are slow.

Figure 26.5 is a plot of the sampling distribution from the randomization test. The \(F\) distribution is overlaid as a dark blue curve.

Figure 26.5: The sampling distribution of the ratio of variances randomization test statistic.

The \(p\)-value is

prop1(~(result >= obs), results)
prop_TRUE 
0.0959041 

This is a similar \(p\)-value to the ANOVA output.

Now, let’s repeat the analysis but use the difference in variance (mean squares) as our test statistic. We’ll define a new function that calculates the test statistic and then run the randomization test.

f_stat2 <- function(x){
  aov(obp ~ shuffle(position), data = x) %>%
  tidy() %>%
  summarize(stat = meansq[1] - meansq[2]) %>%
  pull(stat)
}
set.seed(5321)
results <- do(1000)*(f_stat2(mlb_obp))

Figure 26.6 is the plot of the sampling distribution of the difference in variances.

Figure 26.6: The sampling distribution of the difference in variances randomization test statistic.

We need the observed test statistic in order to calculate a \(p\)-value.

obs <- aov(obp ~ position, data = mlb_obp) %>%
  tidy() %>%
  summarize(stat = meansq[1] - meansq[2]) %>%
  pull(stat)
obs
[1] 0.001255972

The \(p\)-value is

prop1(~(result >= obs), results)
prop_TRUE 
0.0959041 

Again, we find a similar \(p\)-value.

Exercise: If we reject the null hypothesis in the ANOVA test, we know that at least one mean is different but we don’t know which one(s). How would you approach answering the question of which means are different?

26.4 Culminating example

To bring together all the ideas we have learned so far in this block, we will work an example of testing for a difference of two means. In our opinion, the easiest method to understand is the permutation test and the most difficult is the one based on the mathematical derivation, because of the assumptions necessary to get a mathematical solution for the sampling distribution. We will also introduce how to use the bootstrap to get a confidence interval.

26.4.1 HELP example

Let’s return to the Health Evaluation and Linkage to Primary Care data set, HELPrct in the mosaicData package. Previously, we looked at whether there was a difference in substance of abuse between males and females.

We are now interested in whether there is a difference between male and female ages. Let’s first take a look at the data to refresh our memory.

data("HELPrct")
HELP_sub <- HELPrct %>%
  select(age, sex)
favstats(age ~ sex, data = HELP_sub)
     sex min Q1 median   Q3 max     mean       sd   n missing
1 female  21 31     35 40.5  58 36.25234 7.584858 107       0
2   male  19 30     35 40.0  60 35.46821 7.750110 346       0
HELP_sub %>%
  gf_boxplot(age ~ sex) %>%
  gf_theme(theme_classic()) %>%
  gf_labs(x = "Gender", y = "Age (years)")

The distribution of age in the HELP study by gender.
HELP_sub %>%
  gf_dhistogram(~age|sex, fill = "cyan", color = "black") %>%
  gf_theme(theme_classic()) %>%
  gf_labs(x = "Age", y = "")

The distribution of age in the HELP study by gender.

Figures @ref(fig:box253-fig) and @ref(fig:hist254-fig) indicate there might be a slight difference in the means, but is it statistically significant?

26.4.2 Randomization test

The randomization test (approximation of a permutation test) is ideally suited for a hypothesis test. So, we will conduct this first and then see if we can generate a confidence interval.

The hypotheses are:

\(H_0\): There is no difference in average age for men and women in the detoxification unit. In statistical notation: \(\mu_{male} - \mu_{female} = 0\), where \(\mu_{female}\) represents the mean age for female inpatients and \(\mu_{male}\) represents the mean age for male inpatients.

\(H_A\): There is some difference in average age for men and women in the detoxification unit (\(\mu_{male} - \mu_{female} \neq 0\)).

Let’s perform a randomization test. The mean age by sex is again shown below.

favstats(age ~ sex, data = HELP_sub)
     sex min Q1 median   Q3 max     mean       sd   n missing
1 female  21 31     35 40.5  58 36.25234 7.584858 107       0
2   male  19 30     35 40.0  60 35.46821 7.750110 346       0

We calculate the observed difference in the mean ages. Notice we are subtracting the mean female age from the mean male age.

obs_stat <- diffmean(age ~ sex, data = HELP_sub)
obs_stat
  diffmean 
-0.7841284 

Under the null hypothesis, there is no difference in the average age for men and women. So, we shuffle the sex labels around to generate a sampling distribution under the null.

set.seed(345)
results <- do(10000)*diffmean(age ~ shuffle(sex), data = HELP_sub)
favstats(~diffmean, data = results)
       min         Q1     median     Q3      max        mean        sd     n
 -3.378154 -0.5638809 0.01120955 0.5863 3.486224 0.009350908 0.8492454 10000
 missing
       0

The sampling distribution is centered on the null hypothesized value of 0, more or less, and the standard deviation is 0.849. This is an estimate of the variability of the difference in mean ages. The observed difference in means is shown as a red line on the sampling distribution in Figure @ref(fig:hist255-fig).

results %>%
  gf_histogram(~diffmean, color = "black", fill = "cyan") %>%
  gf_vline(xintercept = obs_stat, color = "red") %>%
  gf_theme(theme_classic()) %>%
  gf_labs(x = "Difference of means", 
          title = "Sampling distribution of difference of two means", 
          subtitle = "Null assumes equal means")

The approximate sampling distribution of the difference of means from a bootstrap process.

Our test statistic does not appear to be too extreme. Now, we calculate a two-sided \(p\)-value.

2*prop1(~(diffmean <= obs_stat), data = results)
prop_TRUE 
0.3523648 

Based on this \(p\)-value, we fail to reject the null hypothesis. There is not enough evidence to say that the mean age for men and women is different.

Now, to construct a confidence interval based on the randomization, we have to be careful and think about this. The object results has the distribution of the difference in means, assuming there is no difference. To get a confidence interval, we want to center this distribution on the observed difference in means and not on zero. We will do this by adding the observed difference in means to the randomization results, and then find the percentile confidence interval.

cdata(~(diffmean + obs_stat), data = results)
         lower     upper central.p
2.5% -2.449246 0.8789368      0.95

We are 95% confident that the true difference in mean ages between male and female participants in the study is between -2.45 and 0.88. Because 0 is contained in the confidence interval, we fail to reject the null hypothesis. There is not enough evidence to say that the mean age for men and women is different.

Note that we are assuming the test statistic can be transformed. It turns out that the percentile method for a confidence interval is transformation invariant, so we can perform the transformation of shifting the null sampling distribution by the observed value.

26.4.3 Traditional mathematical methods

Using the CLT or the \(t\) distribution becomes difficult, because we have to find a way to calculate the standard error. There have been many proposed methods, and you are welcome to research them, but we will only present a couple of ideas in this section. Let’s summarize the process for both hypothesis testing and confidence intervals in the case of the difference of two means using the \(t\) distribution. You may recall some of the ideas for a two-sample \(t\) test (hypothesis test) from Chapter @ref(ADDTESTS).

26.4.4 Hypothesis tests

When applying the \(t\) distribution for a hypothesis test, we proceed as follows:

  1. State the appropriate hypotheses.

  2. Verify the conditions for using the \(t\) distribution.

    For a difference of means, when the data are not paired: each sample mean must separately satisfy the one-sample conditions for the \(t\) distribution, and the data in each group must also be independent. Just like in the one-sample case, slight skewness will not be a problem for larger sample sizes. We can have moderate skewness and be fine if our sample is 30 or more. We can have extreme skewness if our sample is 60 or more.

  3. Compute the point estimate of interest, the standard error, and the degrees of freedom.

  4. Compute the \(T\) test statistic and \(p\)-value.

  5. Make a conclusion based on the \(p\)-value, and write a conclusion in context of the problem and in plain language so that anyone can understand the results.

We’ve added the extra step of checking the assumptions here.

26.4.5 Confidence intervals

Similarly, the following is how we generally compute a confidence interval using the \(t\) distribution:

  1. Verify conditions for using the \(t\) distribution. (See above.)

  2. Compute the point estimate of interest, the standard error, the degrees of freedom, and \(t^{\star}_{df}\) (the critical value or number of standard deviations needed for the confidence interval).

  3. Calculate the confidence interval using the general formula, \(\text{point estimate} \pm\ t_{df}^{\star} SE\).

  4. Put the conclusions in context of the problem and in plain language, so even non-statisticians can understand the results.

If the assumptions above are met, each sample mean can itself be modeled using a \(t\) distribution. If the samples are independent, then the sample difference of two means, \(\bar{x}_1 - \bar{x}_2\), can be modeled using the \(t\) distribution and the standard error is

\[SE_{\bar{x}_{1} - \bar{x}_{2}} = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}\]

To calculate the degrees of freedom, we can use statistical software or, conservatively, use the smaller of \(n_1 - 1\) and \(n_2 - 1\).

26.4.5.1 Results

Back to the HELP data, the men and women were independent of each other. Additionally, the distributions in each population don’t show any clear deviations from normality. There is some slight skewness but the sample size reduces this concern, see Figure @ref(fig:qq256-fig). Finally, within each group we also need independence. If each group represents less than 10% of the population, we are good to go on this. This condition might be difficult to verify.

HELP_sub %>%
  gf_qq(~age|sex) %>%
  gf_qqline(~age|sex) %>%
  gf_theme(theme_bw())

The quantile-quantile plots to check the normality assumption.

The distribution of males tends to have longer tails than a normal distribution, and the female distribution is skewed to the right. The sample sizes are large enough that this does not worry us. Below are the summary statistics by sex once more.

favstats(age ~ sex, data = HELP_sub)
     sex min Q1 median   Q3 max     mean       sd   n missing
1 female  21 31     35 40.5  58 36.25234 7.584858 107       0
2   male  19 30     35 40.0  60 35.46821 7.750110 346       0

Let’s find the confidence interval first. We use the smaller of \(n_1 - 1\) and \(n_2 - 1\) as the degrees of freedom: \(df = 106\).

(35.47 - 36.25) + c(-1,1)*qt(0.975, 106)*sqrt(7.58^2 / 107 + 7.75^2 / 346)
[1] -2.4512328  0.8912328

This result is very close to the interval we got with the permutation test.

Now, let’s find the \(p\)-value for the hypothesis test.

The test statistic is:

\[T \ =\ \frac{\text{point estimate} - \text{null value}}{SE}\]

\[ = \frac{(35.47 - 36.25) - 0}{\sqrt{\left( \frac{7.58^2}{107}+ \frac{7.75^2}{346}\right)}}\ =\ - 0.92976 \] We again use \(df = 106\). The \(p\)-value is:

2*pt(-0.92976, 106)
[1] 0.3546079

The results in R are slightly different due to the calculation of the degrees of freedom.

t_test(age ~ sex, data = HELP_sub)

    Welch Two Sample t-test

data:  age by sex
t = 0.92976, df = 179.74, p-value = 0.3537
alternative hypothesis: true difference in means between group female and group male is not equal to 0
95 percent confidence interval:
 -0.8800365  2.4482932
sample estimates:
mean in group female   mean in group male 
            36.25234             35.46821 

Notice that the degrees of freedom is not an integer. This is because it is a weighted average of the two different samples sizes and standard deviations. This method is called the Satterthwaite approximation. Additionally, the confidence interval is slightly different because t_test() subtracted the mean age for men from the mean age for women. Still, the confidence interval and \(p\)-value lead us to the same results as before – we fail to reject the null hypothesis.

26.4.5.2 Pooled standard deviation

Occasionally, two populations will have standard deviations that are so similar they can be treated as identical. This is an assumption of equal variance in each group. For example, historical data or a well-understood biological mechanism may justify this strong assumption. In such cases, we can make the \(t\) distribution approach slightly more precise by using a pooled standard deviation.

The pooled standard deviation of two groups is a way to use data from both samples to better estimate the standard deviation and standard error. If \(s_1\) and \(s_2\) are the standard deviations of groups 1 and 2, respectively, and there are good reasons to believe that the population standard deviations are equal, then we can obtain an improved estimate of the group variances by pooling their data:

\[ s_{pooled}^2 = \frac{s_1^2\times (n_1 - 1) + s_2^2\times (n_2 - 1)}{n_1 + n_2 - 2}\]

where \(n_1\) and \(n_2\) are the sample sizes of groups 1 and 2, as before. To use this new estimate, we substitute \(s_{pooled}^2\) in place of \(s_1^2\) and \(s_2^2\) in the standard error formula, and we use an updated formula for the degrees of freedom: \[df = n_1 + n_2 - 2\]

The benefits of pooling the standard deviation are realized through obtaining a better estimate of the standard deviation for each group and using a larger degrees of freedom parameter for the \(t\) distribution. Both of these changes may permit a more accurate model of the sampling distribution of \(\bar{x}_1 - \bar{x}_2\).

Caution
Pooling standard deviations should be done only after careful research.

A pooled standard deviation is only appropriate when background research indicates the population standard deviations are nearly equal. When the sample size is large and the condition may be adequately checked with data, the benefits of pooling the standard deviations greatly diminishes.

In R we can test the difference of two means with equal variance using the var.equal argument.

t_test(age ~ sex, data = HELP_sub, var.equal = TRUE)

    Two Sample t-test

data:  age by sex
t = 0.91923, df = 451, p-value = 0.3585
alternative hypothesis: true difference in means between group female and group male is not equal to 0
95 percent confidence interval:
 -0.8922735  2.4605303
sample estimates:
mean in group female   mean in group male 
            36.25234             35.46821 

Since our sample sizes are so large, this did not have a big impact on the results.

26.4.6 Bootstrap

Finally, we will construct a confidence interval through the use of the bootstrap. In this problem, we have to be careful and sample within each group. Compare the following two bootstrap samples.

favstats(age ~ sex, data = resample(HELP_sub))
     sex min Q1 median    Q3 max     mean       sd   n missing
1 female  21 30     33 38.75  50 34.64706 6.267387 102       0
2   male  19 29     33 39.00  59 34.74359 7.833066 351       0

and

favstats(age ~ sex, data = resample(HELP_sub, groups = sex))
     sex min Q1 median   Q3 max     mean       sd   n missing
1 female  22 31     34 39.5  57 35.60748 6.901951 107       0
2   male  20 31     35 41.0  60 35.94798 8.039227 346       0

Notice in the second line of code, we are keeping the groups the same size by sampling within the sex variable.

Now, let’s get our bootstrap distribution.

set.seed(2527)
results <- do(1000)*diffmean(age ~ sex, data = resample(HELP_sub, groups = sex))

Figure @ref(fig:boot257-fig) is our sampling distribution from the bootstrap.

results %>%
  gf_histogram(~diffmean, fill = "cyan", color = "black") %>%
  gf_theme(theme_classic) %>%
  gf_labs(x = "Difference in means", y = "")

Bootstrap distribution of the difference in means.
cdata(~diffmean, p = 0.95, data = results)
         lower     upper central.p
2.5% -2.394406 0.8563786      0.95

Again, we find similar results with the percentile confidence interval.


  1. \(H_0\): The average on-base percentage is equal across the four positions. \(H_A\): The average on-base percentage varies across some (or all) groups. That is, the average on-base percentage for at least one position is different.↩︎