9.1 Case study: Comparing work travel times
The following is a worked example using the county complete dataset from Section 8 that demonstrates how to use the infer
package to test if the difference in mean values of two distributions is statistically significant.
To follow along, load the following libraries and dataset into your R session:
# Packages
library(dplyr)
library(ggplot2)
library(readr)
library(infer)
# Dataset
<- read_rds(
county_complete url("http://data.cds101.com/county_complete.rds")
)
9.1.1 Work travel times in Iowa and Nebraska
In Section 8, we looked at the distribution of average work travel times in Iowa and Nebraska measured on the county level.
In our visual inspection of the probability mass function, we concluded that the distribution of work travel times is shifted slightly towards shorter commutes in Nebraska versus Iowa. The summary statistics for the two distributions seem to support this conclusion:
# Filter dataset to only include counties in Iowa and Nebraska
<- county_complete %>%
ia_ne_county filter(state == "Iowa" | state == "Nebraska")
# Compute summary statistics of mean_work_travel column for Iowa and Nebraska
<- ia_ne_county %>%
ia_ne_summary_stats group_by(state) %>%
summarize(
mean = mean(mean_work_travel),
median = median(mean_work_travel),
sd = sd(mean_work_travel),
iqr = IQR(mean_work_travel)
)
state | mean | median | sd | iqr |
---|---|---|---|---|
Iowa | 19.11 | 18.5 | 3.425 | 4.4 |
Nebraska | 17.97 | 17.6 | 3.823 | 5.0 |
Let’s compute the difference in means directly using ia_ne_summary_stats
.
First, we need to grab the mean
column:
<- ia_ne_summary_stats %>%
ia_ne_means pull(mean)
ia_ne_means
is a simple vector that stores two numbers.
ia_ne_means
## [1] 19.11 17.97
The first number of the vector, which we can access with ia_ne_means[1]
, is the mean work travel times in Iowa.
The second number of the vector, which we can access with ia_ne_means[2]
, is the mean work travel times in Nebraska.
Knowing this, we easily compute the difference in means,
<- ia_ne_means[1] - ia_ne_means[2] ia_ne_diff_in_means
obtaining the following value of ia_ne_diff_in_means
:
# (mean work travel time in Iowa) - (mean work travel time in Nebraska)
ia_ne_diff_in_means = 1.15 minutes
However, only checking the numerical difference between the mean values is not enough if we want to claim there is a statistically significant difference in commute times between Iowa and Nebraska. There is significant overlap between the two distributions, with the standard deviation (sd) of each distribution being 3–4 times larger than the difference in means. Before we can conclude that the difference in means is statistically significant, we need to estimate how likely it is that this difference could have arisen from chance alone.
9.1.2 Defining the hypothesis test
To determine whether or not there is a meaningful difference in means between the two distributions, we will conduct a two-sided hypothesis test using the following null and alternative hypotheses:
Null hypothesis: There is no difference between the mean work travel time in Iowa and the mean work travel time in Nebraska.
\[\text{H}_{0}:\mu_{\text{IA}}-\mu_{\text{NE}}=0\]
Alternative hypothesis: There is a difference between the mean work travel time in Iowa and the mean work travel time in Nebraska.
\[\text{H}_{\text{A}}:\mu_{\text{IA}}-\mu_{\text{NE}}\neq{}0\]
We set our significance level to \(\alpha=0.05\), which we will use as a cutoff for determining whether or not a result is statistically significant.
9.1.3 Building the null distribution
To conduct our hypothesis test, we first must generate our null distribution using the infer
package:
<- ia_ne_county %>%
ia_ne_mean_work_travel_null specify(formula = mean_work_travel ~ state) %>%
hypothesize(null = "independence") %>%
generate(reps = 10000, type = "permute") %>%
calculate(stat = "diff in means", order = combine("Iowa", "Nebraska"))
Let’s explain what each of these lines is doing.
We start by piping our filtered dataset stored in ia_ne_county
into specify
:
specify(formula = mean_work_travel ~ state)
Here we specify the response variable and the explanatory variable in the formula =
input keyword.
The formula syntax is given as follows:
response ~ explanatory
Therefore, in our formula mean_work_travel ~ state
, mean_work_travel
is the response variable and state
is the explanatory variable.
How do we know to use mean_work_travel
as the response variable?
It’s because of how we formulated our null and alternative hypotheses, such that we are testing whether a person’s mean work travel time is affected by the state he or she lives in, not whether the state that a person lives in is affected by his or her mean work travel time.
Finally, because the response variable is numerical, we don’t use the success =
keyword in specify
.
The next line that we pipe into is:
hypothesize(null = "independence")
Here we indicate that we are conducting a hypothesis test to check whether or not the variables provided in specify
are independent of one another.
A basic rule of thumb is, if you specify both an explanatory and a response variable in the formula of specify
, then you will use null = "independence"
.
Next we pipe into this line:
generate(reps = 10000, type = "permute")
Here we indicate that we want to permute the mean_work_travel
and state
columns 10,000 times to build up our null distribution.
In effect, we are randomly shuffling the rows in mean_work_travel
into a different order, and then randomly shuffling the rows in state
into yet another order, which should remove any connections between the mean_work_travel
and state
columns (this is what we mean by null distribution).
We know to use type = "permute"
because we have specified both a response and an explanatory variable and we want to see what happens when we assume the two columns are independent.
As for reps = 10000
, this controls the overall accuracy of our hypothesis tests.
There is no hard and fast rule for the value to set, but if you want more accurate results, increase reps
.
However, keep in mind that the larger reps
is, the longer it will take to complete the calculation.
We finish up by piping into the last line:
calculate(stat = "diff in means", order = combine("Iowa", "Nebraska"))
Here we indicate that we are comparing the difference in means between our two distributions (the mean work travel times in Iowa and Nebraska).
The input order = combine("Iowa", "Nebraska")
defines the subtraction order, so we will find the difference in means by subtracting the mean time in Iowa from the mean time in Nebraska, which is the same order we used for ia_ne_diff_in_means <- ia_ne_means[1] - ia_ne_means[2]
.
As for the stat
input, we know to use stat = "diff in means"
because of how we formulated our null and alternative hypotheses, i.e. we are comparing the difference in mean work travel times between Iowa and Nebraska.
The way we know that we can’t use something like stat = "diff in props"
is because at least one of our variables in specify
is numerical (mean_work_travel
).
The input stat = "diff in props"
only applies if both the explanatory and response variables are categorical.
The null distribution that we generated looks as follows:
%>%
ia_ne_mean_work_travel_null visualize() +
labs(
x = "difference in means",
title = "Difference in mean work travel times null distribution"
)
9.1.4 Computing the two-sided p-value
After we’ve generated our null distribution, we can compute the p-value using get_p_value
.
To compute a p-value, we need an observed statistic, which is simply the difference in means we computed using the dataset itself.
This is what we computed using ia_ne_diff_in_means <- ia_ne_means[1] - ia_ne_means[2]
.
However, infer
also provides a shortcut for computing the observed statistic so that you don’t have to use summarize()
,
<- ia_ne_county %>%
ia_ne_diff_in_means specify(formula = mean_work_travel ~ state) %>%
calculate(stat = "diff in means", order = combine("Iowa", "Nebraska"))
Note that all we’ve done is taken the code we used to generate the null distribution and removed the hypothesize()
and generate()
lines.
Now that we have our observed statistic, we can compute the p-value for our two-sided hypothesis test,
<- ia_ne_mean_work_travel_null %>%
ia_ne_p_value_two_sided get_p_value(obs_stat = ia_ne_diff_in_means, direction = "both")
p_value |
---|
0.0344 |
Our two-sided p-value is 0.0344, which is below our significance level of \(\alpha=0.05\). We therefore reject the null hypothesis in favor of the alternative hypothesis. The difference between the mean work travel times in Iowa and Nebraska is statistically significant.
We visualize the meaning of the p-value in the following way:
%>%
ia_ne_mean_work_travel_null visualize() +
shade_p_value(obs_stat = ia_ne_diff_in_means, direction = "both") +
labs(
x = "difference in means",
title = "Difference in mean work travel times null distribution"
)
The portions of the null distribution that are in the shaded red area correspond to results that are more extreme than the observed value 1.15 in a two-sided hypothesis test. The more that a null distribution lies within the red portions of the visualization, the more likely it becomes that we will fail to reject the null distribution and the difference in means that we observed in the dataset arose due to random chance alone.
9.1.5 Computing the 95% confidence interval
In addition to the two-sided hypothesis test, it is also good practice to compute the 95% confidence interval for the difference in mean work travel times between Iowa and Nebraska. To compute this, we need to generate the bootstrap distribution for the difference in means:
<- ia_ne_county %>%
ia_ne_mean_work_travel_bootstrap specify(formula = mean_work_travel ~ state) %>%
generate(reps = 10000, type = "bootstrap") %>%
calculate(stat = "diff in means", order = combine("Iowa", "Nebraska"))
You’ll notice that the code to generate the bootstrap distribution is pretty similar to the code used to generate the null distribution.
All we had to do was remove the hypothesize(null = "independence")
line and change the type =
keyword in generate
from "permute"
to "bootstrap"
, and that’s it!
You might be wondering, how are the methods for generating the null distribution and bootstrap distribution different? For this example, they differ in two important ways:
Unlike when we generated the null distribution, the
mean_work_travel
andstate
columns are not randomly shuffledGenerating a bootstrap distribution requires sampling with replacement, meaning we grab different rows in
ia_ne_county
at random (we might even grab the same row more than), which we do not do if we permute the columns
The bootstrap distribution we just generated looks as follows:
%>%
ia_ne_mean_work_travel_bootstrap visualize() +
labs(
x = "difference in means",
title = "Difference in mean work travel times bootstrap distribution"
)
This shows what could happen if we went out and collected another set of samples of the data in this dataset. Due to random variations in the people we survey, sometimes we might end up with a dataset where the difference between the mean work travel times in Iowa and Nebraska is closer to 2 minutes instead of 1 minute, or it might be closer to zero. Most of the time, it will be close to 1 minute.
The 95% confidence interval is defined as the fraction of the data in the bootstrap distribution that lies between the 2.5th and 97.5th percentiles:
<- ia_ne_mean_work_travel_bootstrap %>%
ia_ne_ci95 get_confidence_interval()
lower_ci | upper_ci |
---|---|
0.1084 | 2.162 |
This means that, if we keep collecting new samples of mean work travel times in Iowa and Nebraska, we expect that 95% of our samples will have a difference in mean work travel times that lies between 0.1084 and 2.1622. Also note that the lower bound of the confidence interval does not intersect with a difference of means equal to zero, lending further support to the conclusion we reached using our two-sided hypothesis test.
We close this example by visualizing the 95% confidence interval by shading the middle 95% of the data in the bootstrap distribution:
%>%
ia_ne_mean_work_travel_bootstrap visualize() +
shade_confidence_interval(endpoints = ia_ne_ci95) +
labs(
x = "difference in means",
title = "Difference in mean work travel times 95% confidence interval"
)