Edit Problem


P values
|
Hard


(Power to Simulate)

This practice problem is drawn from Example 9.19 in Devore and Berk.

The fuel efficiency (mpg) of any particular new vehicle under specified driving conditions may not be identical to the EPA figure that appears on the vehicle’s sticker. Suppose that 10 different vehicles of a particular type are to be selected and driven over a certain course, after which the fuel efficiency of each one is to be determined. Let $m$ denote the true average fuel efficiency under these conditions.

Consider testing $H_{0}: \mu = 20$ versus $H_{a}: m > 20$ using the one-sample t test based on the resulting sample.

(a) Use R to produce a random sample of size 10 from a normal distribution with mean 20 and standard deviation 2. Then, conduct a t-test of this distribution against the null hypothesis that the true mean is 0. Repeat this process a large number of times (i.e. 10,000), saving the resulting p-value each time. What proportion of these simulations would reject the null hypothesis at an $\alpha = 0.05$ critical value? If you produce a plot of this distribution of these p-values, what do you observe? (b) Use R to produce a random sample of size 10 from a normal distribution with mean 21 and standard deviation 2. Then, conduct a t-test of this distribution against the null hypothesis that the true mean is 0. Repeat this process a large number of times (i.e. 10,000), saving the resulting p-value each time. What proportion of these simulations would reject the null hypothesis at an $\alpha = 0.05$ critical value? If you produce a plot of this distribution of these p-values, what do you observe? (c) Use R to produce a random sample of size 10 from a normal distribution with mean 22 and standard deviation 2. Then, conduct a t-test of this distribution against the null hypothesis that the true mean is 0. Repeat this process a large number of times (i.e. 10,000), saving the resulting p-value each time. What proportion of these simulations would reject the null hypothesis at an $\alpha = 0.05$ critical value? If you produce a plot of this distribution of these p-values, what do you observe?

Solution:

This is another way to visualize the power of a t-test!

Null Hypothesis Is True

library(ggplot2)null_p <- NA for(i in 1:10000) {   sample_data <- rnorm(n=10, mean=20, sd=2)  null_p[i] <- t.test(sample_data, mu=20)$p.value}mean(null_p < 0.05)ggplot() +   aes(x=null_p) +   geom_density()

When the null hypothesis is true, this looks like a uniform distribution with equal probability (roughly) across the entire support of [0,1]. This support is defined to be the continues values between 0 and 1 (inclusive) because each p-value from each of the samples is defined on this set. Because this is uniform, it is not entirely surprising to see that about 0.05 of the p-value are smaller than 0.05 -- these are the false rejections that we knew our testing process would generate.

Null Hypothesis is Not True, But the Effect is Small

Notice that as soon as we are generating data from a population distribution with a mean equal to any value that is not 20, the null hypothesis is by construction not true. But, will tests that we conduct on data drawn from these distributions inform us of this reality? Recall that these tests do not know the population truth.

## notice that the mean of the data samples is now 21null_p <- NA for(i in 1:10000) {   sample_data <- rnorm(n=10, mean=21, sd=2)  null_p[i] <- t.test(sample_data, mu=20)$p.value}mean(null_p < 0.05)ggplot() +   aes(x=null_p) +   geom_density()

Now, more of the p-values -- about 30% of them -- are smaller than the 0.05 rejection region. This means that the ability of our test to correctly reject the null hypothesis when there is a 1 unit difference is about 0.30. Notice that if we were to increase the size of the sample, our power will go up!

alt_p <- NA for(i in 1:10000) {   sample_data <- rnorm(n=20, mean=21, sd=2)  alt_p[i] <- t.test(sample_data, mu=20)$p.value}mean(alt_p < 0.05)ggplot() +   aes(x=alt_p) +   geom_density()

Now, the power of the test is roughly 0.55.

alt_p <- NA for(i in 1:10000) {   sample_data <- rnorm(n=30, mean=21, sd=2)  alt_p[i] <- t.test(sample_data, mu=20)$p.value}mean(alt_p < 0.05)ggplot() +   aes(x=alt_p) +   geom_density()

And now, roughly 0.75. Notice that we're getting decreasing returns to adding more data: The first 10 samples we added generated a power gain of 0.25; the second 10 we added generated a power gain of 0.20. We could actually work through a simulation of power across a larger swath of sample sizes, and then plot the distribution of power under each of these sample sizes.

To do so, we will decrease the number of simulations that we run (this will just decrease the precision of our power estimate at any particular sample size); so that we don't have to wait all day for the simulation to finish.

Notice that in the simulation below, we have the same internal for-loop as we've been using before -- or at least, it is very similar. The only difference is that now the size of the sample is defined as sample_size[size].

What's happening there? size is the iterator for the outer loop, iterating along 1, 2, 3, ... , length(sample_size). And so, when we pull sample_size[1] we are pulling the value 3. When we pull sample_size[2] we are pulling the value 4, and so on.

This means that for each size, we are computing a vector of 1,000 p-values from a t-test, and then calculating the proportion of these values that are less than 0.05 and storing into the object power.

sample_size <- seq(from=3, to=100, by=1)alt_p       <- NApower       <- NA for(size in 1:length(sample_size)) {   for(i in 1:1000) {     sample_data <- rnorm(n=sample_size[size], mean=21, sd=2)    alt_p[i]    <- t.test(sample_data, mu=20)$p.value  }  power[size] <- mean(alt_p <= 0.05)}

At the end of this simulation, we can plot what this power curve looks like:

ggplot() +   aes(x=1:length(power), y=power) +   geom_point() +   stat_smooth()

Null Hypothesis is Not True, But Effect Is Larger

What happens when the effect is larger? What do you think will happen to the power of the test at each sample size? Modify the simulation above (I know, burying a new question in an answer!) to set the mean of the sample data to 22, and re-run the power calculations.