9  OLS Regression Inference

sunset on golden gate

9.1 Learning Objectives

After this week’s learning, student will be able to

  1. Describe how sampling based uncertainty is reflected in OLS regression parameter estimates.
  2. Report standard errors, and conduct tests for NHST of regression coefficients against zero.
  3. Conduct a regression based analysis, on real data, in ways that begin to explore regression as a modeling tool.

9.2 Class Announcements

  1. Congratulations on finishing your first lab!
  2. The next (and the last) lab is coming up in two weeks.
  3. Homework 09 has been assigned today, and it is due in a week.

9.3 Roadmap

Rear-View Mirror

  • Statisticians create a population model to represent the world.
  • Sometimes, the model includes an “outcome” random variable \(Y\) and “input” random variables \(X_1, X_2,...,X_k\).
  • The joint distribution of \(Y\) and \(X_1, X_2,...,X_k\) is complicated.
  • The best linear predictor (BLP) is the canonical way to summarize the relationship.
  • OLS provides a point estimate of the BLP

Today

  • Robust Standard Error: quantify the uncertainty of OLS coefficients
  • Hypothesis testing with OLS coefficients
  • Bootstrapping

Looking Ahead

  • Regression is a foundational tool that can be applied to different contexts
  • The process of building a regression model looks different, depending on whether the goal is prediction, description, or explanation.

9.4 Uncertainty in OLS

9.4.1 Discussion Questions

  • List as many differences between the BLP and the OLS line as you can.
  • In the following regression table, explain in your own words what the standard error in parentheses means.
outcome: sleep hours
mg. melatonin 0.52
(0.31)

9.5 Understanding Uncertainty (Part I)

Let’s build some intuition about the distribution of regression coefficients. But, first, what are the formal guarantees that we know exist as a result of:

  1. The Weak Law of Large Numbers; and,
  2. The Central Limit Theorem
data_bivariate <- function(n=100) { 
  d <- data.frame(
    x = runif(n=n, min=0, max=10)) |>  
    mutate(y = 1 + 2*x + runif(n=n, min=-10, max=10))
  }

data_sample <- data_bivariate(n=100)

data_sample |> 
  ggplot() + 
  aes(x=x, y=y) + 
  geom_point()

model_sample <- lm(y ~ x, data = data_sample) 
summary(model_sample)

Call:
lm(formula = y ~ x, data = data_sample)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.8762  -4.5664   0.4383   5.5137  10.7654 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   4.0549     1.1493   3.528 0.000639 ***
x             1.4783     0.2069   7.144 1.62e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.995 on 98 degrees of freedom
Multiple R-squared:  0.3424,    Adjusted R-squared:  0.3357 
F-statistic: 51.03 on 1 and 98 DF,  p-value: 1.618e-10

But, that data that we sampled and estimated out model against is just one of the many samples that we could have taken!

Suppose that (since this is simulation) we could go back and take a bunch more samples, and estimate a model for each of them. What would the distribution of the sample estimates look like?

Recall that none of the data has anything normally distributed inside of it.

coefficient_ <- NA 

for(i in 1:10000) { 
  data_sample <- data_bivariate(n=100)  
  model_sample <- lm(y ~ x, data = data_sample) 
  coefficient_[i] <- coef(model_sample)['x']
  }

ggplot() + 
  aes(x=coefficient_) + 
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

shapiro.test(sample(coefficient_, 5000))

    Shapiro-Wilk normality test

data:  sample(coefficient_, 5000)
W = 0.99969, p-value = 0.6681

What if we have two explanatory variables that have no covariance between them?

data_two_x <- function(n=100) { 
  d <- data.frame(
    x1 = runif(n=n, min=0, max=10), 
    x2 = rpois(n=n, lambda=2)) |>  
    mutate(
      y = 1 + 2*x1 + 3*x2 + runif(n=n, min=-10, max=10)
    )
  
  return(d) 
}

data_sample <- data_two_x(n=100) 

pairs(data_sample)

model_sample <- lm(y ~ x1 + x2, data = data_sample)
summary(model_sample)

Call:
lm(formula = y ~ x1 + x2, data = data_sample)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.6082 -5.7431 -0.5155  5.9939 10.6157 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -1.5475     1.5954  -0.970    0.334    
x1            2.3503     0.2180  10.782  < 2e-16 ***
x2            3.0306     0.4681   6.474 3.92e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.999 on 97 degrees of freedom
Multiple R-squared:  0.6037,    Adjusted R-squared:  0.5956 
F-statistic:  73.9 on 2 and 97 DF,  p-value: < 2.2e-16

But, once again, that is just one of the samples that are possible from the population. What does the distribution look like if we were to repeatedly resample from the population?

coefficient_x1_ <- NA 
coefficient_x2_ <- NA 

for(i in 1:10000) { 
  data_sample <- data_two_x(n=100) 
  
  model <- lm(y ~ x1 + x2, data = data_sample) 
  coefficient_x1_[i] <- coef(model)['x1']
  coefficient_x2_[i] <- coef(model)['x2']
}
x1_plot <- 
  ggplot() + 
  aes(x=coefficient_x1_) + 
  geom_histogram()

x2_plot <- 
  ggplot() + 
  aes(x=coefficient_x2_) + 
  geom_histogram()

x1_x2_plot <- 
  ggplot() + 
  aes(x=coefficient_x1_, y=coefficient_x2_) + 
  geom_hex() + 
  scale_fill_continuous(type = "viridis")

x1_plot | x2_plot | x1_x2_plot
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

What if these variable are correlated with one antoher? What does the distibution of the regression coefficients look like then? Why?

(This is going to take a little more code to write, so hang-tight. It would take you a moment to write this (or GPT it)).

correlated_data <- function(n=n, unif_min, unif_max, lambda, rho) { 

# Step 1: Generate independent Uniform(0,10) variable
uniform_var <- runif(n, min = unif_min, max = unif_max)

# Step 2: Generate a second independent variable (Normal) to induce correlation
normal_noise <- rnorm(n)

# Step 3: Create a correlated Poisson variable
poisson_mean <- lambda + 
  rho * (uniform_var - mean(uniform_var)) / sd(uniform_var) * sd(normal_noise)
poisson_var <- rpois(n, lambda = pmax(poisson_mean, 0))  # Ensure lambda >= 0

correlated_data <- data.frame(
  x1 = uniform_var, 
  x2 = poisson_var
  )

correlated_data <- 
  correlated_data |>  
  mutate(
    y = 1 + 2*x1 + 3*x2 + runif(n=n, min=-10, max=10)
  )

return(correlated_data)
}

data_sample <- correlated_data(n=100, unif_min=0, unif_max=10, lambda=2, rho=.7)

model <- lm(y ~ x1 + x2, data = data_sample)
summary(model)

Call:
lm(formula = y ~ x1 + x2, data = data_sample)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.7855 -4.4056 -0.2029  4.3853 10.1673 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.7356     1.1696   0.629    0.531    
x1            1.9463     0.1863  10.447  < 2e-16 ***
x2            3.1970     0.3562   8.975 2.18e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.442 on 97 degrees of freedom
Multiple R-squared:  0.7237,    Adjusted R-squared:  0.718 
F-statistic:   127 on 2 and 97 DF,  p-value: < 2.2e-16

But, like each of the other times we have done this, we have taken only a single sample from potentially very many samples. Let’s see what happens with the very many samples.

coefficient_x1_ <- NA
coefficient_x2_ <- NA 

for(i in 1:10000) { 
  data <- correlated_data(n=100, unif_min=0, unif_max=10, lambda=2, rho=.9)
  model <- lm(y ~ x1 + x2, data = data) 
  coefficient_x1_[i] <- coef(model)['x1']
  coefficient_x2_[i] <- coef(model)['x2']
}
plot_x1 <- 
  ggplot() + 
  aes(x=coefficient_x1_) + 
  geom_histogram()

plot_x2 <- 
  ggplot() + 
  aes(x=coefficient_x2_) + 
  geom_histogram() 

plot_x1_x2 <- 
  ggplot() + 
  aes(x=coefficient_x1_, y=coefficient_x2_) + 
  geom_density_2d()

plot_x1 | plot_x2 | plot_x1_x2
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

9.6 Understanding Uncertainty (Part II)

Imagine three different regression models, each of the following form:

\[ Y = 0 + \beta X + \epsilon \]

The only difference is in the error term. The conditional distribution is given by:

Model Distribution of \(\epsilon\) cond. on \(X\)
A Uniform on \([-.5, +.5]\)
B Uniform on \([ - |X|, |X| ]\)
C Uniform on \([ -1 + |X|, 1- |X| ]\)

A is what we call a homoskedastic distribution. B and C are what we call heteroskedastic. Below, we define R functions that simulate draws from these three distributions.

rA <- function(n, slope=0){
  x       = runif(n, min = -1,  max = 1)
  epsilon = runif(n, min = -.5, max = .5)
  y       = 0 + slope*x + epsilon
  
  return(
    data.frame(x=x, y=y)
    )
}

rB <- function(n, slope=0){
  x       = runif(n, min = -1, max = 1)
  epsilon = runif(n, min = - abs(x), max =abs(x))
  y       = 0 + slope*x + epsilon
  
  return(
    data.frame(x=x,y=y)
    )
}

rC <- function(n, slope=0){
  x       = runif(n, min = -1, max = 1)
  epsilon = runif(n, min = -1 + abs(x), max = 1 - abs(x))
  y       = 0 + slope*x + epsilon

  return(
    data.frame(x=x,y=y)
    )
}
data <- rbind( 
  data.frame( rA(200), label = 'A'),
  data.frame( rB(200), label = 'B'),
  data.frame( rC(200), label = 'C'))
data %>% 
  ggplot(aes(x=x, y=y)) + 
  geom_point() + 
  lims(
    x = c(-2,2), 
    y = c(-1,1)) + 
  labs(title = 'Samples Drawn from Three Distributions') + 
  facet_grid(rows=vars(label))

9.6.1 Question 1

The following code draws a sample from distribution A, fits a regression line, and plots it. Run it a few times to see what happens. Now explain how you would visually estimate the standard error of the slope coefficient. Why is this standard error important?

data <-  rA(10, slope=0)

data %>% 
  ggplot() + 
  aes(x=x, y=y) + 
  geom_point() + 
  geom_smooth(method='lm', formula = 'y ~ x', se=FALSE) + 
  lims(
    x = c(-2,2), 
    y = c(-1,1)) + 
  labs(title = 'Regression Fit to Distribution A')

data_points <- 200

base_plot_a <- rA(10) %>%  
  ggplot() + 
  aes(x=x, y=y) + 
  geom_point() + 
  scale_x_continuous(limits = c(-3, 3)) + 
  scale_y_continuous(limits = c(-1, 1))

for(i in 1:100) { 
    base_plot_a <- base_plot_a + rA(data_points) %>% 
      stat_smooth(
        mapping = aes(x=x, y=y), 
        method  = 'lm',         se = FALSE, 
        formula = 'y~x', fullrange = TRUE,
        color   = 'grey',    alpha = 0.5,
        linewidth    = 0.5
      )
}

base_plot_b <- rB(10) %>%  
  ggplot() + 
  aes(x=x, y=y) + 
  geom_point() + 
  scale_x_continuous(limits = c(-3, 3)) + 
  scale_y_continuous(limits = c(-1, 1))

for(i in 1:100) { 
    base_plot_b <- base_plot_b + rB(data_points) %>% 
      stat_smooth(
        mapping = aes(x=x, y=y), 
        method  = 'lm',         se = FALSE, 
        formula = 'y~x', fullrange = TRUE,
        color   = 'grey',    alpha = 0.5,
        linewidth    = 0.5
      )
  }

base_plot_c <- rC(10) %>%  
  ggplot() + 
  aes(x=x, y=y) + 
  geom_point() + 
  scale_x_continuous(limits = c(-3, 3)) + 
  scale_y_continuous(limits = c(-1, 1))

for(i in 1:100) { 
    base_plot_c <- base_plot_c + rC(data_points) %>% 
      stat_smooth(
        mapping = aes(x=x, y=y), 
        method  = 'lm',         se = FALSE, 
        formula = 'y~x', fullrange = TRUE,
        color   = 'grey',    alpha = 0.5,
        linewidth    = 0.5
      )
}

base_plot_a + base_plot_b + base_plot_c

9.6.2 Question 2

You have a sample from each distribution, A, B, and C and you fit a regression of Y on X. Which will have the highest standard error for the slope coefficient? Which will have the lowest standard error? Why? (You may want to try experimenting with the function defined above)

9.6.3 Question 3

For distribution A, perform a simulated experiment. Draw a large number of samples, and for each sample fit a linear regression. Store the slope coefficient from each regression in a vector. Finally, compute the standard deviation for the slope coefficients.

Repeat this process for distributions B and C. Do the results match your intuition?

9.7 Understanding Uncertainty

Under the relatively stricter assumptions of constant error variance, the variance of a slope coefficient is given by

\[ V(\hat{\beta_j}) = \frac{\sigma^2}{SST_j (1-R_j^2)} \]

A similar formulation is given in FOAS as definition 4.2.3,

\[ \hat{V}_{C}[\hat{\beta}] = \hat{\sigma}^2 \left( X^{T} X \right)^{-1} \rightsquigarrow \hat{\sigma}^{2}{\left(\mathbb{X}^{T}\mathbb{X}\right)}, \] where \(\hat{\sigma}^{2} = V[\hat{\epsilon}]\)

Explain why each term makes the variance higher or lower:

  • \(\hat{\sigma}^2\) is the variance of the error \(\hat{\epsilon}\)
  • \(SST_j\) is (unscaled) variance of \(X_j\)
  • \(R_j^2\) is \(R^2\) for a regression of \(X_j\) on the other \(X\)’s

9.8 R Exercise

Real Estate in Boston

The file hprice1.RData contains 88 observations of homes in the Boston area, taken from the real estate pages of the Boston Globe during 1990. This data was provided by Wooldridge.

load('data/hprice1.RData') # provides 3 objects 

Last week, we fit a regression of price on square feet.

model_one <- lm(price ~ sqrft, data = data)
model_one$df.residual
[1] 86

Can you use the pieces that you’re familiar with to produce a p-value using robust standard errors?

regression_p_value <- function(model, variable) { 
  ## this function takes a model 
  ## and computes a test-statistic, 
  ## then compares that test-statistic against the 
  ## appropriate t-distribution
  
  ## you can use the following helper functions: 
  ##  - coef()
  ##  - vcovHC()
  df <- model$df.residual
  
  # numerator   <- 'fill this in'
  # denominator <- 'fill this in'
  
  numerator   <- coef(model)[variable]
  denominator <- sqrt(diag(vcovHC(model)))[variable]
  
  test_stat_  <- numerator / denominator
  p_val_      <- 'fill this in'
  p_val_      <- pt(test_stat_, df = df, lower.tail = FALSE) * 2
  
  return(p_val_)
  }

If you want to confirm that what you have written is correct, you can compare against the value that you receive from the line below.

coeftest(model_one, vcov. = vcovHC(model_one))

t test of coefficients:

             Estimate Std. Error t value  Pr(>|t|)    
(Intercept) 11.204145  39.450563  0.2840    0.7771    
sqrft        0.140211   0.021111  6.6417 2.673e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
p_value_ <- broom::tidy(coeftest(model_one, vcov. = vcovHC(model_one))) %>% 
  filter(term == 'sqrft') %>% 
  select('p.value') %>% 
  as.numeric()

test_that(
  'test that hand coded p-value is the same as the pre-rolled', 
  expect_equal(
    object   = as.numeric(regression_p_value(model_one, 'sqrft')), 
    expected = p_value_
  )
)
Test passed 🥇

Questions

  1. Estimate a new model (and save it into another object) that includes the size of the lot and whether the house is a colonial. This will estimate the model:

\[ price = \beta_{0} + \beta_{1} sqrft + \beta_{2} lotsize + \beta_{3} colonial? + e \]

  • BUT BEFORE YOU DO, make a prediction: What do you think is going to happen to the coefficient that relates square footage and price?
    • Will the coefficient increase, decrease, or stay the same?
    • Will the uncertainty about the coefficient increase, decrease, or stay the same?
    • Conduct an F-test that evaluates whether the model as a whole does better when the coefficients on colonial and lotsize are allowed to estimate freely, or instead are restricted to be zero (i.e. \(\beta_{2} = \beta_{3} = 0\).
  1. Use the function vcovHC from the sandwich package to estimate (a) the the heteroskedastic consistent (i.e. “robust”) variance covariance matrix; and (b) the robust standard errors for the intercept and slope of this regression. Recall, what is the relationship between the VCOV and SE in a regression?
  1. Perform a hypothesis test to check whether the population relationship between sqrft and price is zero. Use coeftest() with the robust standard errors computed above.
  1. Use the robust standard error and qt to compute a 95% confidence interval for the coefficient sqrft in the second model that you estimated. \(price = \beta_{0} + \beta_{1} sqrft + \beta_{2} lotsize + \beta_{3} colonial\).

  2. Bootstrap. The book very quickly talks about bootstrapping which is the process of sampling with replacement and fitting a model. The idea behind the bootstrap is that since the data is generated via an iid sample from the population, that you can simulate re-running your analysis by drawing repeated samples from the data that you have.

Below is code that will conduct a boostrapping estimator of the uncertainty of the sqrft variable when lotsize and colonial are included in the model.

bootstrap_sqft <- function(d = data, number_of_bootstraps = 1000) { 
  number_of_rows <- nrow(d)

    coef_sqft <- rep(NA, number_of_bootstraps)

    for(i in 1:number_of_bootstraps) { 
      bootstrap_data <- d[sample(x=1:number_of_rows, size=number_of_rows, replace=TRUE), ]  
      estimated_model <- lm(price ~ sqrft, data = bootstrap_data)
      coef_sqft[i]    <- coef(estimated_model)['sqrft']
    }
  return(coef_sqft)
}
bootstrap_result <- bootstrap_sqft(d = data, number_of_bootstraps = 10000)

With this, it is possible to plot the distribution of these regression coefficients:

ggplot() + 
  aes(x = bootstrap_result) + 
  geom_histogram() + 
  labs(
    x = 'Estimated Coefficient', 
    y = 'Count', 
    title = 'Bootstrap coefficients for square footage'
  )
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Compute the standard deviation of the bootstrapped regression coefficients. How does this compare to the robust standard errors you computed above?

coeftest(model_one, vcov. = vcovHC(model_one))

t test of coefficients:

             Estimate Std. Error t value  Pr(>|t|)    
(Intercept) 11.204145  39.450563  0.2840    0.7771    
sqrft        0.140211   0.021111  6.6417 2.673e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
sd(bootstrap_result)
[1] 0.01979279