12  The Classical Linear Model

12.1 Learning Objectives

At the end of this week’s learning students will be able to

  1. Describe the assumptions of the classical linear model (sometimes referred to as the Gauss-Markov Assumptions) and what each assumption contributes to the estimator.
  2. Evaluate using empirical methods, whether each of the assumptions are likely to be true of the population data generating function.
  3. Assess whether the guarantees that are provided by the classical linear model’s requirements are likely to ever be true, including within data the student is likely to encounter.

12.2 Class Announcements

  • Lab 2 Deliverable and Dates
    • Research Proposal (Today)
    • Within-Team Review (Week 14)
    • Final Report (Week 14)
    • Final Presentation (Week 14)

12.3 Roadmap

Rearview Mirror

  • Statisticians create a population model to represent the world.
  • The BLP is a useful summary for a relationship among random variables.
  • OLS regression is an estimator for the Best Linear Predictor (BLP).
  • For a large sample, we only need two mild assumptions to work with OLS
    • To know coefficients are consistent
    • To have valid standard errors, hypothesis tests

Today

  • The Classical Linear Model (CLM) allows us to apply regression to smaller samples.
  • The CLM requires more to be true of the data generating process, to make coefficients, standard errors, and tests meaningful in small samples.
  • Understanding if the data meets these requirements (often called assumptions) requires considerable care.

Looking Ahead

  • The CLM – and the methods that we use to evaluate the CLM – are the basis of advanced models (inter alia time-series)
  • (Week 13) In a regression studies (and other studies), false discovery is a widespread problem. Understanding its causes can make you a better member of the scientific community.

12.4 The Classical Linear Model

Comparing the Large Sample Model and the CLM

12.4.1 Part 1

  • We say that in small samples, more needs be true of our data for OLS regression to “work.”
    • What do we mean when we say “work”?
      • If our goals are descriptive, how is a “working” estimator useful?
      • If our goals are explanatory, how is a “working” estimator useful?
      • If our goals are predictive, are the requirements the same?

12.4.2 Part 2

  • Suppose that you’re interested in understanding how subsidized school meals benefit under-resourced students in San Francisco East Bay region.
    • Using the tools from DATASCI 201, refine this question to a data science question.
    • Suppose that there exists two possible data sources to answer the question you have formed:
      • A large amount (e.g. 10,000 data points) of individual-level data about income, nutrition and test scores, self-reported by individual families who have opted in to the study.
      • A relatively smaller amount (e.g. 500 data points) of Government data about school district characteristics, including district-level college achievement; county-level home prices, and state-level tax receipts.
  • What are the tradeoffs to using one or the other data source?

12.4.3 Part 3

  • Suppose you elect to use the relatively larger sample of individual-level data.
    • Which of the large-sample assumptions do you expect are valid, and which are problematic?
  • Or, suppose that you elect to use the relatively smaller sample of school-district-level data.
    • Which of the CLM assumptions do you expect are valid, and which do you expect are most problematic?
  • What was the research question that you identified?
  • What would a successful answer accomplish?

12.4.4 Part 4

  • Which data source, the individual or the district-level, do you think is more likely to produce a successful answer?

12.4.5 Part 5

Problems with the CLM Requirements

  • There are five requirements for the CLM

    1. IID Sampling
    2. Linear Conditional Expectation
    3. No Perfect Collinearity
    4. Homoskedastic Errors
    5. Normally Distributed Errors
  • For each of these requirements:

    • Identify one concrete way that the data might not satisfy the requirement.
    • Identify what the consequence of failing to satisfy the requirement would be.
    • Identify a path forward to satisfy the requirement.

12.5 R Exercise

library(tidyverse)
library(wooldridge)
library(car)
library(lmtest)
library(sandwich)
library(stargazer)

If you haven’t used the mtcars dataset, you haven’t been through an intro applied stats class!

In this analysis, we will use the mtcars dataset which is a dataset that was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-74 models). The dataset is automatically available when you start R.

For more information about the dataset, use the R command: help(mtcars)

data(mtcars)
glimpse(mtcars)
Rows: 32
Columns: 11
$ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,…
$ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,…
$ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16…
$ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180…
$ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,…
$ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.…
$ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18…
$ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,…
$ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,…
$ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,…
$ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,…
mtcars
                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

12.5.1 Questions:

  1. Using the mtcars data, quickly reason about the variables that we’re interested in studying:
mtcars %>% 
  ggplot() + 
  aes(x=mpg) +
  geom_histogram(bins=10)

mtcars |> pairs()

mtcars %>% 
  select(mpg, disp, hp, wt, drat) %>% 
  pairs(pch=19)

  1. Using the mtcars data, run a linear regression to find the relationship between miles per gallon (mpg) on the left-hand-side as a function of displacement (disp), gross horsepower (hp), weight (wt), and rear axle ratio (drat) on the right-hand-side. That is, fit a regression of the following form:

\[ \widehat{mpg} = \hat{\beta_{0}} + \hat{\beta}_{1} disp + \hat{\beta}_{2}horse\_power + \hat{\beta}_{3}weight + \hat{\beta}_{4}drive\_ratio \]

model <- lm(mpg ~ disp + hp + wt + I(wt^2) + drat, data = mtcars)
  1. For each of the following CLM assumptions, assess whether the assumption holds. Where possible, demonstrate multiple ways of assessing an assumption. When an assumption appears violated, state what steps you would take in response.
    • I.I.D. data
    • No perfect collinearity
    • Linear conditional expectation
    • Homoskedastic errors
    • Normally distributed errors
# goal:
# consequence if violated:
# View(mtcars)
# goal: we estimate the "right model" 
# consequence if violated:
plot(resid(model), predict(model))

# plot(model)

# plot(mtcars$wt, resid(model))

# class(model)
# goal:
# consequence if violated:

summary(model)

Call:
lm(formula = mpg ~ disp + hp + wt + I(wt^2) + drat, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.9378 -1.4353 -0.8102  1.5695  5.3393 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  44.709966   8.037242   5.563 7.66e-06 ***
disp         -0.002895   0.010018  -0.289 0.774904    
hp           -0.025661   0.010947  -2.344 0.026990 *  
wt          -10.164720   2.637504  -3.854 0.000684 ***
I(wt^2)       0.950808   0.348886   2.725 0.011340 *  
drat          0.498207   1.274372   0.391 0.699024    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.339 on 26 degrees of freedom
Multiple R-squared:  0.8737,    Adjusted R-squared:  0.8494 
F-statistic: 35.97 on 5 and 26 DF,  p-value: 6.977e-11
# goal:
# consequence if violated:

# We're sure they;re not constnat! 

vcovHC(model, type = "const")
             (Intercept)          disp            hp            wt
(Intercept)  64.59725500 -2.014384e-02  2.502973e-02 -1.665306e+01
disp         -0.02014384  1.003602e-04 -7.134494e-05 -1.817698e-04
hp            0.02502973 -7.134494e-05  1.198420e-04 -7.251315e-03
wt          -16.65306393 -1.817698e-04 -7.251315e-03  6.956429e+00
I(wt^2)       1.99212647 -8.590083e-04  1.167804e-03 -8.558110e-01
drat         -8.99211329  4.928814e-03 -4.445799e-03  1.398580e+00
                  I(wt^2)         drat
(Intercept)  1.9921264710 -8.992113294
disp        -0.0008590083  0.004928814
hp           0.0011678043 -0.004445799
wt          -0.8558109527  1.398580259
I(wt^2)      0.1217211286 -0.162563317
drat        -0.1625633168  1.624023962
vcovHC(model, type = "HC2")
            (Intercept)          disp            hp           wt      I(wt^2)
(Intercept) 25.25670069 -1.290457e-02  1.462811e-02 -8.255767929  1.066319640
disp        -0.01290457  1.043367e-04 -9.728375e-05  0.003103165 -0.001286184
hp           0.01462811 -9.728375e-05  1.498401e-04 -0.005332482  0.001216345
wt          -8.25576793  3.103165e-03 -5.332482e-03  4.403737911 -0.630955601
I(wt^2)      1.06631964 -1.286184e-03  1.216345e-03 -0.630955601  0.105757194
drat        -2.57381895  2.188569e-03 -3.162583e-03  0.288839880 -0.022392165
                    drat
(Intercept) -2.573818955
disp         0.002188569
hp          -0.003162583
wt           0.288839880
I(wt^2)     -0.022392165
drat         0.514261588
# goal:
# consequence if violated:

hist(resid(model))

mean(resid(model))
[1] -4.163336e-17
  1. In addition to the above, assess to what extent (imperfect) collinearity is affecting your inference.

  2. Interpret the coefficient on horsepower.

  3. Perform a hypothesis test to assess whether rear axle ratio has an effect on mpg. What assumptions need to be true for this hypothesis test to be informative? Are they?

  4. Choose variable transformations (if any) for each variable, and try to better meet the assumptions of the CLM (which also maintaining the readability of your model).

  5. (As time allows) report the results of both models in a nicely formatted regression table.