12.4 The Classical Linear Model

Comparing the Large Sample Model and the CLM

We say that in small samples, more needs be true of our data for OLS regression to “work.”
- What do we mean when we say “work”?
  - If our goals are descriptive, how is a “working” estimator useful?
  - If our goals are explanatory, how is a “working” estimator useful?
  - If our goals are predictive, are the requirements the same?

Suppose that you’re interested in understanding how subsidized school meals benefit under-resourced students in San Francisco East Bay region.
- Using the tools from DATASCI 201, refine this question to a data science question.
- Suppose that there exists two possible data sources to answer the question you have formed:
  - A large amount (e.g. 10,000 data points) of individual-level data about income, nutrition and test scores, self-reported by individual families who have opted in to the study.
  - A relatively smaller amount (e.g. 500 data points) of Government data about school district characteristics, including district-level college achievement; county-level home prices, and state-level tax receipts.
What are the tradeoffs to using one or the other data source?

Suppose you elect to use the relatively larger sample of individual-level data.
- Which of the large-sample assumptions do you expect are valid, and which are problematic?
Or, suppose that you elect to use the relatively smaller sample of school-district-level data.
- Which of the CLM assumptions do you expect are valid, and which do you expect are most problematic?
What was the research question that you identified?
What would a successful answer accomplish?

Which data source, the individual or the district-level, do you think is more likely to produce a successful answer?

Problems with the CLM Requirements

There are five requirements for the CLM
1. IID Sampling
2. Linear Conditional Expectation
3. No Perfect Collinearity
4. Homoskedastic Errors
5. Normally Distributed Errors
For each of these requirements:
- Identify one concrete way that the data might not satisfy the requirement.
- Identify what the consequence of failing to satisfy the requirement would be.
- Identify a path forward to satisfy the requirement.