12.4 The Classical Linear Model
Comparing the Large Sample Model and the CLM
12.4.1 Part 1
- We say that in small samples, more needs be true of our data for OLS regression to “work.”
- What do we mean when we say “work”?
- If our goals are descriptive, how is a “working” estimator useful?
- If our goals are explanatory, how is a “working” estimator useful?
- If our goals are predictive, are the requirements the same?
- What do we mean when we say “work”?
12.4.2 Part 2
- Suppose that you’re interested in understanding how subsidized school meals benefit under-resourced students in San Francisco East Bay region.
- Using the tools from DATASCI 201, refine this question to a data science question.
- Suppose that there exists two possible data sources to answer the question you have formed:
- A large amount (e.g. 10,000 data points) of individual-level data about income, nutrition and test scores, self-reported by individual families who have opted in to the study.
- A relatively smaller amount (e.g. 500 data points) of Government data about school district characteristics, including district-level college achievement; county-level home prices, and state-level tax receipts.
- A large amount (e.g. 10,000 data points) of individual-level data about income, nutrition and test scores, self-reported by individual families who have opted in to the study.
- What are the tradeoffs to using one or the other data source?
12.4.3 Part 3
- Suppose you elect to use the relatively larger sample of individual-level data.
- Which of the large-sample assumptions do you expect are valid, and which are problematic?
- Or, suppose that you elect to use the relatively smaller sample of school-district-level data.
- Which of the CLM assumptions do you expect are valid, and which do you expect are most problematic?
- What was the research question that you identified?
- What would a successful answer accomplish?
12.4.4 Part 4
- Which data source, the individual or the district-level, do you think is more likely to produce a successful answer?
12.4.5 Part 5
Problems with the CLM Requirements
There are five requirements for the CLM
- IID Sampling
- Linear Conditional Expectation
- No Perfect Collinearity
- Homoskedastic Errors
- Normally Distributed Errors
For each of these requirements:
- Identify one concrete way that the data might not satisfy the requirement.
- Identify what the consequence of failing to satisfy the requirement would be.
- Identify a path forward to satisfy the requirement.