12.4 The Classical Linear Model

Comparing the Large Sample Model and the CLM

12.4.1 Part 1

  • We say that in small samples, more needs be true of our data for OLS regression to “work.”
    • What do we mean when we say “work”?
      • If our goals are descriptive, how is a “working” estimator useful?
      • If our goals are explanatory, how is a “working” estimator useful?
      • If our goals are predictive, are the requirements the same?

12.4.2 Part 2

  • Suppose that you’re interested in understanding how subsidized school meals benefit under-resourced students in San Francisco East Bay region.
    • Using the tools from DATASCI 201, refine this question to a data science question.
    • Suppose that there exists two possible data sources to answer the question you have formed:
      • A large amount (e.g. 10,000 data points) of individual-level data about income, nutrition and test scores, self-reported by individual families who have opted in to the study.
      • A relatively smaller amount (e.g. 500 data points) of Government data about school district characteristics, including district-level college achievement; county-level home prices, and state-level tax receipts.
  • What are the tradeoffs to using one or the other data source?

12.4.3 Part 3

  • Suppose you elect to use the relatively larger sample of individual-level data.
    • Which of the large-sample assumptions do you expect are valid, and which are problematic?
  • Or, suppose that you elect to use the relatively smaller sample of school-district-level data.
    • Which of the CLM assumptions do you expect are valid, and which do you expect are most problematic?
  • What was the research question that you identified?
  • What would a successful answer accomplish?

12.4.4 Part 4

  • Which data source, the individual or the district-level, do you think is more likely to produce a successful answer?

12.4.5 Part 5

Problems with the CLM Requirements

  • There are five requirements for the CLM

    1. IID Sampling
    2. Linear Conditional Expectation
    3. No Perfect Collinearity
    4. Homoskedastic Errors
    5. Normally Distributed Errors
  • For each of these requirements:

    • Identify one concrete way that the data might not satisfy the requirement.
    • Identify what the consequence of failing to satisfy the requirement would be.
    • Identify a path forward to satisfy the requirement.