Practice Problems

Problems

# theoretical

Problem 1: (Reason about Linear Regression)

This is a question that comes from An Introduction to Statistical Learning.

Suppose we have five predictors:

$X_{1} = GPA$
$X_{2} = IQ$
$X_{3} = EDU$ (1 = College, 0 = High School)
$X_{4} = GPA * IQ$
$X_{5} = GPA * EDU$

Further, suppose that the response is starting salary, measured in thousands of dollars, and that we fit the following linear regression model:

$$ \begin{align} \hat{Y} & = \hat{\beta}_0 + \hat{\beta}_1 GPA + \hat{\beta}_2 IQ + \hat{\beta}_3 EDUC + \hat{\beta}_4 GPA + \hat{\beta}_5 GPA \times EDUC \\ \hat{Y} & = 50 + 20 \cdot GPA + 0.07 \cdot IQ + 35 \cdot EDUC + 0.01 \cdot GPA \times IQ - 10 \cdot GPA \cdot EDUC \end{align} $$

First, answer the following, True/False questions, and justify your answers:

For fixed IQ and GPA, HS grads earn more on average than college grads.
For fixed IQ and GPA, College grads earn more than HS grads.
For fixed IQ and GPA, HS grads earn more than college grads if GPA is high enough.
For fixed IQ and GPA, College grads earn more than HS grads if GPA is high enough.

Second, predict the salary of a college grad with IQ = 110 and GPA = 4.0.

Third, is the following statement True or False? Justify your answer.

Since the $GPA \times IQ$ coefficient is small, there is little evidence of interaction.

Solution

Moderate

Problem 2: (Linear or Not, Here I Come!)

This is a question that comes from An Introduction to Statistical Learning.

Suppose that you collect a set of data (n = 100 observations) containing a single predictor and a quantitative response. You then fit two models to this data:

$$ \begin{align} Y &= \beta_{0} + \beta_{1} X + \epsilon \\ Y &= \beta_{0} + \beta_{1} X + \beta_{2} X^2 + \beta_{3} X^3 + \epsilon \end{align} $$

Suppose that the true relationship between $X$ and $Y$ is linear, i.e. $Y = \beta_{0} + \beta_{1} X + \epsilon$. Consider the training residual sum of squares (RSS) for the two models. Do you expect one to be lower than the other, about the same, or is there not enough informatino to tell? Justify your answer.
Answer the first question, but instead of the training RSS, consider the test RSS.
Suppose that the true relationship between $X$ and $Y$ is not linear, but we don't know "how far from linear" it is. That means, it might be quadratic, cubic, or something else. Consider the training RSS for the two models. Do you expect one to be lower than the other, about the same, or is there not enough information to tell? Justify your answer.
Consider the test RSS for the two models in the previous question. Do you expect one to be lower than the other, about the same, or is there not enough information to tell? Justify your answer.

Solution

Moderate

Problem 3: (Lost Intercept)

This is a question that comes from An Introduction to Statistical Learning.

Consider the fitted values that result from estimating a linear regression without an intercept. That is, suppose you fit the following model: $Y_{i} = 0 + \hat{\beta} X_{i} + \epsilon_{i}$.

You could move this into a summation form, using our definitions of the regression estimators, which would look like the following:

$$ \begin{align} \hat{\beta} = \frac{\sum_{i=1}^{n} x_{i}y_{i}}{\sum_{i'=1}^{n} x_{i'}^{2}} \end{align} $$

(In this notation, we're just using $i$ and $i'$ to distinguish the two summations, but they are both summing over the same data points.)

Show that we can write the fitted values from this regression as: $$\hat{y} = \sum_{i'=1}^{n} a_{i'}y_{i'}$$

In this format, what is $a_{i'}$?

What you should note is that fitted values from a linear regression are linear combinations of the response variables.

Solution

Practice Problems

Unit 08