Chapter 6 Ordinary Least Squares

Let \(Y\) be our outcome random variable and \[ \v{X}=\begin{bmatrix} 1 & X_{[1]} & X_{[2]} & \ldots & X_{[k]} \end{bmatrix} \] be our predictor (or explanatory) vector containing \(k\) predictors and a constant. We denote the joint distribution of \((Y,\v{X})\) by \(F(y,\v{x})\), i.e., \[ F(y,\v{x})=\p{Y\leq y, \v{X}\leq\v{x}} =\p{Y\leq y,X_1\leq x_1,\ldots,X_k\leq x_k}. \]

The dataset or sample is a collection of observations \(\{(Y_i,\v{X}_i): i=1,2,\ldots,n\}\). We assume that each observation \((Y_i,\v{X}_i)\) is a random (row) vector drawn from the common distribution, sometimes referred to as the population, \(F\).

For a given vector of (unknown) coefficients \(\v{\beta}=\begin{bmatrix}\beta_0 & \beta_1 & \ldots & \beta_k\end{bmatrix}^T\in\mathbb{R}^{k+1}\), we define the following cost function: \[ \widehat{S}(\v{\beta})=\frac{1}{n}\sum\limits_{i=1}^n(Y_i-\v{X_i}\v{\beta})^2. \] The cost function \(\widehat{S}({\v{\beta}})\) can also be thought of as the average sum of residuals. In fact, \(\widehat{S}({\v{\beta}})\) is the moment (plug-in) estimator of the mean squared error, \[ S(\v{\beta})=\E{(Y-\v{X}\v{\beta})^2}. \]

We now minimize \(\widehat{S}({\v{\beta}})\) over all possible choices of \(\v{\beta}\in\mathbb{R}^{k+1}\). When the minimizer exists and is unique, we call it the least squares estimator, denoted \(\widehat{\v{\beta}}\).

Definition 6.1 ((Ordinary) Least Squares Estimator) The least square estimator is \[ \widehat{\v{\beta}} =\underset{\v{\beta}\in\mathbb{R}^{k+1}}{\arg\min} \ \widehat{S}(\v{\beta}), \] provided it exists uniquely.