Chapter 8 Large-Sample Regression

We assume that the best linear predictor, \(\mathscr{P}[Y|{\boldsymbol{X}}]\), of \(Y\) given \({\boldsymbol{X}}\) is \({\boldsymbol{X}}{\boldsymbol{\beta}}\). \[ Y={\boldsymbol{X}}{\boldsymbol{\beta}}+\varepsilon. \] We have from Theorem 4.3 \[{\mathbb{E}\left[ \varepsilon \right]}=0,\mbox{ and }{\mathbb{E}\left[ {\boldsymbol{X}}^T\varepsilon \right]}={\boldsymbol{0}}.\]

We also assume that the dataset \(\{(Y_i,{\boldsymbol{X}}_i)\}\) is taken i.i.d. from the joint distribution of \((Y,{\boldsymbol{X}})\). For each \(i\), we can write \[ Y_i={\boldsymbol{X_i}}{\boldsymbol{\beta}}+\varepsilon_i. \] In matrix notation, we can write \[ {\boldsymbol{Y}}={\mathbb{X}}{\boldsymbol{\beta}}+{\boldsymbol{\varepsilon}}. \] Then \[{\mathbb{E}\left[ {\boldsymbol{\varepsilon}} \right]}={\boldsymbol{0}},\text{ and } {\mathbb{E}\left[ {\boldsymbol{\varepsilon}} \right]}={\boldsymbol{0}}\]