--- title: "08 Regression" author: "David Housman" format: docx editor: visual fig-width: 6 fig-height: 4 --- ```{r, setup, include=FALSE} knitr::opts_chunk$set(eval = FALSE) ``` ## Theory Given data for a single quantitative variable $$x_1,x_2,…,x_n,$$ we can choose to model the data with a single number $x$ that minimizes a measure of the error $$f(x) = \sum_{i=1}^{n} (x_i - x)^2.$$ We previously found that the mean $$\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i$$ minimizes $f$, and the standard deviation $$s = \sqrt{\frac{1}{n-1}f(\bar{x})} = \sqrt{\frac{1}{n-1}\sum_{i=1}^{n} (x_i - \bar{x})^2}$$ is a consistent and natural measure of how well the model fits the data. In a similar manner, given data for two quantitative variables $$(x_1,y_1 ),(x_2,y_2 ),…,(x_n,y_n ),$$ we can choose to model the data with a linear equation $$y = \hat{b} + \hat{m}x$$ having the two parameters $\hat{b}$ and $\hat{m}$ that minimize a similar measure of error $$f(b, m) = \sum_{i=1}^{n} (y_i - b - mx_i)^2.$$ We prove later that the minimum is $$\begin{array}{lcl} \hat{b} &=& \bar{y} - \hat{m}\bar{x} \\ \hat{m} &=& r\dfrac{s_y}{s_x} \\ \end{array}$$ where $\bar{x}$ and $\bar{y}$ are the means and $s_x$ and $s_y$ are the standard deviations of the explanatory and response data, and $$r = \dfrac{1}{n-1}\left( \displaystyle\sum_{i=1}^{n} \dfrac{x_i-\bar{x}}{s_x}\dfrac{y_i-\bar{y}}{s_y}\right)$$ is the *correlation coefficient*: an adjusted average of the products of the data point coordinates expressed in units of standard deviations from the respective means, called their $z$*-scores*. As we will prove later, the correlation coefficient is always between $–1$ and $1$ with perfect correlation at the endpoints and no correlation when $r = 0$. This linear model is called the *regression equation* and the graph of this equation is called the *regression line*. The regression equation can be written $$y - \bar{y} = r\dfrac{s_y}{s_x}(x - \bar{x}).$$ This form has a natural interpretation. When the independent variable is the mean $\bar{x}$, the dependent variable is the mean $\bar{y}$, and for each standard deviation $s_x$ change in the independent variable, there is a change in the dependent variable equal to the standard deviation $s_y$ directed and attenuated by the correlation coefficient $r$. One measure for how well the model fits the data is the sum of the squared errors $$\text{SSE} = f(\hat{b},\hat{m}) = \sum_{i=1}^{n} (y_i - \hat{b} - \hat{m}x_i)^2.$$ A second measure of model fit is the standard error $$ \hat{\sigma} = \sqrt{\text{SSE} / (n-2)},$$ whose units match those of the response variable and can be interpreted as a "typical" deviation of the model predictions from the actual values. Notice that we can write $$\text{SSE} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$ where $$\hat{y}_i = \hat{b} + \hat{m}x_i$$ is the model's $i$^th^ predicted value of the response variable; hence, SSE measures the variation in the response variable *not* explained by the model through variation of the independent variable. Analogously, a measure for the variation in the response variable explained by the model through variation of the independent variable is $$\text{SSM} = \sum_{i=1}^n (\hat{y}_i - \bar{\hat{y}})^2$$ where $\bar{\hat{y}}$ is the mean of the predicted values. Analogously, a measure for the total variation in the response variable is $$\text{SST} = \sum_{i=1}^n (y_i - \bar{y})^2.$$ We prove later that $$\text{SST} = \text{SSM} + \text{SSE},$$ that is, the variation of the response variable can be partitioned into the variation explained by the model and the variation not explained by the model. This equality motivates a third measure of model fit: the *coefficient of determination* or *R-squared* is the proportion of the variation in the response variable attributable to the variation in the explanatory variable mediated by the model. Formally, $$R^2 = \dfrac{\text{SSM}}{\text{SST}} = 1 - \dfrac{\text{SSE}}{\text{SST}}.$$ Finally, we prove later that $$R^2 = r^2,$$ that is, the coefficient of determination is the square of the correlation coefficient. ## Example For the bogus data set, we follow the theory with "by hand" computations. | $x_i$ | $y_i$ | $x_i-\bar{x}$ | $(x_i-\bar{x})^2$ | $y_i-\bar{y}$ | $(y_i-\bar{y})^2$ | $z_{x_i}$ | $z_{y_i}$ | $z_{x_i}z_{y_i}$ | |:-----:|:-----:|:-------------:|:-----------------:|:-------------:|:-----------------:|:---------:|:---------:|:----------------:| | 0 | 0 | -2 | 4 | -5 | 25 | -1.265 | -1.387 | 1.754 | | 1 | 4 | -1 | 1 | -1 | 1 | -0.632 | -0.277 | 0.175 | | 2 | 5 | 0 | 0 | 0 | 0 | 0.000 | 0.000 | 0.000 | | 3 | 6 | 1 | 1 | 1 | 1 | 0.632 | 0.277 | 0.175 | | 4 | 10 | 2 | 4 | 5 | 25 | 1.265 | 1.387 | 1.754 | | 10 | 25 | | 10 | SST | 52 | | | 3.859 | $$\begin{array}{lll} \hat{x} = 10/5 = 2 & \qquad s_x = \sqrt{10/4} \approx 1.58 & \qquad\qquad r \approx 3.859/4 \approx 0.965\\ \hat{y} = 25/5 = 5 & \qquad s_y = \sqrt{52/4} \approx 3.61 & \qquad\qquad \hat{m} \approx 0.965(3.61/1.58) \approx 2.20 \\ & & \qquad\qquad \hat{b} \approx 5 - (2.2)(2) = 0.6 \end{array}$$ Thus, the regression equation can be written $$y = 2.2x + 0.6 \quad\text{ or }\quad y - 5 = 2.2(x - 2)$$ We now calculate the measures of model fit. | $x_i$ | $y_i$ | $\hat{y}_i = 0.6 + 2.2x_i$ | $y_i-\hat{y}_i$ | $(y_i-\hat{y}_i)^2$ | |:-----:|:-----:|:--------------------------:|:---------------:|:-------------------:| | 0 | 0 | 0.6 | -0.6 | 0.36 | | 1 | 4 | 2.8 | 1.2 | 1.44 | | 2 | 5 | 5.0 | 0.0 | 0.00 | | 3 | 6 | 7.2 | -1.2 | 1.44 | | 4 | 10 | 9.4 | 0.6 | 0.36 | The sum of squared errors is the sum of the last column $$\text{SSE} = 0.36 + 1.44 + 0.00 + 1.44 + 0.36 = 3.60.$$ The standard error is $$\hat{\sigma} = \sqrt{3.60/(5-2)} = 1.095.$$ From the predicted values in the third column, the mean is $$\bar{\hat{y}} = (0.6 + 2.8 + 5.0 + 7.2 + 9.4) / 5 = 5,$$ and the sum of the squared deviations is $$\text{SSM} = \sum_{i=1}^n (\hat{y}_i - \bar{\hat{y}})^2 = (0.6-5)^2 + (2.8-5)^5 + (5-5)^2 + (7.2-5)^2 + (9.4-5)^2 = 48.4.$$ In the first table, we calculated the total sum of squares $$\text{SST} = \sum_{i=1}^n (y_i - \bar{y})^2 = 52.$$ Observe that $\text{SSM} + \text{SSE} = 48.4 + 3.6 = 52 = \text{SST}$. The coefficient of determination is $$R^2 = \frac{48.4}{52} \approx 0.931,$$ or we can use the correlation coefficient $$ R^2 \approx (0.965)^2 \approx 0.931.$$ ## Example in base R ```{r} bogus = data.frame( x = c(0,1,2,3,4), y = c(0,4,5,6,10) ) model = lm(y ~ x, data = bogus) coef(model) sse = sum(residuals(model)^2); sse n = length(bogus$x) se = sqrt(sse/(n-2)); se rsq = var(fitted(model)) / var(bogus$y); rsq ``` R provides a general numeric optimizer that can be used to find best fit parameters when the error function to be minimized is a user specified function: `optim(par, fn)` requires an initial guess `par` for the values of the parameters and a function `fn` to be minimized. The following code chunk finds the best fit parameters $b$ and $m$ for the model $y = b + mx$ to the bogus data with respect to the sum of the absolute values of the errors. ```{r} sum.errors = function(x) { b = x[1]; m = x[2] sum(abs(bogus$y - m*bogus$x - b)) } optim(c(1,2), sum.errors) ``` ## Proofs By the definition of mean, variance, and correlation coefficient, we will often use the following equalities in our derivations: - $n\bar{x} = \sum_{i=1}^{n} x_i$, - $n\bar{y} = \sum_{i=1}^{n} y_i$, - $(n-1)s_x^2 = \sum_{i=1}^{n} (x_i = \bar{x})^2$, - $(n-1)s_y^2 = \sum_{i=1}^{n} (y_i = \bar{y})^2$, and - $(n-1)s_xs_yr = \sum_{i=1}^{n} (x_i = \bar{x})(y_i = \bar{y})$. A necessary condition for $(\hat{b},\hat{m})$ to minimize $$f(b, m) = \sum_{i=1}^{n} (y_i - b - mx_i)^2.$$ is that the first partial derivatives equal zero: $$\begin{array}{lclcl} 0 &=& \dfrac{\partial f}{\partial b} &=& \displaystyle\sum_{i=1}^{n} 2(y_i-\hat{b}-\hat{m}x_i)(-1) \\ 0 &=& \dfrac{\partial f}{\partial m} &=& \displaystyle\sum_{i=1}^{n} 2(y_i-\hat{b}-\hat{m}x_i)(-x_i). \end{array}$$ To verify that we are finding a minimum (rather than a maximum or a saddle point), we find the matrix of second partial derivatives: $$\begin{bmatrix} \frac{\partial^2 f}{\partial b^2} & \frac{\partial^2 f}{\partial b \partial m} \\ \frac{\partial^2 f}{\partial m \partial b} & \frac{\partial^2 f}{\partial m^2} \end{bmatrix} = \begin{bmatrix} 2n & 2\sum_{i=1}^{n} x_i \\ 2\sum_{i=1}^{n} x_i & 2\sum_{i=1}^{n} x_i^2 \end{bmatrix}.$$ The main diagonal elements are clearly positive and the determinant $$4\left(n\sum_{i=1}^{n} x_i^2 - \left(\sum_{i=1}^{n} x_i \right)^2 \right) = 4n\left(\sum_{i=1}^{n} x_i^2 - n\bar{x}^2 \right) = 4n(n-1)s_x^2$$ (the second equality is proved later) is also positive, and so $f$ is concave up everywhere, which implies that the solution to the necessary conditions minimizes $f$. The necessary condition equations are equivalent to $$\begin{array}{rcrcl} (n)\hat{b} &+& \left(\sum_{i=1}^{n} x_i\right)\hat{m} &=& \sum_{i=1}^{n} y_i \\ \left(\sum_{i=1}^{n} x_i\right)\hat{b} &+& \left(\sum_{i=1}^{n} x_i^2\right)\hat{m} &=& \sum_{i=1}^{n} x_iy_i \end{array}$$ The first equation is equivalent to $$\hat{b} = \bar{y} - \hat{m}\bar{x}$$ which implies the model equation is of the form $$y - \bar{y} = \hat{m}(x - \bar{x}).$$ Plugging the first necessary condition equation into the second necessary condition equation, we obtain $$n\bar{x}(\bar{y}-\hat{m}\bar{x}) + \left(\sum_{i=1}^{n} x_i^2\right)\hat{m} = \sum_{i=1}^{n} x_iy_i$$ Solving for $\hat{m}$, we obtain $$\hat{m} = \dfrac{\left(\sum_{i=1}^{n} x_iy_i \right) - n\bar{x}\bar{y}}{\left(\sum_{i=1}^{n} x_i^2\right)-n\bar{x}^2}$$ We digress for a moment to calculate $$\begin{array}{lcl} \sum_{i=1}^{n} (x_i-\bar{x})^2 &=& \sum_{i=1}^{n} (x_i^2 - 2x_i\bar{x} + \bar{x}^2) \\ &=& \sum_{i=1}^{n} x_i^2 - 2\bar{x}\sum_{i=1}^{n} x_i + \sum_{i=1}^{n} \bar{x}^2 \\ &=& \sum_{i=1}^{n} x_i^2 - 2\bar{x}n\bar{x} + n\bar{x}^2 \\ &=& \sum_{i=1}^{n} x_i^2 - n\bar{x}^2, \\ \end{array}$$ which tells us that the denominator of $\hat{m}$ is $$\begin{array}{lcl} \sum_{i=1}^{n} x_i^2 - n\bar{x}^2 &=& \sum_{i=1}^{n} (x_i-\bar{x})^2 \\ &=& (n-1)s_x^2. \\ \end{array}$$ The above also yields the second equality we used earlier in the determinant of the matrix of second partial derivatives calculation. We digress again to calculate $$\begin{array}{lcl} \sum_{i=1}^{n} (x_i-\bar{x})(y_i-\bar{y}) &=& \sum_{i=1}^{n} (x_iy_i - x_i\bar{y} - \bar{x}y_i + \bar{x}\bar{y}) \\ &=& \sum_{i=1}^{n} x_iy_i - \bar{y}\sum_{i=1}^{n} x_i - \bar{x}\sum_{i=1}^{n} y_i + \sum_{i=1}^{n} \bar{x}\bar{y} \\ &=& \sum_{i=1}^{n} x_iy_i - \bar{y}n\bar{x} - \bar{x}n\bar{y} + n\bar{x}\bar{y} \\ &=& \sum_{i=1}^{n} x_iy_i - n\bar{x}\bar{y}, \\ \end{array}$$ which tells us that the numerator of $\hat{m}$ is $$\begin{array}{lcl} \sum_{i=1}^{n} x_iy_i - n\bar{x}\bar{y} &=& \sum_{i=1}^{n} (x_i-\bar{x})(y_i-\bar{y}) \\ &=& (n-1)s_xs_yr \end{array}$$ Hence, $$\hat{m} = \dfrac{(n-1)s_xs_yr}{(n-1)s_x^2} = r\dfrac{s_y}{s_x}.$$ We defined the $i$^th^ prediction of the response variable by $$\hat{y}_i = \hat{b} + \hat{m}x_i$$ which implies that the mean of the predictions is $$\begin{array}{lcl} \bar{\hat{y}} &=& \hat{b} + \hat{m}\bar{x} \\ &=& (\bar{y}-\hat{m}\bar{x}) + \hat{m}\bar{x} \\ &=& \bar{y}, \end{array}$$ that is, the mean of the response predictions is the mean of the response data. Furthermore, $$\begin{array}{lcl} \text{SSM} &=& \sum_{i=1}^{n} (\hat{y}_i - \bar{\hat{y}})^2 \\ &=& \sum_{i=1}^{n} ((\hat{b} + \hat{m}x_i) - (\hat{b} + \hat{m}\bar{x}))^2 \\ &=& \sum_{i=1}^{n} (\hat{m}(x_i - \bar{x}))^2 \\ &=& \hat{m}^2\sum_{i=1}^{n} (x_i - \bar{x})^2 \\ &=& (rs_y/s_x)^2 (n-1)s_x^2 \\ &=& (n-1)r^2s_y^2 \end{array}$$ and $$\begin{array}{lcl} \text{SST} &=& \sum_{i=1}^{n} (y_i - \bar{y})^2 \\ &=& (n-1)s_y^2 \end{array}$$ which imply $$\begin{array}{lcl} R^2 &=& \dfrac{\text{SSM}}{\text{SST}} \\ &=& \dfrac{(n-1)r^2s_y^2}{(n-1)s_y^2} \\ &=& r^2, \end{array}$$ that is, the coefficient of determination is the square of the correlation coefficient. We can now show that the variation of the response variable can be partitioned into the variations attributable to the model and not attributable to the model: $$\begin{array}{lcl} SST &=& \sum_{i=1}^n (y_i - \bar{y})^2 \\ &=& \sum_{i=1}^n (y_i - \hat{y}_i + \hat{y}_i - \bar{y})^2 \\ &=& \sum_{i=1}^n (y_i - \hat{y}_i)^2 + 2\sum_{i=1}^n (y_i - \hat{y}_i)(\hat{y}_i - \bar{y}) + \sum_{i=1}^n (\hat{y}_i - \bar{y})^2 \\ &=& \text{SSE} + 0 + \text{SSM} \end{array}$$ because the middle term $$\begin{array}{lcl} &=&\sum_{i=1}^n (y_i - \hat{y}_i)(\hat{y}_i - \bar{y}) \\ &=& \sum_{i=1}^n (y_i - (\bar{y} + \hat{m}(x_i - \bar{x})))((\bar{y} + \hat{m}(x_i - \bar{x}) - \bar{y}) \\ &=& \sum_{i=1}^n ((y_i - \bar{y}) - \hat{m}(x_i - \bar{x}))\hat{m}(x_i - \bar{x}) \\ &=& \hat{m}\left(\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y}) - \hat{m}\sum_{i=1}^n (x_i - \bar{x})^2\right) \\ &=& \hat{m}\left((n-1)s_xs_yr - (rs_y/s_x)(n-1)s_x^2\right) \\ &=& 0. \end{array}$$ Finally, we prove that $-1 \leq r \leq 1$. By recentering and rescaling the data, we can determine the extreme values of $\qquad (n-1)r = \sum_{i=1}^n x_iy_i$ subject to $\qquad \sum_{i=1}^n x_i = \sum_{i=1}^n y_i = 0$ and $\qquad \sum_{i=1}^n x_i^2 = \sum_{i=1}^n y_i^2 = n-1.$ Consider the Lagrangian $\qquad L = \sum_{i=1}^n x_i y_i + \alpha\sum_{i=1}^n x_i + \beta \sum_{i=1}^n y_i + \delta \sum_{i=1}^n x_i^2 + \gamma \sum_{i=1}^n y_i^2.$ Necessary conditions for an extremum are $\qquad \partial L/ \partial x_i = y_i + \alpha + 2\delta x_i = 0$ and $\qquad \partial L/ \partial y_i = x_i + \beta + 2\gamma y_i = 0.$ Summing each equation over $i$, we obtain that $\qquad \alpha = \beta = 0.$ The first equation can now be written $\qquad y_i = -2\delta x_i.$ Squaring both sides and summing over $i$, we obtain that $\qquad \delta = \pm 1/2,$ and so $\qquad y_i = \pm x_i.$ Hence, $\qquad (n-1)r = \sum (-1)^(a_i) x_i^2$ where each $a_i$ is either $0$ or $1$. This clearly has extreme values of $–(n-1)$ and $(n-1)$.