---
title: "C09 Multiple Regression"
author: "David Housman"
output:
  word_document: default
---

```{r, setup, include=FALSE}
knitr::opts_chunk$set(eval = FALSE)
```

# Background

Our goal is to explain the variation of a response variable $y$ with one or more explanatory variables. We will continue to use the `lm` function to find best fit linear models. The table below shows examples of how the abbreviated forms used in the R `lm` function translate into mathematical functions under the assumption that $x$ is a numeric variable and $w$ is a categorical variable with levels $l0, l1, l2$.

| R form | Mathematical Form |
|:-----------------------------------|:-----------------------------------|
| $y \sim x$   | $y = c_0 + c_1x$   |
| $y \sim x_1 + x_2$   | $y = c_0 + c_1x_1 + c_2x_2$   |
| $y \sim x_1 * x_2$   | $y = c_0 + c_1x_1 + c_2x_2 + c_{12}x_1x_2$   |
| $y \sim x + I(x^2)$   | $y = c_0 + c_1x + c_2x^2$   |
| $y \sim 0 + x$   | $y = c_1x$   |
| $y \sim w$   | $y = c_0 + c_{l1}I_{l1}(w) + c_{l2}I_{l2}(w)$   |
| $y \sim x + w$   | $y = c_0 + c_{l1}I_{l1}(w) + c_{l2}I_{l2}(w) + c_1x$   |
| $y \sim x * w$   | $y = c_0 + c_{l1}I_{l1}(w) + c_{l2}I_{l2}(w) + c_1x + c_{1,l1}I_{l1}(w)x + c_{1,l2}I_{l2}(w)x$   |

As a measure of how well a model fits the data, we will primarily use

1.  The coefficient of determination $R^2$. This is the proportion of variation in the sample associated with the model.

2.  The adjusted R-squared $R_{\text{adj}}^2 = 1 - (1-R^2)\frac{n-1}{n-p-1}$. This is an estimated proportion of variation in the populations associated with the model (there are many different estimates in the literature). The formula penalizes for the use of additional parameters.

3.  The residual standard error $\hat{\sigma}$. This can be loosely interpreted as the value for which about 68% of the data are within their model predicted values and about 95% of the data are within $2\hat{\sigma}$ their model predicted values

With a single explanatory variable, we have nice visualizations: scatter plot for a numeric explanatory variable and grouped boxplots or violin plots for a categorical explanatory variable. If we have one numeric and one categorical explanatory variables, a scatter plot with colors representing the different levels of the categorical variable is a nice visualization. With more categorical variables, facets can be used, but it is usually difficult to obtain nice visualizations when there is more than one numeric explanatory variable, although size of the data points is one possibility.

It is sometimes useful to transform our data in some manner or to consider other than linear models. We will only touch upon these ideas here.

## Packages

Load the libraries containing useful functions for importing, wrangling, visualizing, tidying, and analyzing data.

```{r, warning=F, message=F}
library(tidyverse)
library(moderndive)
```

## Iris Petal Length = f(Petal Width)

```{r}
ggplot(iris, aes(y = Petal.Length, x = Petal.Width)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)
```

```{r}
modpw = lm(Petal.Length ~ Petal.Width, iris)
coef(modpw)
```

```{r}
predict(modpw, tibble(Petal.Width = 0.1))
```


```{r}
tibble(
  r2  = summary(modpw)$r.squared,
  ar2 = summary(modpw)$adj.r.squared,
  rse = summary(modpw)$sigma
)
```

Interpretation: The regression model is $\widehat{\text{Petal.Length}} = 1.31 + 2.23(\text{Petal.Width} - 0.10)$. The model predicts that an iris with a petal width of 0.10 cm will have a petal length of 1.31 cm, and for each 1 cm increase in the petal width there will be a 2.23 cm increase in the petal length. The model predictions will typically deviate from the actual petal lengths by 0.48 cm. The variation in petal width explains 93% of the variation in the petal length.

## Iris Petal Length = f(Species)

```{r}
ggplot(iris, aes(y = Petal.Length, x = Species)) +
  geom_violin() +
  stat_summary(fun = mean, geom = "crossbar", width = 0.22, color = "blue")
```

```{r}
mod1 = lm(Petal.Length ~ Species, iris)
coef(mod1)
```

```{r}
tibble(
  r2  = summary(mod1)$r.squared,
  ar2 = summary(mod1)$adj.r.squared,
  rse = summary(mod1)$sigma
)
```

Interpretation: The model predicts that the petal length will be 1.46, 4.26, and 5.55 cm for a setosa, versicolor, and virginica iris. The model predictions will typically deviate from the actual petal lengths by 0.43 cm. The variation in petal width explains 94% of the variation in the petal length.

## Iris Petal Length = f(Species, Petal Width) With interaction

```{r}
ggplot(iris, aes(y = Petal.Length, x = Petal.Width, color = Species)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)
```

```{r}
modpwsi = lm(Petal.Length ~ Petal.Width * Species, iris)
coef(modpwsi)
```

$$\widehat{\text{Petal.Length}} = 
\begin{cases}
1.33 + 0.55\text{Petal.Width} & \text{if Species = setosa} \\
1.78 + 1.87\text{Petal.Width} & \text{if Species = versicolor} \\
4.23 + 0.65\text{Petal.Width} & \text{if Species = virginica} \\
\end{cases}$$

```{r}
iris |> group_by(Species) |> summarise(min(Petal.Width))
```

```{r}
predict(modpwsi, tibble(Petal.Width = c(0.1,1.0,1.4), Species = c("setosa", "versicolor", "virginica")))
```

```{r}
tibble(
  r2  = summary(modpwsi)$r.squared,
  ar2 = summary(modpwsi)$adj.r.squared,
  rse = summary(modpwsi)$sigma
)
```

Interpretation: The model predicts that the petal length will be 1.38, 3.65, and 5.15 cm for a setosa, versicolor, and virginica iris of minimum petal width (0.1, 1.0, and 1.4 cm), and each additional 1 cm of petal width is associated with a 0.55, 1.87, and 0.65 cm increase in petal length. The model predictions will typically deviate from the actual petal lengths by 0.36 cm. The variation in petal width explains 96% of the variation in the petal length.

## Iris Petal Length = f(Species, Petal Width) With no interaction

```{r}
ggplot(iris, aes(y = Petal.Length, x = Petal.Width, color = Species)) +
  geom_point() +
  geom_parallel_slopes(se = FALSE)
```

```{r}
modpws = lm(Petal.Length ~ Petal.Width + Species, iris)
coef(modpws)
```

```{r}
predict(modpws, tibble(Petal.Width = c(0.1,1.0,1.4), Species = c("setosa", "versicolor", "virginica")))
```

```{r}
tibble(
  r2  = summary(modpws)$r.squared,
  ar2 = summary(modpws)$adj.r.squared,
  rse = summary(modpws)$sigma
  
)
```

Interpretation: The model predicts that the petal length will be 1.31, 3.93, and 4.91 cm for a setosa, versicolor, and virginica iris of minimum petal width (0.1, 1.0, and 1.4 cm), and each additional 1 cm of petal width is associated with a 2.28 cm increase in petal length. The model predictions will typically deviate from the actual petal lengths by 0.38 cm. The variation in petal width explains 96% of the variation in the petal length.

## Iris Petal.Length as a function of multiple variables

```{r}
analyze = function(formula) {
  model = lm(formula, iris)
  tibble(
    formula = deparse(formula),
    r2 = summary(model)$r.squared,
    ar2 = summary(model)$adj.r.squared,
    npar = length(coef(model))
  )
}
bind_rows(
  analyze(Petal.Length ~ Petal.Width),
  analyze(Petal.Length ~ Petal.Width + Sepal.Length),
  analyze(Petal.Length ~ Petal.Width + Sepal.Length + Sepal.Width),
  analyze(Petal.Length ~ Petal.Width + Species),
  analyze(Petal.Length ~ Petal.Width + Sepal.Length + Species),
  analyze(Petal.Length ~ Petal.Width + Sepal.Length + Sepal.Width + Species),
  analyze(Petal.Length ~ Petal.Width * Sepal.Length * Sepal.Width * Species)
)
```