---
title: "C07 Quantitative vs. Quantitative"
author: "David Housman"
format: docx
editor: visual
fig-width: 6
fig-asp: 0.618
out-width: "70%"
fig-align: center
---

```{r, setup, include=FALSE}
knitr::opts_chunk$set(eval = FALSE)
```

## Introduction

For a single quantitative variable a histogram, density, box, or ecdf plot are common visualizations, the mean and median are common statistical models, and the interquartile range, variance, and standard deviation are common measures of model fit. For a single quantitative response variable as a function of a qualitative explanatory variable, these visualizations, statistical models, and measures of model fit have natural generalizations. For a quantitative response variable as a function of a quantitative response variable, a scatter plot with a model curve is the primary visualization, an equation involving parameters is the primary statistical model, and the coefficient of determination, sum of squared errors, and standard error are common measures of model fit.

## By Hand Example

Complete the *Gas Usage Exercise* contained in a separate handout.

## R Tidyverse Example

Load the important packages.

```{r}
#| message: false
#| warning: false
library(tidyverse)
```

Rather than importing data from a file into a `data.frame` (part of base R) or a `tibble` (part of the `tidyverse`), here is how to create one in a `qmd` file using the gas usage example data.

```{r}
usage = tibble(
  month = seq(0.5, 11.5, 1),
  temp = c(25.4,28.2,30.2,47.9,58.5,65.3,73.3,75.7,71.6,65.1,52.1,36.5),
  gas  = c(11.9,11.8, 9.5, 4.9, 3.3, 1.7, 1.0, 1.1, 1.1, 2.4, 5.0, 7.8)
)
```

1.  Obtain a time series plot of gas usage as a function of month.

```{r}
ggplot(usage, aes(x = month,y = gas)) +
  geom_point() + 
  geom_line() +
  scale_x_continuous(breaks = 0:12)
```

For the remaining exercises, consider daily gas usage as a function of temperature.

2.  Obtain a scatter plot with the best-fit linear model.

```{r}
ggplot(usage, aes(x = temp, y = gas)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  ggtitle("Bowen-Housman Home") +
  xlab("Temperature (°F)") + #degree symbol Alt-0176
  ylab("Gas Usage (ccf)")
```

3.  Obtain the best-fit linear model.

```{r}
model1 = lm(gas ~ temp, data = usage)
coef(model1)
```

4.  Interpret the best-fit linear model.

5.  Use the model to predict gas usage during a month in which the average temperature was 20°F.

```{r}
gas1 = 16.610875 - 0.218848*20
gas2 = predict(model1, tibble(temp = 20))
c(gas1, gas2)
```

```{r}
predict(model1, tibble(temp = c(20, 21, 22)))
```

6.  Obtain the coefficient of determination or R^2^. Interpret.

```{r}
rsq1 = var(fitted(model1)) / var(usage$gas)
rsq2 = summary(model1)$r.squared
variation = function(x) {
  s = 0
  for(xi in x) for(xj in x) s = s + (xi - xj)^2
  s
}
rsq3 = variation(fitted(model1)) / variation(usage$gas)
c(rsq1, rsq2, rsq3)
```

7.  Obtain the sum of squared errors. Interpret.

```{r}
sse1 = sum((fitted(model1) - usage$gas)^2)
sse2 = sum(residuals(model1)^2)
sse3 = anova(model1)$`Sum Sq`[-1]
c(sse1, sse2, sse3)
```

8.  Obtain the standard error. Interpret.

```{r}
n = nrow(usage)
se1 = sqrt(sse1/(n-2))
se2 = summary(model1)$sigma
c(se1, se2)
```

9.  Obtain the model summary and point out where the model coefficients and two of the measures of model fit can be found.

```{r}
summary(model1)
```

10. Obtain the ANOVA (analysis of variance) table and point out where one measure of model fit can be found.

```{r}
anova(model1)
```

11. Instead of the model $\text{gas} = c_0 + c_1\text{temp}$, try the model $\text{gas} = c_0 + c_1\text{temp} + c_2\text{temp}^2$,

```{r}
ggplot(usage, aes(x = temp, y = gas)) +
  geom_point() +
  geom_smooth(formula = y ~ I(x^2) + x, method = "lm", se = FALSE) + #poly(x,2) as an alternative
  ggtitle("Bowen-Housman Home") +
  xlab("Temperature (°F)") + #degree symbol Alt-0176
  ylab("Gas Usage (ccf)")
```

```{r}
model2 = lm(gas ~ temp + I(temp^2), data = usage)
coef(model2)
```

```{r}
rsq = var(fitted(model2)) / var(usage$gas)
sse = sum(residuals(model2)^2)
n = nrow(usage)
se = sqrt(sse/(n-2))
c(rsq, sse, se)
```

Is the R^2^ increase from 96.3% to 98.7% and the decrease in the standard error from 0.84 ccf to 0.50 ccf worth the added parameter and the difficulty in describing the meaning of the model?

A more interpretable model with three parameters would be of the form

$$\text{gas} = 
\begin{cases} 
c_0 + c_1(t_1 - \text{temp}) & \text{if temp} < t_1 \\
c_0                          & \text{if temp} \geq t_1
\end{cases}$$

The model means that for temperatures above $t_1$ °F, $c_0$ ccf of gas would be used, and for each one degree Fahrenheit decrease in temperature the gas usage would increase by $c_1$ ccf.