--- title: "C07 Quantitative vs. Quantitative" author: "David Housman" format: docx editor: visual fig-width: 6 fig-asp: 0.618 out-width: "70%" fig-align: center --- ```{r, setup, include=FALSE} knitr::opts_chunk$set(eval = FALSE) ``` ## Introduction For a single quantitative variable a histogram, density, box, or ecdf plot are common visualizations, the mean and median are common statistical models, and the interquartile range, variance, and standard deviation are common measures of model fit. For a single quantitative response variable as a function of a qualitative explanatory variable, these visualizations, statistical models, and measures of model fit have natural generalizations. For a quantitative response variable as a function of a quantitative response variable, a scatter plot with a model curve is the primary visualization, an equation involving parameters is the primary statistical model, and the coefficient of determination, sum of squared errors, and standard error are common measures of model fit. ## By Hand Example Complete the *Gas Usage Exercise* contained in a separate handout. ## R Tidyverse Example Load the important packages. ```{r} #| message: false #| warning: false library(tidyverse) ``` Rather than importing data from a file into a `data.frame` (part of base R) or a `tibble` (part of the `tidyverse`), here is how to create one in a `qmd` file using the gas usage example data. ```{r} usage = tibble( month = seq(0.5, 11.5, 1), temp = c(25.4,28.2,30.2,47.9,58.5,65.3,73.3,75.7,71.6,65.1,52.1,36.5), gas = c(11.9,11.8, 9.5, 4.9, 3.3, 1.7, 1.0, 1.1, 1.1, 2.4, 5.0, 7.8) ) ``` 1. Obtain a time series plot of gas usage as a function of month. ```{r} ggplot(usage, aes(x = month,y = gas)) + geom_point() + geom_line() + scale_x_continuous(breaks = 0:12) ``` For the remaining exercises, consider daily gas usage as a function of temperature. 2. Obtain a scatter plot with the best-fit linear model. ```{r} ggplot(usage, aes(x = temp, y = gas)) + geom_point() + geom_smooth(method = "lm", se = FALSE) + ggtitle("Bowen-Housman Home") + xlab("Temperature (°F)") + #degree symbol Alt-0176 ylab("Gas Usage (ccf)") ``` 3. Obtain the best-fit linear model. ```{r} model1 = lm(gas ~ temp, data = usage) coef(model1) ``` 4. Interpret the best-fit linear model. 5. Use the model to predict gas usage during a month in which the average temperature was 20°F. ```{r} gas1 = 16.610875 - 0.218848*20 gas2 = predict(model1, tibble(temp = 20)) c(gas1, gas2) ``` ```{r} predict(model1, tibble(temp = c(20, 21, 22))) ``` 6. Obtain the coefficient of determination or R^2^. Interpret. ```{r} rsq1 = var(fitted(model1)) / var(usage$gas) rsq2 = summary(model1)$r.squared variation = function(x) { s = 0 for(xi in x) for(xj in x) s = s + (xi - xj)^2 s } rsq3 = variation(fitted(model1)) / variation(usage$gas) c(rsq1, rsq2, rsq3) ``` 7. Obtain the sum of squared errors. Interpret. ```{r} sse1 = sum((fitted(model1) - usage$gas)^2) sse2 = sum(residuals(model1)^2) sse3 = anova(model1)$`Sum Sq`[-1] c(sse1, sse2, sse3) ``` 8. Obtain the standard error. Interpret. ```{r} n = nrow(usage) se1 = sqrt(sse1/(n-2)) se2 = summary(model1)$sigma c(se1, se2) ``` 9. Obtain the model summary and point out where the model coefficients and two of the measures of model fit can be found. ```{r} summary(model1) ``` 10. Obtain the ANOVA (analysis of variance) table and point out where one measure of model fit can be found. ```{r} anova(model1) ``` 11. Instead of the model $\text{gas} = c_0 + c_1\text{temp}$, try the model $\text{gas} = c_0 + c_1\text{temp} + c_2\text{temp}^2$, ```{r} ggplot(usage, aes(x = temp, y = gas)) + geom_point() + geom_smooth(formula = y ~ I(x^2) + x, method = "lm", se = FALSE) + #poly(x,2) as an alternative ggtitle("Bowen-Housman Home") + xlab("Temperature (°F)") + #degree symbol Alt-0176 ylab("Gas Usage (ccf)") ``` ```{r} model2 = lm(gas ~ temp + I(temp^2), data = usage) coef(model2) ``` ```{r} rsq = var(fitted(model2)) / var(usage$gas) sse = sum(residuals(model2)^2) n = nrow(usage) se = sqrt(sse/(n-2)) c(rsq, sse, se) ``` Is the R^2^ increase from 96.3% to 98.7% and the decrease in the standard error from 0.84 ccf to 0.50 ccf worth the added parameter and the difficulty in describing the meaning of the model? A more interpretable model with three parameters would be of the form $$\text{gas} = \begin{cases} c_0 + c_1(t_1 - \text{temp}) & \text{if temp} < t_1 \\ c_0 & \text{if temp} \geq t_1 \end{cases}$$ The model means that for temperatures above $t_1$ °F, $c_0$ ccf of gas would be used, and for each one degree Fahrenheit decrease in temperature the gas usage would increase by $c_1$ ccf.