--- title: "C07 Quantitative vs. Quantitative" author: "David Housman" format: docx editor: visual fig-width: 6 fig-asp: 0.618 execute: enabled: false --- ## Introduction For a single quantitative variable a histogram, density plot, violin plot, box plot, or ecdf plot are common visualizations, the mean and median are common statistical models, and the variance, standard deviation, average deviation from the median, interquartile range, and coverage intervals are common measures of model fit. For a single quantitative response variable as a function of a qualitative explanatory variable, these visualizations, statistical models, and measures of model fit are again used grouped by the values of the explanatory variable. For a quantitative response variable as a function of a quantitative response variable, a scatter plot with a model curve is the primary visualization, an equation involving parameters is the primary statistical model, and the coefficient of determination, sum of squared errors, and standard error are common measures of model fit. ## By Hand Example Complete the *Gas Usage Exercise* contained in a separate handout. ## R Tidyverse Example Load the needed packages for data wrangling and visualizing. ```{r} #| message: false #| warning: false library(dplyr) library(ggplot2) ``` Rather than importing data from a file into a `data.frame` (part of base R) or a `tibble` (part of the `tidyverse`), here is how to create one in a `qmd` file using the gas usage example data. ```{r} usage = tibble( month = seq(0.5, 11.5, 1), temp = c(25.4,28.2,30.2,47.9,58.5,65.3,73.3,75.7,71.6,65.1,52.1,36.5), gas = c(11.9,11.8, 9.5, 4.9, 3.3, 1.7, 1.0, 1.1, 1.1, 2.4, 5.0, 7.8) ) ``` 1. Obtain a time series plot of gas usage as a function of month. ```{r} ggplot(usage, aes(x = month,y = gas)) + geom_point() + geom_line() + scale_x_continuous(breaks = 0:12) + ggtitle("Bowen-Housman Home") + xlab("Month") + ylab("Gas Usage (ccf)") ``` For the remaining exercises, consider daily gas usage as a function of temperature. 2. Obtain a scatter plot with the best-fit linear model. ```{r} ggplot(usage, aes(x = temp, y = gas)) + geom_point() + geom_smooth(method = "lm", se = FALSE) + ggtitle("Bowen-Housman Home") + xlab("Temperature (°F)") + #degree symbol Alt-0176 ylab("Gas Usage (ccf)") ``` 3. Obtain the best-fit linear model. ```{r} model1 = lm(gas ~ temp, data = usage) coef(model1) ``` ```{r} 16.610875/0.218848 ``` ```{r} usage |> summarise(mean(temp), mean(gas)) ``` 4. Interpret the best-fit linear model written in three different forms. 5. Use the model to predict gas usage during a month in which the average temperature was 20°F. ```{r} tibble( gas1 = 16.610875 - 0.218848*20, gas2 = predict(model1, tibble(temp = 20)) ) ``` 6. Obtain the coefficient of determination or R^2^. Interpret. ```{r} rsq1 = var(fitted(model1)) / var(usage$gas) rsq2 = summary(model1)$r.squared variation = function(x) { s = 0 for(xi in x) for(xj in x) s = s + (xi - xj)^2 s } rsq3 = variation(fitted(model1)) / variation(usage$gas) c(rsq1, rsq2, rsq3) ``` 7. Obtain the sum of squared errors. Interpret. ```{r} sse1 = sum((fitted(model1) - usage$gas)^2) sse2 = sum(residuals(model1)^2) sse3 = anova(model1)$`Sum Sq`[-1] c(sse1, sse2, sse3) ``` 8. Obtain the standard error. Interpret. ```{r} n = nrow(usage) se1 = sqrt(sse1/(n-2)) se2 = summary(model1)$sigma c(se1, se2) ``` 9. Obtain the model summary and point out where the model coefficients and two of the measures of model fit can be found. ```{r} summary(model1) ``` 10. Obtain the ANOVA (analysis of variance) table and point out where one measure of model fit can be found and how to easily calculate another measure of model fit. ```{r} anova(model1) ``` 11. Instead of the linear model $\text{gas} = c_0 + c_1\text{temp}$, try the the quadratic model $\text{gas} = c_0 + c_1\text{temp} + c_2\text{temp}^2$, ```{r} ggplot(usage, aes(x = temp, y = gas)) + geom_point() + geom_smooth(formula = y ~ I(x^2) + x, method = "lm", se = FALSE) + #poly(x,2) as an alternative ggtitle("Bowen-Housman Home") + xlab("Temperature (°F)") + #degree symbol Alt-0176 ylab("Gas Usage (ccf)") ``` ```{r} model2 = lm(gas ~ temp + I(temp^2), data = usage) coef(model2) ``` ```{r} rsq = var(fitted(model2)) / var(usage$gas) sse = sum(residuals(model2)^2) n = nrow(usage) se = sqrt(sse/(n-2)) c(rsq, sse, se) ``` A more interpretable model with three parameters would be of the form $$\text{gas} = \begin{cases} c_0 + c_1(t_1 - \text{temp}) & \text{if temp} < t_1 \\ c_0 & \text{if temp} \geq t_1 \end{cases}$$ The model means that for temperatures above $t_1$ °F, $c_0$ ccf of gas would be used, and for each one degree Fahrenheit decrease in temperature the gas usage would increase by $c_1$ ccf. ```{r} sum.sq.errors = function(x) { c0 = x[1]; c1 = x[2]; t1 = x[3] ypredict = if_else(usage$temp < t1, c0 + c1*(t1 - usage$temp), c0) sum((usage$gas - ypredict)^2) } optim(c(1,0.2,70), sum.sq.errors) ``` ```{r} sst = 181.702 + 7.021 sse = 4.25354 rsq = 1 - sse/sst rsq ``` This R^2^ of 97.7% can be compared with the linear model R\^2\$ of 96.3% and the quadratic model R^2^ of 98.7%. ```{r} fun3 = function(x) { if_else(x < 67.7594210, 1.0666850 + 0.2427267*(67.7594210 - x), 1.0666850) } ggplot(usage, aes(x = temp, y = gas)) + geom_point() + geom_function(fun = fun3, color = "blue") + ggtitle("Bowen-Housman Home") + xlab("Temperature (°F)") + #degree symbol Alt-0176 ylab("Gas Usage (ccf)") ```